1.2 zettabytes of data created in 2010

For the last five years or so, IDC has released an EMC-sponsored study on “The Digital Universe” that looks at how much data is created and replicated around the world. When I last blogged about it back in 2008, the number stood at 281 exabytes per year. Now the latest report is out, and for the first time the amount of data created has surpassed 1 zettabyte! About 1.2 zettabytes were created and replicated in 2010 (that’s 1.2 trillion gigabytes), and IDC predicts that number will grow to 1.8 zettabytes this year. The amount of data is more than doubling every two years!

Here’s what the growth looks like:

How much data is that? Wikipedia has some good answers: exabyte, zettabyte. EMC has also provided some examples to help make sense of the number. 1.8 zettabytes is equivalent in sheer volume to:

  • Every person in Canada tweeting three tweets per minute for 242,976 years nonstop
  • Every person in the world having over 215 million high-resolution MRI scans per day
  • Over 200 billion HD movies (each two hours in length) – would take one person 47 million years to watch every movie 24/7
  • The amount of information needed to fill 57.5 billion 32GB Apple iPads. With that many iPads we could:
    • Create a wall of iPads, 4,005 miles long and 61 feet high extending from Anchorage, Alaska to Miami, Florida
    • Build the Great iPad Wall of China – at twice the average height of the original
    • Build a 20-foot high wall around South America
    • Cover 86 per cent of Mexico City
    • Build a mountain 25 times higher than Mt. Fuji

That’s a lot of data!

EMC/IDC has produced a great infographic that explains more about the explosion of data – see it here in PDF. One of the things that has always been fuzzy for me is the difference between data we’ve created intentionally (like a document) and data we’ve created unintentionally (sharing that document with others). According to IDC, one gigabyte of stored data can generate one petabyte (1 million gigabytes) of transient data!

Cost is one of the biggest factors behind this growth, of course. The cost of creating, capturing, managing, and storing information is now just 1/6th of what it was in 2005. Another big factor is the fact that most of us now carry the tools of creation at all times, everywhere we go. Digital cameras, mobile phones, etc.

You can learn more about all of this and see a live information growth ticker at EMC’s website.

This seems as good a time as any to remind you to backup your important data! It may be easy to create photos and documents, but it’s even easier to lose them. I use a variety of tools to backup data, including Amazon S3, Dropbox, and Windows Live Mesh. The easiest by far though is Backblaze – unlimited storage for $5 per month per computer, and it all happens automagically in the background.

First look at Canada’s new Open Data portal: data.gc.ca

Yesterday the Government of Canada launched its open data portal at data.gc.ca. Open Data is one of three Open Government Initiatives, the other two being Open Information and Open Dialogue. Stockwell Day, President of the Treasury Board and Minister for the Asia-Pacific Gateway, issued a statement today on the launch:

“Today, I am pleased to announce the next step in our government’s commitment to enhancing transparency and accountability to Canadians. The expansion of open government will give Canadians the opportunity to access public information in more useful and readable formats, enable greater insight into the inner workings of the Government and empower citizens to participate more directly in the decision-making process.”

He goes on in the statement to say that Canada has historically led the way in providing information to citizens. Lately though, we’ve definitely fallen behind. I’m glad to see us moving forward once again. This development is no doubt the result of lots of work by many passionate Canadians, such as David Eaves. Here’s what he posted yesterday:

The launch of data.gc.ca is an important first step. It gives those of us interested in open data and open government a vehicle by which to get more data open and improve the accountability, transparency as well as business and social innovation.

David does a good job in that post of highlighting some of the issues the site currently faces, such as some problematic wording in the licensing, so I won’t repeat that here. Instead, I figured I’d do what I always do when I get new datasets to play with – make some charts!

The open data portal says there are 261,077 datasets currently available. Just 781 of those are “general” datasets, the rest are geospatial. That’s an impressive number of geospatial datasets, but they are somewhat less accessible (and perhaps less interesting) to the average Canadian than the general datasets. It looks like you need to be able to work with an ESRI Shape File to use most of them.

There are lots of general datasets you might find interesting, however. For example, here’s the Consumer Price Index by city:

Here’s another dataset I thought was interesting – the number of foreign workers that have entered Canada, by region:

Have you ever wondered how much of each type of milk Albertans consume? You can find that out:

There’s actually a fairly broad range of datasets available, such as weather, agriculture, economics, and much more. As David said, it’s a good first step.

I’m excited to see more ministries get involved, and I hope to see the number of datasets available increase over time. I’d also love to see the licensing change, perhaps by adopting the UK Open Government License as David suggested. Exciting times ahead!

281 exabytes of data created in 2007

data I typed the title for this post into Windows Live Writer, and a red squiggly appeared under the word “exabytes”. I just added it to the dictionary, but I can’t help but think that it’ll be in there by default before long.

Either it takes three months to crunch the data or March is just the unofficial “how much did we create last year” month, because researchers at IDC have once again figured out how many bits and bytes of data were created in 2007. You’ll recall that in March of last year, they estimated the figure for 2006 to be 161 exabytes. For 2007, that number nearly doubled, to 281 exabytes (which is 281 billion gigabytes):

IDC attributes accelerated growth to the increasing popularity of digital television and cameras that rely on digital storage. Major drivers of digital content growth include surveillance, social networking, and cloud computing. Visual content like images and video account for the largest portion of the digital universe. According to IDC, there are now over a billion digital cameras and camera phones in the world and only ten percent of photos are captured on regular film.

This is obviously a very inexact science, but I suspect their estimates become more accurate with experience.

Interestingly, this is the first time that we’ve created more data than we have room to store (though one wonders if that’s simply due to a lack of historical data than anything else).

Read: ars technica

Searching Wikipedia Sucks!

Post Image Have you tried searching Wikipedia lately? Don’t bother, because you probably won’t find what you’re looking for! I am continually amazed at how terrible the Wikipedia search results are. Here’s an example of what I mean. Go to Wikipedia, type “al gor” in the search box, and click the search button. You should see something like this. That’s right, the top results are Al-Merrikh, Cy-Gor, Firouzabad, and Kagame Inter-Club Cup.

Absolutely terrible! If you type the same thing in the search box at Google, not only do you get accurate results, but Google prompts you with “Did you mean: al gore”. Why yes, I did! So why is searching Wikipedia so bad?

Part of the problem is that Wikipedia actually has two search modes: “Go” and “Search”. If you type “Al Gore” (spelled correctly) in the box and click Go, you’re taken right to the entry about Al Gore. If you instead click Search, you’re taken to a list of articles that contain or reference “Al Gore”. You can read more about searching Wikipedia here. So they’ve sort of complicated things by including two buttons instead of just one. The Go button is useful when you know the name of the article you want, but useless otherwise.

The other part of the problem is that the search algorithm just plain sucks. I know they don’t have a lot of resources, but you’d think that one of the most popular websites on the web could have a decent search feature. Matching “al gor” with “al gore” is a problem that has been solved for years, yet Wikipedia doesn’t even come close to accomplishing it!

Wikipedia itself mentions external search engines as a way to find what you’re looking for, but they aren’t really much better. For instance, if you type “al gor” at the special Google search for Wikipedia page, you do get the correct Al Gore entry as the first result, but the rest are not relevant at all.

So here’s where we’re at. Google knows that if you type “al gor” you probably mean “Al Gore”. Wikipedia knows about all of the entries that reference “Al Gore”. What we need is a way to combine the two! Is that really so much to ask?

If you know of a better way to search Wikipedia, please let me know!

The Gatekeepers of Privacy

Post ImageAs you know, I don’t worry that much about online privacy. In fact, I think it’s a huge waste of time to be overly concerned about privacy on the web. I always keep two things in mind:

  1. There is no such thing as private information.
  2. If someone looks at information online and draws a negative impression about me, I have larger problems than privacy to worry about.

So far my strategy has been working fairly well. To my knowledge I haven’t missed out on any opportunities because of information about me found on the web – quite the opposite in fact.

For some reason though, I am fascinated by the worries and concerns of others when it comes to information privacy. And believe, me there are a lot of worriers out there. So many, it seems, that Global TV‘s troubleshooter looked at the security of Facebook and other popular websites last night (unfortunately they haven’t full embraced the new web, and the video is not available on their site).

They contacted a local “hacking” firm, and asked them to review Facebook, Gmail, and other popular sites. The gentleman they spoke to couldn’t have been more cliché – long hair, super geeky, could be mistaken for a girl, you know the type. Anyway, they apparently spent over 30 hours trying to “hack” into Facebook and couldn’t get in. I just shook my head through all of this. They deemed Facebook “very secure”. Well, problem solved I guess, haha!

Then they spoke to a professor from the UofA (if I remember correctly) who said that living under the assumption that your information is safe is a dangerous thing to do. Finally someone smart! The segment then ended with the anchors asking each other if they were on Facebook (they aren’t, unfortunately). Oh and the suggestion that you should read the privacy policy of every site you visit (yeah, cuz that’s going to happen).

It doesn’t matter how secure Facebook is. Privacy is not about technology. If someone wants to find out something about you, they will. Social engineering, dumpster diving, and many other techniques are far more effective than trying to hack into a site like Facebook. More importantly, there’s no need to – just create your own Facebook account! Chances are, the person you’re interested in hasn’t adjusted their privacy settings anyway.

For its part, Facebook follows two core principles:

  1. You should have control over your personal information.
  2. You should have access to the information others want to share.

A respectable policy, no doubt. Here’s the problem though. Let’s say I give access to certain information only to my brother. No one else (in theory) can see it, right? Wrong. I can give my brother access to the information, but I can’t restrict him from doing something with it.

Technology is just a tool. People are the gatekeepers of privacy.

161 exabytes of data created in 2006

Post ImageThere’s a new report out from research firm IDC that attempts to count up all the zeroes and ones that fly around our digital world. I remember reading about the last such report, from the University of California, Berkeley. That report found that 5 exabytes of data were created in 2003. The new IDC report says the number for 2006 is 161 exabytes! Why the difference?

[The Berkeley researchers] also counted non-electronic information, such as analog radio broadcasts or printed office memos, and tallied how much space that would consume if digitized. And they examined original data only, not all the times things got copied.

In comparison, the IDC numbers ballooned with the inclusion of content as it was created and as it was reproduced – for example, as a digital TV file was made and every time it landed on a screen. If IDC tracked original data only, its result would have been 40 exabytes.

Even still, that’s an incredible increase in just three years. Apparently we don’t even have enough space to store all that data:

IDC estimates that the world had 185 exabytes of storage available last year and will have 601 exabytes in 2010. But the amount of stuff generated is expected to jump from 161 exabytes last year to 988 exabytes (closing in on 1 zettabyte) in 2010.

Pretty hardcore, huh? You can read about zettabytes at Wikipedia. I’m not too worried about not having enough space though, even if we were attempting to store all that data (which we aren’t). Hard drives are already approaching the terabyte mark, so who knows how big they’ll be in 2010. Then of course there’s also the ever falling costs of DVD-like media.

More importantly, I bet a lot of the storage we “have available” right now is totally underutilized. You’d be hard pressed to find a computer that comes with less than 80 GB of storage these days, and I can assure you there are plenty of users who never even come close to filling it up. Heck, even I am only using about 75% of the storage I have available on my computer (420 GB out of 570 GB) and I bet a lot of it could be deleted (I’m a digital pack rat).

Read: Yahoo! News

Podcasting Professor Has Website Suspended

Post ImagePodcasting and education – I think it’s only a matter of time, once the issues that make educational institutions uneasy are worked out. And to be sure, educators are already experimenting with podcasting, like communication and technology professor Robert Schrag. The problem is that he decided to charge for his podcasts, and NC State University didn’t like that too much (via Podcasting News):

Schrag had made his lectures available to students and the general public online for a fee of $2.50. The University questioned whether this practice was ethical, referring to the inconsistencies in opinion concerning intellectual property and decided to ask Schrag to suspend the Web site until copyright-issue clarifications could be made.

Besides wanting to make a small profit, I don’t know why Schrag was charging for his podcast. I highly doubt he gave the money to the university to cover his (probably very small) bandwidth costs. Interestingly enough, when he asked his class about the situation, only four of them said the podcasts should be free, and no one said the site should have been taken down.

This situation brings up a bunch of questions. As a paying student, is recording what the professor says for my own consumption any different than frantically trying to write everything down? Does the university own the content that the professor delivers, or does the professor himself/herself retain ownership? Why should I as a student have to pay extra to get an audio file of the lecture?

And perhaps most important of all, is podcasting just something universities need to embrace in order to keep up with the times? I think it might be, kind of like replacing blackboards with whiteboards or overhead projectors with digital projectors and computers. Schrag has the right idea:

“I’m not sorry I made the choice and I hope I can get back to giving the information,” Schrag said.

After all, isn’t the primary function of a university to disseminate information? We call it teaching or learning, but really, a university is just a fancy way to spread information and knowledge to the population. Podcasting then should be viewed by universities as just another tool to help them spread information.

Read: Technician Online

Visualizing Information

Post ImageOne thing that really interests me is the different ways in which you can visualize information. Most often, text is simply not the best way. A picture really is worth a thousand words! Audio and video or animation of some sort can also be quite helpful in trying to comprehend something that may be difficult using just words. A good example is this post put together by Matt at the Signals vs. Noise blog:

Science presents some of the most interesting challenges for information designers. How do you help people grasp sizes, distances, and ratios that are nearly unimaginable?

You’ve got to check out some of the images he found, especially if you’re a teacher or parent. There’s some really interesting stuff! My favorite is the image of our solar system.

Read: Signals vs. Noise

Wikipedia Under Fire

Wikipedia is without a doubt one of my favorite websites. Even though I have only ever made one or two contributions to Wikipedia, I find the site invaluable for research. The vast amount of information immediately available is hard to overlook for research of any sort (there are 848,598 English language articles as of this post). If you have a question about something, you can probably find the answer at Wikipedia.

Called “the self-organizing, self-repairing, hyperaddictive library of the future” by Wired Magazine in March of 2005, Wikipedia has enjoyed much success. The Wired article is just one of many mainstream media articles praising the site, and there are many thousands if not millions of bloggers and others who use and recommend Wikipedia each and every day. The New York Times offers some numbers describing Wikipedia’s success:

The whole nonprofit enterprise began in January 2001, the brainchild of Jimmy Wales, 39, a former futures and options trader who lives in St. Petersburg, Fla. He said he had hoped to advance the promise of the Internet as a place for sharing information.

It has, by most measures, been a spectacular success. Wikipedia is now the biggest encyclopedia in the history of the world. As of Friday, it was receiving 2.5 billion page views a month, and offering at least 1,000 articles in 82 languages. The number of articles, already close to two million, is growing by 7 percent a month. And Mr. Wales said that traffic doubles every four months.

Lately though, despite all of the success and impressive usage numbers, cracks have started to appear. Two questions, both of which have been asked before, have once again been brought into the spotlight – just how reliable is the information found on Wikipedia, and where is the accountability?

Consider what happened to John Seigenthaler Sr.:

ACCORDING to Wikipedia, the online encyclopedia, John Seigenthaler Sr. is 78 years old and the former editor of The Tennessean in Nashville. But is that information, or anything else in Mr. Seigenthaler’s biography, true?

The question arises because Mr. Seigenthaler recently read about himself on Wikipedia and was shocked to learn that he “was thought to have been directly involved in the Kennedy assassinations of both John and his brother Bobby.”

If any assassination was going on, Mr. Seigenthaler (who is 78 and did edit The Tennessean) wrote last week in an op-ed article in USA Today, it was of his character.

Whoever added that false information to the article did so anonymously, so beyond publicly stating the truth, Mr. Seigenthaler really had no recourse. So there’s the issue of false information, and how to stop people from entering it. Wikipedia works on the premise that mistakes are caught by later contributors, and regular users who monitor changes. Clearly, that doesn’t always work.

If reliability and accountability weren’t enough, how about ethics? Should you edit the entry for something you were involved in? The question was raised earlier this week when Adam Curry attempted to make some changes to the entry for Podcasting. Dave Winer explains:

Now after reading about the Seigenthaler affair, and revelations about Adam Curry’s rewriting of the podcasting history — the bigger problem is that Wikipedia is so often considered authoritative. That must stop now, surely. Every fact in there must be considered partisan, written by someone with a confict of interest. Further, we need to determine what authority means in the age of Internet scholarship. And we need to take a step back and ask if we really want the participants in history to write and rewrite the history. Isn’t there a place in this century for historians, non-participants who observe and report on the events?

Dave makes some very good points. Upon first reading his entry, I though the question of historians and third-party observers was very obvious and a simple way to resolve these kinds of issues. The more I thought about it though, the less sure I felt. Requiring historians and non-participants to write the entries simply because that’s the way we’ve always done it may not be the best way to move forward. Thanks to Wikipedia and the web in general, we have the ability to turn the conventional wisdom “the winners write the history books” completely upside down. By editing websites like Wikipedia as events are taking place (such as the creation of podcasting) do we not have a better chance of capturing a more realistic view of history? If all sides of an issue can enter their views, do we not have a more accurate and complete entry? Of course, we unfortunately need to deal with flame wars in many of these cases, but maybe that will change as the process matures.

The issues I mentioned above are currently getting a lot of attention, and are pretty natural in the evolution of a system like Wikipedia. I don’t think anyone should be surprised that questions of reliability, accountability and ethics are being asked. And if you really stop and think, you’ll probably realize that the solution to all of these problems has been around for a very long time. As with all websites on the Internet, it is up to the reader to use his or her best judgement in evaluating the accuracy and relevancy of the informaton on a web page. Searching the information available at Wikipedia should be no different than searching the information available in Google – reader/searcher/user beware.