TransitCamp Edmonton: Data for Developers

I’ve been looking forward to this presentation for a long time! As you may know, I’ve been one of the more vocal citizens asking for an API or data dump from Edmonton Transit. I think only positive things will result from giving everyone access to the data! ETS simply doesn’t have the resources to build interfaces for the iPhone, SMS, etc., so releasing the data would enable other people to build them instead!

Today at TransitCamp Edmonton, I’m pleased to share with you that ETS has become the 2nd transit authority in Canada (and 29th in the world) to release their route and schedule information for free in the GTFS format!

Here are the slides from my presentation:

The ETS GTFS data is about 16 MB compressed and 177 MB uncompressed, so it’s quite a bit of data. If you’re looking for some help getting started, I’d suggest checking out the googletransitdatafeed project and the timetablepublisher project.

We’re also going to be holding a programming competition, as a little extra incentive for you to build something cool and useful with the data. So far we’ve got three prizes: 6 months of free transit for first place, 4 months for second place, and 2 months for third place (to clarify: that’s 6 months for the team, not for each individual on the team). I don’t have all the details yet, but stay tuned. I’ll be posting more information on the TransitCamp site (and here).

I think this is fantastic. Open cities are the future, and this is a big step in the right direction for the City of Edmonton.

Mountains of data, right at your fingertips

Last week, two announcements caught my eye. The first was from Amazon.com, which announced that there is now more than 1 TB of public data available to developers through its Public Data Sets on AWS project. The second was from the New York Times, which announced its Newswire API, providing access all NYTimes articles as they are published.

This is a big deal. Never before has so much data been so readily available to anyone. The AWS data is particularly interesting. All of a sudden, any developer in the world has cost-effective access to all publicly available DNA sequences (including the entire Human Genome), an entire dump of Wikipedia, US Census data, and much more. Perhaps most importantly, the data is in machine-readable formats. It’s relatively easy for developers to tap into the data sources for cross-referencing, statistical analysis, and who knows what else.

The Newswire API is also really intriguing. It’s part of a growing set of APIs that the New York Times has made available. With the Newswire API, developers can get links and metadata for new articles the minute they are published. What will developers do with this data? Again, who knows. Imagination is the only limitation now that everyone can have immediate access.

Both of these projects remove barriers and will help foster invention, innovation, and discovery. I hope they are part of a larger trend, where simple access to data becomes the norm. Google’s mission might be to organize the world’s information and make it universally accessible and useful, but it’s projects like these that are making that vision a reality. I can’t wait to see what comes next!

Thoughts on backups with MozyPro

At around 1:30am on August 6th, a hard drive in one of our database servers died. It took down our mail server and WordPress blogs, but everything else (such as Podcast Spot) was unaffected. It sucks, but it happens. We’ve had many drives die over the last few years, unfortunately. All you can do is learn from each experience.

In this case, we had a full image of the server backed up. All we had to do was stick in a new hard drive, and deploy the image. That worked fairly well, though it did take some time to complete. The only problem was that the image was about 24 hours old – fine for system files, but not good for the data files we needed. For the most up-to-date data files, we relied on MozyPro.

(I should point out that we generally configure things so that data files are on separate drives from the system. In this case, we had about 250 MB of data files on the system drive. I have since reconfigured that.)

For the most part, Mozy has worked well for us. We’ve had a few bumps along the way, but no major complaints or problems. Until I tried to restore the data files yesterday, that is. The first problem was that I couldn’t use the Windows interface. The Mozy client would not “see” the last backup, presumably because the image was older than the last backup. You’d think it could connect to Mozy and figure that out, but apparently not. So I tried to use the Web Restore. It eventually worked, but it took about four hours to get the files. I don’t mean to download them, but for Mozy to make them available for download. Thank goodness it was only about 1000 files and 250 MB or it could have taken days!

So I learned that Mozy is reliable, but certainly not quick. If you need to restore something quickly, make sure you have a local backup somewhere. If you’re just looking for reliable, inexpensive, offsite storage then Mozy will probably work fine for you.

My next task is to upgrade this server particular to a RAID configuration, something we had been planning to do anyway. Should have done it sooner!

281 exabytes of data created in 2007

data I typed the title for this post into Windows Live Writer, and a red squiggly appeared under the word “exabytes”. I just added it to the dictionary, but I can’t help but think that it’ll be in there by default before long.

Either it takes three months to crunch the data or March is just the unofficial “how much did we create last year” month, because researchers at IDC have once again figured out how many bits and bytes of data were created in 2007. You’ll recall that in March of last year, they estimated the figure for 2006 to be 161 exabytes. For 2007, that number nearly doubled, to 281 exabytes (which is 281 billion gigabytes):

IDC attributes accelerated growth to the increasing popularity of digital television and cameras that rely on digital storage. Major drivers of digital content growth include surveillance, social networking, and cloud computing. Visual content like images and video account for the largest portion of the digital universe. According to IDC, there are now over a billion digital cameras and camera phones in the world and only ten percent of photos are captured on regular film.

This is obviously a very inexact science, but I suspect their estimates become more accurate with experience.

Interestingly, this is the first time that we’ve created more data than we have room to store (though one wonders if that’s simply due to a lack of historical data than anything else).

Read: ars technica

161 exabytes of data created in 2006

Post ImageThere’s a new report out from research firm IDC that attempts to count up all the zeroes and ones that fly around our digital world. I remember reading about the last such report, from the University of California, Berkeley. That report found that 5 exabytes of data were created in 2003. The new IDC report says the number for 2006 is 161 exabytes! Why the difference?

[The Berkeley researchers] also counted non-electronic information, such as analog radio broadcasts or printed office memos, and tallied how much space that would consume if digitized. And they examined original data only, not all the times things got copied.

In comparison, the IDC numbers ballooned with the inclusion of content as it was created and as it was reproduced – for example, as a digital TV file was made and every time it landed on a screen. If IDC tracked original data only, its result would have been 40 exabytes.

Even still, that’s an incredible increase in just three years. Apparently we don’t even have enough space to store all that data:

IDC estimates that the world had 185 exabytes of storage available last year and will have 601 exabytes in 2010. But the amount of stuff generated is expected to jump from 161 exabytes last year to 988 exabytes (closing in on 1 zettabyte) in 2010.

Pretty hardcore, huh? You can read about zettabytes at Wikipedia. I’m not too worried about not having enough space though, even if we were attempting to store all that data (which we aren’t). Hard drives are already approaching the terabyte mark, so who knows how big they’ll be in 2010. Then of course there’s also the ever falling costs of DVD-like media.

More importantly, I bet a lot of the storage we “have available” right now is totally underutilized. You’d be hard pressed to find a computer that comes with less than 80 GB of storage these days, and I can assure you there are plenty of users who never even come close to filling it up. Heck, even I am only using about 75% of the storage I have available on my computer (420 GB out of 570 GB) and I bet a lot of it could be deleted (I’m a digital pack rat).

Read: Yahoo! News