How Not To Generate A CRC

by

Yesterday we had a success disaster. An überpopular mashup album that mixed the Wu Tang Clan and The Beatles was released a few days ago and yesterday it got stumbled upon and boing boinged and tweeted to the four winds. Our downloads per minute went through the roof. We’ve dealt with big releases in the past and our download machines typically respond with a big collective yawn. This time around we had problems. Downloads were slow to start and the downloader processes crashed a bunch. We eventually had to make the album streaming-only for a few hours until we figured it out.

I remember first hearing the term “success disaster” from Chuck Geschke, co-founder of Adobe and certified wonderful guy. I’d been there a few years working on Photoshop and Illustrator. This was back when the apps were just a small percent of Adobe’s business. Adobe made most of its money from licensing PostScript and selling laser printer schematics to other companies. The apps were taking off and Adobe was unprepared to deal with a big, sudden increase of flesh-and-blood customers. Phone support was understaffed, production and fulfillment channels were too small. It’s a success, and a disaster. Yay.

Wu & Co’s download success was our download disaster. What’s interesting about what happened yesterday is that the disaster was made a lot worse by a bug in my code, a really stupid, awesome bug. If the bug hadn’t been there the downloaders would’ve hummed through the day with no more than a lazy stretch. The bug’s been in production for at least six months, annoying the downloaders but staying under the radar until now.

“I work for a music publishing company in Australia, we have a number of artists who currently use bandcamp… I’ve been very impressed with the simplicity of design, flawless coding and speed of downloads. I think that you understand the way people consume music, better than most other companies.”

Flawless coding. Yeah.

This is about how not to generate a stable CRC. We have this very cool system in Bandcamp where tracks are turned into the format you’re downloading – mp3, aac, flac, etc. – on demand. If you’re the first person to download that new bandcamp.com/artists track in ogg vorbis, the downloader takes a small detour and makes it first. Once made, it caches it so that future ogg vorbis downloads happen immediately. There’s an aging algorithm in the cache that keeps active tracks around and garbage collects the inactive ones. In this way we’re able to run a site with hundreds of thousands of tracks and millions of downloads a week on way less hardware than you’d think.

In addition to generating formats on demand, the downloaders also check to see if the metadata in a track is up-to-date before starting a download. Lots of pieces of info get put into an mp3’s metadata: the track title and artist name, ’course, the track number, release date, album name, album artwork, lyrics (which is totally cool on an iPhone), etc. If any of this info was changed since the last time the track was downloaded, a metadata job is fired off to update the mp3 before the download begins. (we put metadata in all the downloadable formats, but I’ll use mp3s as the example here.)


Lyrics.

Here’s the bug. For every file I store a signature of the metadata in our database. The signature’s a simple CRC32 of all the relevant metadata info. The info tends to be a couple hundred bytes so nothing fancier than a 32-bit integer CRC is necessary. Before every download I gather the current metadata info from the database – the track info, the info about the album the track’s on, the band’s info – generate a CRC and compare it to the CRC in the database. If they’re different, it’s time to update the mp3. And then the new CRC gets stored in the database to compare against in future downloads. Storing a CRC is tiny, efficient and much better than storing a snapshot of all the metadata info itself.

But darlin’. The CRC really needs to be stable in order to be useful. One’s CRC algorithm (not mentioning names) is useless if it generates a different number for the same source data. I call the CRC routine today and I get X. Tomorrow I get Y and X != Y even though the metadata hasn’t changed. That’s worse than useless because we’re depending on the CRC to reduce the number of times we have to update the metadata of every track that’s downloaded. If we get this wrong, we may underestimate the number of machines we need in our system dedicated to doing the metadata jobs. They may be underpowered for a large rush of downloads. They may get backed up and downloads may pile up and time out. And then Ethan may have to spend 6 1/2 hours answering support emails. Just sayin’.

Here’s how I did it. I have a routine that collects all the metadata used in an mp3 and puts it into an array of arrays, something like this:

[
['TIT2', {:encoding => 1, :text => 'Do I Move You?'}],
['TPE1', {:encoding => 1, :text => 'Nina Simone'}],
['USLT', {:encoding => 1, :text => '...', :lang => 'eng'}],
...
]

The first element of each array, for example ‘TIT2’, is the official ID3 metadata key (see www.id3.org/d3v2.3.0) and the second element is the metadata itself, for example the track’s title. This array of arrays is handed to a routine that does the (ridiculously hard) work of updating an mp3’s innards with the new metadata.

This blob of track data is also given directly to the CRC routine. Makes sense, right? The CRC is dependent on all the metadata that’s going into the mp3, so just use the same data blob that’s actually used to set the metadata. I simply run a CRC32 algorithm right on the array:

return Zlib.crc32(Marshal.dump(metadata_array))

The stoopid stoopid stoopid part that I’m embarrassed I overlooked is that hashes are not guaranteed to be read in the same order that they are created. If you do this:

h = {:a => 1, :b => 2}
puts h.inspect

You might get this:

{:a => 1, :b => 2}

or this:

{:b => 2, :a => 1}

Depending on the process’s mood. It’s not a mistake. Hashes are unordered sets of data. It almost never matters what order the keys are in; if you want the value of :x you’ll always get 1 and with :y you’ll always get 2. If you want an ordered set then you use an array. Duh. In practice the order of items in a hash will actually be the same as long as the program is running. But if you stop the program and run it again, you might get a different order. (I haven’t looked at the implementation but I imagine it has to do with the values of object ids. Doesn’t Ruby 1.9 do something different?)

So in other words, because the metadata blob given to the CRC routine contains hashes, and because the order of items in a hash is not fixed, the generated CRC will be different from time to time. Run the program once you get X. Run it again, the exact same code, you get Y.

Our downloader processes have great uptime. They’re rarely restarted. So for a given downloader the CRC will be stable. What’s humorous about what happened yesterday is that the downloads were being handled simultaneously by the two downloader machines morbo and linda. Linda would get a request for the Wu’s album, calculate CRCs for every track (and there are 27 of them), update the mp3s, cache those files and save the new CRCs in the database. Subsequent download requests to linda would match and the cached files would be used. Meanwhile morbo was doing the same thing, coming up with its own CRCs, updating the mp3s again, re-caching the files and updating the database. Basically morbo and linda were fighting over the “correct” CRCs, disagreeing in their silicon way, and generating tons of metadata update jobs and file I/O. Oh linda, how cute. You got the CRC wrong. I’ll fix it for you. Morbo you zoo creature! Take that! 27 tracks times mp3-320, mp3-v0, mp3-v2, etc. times thousands of downloads per hour. The metadata jobs backed up and the cache disk write I/O went to hell. Downloads failed. Ethan emailed.

Of course when I first saw these thousands of metadata jobs fly by in the logs my honed engineering intuition led me to blame someone else. I figured the artist was sitting at his site madly tweaking the release date in some ambien ADHD fit. We actually sent him an email saying Lay off the edits! In retrospect I had seen clues of this in logs over the months. I remember noticing that one track had been updated more than a thousand times and thinking what a dick.

I had the fix for this in half an hour. Five lines of code that temporarily converted the hashes to arrays and sorted them by key, ensuring that the order was predictable and stable across all processes.

[
['TIT2', [[:encoding, 1], [:text, 'Do I Move You?']] ],
['TPE1', [[:encoding, 1], [:text, 'Nina Simone']] ],
['USLT', [[:encoding, 1], [:lang, 'eng'], [:text, '...']] ],
...
]

(Note how :lang now comes before :text.)

We pushed this fix to our downloaders a few hours after the success disaster hit and later re-enabled downloads for Wu vs. B. Everything’s humming along fine. When we restart a downloader we have to wait until all of its current downloads dry up. In a glorious final middle finger to me for my grievous sins, linda happened to be in the middle of incorrectly calculating the CRCs and updating the metadata for every track of a fucking song a day for an entire year album. Yeah, you want me to restart? FU. (Righteous album, dubs.)


About these ads

2 Responses to “How Not To Generate A CRC”

  1. nils Says:

    awesome. fun read joe!

  2. blackeyedpreacher Says:

    Bandcamp goes from strength to strength! Excellent work, excellent!

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s


Follow

Get every new post delivered to your Inbox.

%d bloggers like this: