Be careful how you rsyslog

2015-04-28 by

Bandcamp was offline briefly yesterday due to what I like to call an unexpected single point of failure. Good systems design is all about addressing single points of failure, making sure you have redundancies in place, but sometimes you discover single points of failure that you didn’t realise you had.

Yesterday’s problem was caused by maintenance on our central rsyslog server, which we use to collect analytics from our application servers. When that central server went down, it set a chain of events in motion:

  • Remote logging from our app servers blocked, since we have rsyslog configured to use TCP, which attempts to guarantee delivery.
  • Those blocked messages blocked all syslog logging on the app servers, since the default rsyslog configuration puts all logging in to a single delivery queue.
  • Within minutes that delivery queue filled up, causing all subsequent logging requests to block, freezing not just our apps but also system services like sshd. So, no logging in.

In the course of responding to the outage we quickly decided it was prudent to reboot the affected servers before continuing to investigate the root cause. We didn’t know at the time that once we restarted our apps the countdown clock started ticking. Fortunately we got to the bottom of the problem before the servers froze up again.

Lesson learned! Our rsyslog configuration now uses a dedicated queue for remote logging, and that queue spills over to disk if it fills up, preventing rsyslog from blocking logging if the central server goes offline. Here’s the relevant code:

$ActionQueueType LinkedList
$ActionQueueFileName apptimer
$ActionResumeRetryCount -1
$ActionQueueSaveOnShutdown on
local0.* @@rsyslogserver:10514

 

 Leigh Dyer is the Lead Engineer of the Systems Team at Bandcamp, Inc.

Advertisement

Expression Evaluators

2010-02-07 by

Bugs are fun but this time I want to talk about win. Recently I finished work on a recursive descent expression evaluator in the Bandcamp code. (That’s a mouthful and I enjoy slinging that whenever I can—it’s up there with binary space partitions.) An expression evaluator is something that can take a sequence of characters like

2 + 2

and figure out that it means to add two and two and give up four. Expression evaluators are obscure but oddly enough I’ve written at least five in my professional life.

The 1st brush was when I was 25, new at Adobe and began to seriously dig into the C programming language. In the appendix of K&R I found the grammar for the entire language in a formal notation called “modified BNF”. It described, line by line, every element of the language beginning with the highest thing, the mysteriously named translation-unit, and breaking it down from there:

translation-unit:
  external-declaration
  translation-unit external-declaration

external-declaration:
  function-definition
  declaration

function-definition:
  declaration-specifiersopt declarator declaration-listopt compound-statement
...

In this grammar a thing is defined in terms of simpler things. And these simpler things are subsequently defined in terms of even simpler things. And so on, downward, until you get to the atoms of a language, basic elements like integer-constants and character-constants. This is a conventional device in Computing.

Can we take a moment to worship at the altar of K&R? In 189 thin pages and eight quick chapters you have everything there is to know about the C programming language. The tutorial that’s chapter 1 is a model of concise, practical writing. Please for the love of God everyone who wakes up and thinks I’ll write a computer book! internalize and become one with this chapter. I mean the entire book.

So I wrote a program that understood C based on this grammar. It didn’t evaluate the language, but it could parse it. Meaning it was code that understood the rules for the arrangement of the keywords and symbols in the language. Programming is all about reference and so of course self-reference pops up from time to time—I had written a program in C that understood C.

You think C’s an obscure language that should be put to pasture? You owe everything you’re doing on a computer right now to C. Your operating system is written in C. Your warm and fuzzy language that’s not C is written in C (and C is written in C—ha!). Photoshop is written in C (okay C++ but we regret ever touching the ++). Apache is written in C. The syntax of your favorite language is heavily influenced by C. The world runs on C. #thatsawesome #fistpump

The 2nd and 3rd times I wrote an expression evaluator were when I wrote some plug-ins for Photoshop a few years later. Photoshop has an external programming interface that programmers can use to write little (and big) plug-ins that live in the app and do algorithmic things to images. When Photoshop invokes a plug-in the code is given the pixels of the image, the code messes with those pixels and voilá an effect. A butt simple plug-in could zero the green and blue values of every pixel, effectively putting the image through a red filter. Or it could halve every value making the image darker. It could peek at nearby pixels; that’s how blurring is done. Once you realize pixels are just numbers you can go crazy with the effects.

Fellow Adobeite Joe Ternasky had the idea of making a meta plug-in—a plug-in for making plug-ins—where you could enter expressions into boxes:

R = [    ]
G = [    ]
B = [    ]

Then when the plug-in runs it applies those expressions to each pixel in the image.

R = r
G = 0
B = 0

would be the red filter.

R = r/2
G = g/2
B = b/2

would be the darkener. All the normal math operators are available, as are several functions to give the pixel’s original values, coordinates, its distance and angle from the center, etc., and the values of a few user-defined sliders. If you were enough of a math-head you could write your own plug-ins to do the canonical resize, rotate, blur, and then go crazy with whatever other effects you imagined. The plug-in was called Filter Factory.

It had a fan club.

Filter Factory needed an expression evaluator so Joe and I wrote one. Then for fun we rewrote it as a set of expression compilers that turned the expressions into raw machine language. Joe did the 68K compiler and I did the PowerPC. The filters consequently ran just as fast as the ones built into Photoshop. It was cool. This is when I began to understand recursive descent expression evaluators.

Fast forward to Bandcamp. The 4th expression evaluator I wrote was last summer for our HTML templates. A template might have lines like this:

{{band.name}}
<a href="{{band.stats_url}}">stats</a>

When the page is served the {{curly brace}} things are filled in by the Liquid template engine we use. Liquid can do conditionals—the stats link on a band’s page for example is shown only if you’re logged in:

{% if user.logged_in %}
<a href="{{band.stats_url}}">stats</a>
{% endif %}

The user never sees the curly braces or the conditionals. The HTML is put together on the server and the custom page with the {{blanks}} filled in and the conditionals figured out is sent down to you. Liquid is too limited for some of the really awesome stuff we do in Bandcamp, but because it’s Ruby and open source we’ve been able to hack on it. I added a full-on expression evaluator. Now we can have smarter conditionals and counters.

{% if user.logged_in && user.admin_level >= 10 %}
...
{% let items_shown = items_shown + 1 %}

This expression evaluator was a breeze to write in Ruby. It’s only a page of code and doesn’t cheat by calling eval().

The 5th expression evaluator. This week I’ve been improving the system we use to distribute files around among the servers that together make Bandcamp. All of the files on Bandcamp (images, audio tracks, etc.) are kept in a big virtual file system called the zstore. It looks like one really big hard drive that’s made from the hard drives of all the machines. The files are spread across all the servers, and no single server has a complete set of all the files. There is redundancy so that if one machine dies we’ve got copies of the files someplace else.

For example, the transcoding servers hold most of the lossless audio files so that they’re immediately available for encoding into whatever format is needed. The downloader machines keep the audio files around for quick-start downloading. The webapp machines covet the public files (cover art mostly, in various sizes and formats). But any file can be anywhere. Our zstore code can act like every file is always available on every machine, and if a file isn’t where it’s needed the zstore copies it first behind the scenes. I wrote the zstore as one of the original integral pieces of Bandcamp, way back in ancient history (2008). I’ve studied distributed file systems and implemented a few before. I was amped to write one for Bandcamp.

Why did I write one instead of using an existing DFS? That’s worth its own post. The bottom line is that distributed file systems are still thumb-sucking toddlers, Computer Science speaking. We’re still figuring out how they should work: dealing with connectivity events, ensuring redundancy, providing the most-recent copy of a file when it’s needed, archiving, versioning, etc. These are hard. By focusing on Bandcamp’s specific needs I was able to write a DFS that does just what we need, and it does it well.

In addition to copying files between machines on demand, there’s code that proactively copies files around to anticipate needs and reduce delays. For example when a user uploads cover art for an album it first goes into the zstore on the webapp machine that handled the upload. The other webapp machines then see it and grab copies for themselves.


I’ll take one.

Me too.

I’m in.

Each machine needs to know what files it should proactively copy. I was reworking this code last week and decided an expression evaluator was the bee’s knees for this. I extended the Liquid evaluator with strings and built-in variables and extended comparisons. Now each zstore machine has an expression like this:

morbo:    tag == 'lossless' && (id % 100) >=  0 && (id % 100) <= 47
linda:    tag == 'lossless' && (id % 100) >= 48 && (id % 100) <= 96
calculon: tag == 'lossless' && (id % 100) >= 97 && (id % 100) <= 99
devil:    public == true
santa:    public == true

Whenever a file is added or modified in the zstore each of the machines looks at its expression and decides if it should make a local copy. What the rules above mean is that morbo, linda and calculon all want the lossless audio files, and each claims some percent of them (because no machine is large enough to hold all of them). The terms tag and id are built-in variables that resolve to attributes of the file. Devil and santa want all the files marked public because they’re webapp machines.

(The part of the expression (id % 100) >= 0 && (id % 100) <= 47 essentially partitions the entire set of files into 100 slots based on the file ids. A file id is a large random integer assigned when a file is created; it’s never changed or re-used. The % operator means modulo. I recently learned about a persistent partitioning scheme that uses a much higher number of slots and then distributes groups of slots to particular machines. This has the advantage of creating less havoc when a machine disappears. TODO.)

Doesn’t that look a bit like Filter Factory?

The code for this latest incarnation of Joe’s Recursive Descent Expression Evaluator is less than 200 lines of Ruby. You can take a look at it here:

bc-expression at google code

Start about half way down at parse_comparison_op. One of the interesting things about a top-down grammar is that it’s easy to implement that way, too. See if you can follow the code from there to the bottom of the source. (By the way, does anyone want to turn this into a gem? TXTMEOK) I find it fascinating when the same Computing concepts—the expression evaluator in this case—come up over and over again in wildly different contexts. I love this about programming.

You may be asking what’s recursive about all of this? Think about parentheses in an expression. If the topmost item in the grammar is an expression, and you descended the grammar through all the valid syntax, at the very bottom you’d come to a term. A term is the simplest unit that has a value. It’s either a constant (42), a variable (id, tag, public), or… well, take a look:

expression:
  boolean

boolean:
  comparison || comparison
  comparison && comparison

comparison:
  additive == additive
  additive != additive
  additive < additive
  additive > additive

additive:
  multiplicative + multiplicative
  multiplicative - multiplicative

multiplicative:
  term * term
  term / term
  term % term

term:
  constant
  variable
  ( expression )  # Hey!

How Not To Generate A CRC

2010-01-21 by

Yesterday we had a success disaster. An überpopular mashup album that mixed the Wu Tang Clan and The Beatles was released a few days ago and yesterday it got stumbled upon and boing boinged and tweeted to the four winds. Our downloads per minute went through the roof. We’ve dealt with big releases in the past and our download machines typically respond with a big collective yawn. This time around we had problems. Downloads were slow to start and the downloader processes crashed a bunch. We eventually had to make the album streaming-only for a few hours until we figured it out.

I remember first hearing the term “success disaster” from Chuck Geschke, co-founder of Adobe and certified wonderful guy. I’d been there a few years working on Photoshop and Illustrator. This was back when the apps were just a small percent of Adobe’s business. Adobe made most of its money from licensing PostScript and selling laser printer schematics to other companies. The apps were taking off and Adobe was unprepared to deal with a big, sudden increase of flesh-and-blood customers. Phone support was understaffed, production and fulfillment channels were too small. It’s a success, and a disaster. Yay.

Wu & Co’s download success was our download disaster. What’s interesting about what happened yesterday is that the disaster was made a lot worse by a bug in my code, a really stupid, awesome bug. If the bug hadn’t been there the downloaders would’ve hummed through the day with no more than a lazy stretch. The bug’s been in production for at least six months, annoying the downloaders but staying under the radar until now.

“I work for a music publishing company in Australia, we have a number of artists who currently use bandcamp… I’ve been very impressed with the simplicity of design, flawless coding and speed of downloads. I think that you understand the way people consume music, better than most other companies.”

Flawless coding. Yeah.

This is about how not to generate a stable CRC. We have this very cool system in Bandcamp where tracks are turned into the format you’re downloading – mp3, aac, flac, etc. – on demand. If you’re the first person to download that new bandcamp.com/artists track in ogg vorbis, the downloader takes a small detour and makes it first. Once made, it caches it so that future ogg vorbis downloads happen immediately. There’s an aging algorithm in the cache that keeps active tracks around and garbage collects the inactive ones. In this way we’re able to run a site with hundreds of thousands of tracks and millions of downloads a week on way less hardware than you’d think.

In addition to generating formats on demand, the downloaders also check to see if the metadata in a track is up-to-date before starting a download. Lots of pieces of info get put into an mp3’s metadata: the track title and artist name, ’course, the track number, release date, album name, album artwork, lyrics (which is totally cool on an iPhone), etc. If any of this info was changed since the last time the track was downloaded, a metadata job is fired off to update the mp3 before the download begins. (we put metadata in all the downloadable formats, but I’ll use mp3s as the example here.)


Lyrics.

Here’s the bug. For every file I store a signature of the metadata in our database. The signature’s a simple CRC32 of all the relevant metadata info. The info tends to be a couple hundred bytes so nothing fancier than a 32-bit integer CRC is necessary. Before every download I gather the current metadata info from the database – the track info, the info about the album the track’s on, the band’s info – generate a CRC and compare it to the CRC in the database. If they’re different, it’s time to update the mp3. And then the new CRC gets stored in the database to compare against in future downloads. Storing a CRC is tiny, efficient and much better than storing a snapshot of all the metadata info itself.

But darlin’. The CRC really needs to be stable in order to be useful. One’s CRC algorithm (not mentioning names) is useless if it generates a different number for the same source data. I call the CRC routine today and I get X. Tomorrow I get Y and X != Y even though the metadata hasn’t changed. That’s worse than useless because we’re depending on the CRC to reduce the number of times we have to update the metadata of every track that’s downloaded. If we get this wrong, we may underestimate the number of machines we need in our system dedicated to doing the metadata jobs. They may be underpowered for a large rush of downloads. They may get backed up and downloads may pile up and time out. And then Ethan may have to spend 6 1/2 hours answering support emails. Just sayin’.

Here’s how I did it. I have a routine that collects all the metadata used in an mp3 and puts it into an array of arrays, something like this:

[
['TIT2', {:encoding => 1, :text => 'Do I Move You?'}],
['TPE1', {:encoding => 1, :text => 'Nina Simone'}],
['USLT', {:encoding => 1, :text => '...', :lang => 'eng'}],
...
]

The first element of each array, for example ‘TIT2’, is the official ID3 metadata key (see www.id3.org/d3v2.3.0) and the second element is the metadata itself, for example the track’s title. This array of arrays is handed to a routine that does the (ridiculously hard) work of updating an mp3’s innards with the new metadata.

This blob of track data is also given directly to the CRC routine. Makes sense, right? The CRC is dependent on all the metadata that’s going into the mp3, so just use the same data blob that’s actually used to set the metadata. I simply run a CRC32 algorithm right on the array:

return Zlib.crc32(Marshal.dump(metadata_array))

The stoopid stoopid stoopid part that I’m embarrassed I overlooked is that hashes are not guaranteed to be read in the same order that they are created. If you do this:

h = {:a => 1, :b => 2}
puts h.inspect

You might get this:

{:a => 1, :b => 2}

or this:

{:b => 2, :a => 1}

Depending on the process’s mood. It’s not a mistake. Hashes are unordered sets of data. It almost never matters what order the keys are in; if you want the value of :x you’ll always get 1 and with :y you’ll always get 2. If you want an ordered set then you use an array. Duh. In practice the order of items in a hash will actually be the same as long as the program is running. But if you stop the program and run it again, you might get a different order. (I haven’t looked at the implementation but I imagine it has to do with the values of object ids. Doesn’t Ruby 1.9 do something different?)

So in other words, because the metadata blob given to the CRC routine contains hashes, and because the order of items in a hash is not fixed, the generated CRC will be different from time to time. Run the program once you get X. Run it again, the exact same code, you get Y.

Our downloader processes have great uptime. They’re rarely restarted. So for a given downloader the CRC will be stable. What’s humorous about what happened yesterday is that the downloads were being handled simultaneously by the two downloader machines morbo and linda. Linda would get a request for the Wu’s album, calculate CRCs for every track (and there are 27 of them), update the mp3s, cache those files and save the new CRCs in the database. Subsequent download requests to linda would match and the cached files would be used. Meanwhile morbo was doing the same thing, coming up with its own CRCs, updating the mp3s again, re-caching the files and updating the database. Basically morbo and linda were fighting over the “correct” CRCs, disagreeing in their silicon way, and generating tons of metadata update jobs and file I/O. Oh linda, how cute. You got the CRC wrong. I’ll fix it for you. Morbo you zoo creature! Take that! 27 tracks times mp3-320, mp3-v0, mp3-v2, etc. times thousands of downloads per hour. The metadata jobs backed up and the cache disk write I/O went to hell. Downloads failed. Ethan emailed.

Of course when I first saw these thousands of metadata jobs fly by in the logs my honed engineering intuition led me to blame someone else. I figured the artist was sitting at his site madly tweaking the release date in some ambien ADHD fit. We actually sent him an email saying Lay off the edits! In retrospect I had seen clues of this in logs over the months. I remember noticing that one track had been updated more than a thousand times and thinking what a dick.

I had the fix for this in half an hour. Five lines of code that temporarily converted the hashes to arrays and sorted them by key, ensuring that the order was predictable and stable across all processes.

[
['TIT2', [[:encoding, 1], [:text, 'Do I Move You?']] ],
['TPE1', [[:encoding, 1], [:text, 'Nina Simone']] ],
['USLT', [[:encoding, 1], [:lang, 'eng'], [:text, '...']] ],
...
]

(Note how :lang now comes before :text.)

We pushed this fix to our downloaders a few hours after the success disaster hit and later re-enabled downloads for Wu vs. B. Everything’s humming along fine. When we restart a downloader we have to wait until all of its current downloads dry up. In a glorious final middle finger to me for my grievous sins, linda happened to be in the middle of incorrectly calculating the CRCs and updating the metadata for every track of a fucking song a day for an entire year album. Yeah, you want me to restart? FU. (Righteous album, dubs.)