Bandcamp was offline briefly yesterday due to what I like to call an unexpected single point of failure. Good systems design is all about addressing single points of failure, making sure you have redundancies in place, but sometimes you discover single points of failure that you didn’t realise you had.
Yesterday’s problem was caused by maintenance on our central rsyslog server, which we use to collect analytics from our application servers. When that central server went down, it set a chain of events in motion:
- Remote logging from our app servers blocked, since we have rsyslog configured to use TCP, which attempts to guarantee delivery.
- Those blocked messages blocked all syslog logging on the app servers, since the default rsyslog configuration puts all logging in to a single delivery queue.
- Within minutes that delivery queue filled up, causing all subsequent logging requests to block, freezing not just our apps but also system services like sshd. So, no logging in.
In the course of responding to the outage we quickly decided it was prudent to reboot the affected servers before continuing to investigate the root cause. We didn’t know at the time that once we restarted our apps the countdown clock started ticking. Fortunately we got to the bottom of the problem before the servers froze up again.
Lesson learned! Our rsyslog configuration now uses a dedicated queue for remote logging, and that queue spills over to disk if it fills up, preventing rsyslog from blocking logging if the central server goes offline. Here’s the relevant code:
$ActionQueueType LinkedList $ActionQueueFileName apptimer $ActionResumeRetryCount -1 $ActionQueueSaveOnShutdown on local0.* @@rsyslogserver:10514
Leigh Dyer is the Lead Engineer of the Systems Team at Bandcamp, Inc.