Be careful how you rsyslog

by

Bandcamp was offline briefly yesterday due to what I like to call an unexpected single point of failure. Good systems design is all about addressing single points of failure, making sure you have redundancies in place, but sometimes you discover single points of failure that you didn’t realise you had.

Yesterday’s problem was caused by maintenance on our central rsyslog server, which we use to collect analytics from our application servers. When that central server went down, it set a chain of events in motion:

  • Remote logging from our app servers blocked, since we have rsyslog configured to use TCP, which attempts to guarantee delivery.
  • Those blocked messages blocked all syslog logging on the app servers, since the default rsyslog configuration puts all logging in to a single delivery queue.
  • Within minutes that delivery queue filled up, causing all subsequent logging requests to block, freezing not just our apps but also system services like sshd. So, no logging in.

In the course of responding to the outage we quickly decided it was prudent to reboot the affected servers before continuing to investigate the root cause. We didn’t know at the time that once we restarted our apps the countdown clock started ticking. Fortunately we got to the bottom of the problem before the servers froze up again.

Lesson learned! Our rsyslog configuration now uses a dedicated queue for remote logging, and that queue spills over to disk if it fills up, preventing rsyslog from blocking logging if the central server goes offline. Here’s the relevant code:

$ActionQueueType LinkedList
$ActionQueueFileName apptimer
$ActionResumeRetryCount -1
$ActionQueueSaveOnShutdown on
local0.* @@rsyslogserver:10514

 

 Leigh Dyer is the Lead Engineer of the Systems Team at Bandcamp, Inc.

Advertisements

One Response to “Be careful how you rsyslog”

  1. Elmer Fud Says:

    If your disk fills up your queue will still block, or if the disk is somehow unavailable for writing. You might want to also consider adding the following 2 options as well.

    $ActionQueueMaxDiskSpace 512m # limit amount of disk space used
    $ActionQueueTimeoutEnqueue 0 # Drop messages with no wait when they can’t be queued

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s


%d bloggers like this: