Episode #10

Why You Need Downtime Notifiers

In this episode, I explain why you need downtime notifiers in addition to regular error tracking systems. Briefly: because regular error reporting systems fail to even boot when the error is very serious. Also: total failure demands that you are reached via a more urgent channel - e.g. phone call or SMS. Next I describe a common blind-spot that will remain undetected even when you have a downtime-notifier in place: cronjobs that fail to run. However there is an elegant solution to this problem, in the form of heartbeat monitoring.

September 06, 2020

Show Notes

No notes available for this episode.

Screencast.txt

Transcribed by Rugo Obi

Sometimes your production system goes down completely.

During these times of total failure, the 'normal systems' for alerting you of errors like exception reporters and so on, will fail to even load.

So, ironically, the more serious the error, the less you are likely to know about it.

This creates the need for downtime monitoring systems that exist outside of your system, and are therefore immune to its crashing. And due to the seriousness of a total failure, these downtime monitoring systems need to contact you via a higher priority medium. For example, they should call or SMS you instead of sending you an email or slack message you might not see until tomorrow morning.

--

Previously, we covered exception notifiers. I showed you how I use libraries like Rollbar and Sentry in my Ruby code or JavaScript code etc, to report exceptions when they happen.

Unfortunately, this isn't enough for production systems.

The problem is that if your exception notifier is a Ruby, or PHP, or JavaScript library, It depends on those systems — i.e. Ruby, and JavaScript, to be able to run correctly.

If, for some reason they don't boot, or they do boot and crash immediately, or the network is down, then there is no way for you to be aware of the exceptions.

This is depicted in this picture over here.

Therefore, exception reporting measures are blind to a whole class of fatal errors, and ironically, these are the ones that are probably the most important to keep track of.

Therefore, you must use a system that exists outside of your software, in order to monitor the uptime of your entire system.

The typical solution is a tool that, for example, every five minutes pings a page on your website, maybe it's just the home page - that's a good start if you're running a monolith like I am.

And if your website doesn't respond in x seconds after y retries, then the system will send you an SMS or call you about the production emergency.

Personally, I use a service called DownNotifier. I haven’t exhaustively checked which one is best but this one serves my needs and seems to work pretty well.

These are all the websites that I currently monitor. I'm checking them once every minute, they are all currently up and the up-time for all of them was 99.99%, except my personal blog. I had an incident there, it seems. Let's click into it and take a further look.

Okay I'm going to go into downtimes here, and we can see that I had a massive 52 hour, 54 minute downtime here, with HTTP error 503. They emailed me the particular response at the time. This is what the email they sent me looks like -Error found for http://www.jackkinsella.ie blah blah blah.

I believe the issue at the time was that I hadn't upgraded my blog in years and it was no longer supported by the platform. I can't really remember.

What they also did, what DownNotifier also did is they sent it to me an SMS. I might not be checking my email all the time but I'm much more likely to respond to an SMS.

And with other kind of websites, for example with Oxbridge Notes, it would have been a genuine emergency, if the website had gone down. My personal blog? Not such a big deal.

So far so good.

But the software processes that the user interacts with: the website, the app, the GUI and desktop, that's not all there is. Lots of software runs in background queues or on a schedule as Cron Jobs or the equivalent.

For example, imagine you run a CryptoIndex fund. If business is going well, customers are constantly transferring Ethereum and what-not to your servers.

In order for you to acknowledge this receipt, update the user interfaces, etc., you need to connect to the blockchain and get the latest info.

On a less frequent schedule, you'll also need to rebalance portfolios, say, every week, month or whatever.

This is a project I once actually embarked on. You'll see that the contributor here is me.

What I want to show you now is a slightly simplified rendition of what happens in the cron scheduler in this particular piece of software.

So every 10 minutes, the following tasks are run. I look for incoming_deposits on the Ethereum blockchain, I update all my crypto_prices... (why, there's a little typo there). And then I refresh the database views, that’s not important, just a performance thing.

Then hourly, I reallocate portfolios: that's the actual indexing. And then daily I send out some email marketing messages.

Some of this, especially this part, is high responsibility stuff.

If I promise a customer I'll rebalance their portfolio on a certain schedule, but my system fails to do so, then I've broken a promise.

If there's a large move in the market prices, then their investments will have diverged from the ideal of the index. This may have legal consequences for breach of fiduciary duty.

Based on the monitoring tools we've seen so far: exception notification and downtime notifiers, I’ll know about errors that happen when background jobs run i.e., errors that occur during the process of these sort of cron jobs.

But what happens if some of these don't run at all?

What happens if the system that’s supposed to schedule them is down but no code is run? No exception could possibly be raised.

What’s more, the downtime notifier that I showed you so far - that works by visiting my homepage - that can’t possibly work here, because there is no homepage to a cron job. This is something that just exists within a computer's hardware.

So, the answer is rather elegant: Basically it inverts regular downtime monitoring. Instead of some external system pinging your server, every 10 minutes, your code is responsible for pinging some external system.

This system expects to hear some sort of regular "heartbeat" from your servers. And if a beat is missed, give or take a small grace period, it raises the alarm.

I use a tool called Dead Man’s Snitch to do this. Let me show you the absolute minimum code to make this work. That's it.

All that happens is I curl some external URL, whenever my background processing is done, as a way to indicate that it actually happened.

Here's what the user interface looks like.

I don't actually use this in Oxbridge Notes at the moment because there’s nothing too important in the cron job.

I just added it for the purpose of this video, and I also deleted it for a while so you can see what the incidents look like.

And you can also see here that it tells you for how long the signal was not sent for, and when the last check in was; that’s four minutes ago. That’s the curl request you just saw there, etc. etc.

-

Okay, so I have an update for you.

Earlier in the video, which I recorded a couple of weeks ago, I said that there's nothing too important in my cron jobs, therefore I don't use Dead Man Snitch on it.

It turns out there was a serious error in my cron jobs causing all my marketing emails not to send, perhaps for the last couple of months. It's pretty shameful.

And I realized this after adding in this kind of monitoring. This caused an immediate bump in revenue, quite a significant one.

And it goes to show that you really should be tracking those things, and I had made a mistake by underestimating the importance of that sort of tracking.