Episode #8

Error Tracking and Monitoring: Part II

Continuing on from last week's episode, I talk about how: 1) Being on top of your exceptions can wow customers and clients. 2) The dangers of being inundated with too many exceptions and how to make the numbers manageable with filters. 3) How to connect errors in production to the actual humans you might need to apologize to. 4) The need to provision error tracking in ALL of your microservices — every component needs it. 5) The need to scrub sensitive data before potentially sending it to a 3rd party error-reporting platform. 6) The value of connecting error reporting to your deploys.

August 23, 2020

Show Notes

No notes available for this episode.

Screencast.txt

Transcribed by Rugo Obi

It’s one thing just to have exception reporting in your software, but it’s another one altogether to really use it well.

That includes things like filtering out rubbish noisy errors, scrubbing sensitive data, and so on.

Watch this episode to find out more.

A general point I'd like to make is that being on top of your exceptions makes you look like you have superpowers. I remember, one time, an important client was trialing some software and they hit an error. We were monitoring the exception reports, saw the error, and proactively fixed it, then sent an SMS about three minutes later saying to try again and everything worked. That's responsiveness.

Be warned. If you've been running some production code for a while, without any exception notifications. As soon as you switch it on, you're going to be inundated with errors that first time, assuming you have any non-negligible number of users.

Some of these errors will be real. Real latent errors, ones that you just weren't aware of before. But many of them will just be noise caused by bots or browser extensions or ancient browsers. That kind of thing.

Too much noise in exception reporting is actually a serious problem. You don't want too many false positives. This is the programmer equivalent of the boy who cried wolf effect: that explains the picture here. You end up seeing too many errors that aren’t actually serious, and therefore you stop paying attention to your exception reports. This will cause you to eventually miss that one critical error that you really should have seen and ends up annihilating your system. Therefore, some effort is required to prune what kind of errors get reported. That's the topic coming up.

We're back inside my rollbar.js file right now. Highlighted, you see a function shouldIgnore(). This has some logic that tells Rollbar whether or not an exception is worth reporting. It basically calls two other functions. It calls the isBot function, which, as you'd expect, checks if the user agent is a bot. I get this functionality from a third party library you see up here, isbot. I also have another function unsupportedBrowser. This then checks if the browser is internetExplorerUnderVersion10, which I categorically do not support and do not want to receive constant exceptions about some sort of browser functionality not being available there. This shouldIgnore function is then passed to the Rollbar config.

Regrettably - and I'm not really sure why this is a thing, but it seems to be a problem with every single one of these exception notification services - you get a lot of noise via browser extensions and third party libraries, that kind of thing. For example, here I've got an AbortError: The play() request was interrupted by a call to pause() and then a link to Google. Okay, whatever. This has happened five times affecting seven IP addresses.

And if I scroll down here... no, I need to go into an individual occurrence actually because I've grouped them together. Okay, if I scroll down here, there's no back trace, wonderful. And then I get the Telemetry, which is all to do with Wistia: a third party service that provides my video plugins. None of this has anything to do with my code, so it's not really my problem in this case, and you can see it goes on and on, look at that.

In this case, I want nothing to do with this error, it doesn't happen that often. I looked it up online, by the way, and it's due to a race condition between someone pressing play and someone pressing pause. They must really dislike my stuff.

Anyway, in order to get rid of this error, in the sense of for it no longer to be reported to me, I'm going to use the mute functionality. Now, whenever it happens again, I just won't know about it, which I think is acceptable in this case.

Returning to the Rollbar configuration in my backend Ruby code, the bit of code I've got highlighted is a set of Ruby level exceptions that I do not want to report.

The first one is ActiveRecord::RecordNotFound, that's basically when you try to find a database record and it's not there. Normally it throws an exception that leads to a 404 or whatever. This happens pretty often on a website like mine with massive amounts of user generated content, therefore I choose to just ignore it. Ditto for routing errors. I'd rather my exception reporting doesn't get filled up with this kind of stuff that's not super important to me.

Another pair of extremely useful conveniences available in these exception monitoring softwares is the ability to connect the current user of the request — if a current user exists — to that exception.

This means you'll be able to see who is affected by an error, even if that particular exception had nothing to do with the POST parameters having included an email, or your custom exception bubbling up that email. This is very useful for any kind of errors that happen within a normal HTTP route/controller kind of request cycle.

It doesn't work for background jobs, therefore I needed to bubble up my current_user in the payout exception you saw earlier.

So the current_user in my case is just a method that grabs the current_user based on a cookie, nothing really fancy there.

One thing I've learned over the past decade of maintaining production software is that you need to be thorough about where you report exceptions from. They don't just automatically get reported from whatever software you have. You need to manually configure it and make sure it's configured in a sensible way for that particular platform.

For example, I use a background kind of queue in my Rails app. It's database driven, though it might be Redis-driven in another piece of software. The important thing is that there’s a queue. I need the exceptions from that queue to be reported because very important work gets done there. And in some systems, those exceptions just disappear without being reported.

Another thing you want is, you want not to report exceptions when they fail the first time in your background job. A lot of the work that I put in my background job queue consists of things that are likely to fail for transient reasons. For example, HTTP requests, sending emails, talking to some external API that might have downtime. Therefore I sort of expect them to fail once or twice and then I try again in a few minutes. I only want the exception to be reported to Rollbar after my job queue has tried the requisite number of times, otherwise I’m kind of getting false alarms for errors that have since been fixed.

That's what this line of code does here: it sets the threshold of reporting to whatever the max attempts of my worker is, which is 3 in the case of this software.

config.dj_threshold = Delayed_attempts

Another aspect of thorough coverage is to ensure that each of your micro services, if you use such things, also has error reporting. A quick aside: It is the need to anticipate and provision all this kind of side instrumentation and infrastructure that often makes micro-services pretty damn inefficient for small teams. This is something I learned the hard way at least.

So here I have a micro-service I use in order to carry out very fragile conversions into different kinds of document formats, for example PDF, image and so on.

It uses LibreOffice under the hood, and when I first created Oxbridge Notes, this particular binary was subject to crash and I was unable to put it on the same server as my primary Oxbridge Notes Rails app. It’s probably different today but I’ve built this and there’s no point in changing it. Anyway, I don't have any exception reporting here so obviously enough, I didn't hear about any of my errors. I had to also include Rollbar here, a little bit of configuration that you can see I've just pasted in.

Another thing you got to be aware of when you're doing exception reporting is sensitive data. All these tools filter the basic kind of fields that might contain sensitive data, such as the password field when you log in. But if you're using custom sort of fields like I was in this crypto project I once worked on, you want to make sure that you're scrubbing things like the private key from someone's wallet or some sort of API token, credit card numbers, all that kind of stuff. They can’t read your mind, and you really don't want to be sending a third party company, even if they seem trustworthy, you still don't want to be sending them your sensitive information.

That’s what this scrub fields config variable is for:

config.scrub_fields |=%i[private_key_api_token_credit_card_number ccv]

It's to list all the fields that you do not want to transmitted over to the third party.

Something else you will probably want to do in your exception monitoring platform is connect it to your deploy system, in the sense that you want to be able to see every single one of your deploys here. And this can help you to distinguish errors that are likely to be transient and over because you fixed them in a previous deploy, versus recurrences. You can see I had this error here, and then I had it again after deploying, which indicates that my attempts to fix it with this particular deploy were unsuccessful.

That's all for today, see you next Sunday.