Episode #7

Error Tracking and Monitoring: Part I

In war, no plan survives contact with the enemy. In tech, no code survives contact with real users. In this episode, I discuss systems to monitor and stay alerted about errors in production. By watching me debug a few of my exceptions, you'll learn: 1) How these tools can integrate back-traces of the original source code, rather than any minified or transpiled versions. 2) How to configure your error reporting to include not just back-end, but also errors in the front-end and background queues. 3) How these tools can speed up back-end debugging by showing the SQL queries prior to the error. 4) How these tools speed up front-end debugging by using telemetry to list the browser events leading up the error. 5) How you can use the info persisted in the error tracking tool to salvage mission-critical data you would otherwise have lost.

August 16, 2020

Show Notes

No notes available for this episode.

Screencast.txt

Transcribed by Rugo Obi

If you're running a software company, it's a really good idea to monitor the exceptions and errors happening on your users’ machines or browsers.

This episode is an introduction to how to get the most out of these sorts of tools.

--

In the 1800s, a Prussian field marshal by the name of Moltke the Elder said that ''no plan survives contact with the enemy''.

He might as well have been a software developer because as far as I'm concerned, no code survives contact with the real world, especially the free-for-all that is the modern web browser environment. The assumptions you made in development will be proven woefully naïve. The real test is whether you can adapt quickly enough to the demands made against your codebase in a production environment.

Over the next few episodes, I'm going to touch on what I see as the five pillars of a responsive approach to production issues.

One is 'Exception Reporting', the topic of this episode, and then there's 'Logging', 'Downtime Monitoring', 'System Monitoring', and 'Financial Tracking'.

The first piece of the puzzle is exception reporting and monitoring.

Basically, whenever there is an exception in your code in production, back-end or front-end, you need to somehow transmit that exception to your team, or to yourself, so you can fix it.

It's also a bonus if you can get all the surrounding execution context and so on.

It's perfectly possible to roll your own and send the exceptions via email or Slack to yourself, but like most people, I prefer to use a tool.

You can see the most popular current tools here. Sentry’s number one, TrackJS, Bugsnag, Rollbar, etc, etc.

The advantage of using a tool like this is that you have a really handy UI for tracking and grouping all your exceptions. And they have great integrations with Git and so on. I'll show you a bit more later.

By the way, even though many of these are paid offerings, I've personally been able to get by for years with their free packages. Most of these companies offer very generous freebies.

Right now you're inside my Rollbar installation.

I've already redacted the emails and converted them to semicolonandsons@x.com from the original emails that my users might have had.

I want to show you briefly how I did that in the dev tools.

So basically I executed a search and replace that look like this. And this grabs all the emails and replaces them with EMAIL REDACTED. And you can see everything is changed there. That's how I'm maintaining the privacy for this. I thought that was kind of interesting.

On this page, you see a list of the various exceptions that happened within my codebase over the last 30 days or so.

So the top one here, exception #943, MakePayPalPayoutService::PayoutFailed.

This is the system I use to send payments to the authors on Oxbridge Notes, this seems to be an error there. The last error occurred 14 hours ago and it occurred in Oxbridge Notes. Let's click into that now.

So we're on the page for an individual exception here. We can see a very detailed exception name and message for redacted@example.com.

That will be substituted with the actual author's email in my system. I like to bubble that up to the exception reporting system to make debugging easier. I think this is justified from a privacy point of view because my website doesn't do anything particularly private, and I'm okay with the Rollbar service, having that particular information.

Next, there's the actual message: the payout FAILED. I don't know why I needed all caps.

And then I say, "The receiver’s account is locked, or inactive. Any payment made to this account will appear as FAILED. You can contact the recipient to request them to activate their account."

Okay, now I'm understanding. What I'm doing here is bubbling up the message from PayPal, which gave the response "FAILED", and this extra text to my exception reporting system.

If I scroll down here, you can see the code that actually generated that exception raise PayoutFailed, and then some interpolated strings and so on and so forth.

By the way, bubbling up the email address of the particular user having an issue, I find to be very useful.

Whenever I'm sending money to the authors in my system, they can understandably get very upset when their money doesn't arrive on time.

Therefore when I have all the information about the exception in one convenient location, I can proactively reach out to them and tell them what's happened and advise them what the next steps would be. For example, redacted@example.com should contact PayPal and unlock their account.

Continuing on here, you'll see I have the exact code from my codebase. This is because I connect the Rollbar exception monitoring tool to GitHub, and you can see the exact code that corresponds to the SHA1 of my current release. This is extremely handy to kind of get an overview of what's going on.

Something that I do not have available in my Ruby Rollbar setup, but I do have in my PHP Sentry setup is this thing called telemetry for SQL. I think they call it "breadcrumbs".

So I can see all the particular SQL queries that got run in that request, and that can be very helpful for debugging things.

How difficult are these tools to install? Not at all. All I had to do to install Rollbar was to include the library with gem install and then add this quick config.rb file to my Ruby on Rails application. It was similarly easy to install Sentry in my PHP Laravel application.

One thing to remember is that you need to install Rollbar or Sentry or whatever, in two places in a modern web app. Once on the back-end, and then once on the front-end in JavaScript or whatever.

That's because there are wildly different requirements and expectations as to what should happen in each of these programming environments. You even get separate API keys... for a good reason.

Previously when we were in Rollbar, I showed you a back end exception. This is now a JavaScript exception: Uncaught TypeError: document.querySelector(...) replaceWith is not a function.

This occurred four times, affecting three different IP addresses.

If you scroll down here we get more information about each of these occurrences - crucially, the browser. I can see that Chrome was affected as well as Chrome WebView versions 46 and 43. I can also see that the OS was Android. All this gives me enough information to assume that the bug was due to replaceWith not being available in certain browsers. I probably should have a polyfill in place.

Let's go into one of these individual occurrences and see what we find there.

Now we're inside one of these individual occurrences. The first thing to notice is that we have a nice backtrace here.

This is a little bit special because I minify my JavaScript in Oxbridge Notes, therefore Rollbar has to have access to my source maps. And It can work back from those source maps to figure out what my actual TypeScript was. See here the file is webpack://… etc. This doesn't come for free you have to set this up in some way.

The function that caused the error was handleMailingListSignup, some of you probably mailed me about that, and you get a full backtrace there.

Another very interesting thing that you only get in the front-end, is this thing called telemetry, or at least this kind of telemetry.

So what this does is give me a list of relevant page events that happened before the error, allowing me to piece together the story and try to recreate whatever went wrong.

For example, the Pageload DOMContentLoaded event fired, and the load event fired, and then there's some user input to the input#email_subscriber_email ID, and then a little bit later the user clicked #save_email_subscriber.

That caused a POST /email-subscribers, and my website responded with 204, then some Facebook pixel warning (which is not important).

And then, and you can assume this is after the response came back from my server, whatever code I used to update the page didn't work because replaceWith is not a function.

Then when I go to visit 'canIuse.com' or whatever it's called, we see that there's a lot of red here, meaning that a lot of browsers do not support this particular function yet.

In total, only 85% of web users’ browsers have replaceWith available, therefore I need to install a polyfill in order for my code to work generally.

Certain actions that users take on your website, like buying things or becoming subscribers have a pretty high financial worth.

Therefore, if there's a bug on your website you want to be able to replay and recover from all these actions.

In this case, a vendor I use SendInBlue, gave an 'error 400' due to an invalid email address.

Now, I can't tell at this point whether or not someone just filled in a completely fake email address, or if there's something wrong with my code that sends the address to SendInBlue.

Therefore, I'll take a look at this exception. We’ve got a lot of information here, such as the controller and action called, what environment it's in, what version of Rails blah blah blah.

Here we get the actual POST parameters that we used. The authenticity_token isn't important, neither is the commit message.

This, however, is important. Basically I'm able to see the email that someone used to try to subscribe with. There’s something subtle here and you’ll notice it when I scroll further down. There is a trailing space here. That is suspicious because as far as I know, you can't have an empty space within an email, that's not valid.

So anyway, I can see all these other headers here, like Accept-encoding and Accept-language, this can all be very helpful for debugging purposes.

Let me get to the very bottom here, there's a curl request I can use to replay things, I don't want to do that either. But what I do want to do is find the email.

And you can see here that redacted@example.com has a space following it. There should be no spaces here or spaces should at least be escaped within this curl command.

So that looks like my issue.

I'm going to confirm that by opening up a Ruby console. And then triggering this underlying code myself. I'm going to add a trailing space here, redacted@example.com, and send that off, and we get the exact same error message. Let's see what happens if I remove that trailing whitespace. And it works perfectly.

I figured out what the bug was, all thanks to exception reporting.