Episode #13

Website Monitoring Tools II: APMs, HTTP status aggregated stats

When you want really detailed breakdowns about how different endpoints in your code contribute to poor performance, you need an APM. These can pinpoint which functions and database calls are the slowest, which is critical info when deciding what to prioritize when doing performance optimization. Another great way to take the pulse of a web-app system is to look at the stats for each class of HTTP code (200s/300s/400s/500s). We'll analyze these stats en-masse with awk and friends on the command line. Lastly I introduce a yet higher-level monitoring metric: revenue. Even if all your other monitoring stats look healthy, if revenue suddenly drops to 0, suspect an undiagnosed tech problem.

September 20, 2020

Show Notes

No notes available for this episode.

Screencast.txt

Transcribed by Rugo Obi

Continuing our tour of website monitoring tools from last week, we're gonna dive into the nitty gritty of exploring how an APM can be used to figure out which function paths are slow.

After that we're going to zoom, way, way out and look at financial monitoring.

This is a level of monitoring that's no longer really about software anymore.

And, of course, along the way, in true sysadmin tradition, we will be ripgrepping some logs.

When you want to go deeper and get some insights into the slower requests, you’d use an APM.

I'm using Scout APM at the moment and that's what you're looking at right now.

Let's get some metrics on my web endpoints by going in here.

Here we see a list of different controller methods in my system. I can sort them by 'Time Consumed'. That’s how much of a battering do each of these controller methods give my server, what’s consuming the most CPU and RAM.

Here I have ActiveStorage:: RepresentationsController#show. This is something to do with displaying images and files.

Alternatively, I can sort by "Response Time", ie, how much latency is the end-user experiencing.

The slowest one of all here is PaypalController#finalize, that's the endpoint that gets called after someone has completed a payment on PayPal and comes back to the website.

Let's click into it and see what kind of insights we can get.

So the mean response time is about a second. Now we get a breakdown of what's happening. And if you look here, 38.7% of the time or 408.6 milliseconds is consumed with HTTP. This makes sense actually, because the controller endpoint has to communicate with PayPal in order to double check that the payment actually went through.

By looking at this breakdown, I learned more about what's happening.

The next thing taking up time is AutoInstrument. That's actually something to do with the tools I'm using in order to collect this data. I’ve switched that on at the moment but I could switch that off later. So you can probably ignore that.

The next big thing that's taking up time is User#create. That takes 147 milliseconds. If I click into it, you can see the SQL that's generated. Seems like rather simple SQL, so I am surprised that it takes so long.

There is a bunch of other SQL stuff happening here, like the Order#save. You can see what happens - that's actually relatively involved.

Then there is a background job created. There’s a PaypalTransaction created, and you can see how much time all of this takes and this gives you some ideas for debugging.

Assuming this information is correct, then I would probably log into my users table and also the sellers table here. Maybe I'm missing an index, I don't know.

Next thing I like to do is to take note of my aggregate HTTP status codes.

So I’m back on my system metrics page here, and if I scroll down to "Throughput", then for any moment in time I can see the relative number of each kind of HTTP status code.

Ideally I want to have, no 500s and all 200s. Some 400 will be inevitable because they’re client errors.

Anyway, if I look at this last timestamp here, I have 1.4K 200s, then I have 1.7k 300s. That seems a bit obscene, why should I have more redirects than successes? We're going to investigate that in a second.

110 400s, and a single 500.

Okay, so I don't want to do any kind of power analysis of my log files within the LogEntries website. So I've downloaded it off screen, and we're going to use the Unix toolkit to analyze it.

Let's start off by taking a look at the size of this file by using du -h (human). So 404 megabytes, not too crazy.

Next let's find the number of lines within this: how many events are there? And... God, I don't even know how many that is so... That looks like 1.3 million.

You notice that wc (word count) printed out the number of lines, and also the file name. For reasons that you'll see later, I only want the number, I don't want the filename.

One trick that works with a lot of commands is, instead of giving the file as an argument, you feed it in via stdin. Therefore the command has no way of accessing the filename.

And bingo.

The next thing I want to do is, scan within that file using a rg for 3 and then any digits that match. So we're going to run that there.

That's a little bit too much information I think, so we'll condense the results in the next step.

I happen to know that the 10th, 17th and 19th fields that are space separated (you can see -d for delimiter there) - contain the information that I want. The URL, the protocol and the status code.

So let's run that again.

Okay, again, that's too much information.

So let's slow that down a bit and just feed into the head command.

Now, that looks usable.

There's something I don't like about the way this output is formatted.

Basically, the first field has a sort of a random length, whereas the other two are basically fixed length.

This makes the output a little hard to read.

So I'm going to modify this command. I'm going to use awk instead of cut, because cut doesn't allow you to rearrange the order of the fields, whereas awk does.

So if I just run this command here, you can see that it's much more neatly formatted.

I'm going to want a few more lines than just the 10 that head gives by default. So let me tell it to give me a 1000 lines, and then feed it into a pager - I’m using bat here - so that I can go back and forth in a kind of nice and easy manner.

And page-up and page-down.

So the very first entry here is the /robots.txt path.

That should not be a redirect.

But wait a second, the protocol here is HTTP. I redirect all HTTP requests to HTTPS, therefore this is expected.

The next couple of entries seem like hacking attempts. One is trying to access /wp-admin WordPress admin, the next is looking for a .env file, some poor suckers on the internet probably put in the public folder. And it repeats...

Here is something that I recognize from my framework. This is the way the current version of Rails basically lets you look at S3 images or other files.

There are a lot of these redirects and I find this sub-optimal.

The issue is that Rails does not allow you to directly link to a public S3 URL or at least it didn’t at the time.

My next step is to remove those entries, because they're just noise and see if there's any other redirects that I should be paying attention to.

So I've modified the command here by doing an inverse search with rg , and we don't want anything that contains active_storage OR we don't want anything that contains protocol just http (using the word \b boundary). Then it’s fed into head and that runs again.

Let's take a look at this output.

Now we get some URLs that seem legitimately to be missing for normal reasons or for irregular reasons, I can't really tell you that right now. But I've demonstrated enough of the process for now.

Previously over on this metrics dashboard, It seemed like 300 requests were the majority of my requests.

Let’s calculate a more accurate figure based on the insights we've seen so far.

My plan is to use the bc basic calculator that's available within bash.

For example, I could say echo 3 + 5, and then feed that into bc and it’ll give me 8.

Now, let me call up that previous command again, stop it, and then run fc (fix command).

This opens up an editor, and will allow me to modify it in a fluid sort of way.

The overall plan is to echo the number of 301s and then divide that by the total number, and then feed that into the bc basic calculator.

So what I want to do first is surround this with echo, and then we're going to put this in a sort of a sub-expression... where do we want to go to? I think we want to go all the way over here. We don't need this head stuff anymore. What we do need instead, is a kind of wc word count of the number of lines.

Then we're going to do another sub expression for the wc -l word count lines, using that trick from earlier with redirect.

And then we're going to feed the whole lot into the bc basic calculator. And we're also going to tell it to use decimal points.

Let’s see what happens.

Okay, it’s taking a while. Let’s hope it actually runs... and bingo. 0.0028. That’s a much better figure.

The last pillar in monitoring your website, in my opinion, is to step completely away from the tech mumbo jumbo and look at a metric or two that really matter, the ones that are the reason of being for your software.

With Oxbridge Notes, it’s revenue. Therefore I have this page showing trailing sales numbers. If these go to zero, that's a pretty damn good indicator that something's broken, even if all the other alarms have stopped.

I also have something similar in Google Analytics, if there are no conversions then I get an alarm.

And with that, we've come to a close of the five pillars I use to ensure that my system is up and running.

See you next week.