Episode #9

Production Logs: Getting The Most Out Of Them

Logs are a fundamental part of diagnosing and fixing issues in your production system. I talk about 1) the importance of having a single command to immediately tail all relevant logs; 2) the need to ensure one line of log corresponds to one event (in order to play well with unix tools); 3) how to use the timeless awk command to parse your logs; 4) the value of being comprehensive in what you log, so that you can better diagnose issues and re-run failed requests after a bug has been fixed; 5) my system for recreating entire user session journeys by parsing my logs.

August 30, 2020

Show Notes

No notes available for this episode.

Screencast.txt

Transcribed by Rugo Obi

Proficiency with extracting relevant information from your logs is an absolute game changer in enabling you to rapidly respond to issues in production.

I've maintained my site Oxbridge Notes for 10 years now, and thought a lot about how to make my logs work for me, culminating in a system that converts my logs into a chronological list of things a given user or a given IP address did on my website.

Enjoy.

--

Welcome back. Today I continue our tour of the tools I use to stay responsive towards issues in my production systems, such as in this website here.

This stuff is rarely taught in typical courses, yet, I see this as a necessity in any kind of high-level professional software work.

My favorite must have tool is a single command to start live-tailing the production logs from the command line of your development machine: Only a sucker's workflow involves SSHing into a server and looking for some particular log, hoping you can remember where it's located and then tailing it from there.

In Oxbridge Notes I have this set up with a command bin/tail_production_logs, and now you can see all the logs from my production machine.

So let's go over here and click on something like 'Classics', for example. And let's see what's happening over in these logs again... does that turn up? Yeah. You can see that there's a log entry for the action I just took on the website. This is perfect for debugging production issues, for poking around in the live website, and looking at the effects in the logs.

Unfortunately, some web frameworks and systems spit out logs for a single event over many lines. All of this represents a single request to the home page, yet it takes up something like eight lines there. This is unfortunate because it breaks just about every single unix tool because most of these tools work best when there's one entry or one event per line. Therefore it's important to squash all of that info down to a single line, as is done here.

Now, one line represents one event.

This is a much better way to represent your data.

What’s more, you'll notice that each of the types of data, like the method, path,format, controller, action, status, etc. are separated by whitespace.

This gives them a sort of tabular format that a dumb tool which just divides things based on whitespace can easily parse.

For example, let's say I wanted to analyze the IP addresses making requests to my system, because I suspect I'm being machine-gunned by some bot.

I've got a minute's worth of data here in this file. You can look at it there.

I'm interested in the remote_ip field, you can see this here. I want to extract out that remote_ip field from every single one of my log events. I'm going to do that using the timeless awk tool. And what I'm going to do is pipe the results of those logs into awk, and then print the 15th field, like so.

Right now, we see a bunch of boilerplate. There's this repeating string, remote_ip= , and then the actual IP address which we want to extract.

So I'm going to modify my awk command to accommodate that use case. I've pre-written it here. Basically what it does is split on the 15th field, assign this to the variable, a, split it on the = sign, and then print the second entry from that variable a. (This is one indexed instead of zero indexed.)

Let's run that and see what we get. Excellent. We have all the IP addresses here.

Finally, let's count the number of unique IP addresses here, in order to answer that question as to whether or not we are being machine-gunned by some bot -- a bot having the same IP address with this hypothesis.

So, the naïve thing to do is to simply pipe this into uniq -c for count.

Those results look plausible at first, but in fact there's an issue here. This IP address [top] is the same as this IP address [bottom].

The problem is that the command line uniq command only works on consecutive entries. The fix, therefore, is to sort everything before parsing it to uniq. And here we can see that this IP address appeared four times.

All in all, it looks like no IP address was over-represented in that one minute. It's quite plausible that one normal user hit four pages, it's very plausible. Therefore I would conclude there’s no bots attacking my website.

The next important factor with logs, is the comprehensiveness of the data you output.

As a general rule, I like to log enough data to be able to recreate the request, should that be necessary (for example, if I had some sort of bug in the code). Obviously sensitive stuff is filtered out beforehand.

So here's the 'Contact Us' form from my website. There's nothing special here, you put in a subject, body, and your email address, and send that off.

This is a public facing contact form and there is no CAPTCHA here. So you'd imagine it receives its fair share of spam.

I use a bunch of heuristics like negative CAPTCHAs, in order to prevent that spam. But maybe I want to double check whether or not I'm accidentally filtering too much.

Let's take a look at the logs and figure that out.

Previously you saw me live-tailing of the logs. However, that does not persist the logs, instead I use a third party service - LogEntries - to store the last month or so of my logs. I do this because my hosting provider doesn't offer that as a service.

Anyway, I'm going to look for all the logs corresponding to that controller, the ContactMessagesController.

I also want to only check the log entries where the method was POST. It's no good looking at the GET requests, because that's just people looking at the contact form instead of sending messages.

And my goal here is to check these sort of messages that were sent, that might have been filtered out by my spam filters.

Now, I'm going to scroll across, and find the params - the params which I went out of my way to log. (Normally these are not logged with many web servers.)

The first one, authenticity_token, that's not of much interest, timestamp, category - that's just a drop-down box. What I'm really interested in is subject, for example.

Here we get some German stuff about "passive income", I’m glad I filtered that out, that it never reached my inbox. "The best online job for retirees", "make your old ages rich" and nope... "I reached out some months ago about how explainer videos" help... no, that's more cold calling rubbish, "free from neck pain".

So, basically I could go through all these and take a look at what percentage of this was genuine spam versus what percentage was messages I should have received, but I'm not going to do that now, I just want to demonstrate the process.

Sometimes there's a confusing exception in production and not enough info to solve it.

I might know that some particular user faces the issue, so I contact them. But they're unable to, or – as is their full right – unwilling to assist in answering my debugging questions.

I therefore came up with a system to replay, blow by blow, every step a user took on my website, even if it's 20 steps long, in order to inform myself about what might have triggered the bug.

Let me give you a demo:

First, I need to know my public IP address - how my computer is viewed on the internet. You might think the best way is to use ifconfig on the network interface I use to connect to the internet.

So if I do that here, you get an IP address alright, but this is 192.168... and that is reserved for private networks. That's no good.

To get my public IP address, I need to do something else. The easiest way I know is to ask a third party service.

So I'm going to curl ifconfig.me and then I see that my public IP address as of this moment is 178.0.238.224.

Now, keep that in mind while I revisit Oxbridge Notes and do some browsing. I’m going to login, logout.

[music]

Okay, now that I’ve browsed around on the website and I know my IP address, let's use my software's ability to intelligently work with logs in order to replay the events that happened, as I often do when involved in serious debugging.

I’m in an interactive Ruby session on my production server. Now I'm going to call this bit of code, the EventReplayer, something I wrote myself, feeding it the IP address that we looked up.

Now you can see here from top to bottom, all the events that I've just carried out in the browser.

I logged in, then I was on the admin page, then some login page again because I went back, then I looked up some law stuff, blah blah blah, added something to the cart, then I logged out.

This is super useful for trying to figure out what happened when something went wrong.

That's all for today, see you next Sunday.