System failure is not binary - your application may be slow, or restarting too often, or teetering on the edge of disaster. System monitoring is how you detect these issues - either before or after the fact. I show you how I scan my aggregated logs from multiple components for certain logged error conditions (e.g. hacking attempts). If serious ones crop up, an email alert is triggered. Next I demonstrate how to use Monit to get real-time stats on your system's CPU and RAM usage. Following that I use another tool to graph my RAM usage patterns over the last week. Zooming out again, I introduce response time stats.
September 13, 2020
3 months ago
300 HTTP status codes are not just redirects.
E.g. A 304 Not Modified status is expected with HTTP caching
Transcribed by Rugo Obi
Step 1, the service starts off with a minuscule 0.2% CPU usage.
Step 2, I create a CPU-intensive infinite loop.
Step 3, the CPU usage rises to 5.7% and then a 100%. Are you going to know about that?
Your software system’s health is not binary. It's not quite as simple as broken versus working: It might be slow, it might need constant restarting, it might be under attack by hackers, it might be running perfectly fine right now but teetering on the edge of disaster. That's where system monitoring comes in.
This topic is huge. So I'm just going to focus on a small subset of what I found helpful. So I've opened up my logs here again and I'm going to search for fatal database errors using this regex I prepared earlier.
Now, let's look at what turns up.
So, the first few entries are sort of weird. I'm just going to skip them for now, they're not terribly interesting.
Here's an interesting one:
FATAL: password authentication failed for user ‘postgres’
FATAL: no pg_hba.conf entry for host was ‘188.8.131.52’
So, this is basically a hacking attempt into my postgres, but the host based authentication (HBA) file is preventing this foreign host from entering as it should.
I use a hosted postgres service, and I trust them. Therefore I do not get email notifications about this particular error.
But if I was running a piece of software that had higher security needs than Oxbridge Notes, then I might want to get emails for every single such attempt.
Each of these errors have the
sql_error_code 28000. So what I'm going to do is copy that and then paste it over here to ensure that I can do a regex filter, just for that.
It takes a second, and sure enough, it's just those particular entries.
The reason I might want to do this is because if I'm more security conscious, I'll want to get an email - some sort of alert - every time this thing happens.
So I'm testing my filter here, then I'm going to create a tag and alert, and I'm going to add a new one here, and I can say
DB hack attempt or whatever. I put the pattern in here. And then I say it's
fatal or whatever or... doesn't matter. And then I can set an alert that I get an email or whatever. I can save changes. I'm not actually going to save changes because I don't need this one, but that's how you do something like that, that's how you scan your logs in order to detect issues with your system.
I do however have tags and alerts for other kinds of things that are more important to me. You can see them all here.
And the most important one by far for me is when my memory quota is exceeded - when I’m out of RAM. This is a very common issue whenever you're running a web server.
This brings me on to my next point, you need to be able to check in with your servers' system stats, i.e. RAM, CPU at any instance.
Depending on the platform you deployed to - like raw Ubuntu, AWS, platforms as a service like Heroku - there will be all sorts of tools that are more or less helpful in order to monitor your system resources and so on.
One of the most popular DIY approaches is Monit, which you're looking at right now.
This can be deployed for free on any server and then is available as a sort of web app on a certain port.
We're looking at my Monit instance for a micro server I have, a micro service that does some sort of document conversion for me.
The main thing you'll notice is the overall system stats, It's okay. There's pretty much zero load on it right now.
The CPU is at 0.2%, memory usage is 10%. Basically this server has not been used, there's no workload being given to it. It's a very bursty sort of workload that it receives.
So, as an experiment, I'm going to create an artificial load on this server.
So first I'm going to SSH in. I have this little handy alias here,
root@oxnotesdocservices, or rather the
oxnotesdocservices is the alias. I don't have to type in the IP address.
Now you can see I'm inside the server. I'm just going to clear it.
Next, I'm going to create an infinite loop, one for every single CPU in the system in order to use up all the resources, to waste the resources.
So let's see how many processors I have with
nproc...only 1. Okay, so that’ll make the script a lot easier.
So I'm just going to do a
while script and make it
do nothing. And let's see what happens.
So I'm running that there and I'm going to skip back over to Monit and see if I notice anything.
Already the CPU usage is starting to climb upwards, it’s 5.7%. By the way, this
us means user-space stuff, and this is system stuff. Oh! it jumped to a 100%.
The load is based on an average over a certain time span.
If this were to continue for much longer, Monit would send me an email (I've given it some sort of email config to Sendgrid or something), and it would tell me that there's a problem with my server, its resources are too heavily used, and I need to scale up.
Anyway, I don't want that to happen so I'm going to cancel this fake workload.
Next I'm going to show you another perspective on system monitoring.
This one a little more long term. These techniques are not so much about getting alerted when your system is burning up in flames, but rather about getting a general indication of its health.
Think of this as akin to taking your blood pressure or weighing yourself regularly.
You're looking here at the stats over the last one week of my primary Oxbridge Notes server.
For most web developers, I believe memory usage is the biggest issue they’ll face.
The reason why RAM is an issue is because if you run out, then you have to swap to disk, which is orders of magnitude slower and also really CPU-intensive.
This means that your app is going to drop requests, lose information, bug out during payments, and generally become a heaping pile of useless crap until it’s rebooted and the whole process begins again.
So this is my memory graph over the last week. You see it has a certain pattern that increases slightly then levels out, and then it goes down again. And that happens again and it goes down.
The reason for this is because there's a daily restart. Unfortunately the CSS is a bit weird here because I’ve zoomed in so much but this daily restart should actually correspond to the peak here.
Now, the reason why memory increases over time is because more parts of my code get loaded and so on, but it doesn't increase to infinity.
That's a good thing. That means I don't have a memory leak. In the past I did have a memory leak and the graph would look more like increase, increase, increase, increase -I also had much more RAM then, something like two gigabytes of size- increase, increase, increase and then restart and very frequent restarts because it kept maxing out.
That's a really bad situation to be in.
Scrolling down further we have response time. How long does it take my server to give a response to the browser?
This is a much higher level, more customer-centric metric than RAM. While RAM issues can cause a slow response time, a lot of other things can too. Like crappy network interfaces, slow SQL queries, poor caching, etc., etc.
So my median is 39 milliseconds, which I find pretty reasonable and 95% of requests are completed in under 150 milliseconds.
This isn't amazing but it's good enough for my purposes.
See you next week.