Episode #1

Data Integrity: Null Constraints and Check Constraints

Data is more important than code, therefore the most important job you, as a programmer, have is to design a system that allows for a simple, constrained, and predictable set of data. In this episode, I'll discuss how null constraints can reduce the number of types your program has to deal with, thereby simplifying your code. Then I discuss how check constraints can force data to take a limited (and more useful) range of values. Lastly I'll explain why it's better to carry all this out at the database level rather than at the Ruby/Python/Php/JS level.

June 13, 2020

Show Notes

References

Postgresql Documentation on Constraints - covers check constraints, not null constraints, and more. Are email addresses case sensitive - technically the part before the "@" in an email can be case sensitive, but in practice it almost never is. The &. operator you saw - Ruby's safe navigation operator (&.)

Rails stuff

Attribute Normalizer gem - what I use at the Rails level to remove trailing/leading whitespaces before saving records Lifecyle methods - how Rails allows you to hook in behavior when a record is created, saved, etc.

Screencast.txt

Transcribed by Rugo Obi

I'm always asking myself after a project or after even a feature, what do I know now that if I'd known before I started this feature - would have made me a hell of a lot faster?

One of the all-time biggest factors for me has got to be data integrity. By that I mean having restraint in what shape your data has and enforcing that by leaning on the database - and to some extent validations.

This is the sort of topic that would have been completely invisible to me when I started programming and I would have failed to see the value of it. But what I hope to do in this video is to enable viewers to realize why this is important, perhaps before they would have otherwise.

Not Null Constraints

The first big advantage of using booleang data constraints is that you limit the range of values that your program has to be able to handle. Principal among these, is you limit the number of types that your program has to handle.

For example by removing the ability for null to appear as a type on columns such as booleans.

Let's look at an example here:

So I have this sellers table and if you look at line 15 there — or 15 down at least — you have buy_out and that's a boolean and it has a default value of false.

Just to place that boolean column into a real-life context, it corresponds to a feature on my website where instead of offering people a commission for the notes they upload, I instead buy the complete rights to them. As you can see on this page I'm showing you now.

The difference that that boolean column represents is actually quite huge in how the business is run.

Now you know how booleans work in general, you’d expect there to be two possible values. And I'm going to run a piece of SQL here to count how many of each possible value of buy_out there are in my production database.

I’ve copied this locally by the way.

And here you see something surprising. There are three possible values of buy_out. There is null with 52 values, there's false with 1798 values and there's true with 8 values.

This is confusing because you’d expect there to be only two possible values. The underlying problem here is that I allowed this boolean column to be nullable; for null to be a possible value.

And nowadays when I'm writing database migrations, it's very rare that I allow any column to be null because that means I have to deal with more possible outputs.

Let me give you an example of some of the complications that come when you have null as a possible value.

If I want to get all the sellers who are not bought out, then I have to look for those whose buy_out column is null here. And also those whose buy_out column is false.

So I've got to remember to look for two things instead of one thing.

And here’s how I have to combine those two together in a query.

That's a lot to remember. And imagine you have this buy_out attribute used in many places throughout your codebase, with many different programmers who might not be aware that it could be null. It’s gonna get messy.

So first, I'm going to fix this by setting the buy_out column in all the rows which are currently null to false.

You can see 52 records got updated and now you can see that the distribution is easier to understand.

And with that done, I'm going to alter the table to add a not null constraint.

This is something I should have had initially, but no worries, I can add that later on, like I'm doing here.

What I've shown you so far is just one boolean column. Now we’re looking at my schema.rb file, which is a Ruby representation of my entire database.

This is huge. It has 535 lines as you can see in the bottom right corner.

And let's see how many other boolean columns there are. They’ll be affected by the same issue.

Here we see that there are...you can see in the bottom, 24 of them. Some of them I now have with null: false but some of them also don't have that. Some of the older models.

An improvement to this database design will be to make them null: false. And I'm pretty sure that would reduce the bug count and also simplify the code that it tracks with the database.

Check Constraints

Over my 10 years of running a website, I've noticed that users have a tendency to write their email with weird capitalization, even though email is a lowercase-only form. Or to add leading or trailing whitespace.

Here's an example of what I mean. I've added some space to the left and now I'm capitalizing random characters. Despite this, my system will still log you in.

This was a breakthrough in terms of reducing customer service load when I first added this feature. Because so many customers would enter their email one way in the database and then type it in a different way and hope to enter.

You can see that I'm in the UserSessionsController here, that controls login and logout. And the create method is the one that gets called whenever someone submits the form you just saw.

The naïve way of doing a login would be to exact-match whatever the user enters within that form with what’s in the database.

But as you can see in the commits, the left column shows how the email might be represented in the database, whereas the right column shows what's given in theparams when somebody fills in that user form.

And if you're doing an exact match, then you’ll get failures for all sorts of trivial reasons that are going to confuse end-users.

The first row differs, in that there’s capitalization given in the params, but the user registered the email with lowercase 'j'.

The second row has the exact opposite problem: The email in the database has capital 'J', whereas in the params it is lowercase. And the next two suffer from a combination of leading whitespace and some extra capitalization.

Here’s how you'd write a function to normalize everything under the assumption that the database can have any sort of content in terms of capitalization, or leading or trailing whitespace.

I call the btrim function and the lower function within postgres in order to make the email string all lowercase and for it to be rid of trailing or leading whitespace.

And then theparams are also transformed using the null-safe &. operator.

Again, you can imagine that it could be quite a lot of work to consider all these possibilities throughout the codebase.

A better option would be one where the database can only ever have lowercase strings with the trailing and leading whitespace trimmed. The database query will then be simplified down to one like this. One where you only have to worry about the user params coming in.

This for me, is another example of database constraints. In that I’m not allowing it to contain whole classes of input, therefore I no longer have to think about their existence.

It would effectively delete that left column, because all that’s there is the email address; Jack@example.com. The number of combinations that I'm left to deal with is tiny.

In reality, I already do this at the Ruby level, at the web app level, i.e. before the database with a plugin called attribute_normalizer.

This is called on the email attribute. And if we open up the attribute_normalizer configuration file, you can see that by default it calls the squish method on any of those attributes it’s supposed to normalize, before it's saved to the database.

You can see this in action over here on this little string. The whitespace is removed.

I also have a lifecycle method downcase_email that gets called before a user record is saved in the database.

This approach would be three out of five stars. It ensures data integrity whenever someone uses the Rails API to set things. Or in particular uses the Rails methods that do call the lifecycle methods.

Unfortunately, some methods skip them and they may be accidentally used. And also, someone might change the database manually, and thereby skip all these checks and constraints that are built into the Ruby level.

As you can see here, if I enter the database directly, I'm able to set the email to be uppercase and with leading white space. And you can see that the email now has this crappy form in the database.

This can lead to all sorts of problems: You don't expect it to be malformed in the database.

What we really want to do is use an SQL check constraint to guarantee that the email column is always lowercase and trimmed of trailing and leading whitespace.

With this in place, it is impossible to set the email to something uppercase or with trailing or leading whitespace.

This is guaranteed outside of the Rails app. And that means that no-one, no junior programmer, no broken code, no misuse of the Rails API can damage your data. Your data is sacred, as it should be.

If this seems like a lot of work, I promise you that it isn't.

When you start a project, the database migrations change very dramatically.

But when I looked at the percentage of my commits that touched the database schema, versus the overall number of commits, it was orders of magnitude less. For every commit that changed something in the database, there were 99 - at least 99m- commits that did other things.

Therefore, this is a relatively small amount of work for the amount of protection you get.

With that database constraint in place, the emails represented in the database will always be the same thing. By constraining the database, the code will become simpler because there are far less conditions it has to deal with.

There are many many ways you can use check constraints in your code.

I'm going to list a few common use cases here.

In this first example, I'm showing that the seller state is within a list of possible valid values. This corresponds with a state machine within my code.

I never wanted it to be anything outside that list. And in the past, sometimes there has been something outside of that list and if I had had this constraint in sooner, I would have found that bug more quickly.

The next one checks that some date period makes sense. I use this in the invoicing part of my software and in the seller_invoices table, I check that the start_period comes before the end_period. A little logical check just to prevent dumb mistakes.

I see this stuff as having a parallel with strict typing in a typed language; which is a feature I really appreciate.

And these final two constraints check that some financial numbers in the orders table have the correct positivity / negativity. Is that "polarity"? I'm not really sure in this context.

Anyway, this one ensures that the total is greater than 0, and this one ensures that any discount is less than 0. And that ensures that some other user of this code won't accidentally think that the discount should be measured positively and then applied as a negative number or something like that.

To sum up the value of having these database constraints at the database level as opposed to at the Ruby level, is that you prevent any dangerous backdoors from appearing.

For example, administrators going into your database and changing things manually, leaving your data in some sort of inconsistent or unconstrained form.

This is a way that you as a senior programmer can exert a benevolent kind of control over the shape of your program that overreaches your ability to manually inspect code.

Well that's all I've got time to discuss today.

Next week I'm going to look into other aspects of data integrity.

All in all, I believe that data is more important than code and I hope to convince you of the same over the next couple of episodes.

See you next time .