Episode #2

Data Integrity: Foreign Keys and Uniqueness Constraints

Here I continue on from the last episode (null constraints, etc.) in exploring ways to use an SQL database to ensure data integrity. I'll show ways to avoid shooting yourself in the foot by setting non-existant relationships or by deleting rows that are referenced elsewhere in the database and are therefore necessary. I'll show how to lean on foreign keys to build resource allocation features with practically no backend code. Next I'll demonstrate the perils of relying on uniqueness validations at the Ruby/PHP/Python backend level. Finally I'll show to avoid bloat in pivot tables for many-to-many relationships.

June 21, 2020

Show Notes

References

Postgresql Documentation on Constraints - covers uniquness constraints and foreign key constraints

Screencast.txt

Transcribed by Rugo Obi

1. Foreign Keys To Prevent References To Non-existent Associations

One annoying bug I had when I was a naïve younger programmer was that foreign keys in some database table might refer to a record id that no longer exists.

To put that into context, here I have a Product model and you can see that it has a notes_pack_id field, which is set to 60.

And then, thanks to the ORM, I can also instantiate that notes_pack object by just calling NotesPack on that product. And here we see it.

Now, let's imagine I want to set that to a non-existent number. So I use Rails to find the highest notes_pack_ id in my system, that's 834.

So I'll just pick an arbitrary number bigger than that, let's say 9999 and then save it. And BOOM. You see that I'm unable to save that, due to a foreign key constraint that the notes_pack ID (9999) is not present.

That is exactly the kind of error I want, and it prevents a whole class of failures from ever happening in my code base.

2. Foreign Keys To Prevent Problematic Deletions

Let's say that some part of my application, for example code in my admin area generates the SQL statement to delete the notes file with ID 17133. Should this operation be allowed to go through?

In general, there may be some sort of restraints and checks within the Ruby world, but depending on how the code is executed or the workflows and controllers it runs through, these checks may or may not get run.

The ultimate way to prevent overeager deletion of this record, deletion that would leave other records in a lurch, is to use foreign key constraints.

And you can see here that this delete operation failed because a constraint in the products table references this particular notes file.

This is because, as you can see in the website here, each product has a best notes file. Something that they display on their product page, and that's what generates this large PDF sample that you see.

And a product contains many notes files, one of which is elevated to this kind of special place as best notes file.

And in the case of the notes file we just tried to delete, this one was, in fact, the best notes file for the product it was contained within. Therefore we shouldn’t be allowed to delete it without elevating another one to that position, and the database constraint assures us of that.

As you can see here we changed the best notes file to something else, then the deletion will go through - i.e the constraint helps us when it's needed, but gets out of the way when its job is done.

3. Leaning On Foreign Keys To Manage Resource Allocation.

Database constraints, can be used to replace code that's involved with creating entire features, particularly ones involved with resource allocation.

Case in point, here's a PHP application involved in booking medical appointments.

Each appointment requires a ticket, and the ticket should be unused in order for the appointment to be booked.

You can see that an exception will be raised if a ticket is used, otherwise an appointment gets created. And in code that's not shown here, that ticket is marked as used, and then saved.

In order to cancel an appointment then, we have this cancelBooking function, which essentially sets the appointment ticket->isUsed field to false, saves it, then deletes the appointment. This is necessary to free the resource.

Unfortunately this adds some complication, because you have to remember to always deal with this boolean with any actions involving the using up or freeing of that ticket resource.

There is however a much simpler way to do this, by using a FOREIGN KEY on ticket_id, and making that foreign key unique.

Here's the new and improved code for the book method. You'll notice there's no need to manage the ticket_is_used Boolean. In fact that whole column has been deleted from the database... this shorter version of the code does the trick.

Similarly, the cancelBooking method is greatly simplified. In fact, all it does is delete the appointment, because by deleting that appointment, the ticket ID will be freed.

Let's make this a bit more concrete by visiting the database. So I'm going to check out what the first appointment in my database looks like. And you can see it has ticket_id equals 1. Let's remember that and try to reuse that ticket and see what happens.

In fact, first let's attempt to create an appointment without using any tickets. Because that might be the first way you could experience that an appointment requires a ticket. It could perhaps be broken and practice ending up shortchanged.

You can see here that, as expected, you get an error about ticket.id violating NOT NULL constraint. That's good, it means it's impossible to create an appointment without a ticket id.

Next, let's attempt to use, or rather reuse the ticket id number one. This fails, and it should as well. key value violates unique constraint “appointments_ticket_id_unique”.

This shows us that it's impossible for someone to reuse the same ticket on another appointment.

And if you watch me delete the original appointment you saw up above, then I can run that code for inserting and creating a new appointment with a ticket_id set to 1 and it will work.

For me, this is much more robust and reliable than anything involving the management of the boolean. Not to mention, much less code.

4. Stop Race Conditions From Disrupting Uniqueness Checks In Backend/Framework Code.

Validations at the Ruby or the web-app level are rubbish when it comes to uniqueness.

So you can see here that I have a uniqueness validation on invoice_number that is very important to be unique, otherwise there's going to be a serious irregularity in my accounting.

So if we go into the rails console and check out what happens when we run that validation, this gets run, whenever you update, or save a record.

So now we're going to try it with the invoice_number from the previous invoice. The current one is ”2020-5-32”. So let's just subtract one from that last number and make it 31.

This should fail, and it does. But the more interesting part is the SQL queries that get run. You can see that the first one is Seller Load, we don't care about that one.

But the next one is SellerInvoice exists?, and what this one does is, check if there exists an invoice number ”2020-5-31”. ie, an SQL query gets run and the Ruby app has to wait for the results before deciding on whether or not the uniqueness constraint works out.

Of course, you can have multiple threads or multiple servers, acting on the same database. Therefore this is vulnerable to race conditions, and over a 10 year period I have seen this happen multiple times, and they're very surprising and difficult to debug.

The kernel of the problem is that there is a time delay between when you query for uniqueness and when you actually do that write. And by the time you actually do that write, in that intervening hundred milliseconds or even a second in some cases, some other process or server will have taken up that identifier.

And by the time you write, it will no longer be unique, and if your constraint is only at the programming language layer, for example in Ruby or Python, then there's no guarantee of actual uniqueness.

You can however completely avoid this issue by adding a uniqueness constraint in the database. You can see by looking at my schema that I have already added one. It's the second of the indexes, a UNIQUE index acts as a constraint.

5. Keep (unwanted) Duplication Out Of Pivot Tables (Many To Many Relationships)

Uniqueness constraints are important in many-to-many tables, in pivot tables.

Without them, you can end up with a lot of duplicate data that is confusing and also causes bloat, in the sense that the data set gets larger and also slows down, because a larger data set takes longer to traverse.

Let's make this concrete with an example from my code-base.

I have a Taxon and a Product model. These two have a many-to-many relationship.

You can see that represented in the Rails world with has_and_belongs_to_many appearing in each of these classes.

This makes sense, a product could be tagged with law, and with undergraduate. And, similarly, a tag for undergraduate can contain many products.

This connection is represented in the products_taxons table, which just contains product_id, taxon_id, and a regular id.

Now I'm just going to grab the first product, and the first taxon. That's arbitrary. And then I'm going to create a connection between them in Ruby, using the shovel operator, and I'm going to repeat this a couple of times, basically pushing the taxon into the product's current list of taxons a few times.

Now that I've run this, I'm going to move to the SQL database, to the Postgres database and have a look at the records that are created.

The id for the product I worked with was 26, and the taxon, 2. So let's check records that have both of them.

And as you can see here, there are four records, corresponding to the four times that I pushed that taxon into the product's list of taxons.

Yet, this contains no extra information, or no extra information that's valuable in this context.

The naive way to avoid these kinds of problems, is to check whether a products list of taxons already includes the taxon you want to add.

The problem with this approach is that you have to remember to do it in every part of the codebase that modifies that list of taxons.

This is error-prone, especially when there are multiple people on the project. A better way to do this is to create a unique constraint that spans those two columns.

Now, once we try adding a taxon to a product's list of taxons a second time, we get an error. Specifically we get a PG:UniqueViolationand duplicate key value violates unique constraint. That's much better.

That’s all I have to say on this aspect of data integrity.

So, see you next week.