When modifying code that acts on production data test against a good sized sample

This is part of the Semicolon&Sons Code Diary - consisting of lessons learned on the job. You're in the data category.

Last Updated: 2021-05-15

Case 1: I wanted to write some code to standardize the formatting of phone numbers in a DB of 6000 existing numbers. The existing numbers had and all sorts of random characters within:

My initial code was:

<?php

private function formatPhoneNumber($phoneNumber) {
    // Get rid of spaces, dashes etc.
    $phoneNumber = preg_replace('/[^+0-9]/', '', $phoneNumber);

    $countryCode = '+49';

    if (substr($phoneNumber, 0, 2) == $countryCode) {
        return $phoneNumber;
    } else {
        // Add country code, replacing any preceding 0s, if present
        return preg_replace('/^0/', $countryCode, $phoneNumber);
    }
}

I looked at like one record and everything seemed fine so I ran it against the (backed up) production database.

Later I discovered some records were invalid. E.g. The ones that started with +49 (0)30 – 844 164 – 0 became +490308441640. (This is bad because the 0 after the +49 was not stripped, as it should be). I did not foresee this case.

Because of my backup, I was able to fix this soon afterwards. But it's worth thinking how to handle this better upfront.

A better approach would have used sampling - e.g. to eyeball 5% of the data and figure out edge cases from them before running any code.

A higher degree of care (but slower process) would be to download all the old data elsewhere, transform it all, and eyeball the results, checking against certain constraints (e.g. there should never be a 0 after the +49);