Data wrangling should be end to end ipempotent

This is part of the Semicolon&Sons Code Diary - consisting of lessons learned on the job. You're in the workflows category.

Last Updated: 2024-05-07

I was part of a project for transforming data to JSON events, X30 small scripts with haphazard names that performed small transformations by stitching together disparate data streams.

Unfortunately, more data was added to our original load, and we had to run the scripts again. But this time we forgot a step, causing blanks and nulls to occur where they shouldn't be. This required us to manually run the commands again in a finicky order, taking many hours and brain cycles.

Ultimately, it would have been much more resilient (and would have saved time and headache) if we had a single entry point that took the dirty data and filled in as much info as possible in a single run in a way that was safe for re-runs.

Lesson

Data wrangling scripts should be end-to-end idempotent