For context, I work in a Django/React (TS)/React Native codebase on a 20-person team at MorphMarket. I have 17 years experience with web development in other frameworks (Rails/Laravel/Express etc.) and similar ideas are applicable.
Performance
- Page Counts Are a Major (but Surprising) Source of DB Performance Issues at Scale. Giving an accurate "total pages" count (or total "items" count) is a hard problem at scale. This is because the DB needs to count all matching rows—and, often, the queries used here select a great many related records and call many filters. This can potentially require millions of rows to be scanned. I don't have a full solution here. What we have done so far is automatically strip all "annotations" from paginators (extra data used to render items but not necessary for page counts). In some cases for very large tables, we use an estimated count paginator (which uses some DB stats instead of actual queries) instead of actually counting all rows. Lastly, for endpoints where the page count really matters, we use some systems to cache it. For example, if the page count is going to be less than the number on the page (or, more technically, per our "cache block," which is 1,000 items), then we instruct the paginator to not do an extra DB query since it can already infer the count of records by the fact of the block being incomplete. I could go on...
- Long Cron-Job Reads Can Cripple Your System. A single inefficient, long-running DB query (often in a cron job for reporting) can halve the performance of your entire system if they share a DB. Moreover, queries happening outside the web process are often not as effectively tracked in APMs like New Relic, so they can be invisible. The solution here is to treat the performance of these queries as something worth worrying about... or create a read-only follower of your DB for these purposes so as to insulate your "hot" DB serving web traffic.
- More generally, long queries OR large payloads are the chief troublemakers for systematic performance issues in Redis/PostgreSQL (etc.), so if in doubt about what to optimize, start here.
- RAM Pressure Is Underrated as a Cause for Performance Issues. This cause can be a little insidious because your system will not crash if under moderate RAM pressure; instead, it will swap memory to disk and get much slower.
- Mark Each Deploy in Your APM (e.g., New Relic). This is in order to correlate performance data with code changes. Connecting causing and effect speeds up debugging perf regressions.
- Offloading Work to Background Jobs Is Very Effective. For example, sending emails, push notifications, WebSocket messages, and processing webhooks from payment providers, etc. You want to keep your web processes running fast.
- If Data Is a Tree Structure (e.g., Categories), Use Power Tools for Data Modeling and Querying. Our category hierarchy was highly inefficient when using standard ORM modeling. Theoretically, there are systems for doing this well in SQL. However, in our case, it was easier to move all operations (other than grabbing all categories on boot) off the database completely and cache them as a tree in local RAM. This saves countless hundreds of millions of self-joins in SQL for operations such as "ancestors" or "descendants". It also goes to show: know your data structures.
- Set Timeouts at Various Levels to Prevent Resource Starvation. As a sort of general immune system, set system limits, such as requiring that any web request must take under 25s; otherwise, it will be terminated. While this can create complexity for genuinely long-running ones and require the use of background jobs, it's worth it because of the protection it provides by preventing situations where a performance issue in a single endpoint can cause your entire server to get starved of threads/processes and crash. Another example is when you rely on a third-party API that suddenly experiences performance issues and takes (e.g., 180s) per request. Very quickly, all your parallelism will be used up doing this waiting, and you won't be able to serve "normal" requests and will experience downtime.
- Kill Worker Processes Regularly as Insurance Against Memory Leaks. Each of our servers has about 16 worker processes. We restart these worker processes once they reach a certain number of requests (e.g., 10,000 requests). Obviously, there should be jitter to ensure that the process is killed at a random time within some range of requests; you don't want all workers to suddenly go offline at the same time since this would effectively cause downtime.
- Common Sources of FE Performance Issues (React for Us, but Probably Somewhat Generalizable). For a fairly standard web application, the most common sources of FE performance issues tend to be:
- Complex form elements doing unnecessary work (e.g., re-rendering parts of the form that have not changed, or running validation code too eagerly on each character).
- Having too many elements in the DOM (e.g., populating a select with 5,000 options that all get rendered at once).
- Rendering too often (e.g., due to inefficient or buggy use of
useEffect). - Bugs that cause unnecessary numbers of requests to be sent to the BE (also often due to issues with
useEffect). - Boot-up costs of a React app. This won't affect all companies but will affect some, so I'll mention it anyway. We have a mix of BE-rendered HTML pages and pure React pages. Moving from an HTML page to a React page causes the React app to need to boot up. Once that React app gets large, this can incur a noticeable performance hit.
- Relatedly: unnecessary work done on page load (e.g., indexing a frontend search) rather than doing this "just in time" when someone is about to search.
- Make It Really Easy to Run EXPLAIN on DB Queries. One of my favorite home-grown tools is something that lets me take an ORM query running locally in my REPL and then run the same one (with EXPLAIN) against production, all without actually having to deploy that code into production. It essentially spins up a read-only DB instance in production and runs EXPLAIN for that particular query there on a once-off basis.
- Know Your DB Indexing Options. There is a lot more to DB indexing than "is that column indexed or not?". This includes things like compound indexes, rewriting queries to make index-only scans possible, and being aware of the type of filtering being done to choose an index type or algorithm that matches it. For example, if it's a prefix-only search, a regular B-tree index will work; you'll need a trigram index if you want to match any character contained within.
- ORDER BY
idInstead ofcreated(etc.). CertainORDER BYclauses in SQL yield the same results as others but are more inefficient. For example, if you want to order bycreated, why not just order byid? Thecreatedfield is immutable and will match theidcolumn's ordering anyway... and theidcolumn comes indexed for free on all tables. - Delete Useless Data. Part of the performance problems I have faced is that some tables accumulated ten years of data even though they were really not necessary. Some examples are security logging of things like user agents and IP addresses, push notification records, and messages between users about a transaction nine years ago. We have a script to delete old data, and this makes these trimmed tables much more efficient to query since (obviously) the size is smaller.
Human
- All Programmers Should Be (Somewhat) Full-Stack Previously, there was an FE/BE/ops distinction on the team, and people stuck to their corners. We moved to making everyone have some full-stack capabilities (but, of course, with some specialty in one area or another). Yes, it took a year for people to retrain, but a) they appreciated the opportunity to expand their skill sets while still getting paid (not a single engineer quit in over my five years), and b) once trained, it greatly reduces management coordination and lead time to get features into users' hands because an FE programmer can add a field to the DB instead of requesting the BE engineer to do it, which would otherwise increase lead time.
- Don't Blame Individual Contributors for Errors If an issue manifests in production, this is usually the result of at least three cascading failures. Assigning blame to a single person (in particular, the person who wrote the code) is both disingenuous and short-sighted (since it causes people not to think about higher-leverage, big-picture causes):
- The person who committed the change didn't catch it.
- The person who reviewed the change didn't catch it.
- The person who QAed did not notice it.
- The person or system that decides who to assign the review to was wrong.
- Management assigned the wrong person to the ticket.
- Management is not hiring the right people.
- The software tests had a gap, causing this not to be detected.
- The software architecture allows for this class of errors or is too surprising.
- Create an Employee Expectations Doc For example, saying things like "we value if you solve problems in #bug-discussion before they reach the CTO's desk" will actually steer behavior in that direction. A motivated employee likes clarity on what good work looks like. They cannot read your mind.
- Get Comfortable Talking About Performance With Your Reports I absolutely dreaded this when I first became a manager because I don't like conflict. However, the reality is that it's better to be open with someone that they are not meeting expectations in XYZ ways than to slowly build up resentment and frustration and then give them notice. Everyone is better off if the performance problem can be discussed and addressed early. And, I think I've underestimated people's ability to improve under coaching.
- Tech Debt Fridays There is a tension in tech companies between the CEO's desire for new features and the engineering team's need to work on performance/security/maintainability—the kind of things that ensure the engine runs smoothly and that velocity can be kept high on future features. A reasonable compromise is that the CEO dictates what happens Monday through Thursday, and the CTO on Fridays. This is also a nice quality-of-life perk for engineers, since they get to nerd out on deep problems or create tooling that removes their pain points. And it segues nicely into the weekend.
- Linters Improve Outcomes Introduce linters for formatting and code rules. This is under "people" instead of code quality because, IMO, the primary value is to stop people from giving low-value code reviews by getting distracted by formatting issues instead of focusing on API design and risks.
- Prevent Long PRs Either in LOC or calendar time. There was a point when large features could take 3 months in a divergent PR and accumulate an enormous number of changes. These were impossible to review, difficult to keep "in sync" with master (conflicts), and—moreover—created too much deploy risk. Instead, we asked programmers to break things up into steps and merge into master sooner—ideally every day or two (but at most, once a week). This had the added benefit that more senior programmers could give feedback earlier, which is net more efficient since it prevents the dramatic rewrites that are required when a large batch of code takes a fundamentally wrong turn early in the process.
- Encourage Chattiness in Remote Companies There was no standup when I started, but we started with a daily post about what people worked on. With time, I would ask a lot of questions (even ones I knew the answers to) in order to show people that it's okay to ask questions. I also started doing things like sharing code changes, asking for people's opinions, etc. I would also ask reports to have calls with each other (in order to seed the relationship as much as to solve some technical problem together).
- Give Users a "Feature Requests" Page where they can request changes, discuss among themselves, and upvote/downvote. This provides a powerful, data-driven way for us to ensure we are giving users what they want. Additionally, it helps ensure that programmers have a backlog of bits and pieces to work from if the product team has not finished speccing some major feature.
- One Major Technical Push/Change at a Time When implementing a major organizational change (e.g., introducing e2e tests or linters), get it stable and reliable before starting something else. This is because if a system is not done well, the team won't trust it and won't use it. Also, there's a limit to how much change people can handle at once, so you are wise to take this into account.
- There Should Be at Least Two People That Can Fix Anything—this is the real point of reviews. Do not allow knowledge of a key system to be siloed with one person—otherwise, if they become unavailable (sick, etc.), it can be challenging.
Monitoring
- Treat Your Monitoring Layer as a First-Class Citizen in Terms of Software Testing: A gap here can be devastating because an error will be shipped to production and may be active without your knowledge for weeks. Therefore, do things like adding dummy endpoints to test that various conditions trigger the appropriate alerts, or that an exception in the background job queue will alert, etc.
- Monitor if Your Worker Queue Seems Clogged Up: We've had issues where the worker queue gets backed up by a few hours (e.g., if a highly inefficient new job is added). This causes a lot of problems since jobs that should run fast (e.g., sending the 2FA code via SMS to a user) might take hours. We therefore monitor (and alert) when our queue seems slow.
- Get to Inbox Zero (or Close Enough) in Your Exception Reporting Software (e.g., Sentry): This is probably the single most important thing you can do to improve quality at your company. Once you get to this point, your team will treat each new Sentry exception as something actionable—and also be worried that they introduced a new error in whatever change they just shipped. You will need to extensively ignore unactionable or low-value errors (especially in browser JavaScript) to get to this point. The biggest improvements we had in this dimension came from: a) ignoring all errors coming from unsupported browsers (essentially anything over 5 years old); and b) ignoring bugs from browser extensions. We did this by injecting all our FE code with a unique identifier and then auto-ignoring any exception without this. Through this process—and fixing actual bugs—we went from 1,000 exceptions per hour at the start of my tenure to under 1 per hour.
- Have Ways to Monitor and Test Email Deliverability: At scale, some percentage of your users will not be able to receive emails (e.g., emails bouncing due to their inbox being full, or due to them accidentally making spam complaints against you, etc.). They will, of course, blame this on your tech. You need to have tooling and systems to identify these issues—and ideally to notify users through other means (on-site notifications or push notifications) when these issues start.
- Have Ways to Monitor and Test Push Notification Deliverability: Inevitably, some won't be delivered, and if you have over a million users, there will be a steady stream of users reporting bugs about why they did not receive notification XYZ. Therefore, you'll need to track and persist the delivery status/receipt of each message so you can debug issues. You'll also need mechanisms to send "tracer bullet" notifications to customers ad hoc to help them debug.
- Dead Man Switch Monitoring: Exceptions make noise and are easy to get notified about. However, you also must have some way to monitor that certain events occurred (e.g., that your cron system is still online, or that your DB backups are being taken daily). Some event not occurring can be disastrous—but will not loudly explode in the way that an exception for a crashing web request will. Look up "heartbeat monitors".
- N+1 Queries Are the Single Biggest BE Performance Risk for Code Using ORMs: The problems typically get introduced on major endpoints that are incrementally expanded by many different authors (e.g., to add different filters/parameters). This makes the N+1 problem sporadic (and harder to notice) since it will only occur when a certain set of filters is applied AND when the user making the request has sufficient data for the loop to actually require more than 1 additional request. This greatly complicates debugging. To address this, make sure your APM is configured to detect N+1 queries and track what filter parameters were used, as well as which user was making the request. In general, training your team to flag these in code reviews is high-yield.
- Be Mindful of Log Space Usage: Depending on your setup for logging (especially if it's relatively unsophisticated), you may need to monitor log space usage in order to prevent it from filling up and causing issues or cost explosions.
- You Need Tooling to Quickly Identify Which Resource Is Under Pressure When Performance Problems Strike: When faced with a performance problem—either at the system level or at an endpoint level—you should have metrics to know if the issue is:
- Server CPU
- Server memory
- SQL database (this should be broken down into slow queries vs. blocked on locks vs. too many queries)
- Redis database
- Network or other I/O
- Database connections to each database
- Be Especially Scared of Request Queuing: In judging overall system performance, request queuing (from your proxy server/main router to your backend server) is probably the single most important metric to be worried about. On average, this should be minuscule (under 2ms). If requests have to be queued up, this tends to cascade quickly into downtime.
- Tell Your Database What Kind of Process Was Connected: We started sending
application_nameto PostgreSQL to understand whether it was a web node (and which one), worker, or schedule node that might be holding a lock for a long time. This is useful when working withpg_stat_activity. (And, in general, learning your database introspection tools -- including for redis etc. too -- is high ROI) - Production Logs Need a Great UX: They need to be incredibly easy to use. The three must-haves for me are:
- There must be a user interface that will graph the number of times a term appeared in the logs over time. This helps identify clusters of issues, confirm fixes over time, etc.
- There must be some way to save common searches and share them with the team (e.g., "performance issues", "security issues", "rate limits").
- There must be a CLI for searching them (so we can script log parsing or give them to AI agents).
- Add Trip-Wire Alerts for Performance Regressions on Your 20 Most Important Endpoints: BE performance problems (i.e., endpoints becoming slow) tend to creep up on you due to incremental feature development that eventually makes the DB unable to effectively use existing indexes, caches, etc. This is hard to notice during programming or code review. The pragmatic solution here is to put trip-wire alerts on the most important endpoints and alert if they are a few standard deviations above normal. Essentially, this means your team will get info about performance degradations within hours of a deploy that introduced regression problems. Knowing what change caused issues saves 80% of the work.
- Monitor if Your Cache Gets Full: Like many web application frameworks, Django defaults to using one cache for both ephemeral data and stuff that should be "stickier" like login sessions or (via the rq extension) background jobs. Yes, this is configurable, but if you're in this default situation, it's good to get alerted if your cache gets to 80% full, lest you start losing data you don't want.
- Provide Snappy, Fun-to-Use Business Intelligence Graphs: Our business intelligence was initially a stats email with a CSV that came once a week. When we moved to a dashboard of interactive graphs, many more people kept an eye on key metrics. This allowed us to detect higher-level issues affecting the business (e.g., "Sign-ups dropped by 20%—I wonder if our CAPTCHA is too strict.").
Maintenance
- Avoid Low-Level Dependencies That Are Hard to Install We removed large external dependencies, especially ones that were difficult to install. An example is GDAL and GIS, which are recommended in Python core for latitude/longitude calculations. These were a nightmare to operationalize, and we realized we could achieve distance calculations that were both accurate and performant enough for our needs with standard Postgres extensions (
cubeandearthdistance). We could have saved 200 hours of engineering time by not installing them. - Limit Low-Value-Add Libraries Similarly, we started asking people to stop including libraries if they could instead just extract the functionality they needed from the library and move it to our codebase—often, it was only 40 lines of code we needed. Not depending on some external library for minor stuff helps you avoid dependency hell when doing big upgrades to things like Python or Django. A big no-no is pulling in vendor libraries for API calls: if you want to call a third-party service, build your own API client and call it directly. That way, you avoid dependencies and get to have unified semantics around requests.
- Observability CLI Tooling Is Your Number One Force Multiplier for AI AI is great at reading text. Guess what else works well with text? CLI tools! Therefore, you can get enormous leverage from giving your AI CLI access to production logs, exception reports, a read-only version of your production database (to investigate bugs), your CI test failure system, and a read-only version of your REPL (or read-write version if local/staging).
- Distinguish Between Technical Problems That Are Stably "Bad" vs. Getting Worse the Longer You Leave Them This provides an honest way to prioritize which tech debt to pay off. For example, a bad data modeling decision can be "contagious" in the codebase; today's 15 areas in need of rewriting might become tomorrow's 150 if not acted upon.
- Having the Same Thing Implemented Multiple Times Is the Number One Sign of a Failure of Technical Leadership If your codebase has issues like multiple different clients for the same payments provider, competing FE components for something that should be reusable like a drawer or select box, or three different files for storing functions related to date times, something is very, very wrong. At scale, consistency and reuse of battle-tested code are the foundations of maintainability. Ultimately this is about making sure there are people who are accountable for overall architecture. This might end up being you if you cannot delegate it yet. One thing that helps here is creating obvious, searchable places for shared functionality (e.g.,
common/datetime.pyorcommon/geo.py), flagging any theoretically reusable code for centralization during PRs (just to get people thinking along these lines), and creating a style guide about what is available where and giving it to AI agents. - Squash Migrations We accumulated something like eight years of database migrations without any squashes. This made CI testing slow and brittle and became a mess to manage, so we essentially deleted all migrations, regenerated them from scratch, did some testing, and that more or less worked out.
- Create Custom Linters We have a specific set of colors, spacings, etc., in our CSS. It was a PITA to keep telling people to stick to them in reviews. The solution: a custom linter that knows these rules (and even suggests the closest "legal" alternative color, etc., as a fix). This helps both people and AI agents.
- Invest in Cutting-Edge Package Managers Take the time to replace/upgrade package managers. When we moved from
piptouv, we experienced both a reduction in dependency hell headaches and speed-ups across the board (developer experience, deploy script, staging, CI testing, etc.). Ditto for moving fromnpm(a few years ago) toyarn, and then fromyarntoyarn v4. - Prevent Inline Imports Some languages, like Python, have (IMO) a grave language flaw in that they allow inline imports in functions. This causes people to write code with circular dependencies. Eventually, this leads to both architectural rot and production exceptions when a crash occurs because an import has broken. If the import were at the file level, such a crash would be detected during development/CI because the program would not boot.
- Use Service Objects/Modules In web frameworks that emphasize data modeling based on nouns (e.g.,
Product,User,Auction), the tendency is to keep adding methods to these models, causing them to become large and gnarly. The simplest way to combat this is to encourage the use of objects that orchestrate some process, for example,PlaceAuctionBidService,ProductBillingService, etc. The idea is to move past nouns and think in terms of verbs. - Polymorphic DB Relations Are Often (but Not Always) a Mistake The issue is that ORM code becomes exceedingly tricky to write here. And you lose out on DB protections like foreign key constraints. Just add nullable FKs for each related model and some query-level abstractions.
- RIGHT TERMS?
- "Default" DB Queries—e.g., Filters or Orderings—Are ALWAYS a Mistake Some of the most insidious—and repeating—classes of bugs I've seen in my career are where someone instructed the ORM to do something like "order by ID in reverse" or "filter all records to non-deleted ones". The problems here are twofold: when a codebase becomes large (e.g., 1 million LOC), you cannot expect programmer 36 on the team to know about this implicit behavior, and therefore issues occur. Secondly, even something like an ordering clause can create a mess with many ORMs due to them adding the ordering key to the SELECT clause of the generated SQL. This leads to incorrect counts in analytics or user-reporting code.
- All Infrastructure Should Be in Code When I first started, all 27 of our cron jobs were configured ad hoc on the production server. Most did not appear on our staging servers. And there was no concept of continuous cron jobs on our development machines. Moving these to be managed in code was a big win because it made all systems behave more similarly. The next step was to move the configuration of AWS/Heroku/CloudFlare/Sentry, etc., to be in code (via
terraform). The advantage here is that it makes otherwise "hidden" configuration context about your system legible to your full team or to AI agents. - Have Tooling to Easily Get a Critical Fix/Change Into All Branches Let's say an emergency hotfix on master accidentally breaks a unit test. Other people have since merged master into their feature branches and see the failing unit test too, making their branches go from green to red in CI. This creates inefficiency and confusion. If you have a command to
merge_master_into_all_affected_branches, you can do damage control. This is also valuable outside of error conditions—e.g., the introduction of new lint rules or new deploy systems affecting staging servers, etc. - Webhooks Are a Hotbed for Race Conditions Nothing against webhooks—they are a fantastic language-agnostic protocol. But be careful because sometimes an action in your code might trigger a webhook that arrives so quickly that the code which triggered the webhook might not have been saved to the DB yet (or not fully). This can lead to the webhook receiver instantiating "stale" state, then saving that to the DB, potentially clobbering good data. The main defenses are delaying webhook-causing actions until as late as possible (e.g., only on commit to the DB—see standalone section) and limiting what fields are allowed to be updated (instead of calling a general
saveon the whole record). - Understand When to Use oncommit vs. onsave The difference here is there are certain actions you only want to do after you are SURE the data was written to the DB. For example, scheduling a background task dependent on that data (otherwise, it will not find it or will find a stale version) or notifying a user via email, etc. By postponing these until after the DB commits, you avoid the situation where the DB fails to write it—e.g., due to some DB constraint failing. Ask yourself: if a DB transaction rolls back and throws an error, will I regret that this code executed? If so, defer it to on commit.
- Use Git Hooks to Automatically Make the Environment Work for a Given Branch A feature branch might change the database schema or add a Python/JS dependency. You don't want your developers—or product/QA teams—to need to manually deal with this. Instead, the Git hooks should detect that a change was made and ensure the environment works for that (e.g., rolling back migrations from the outgoing branch and applying migrations on the incoming branch).
- Lean Extensively on CHECK Constraints It is immeasurably better to prevent problematic data from being written compared to detecting it after the fact. Previously, there was a cron job at our company to send alerts via email whenever corrupt data was found. All of these were replaced by putting in place sophisticated check constraints at the DB level. When I say sophisticated, I mean it. These may consider up to 16 different operations across various columns to ensure only valid states can exist.
- Require a Factory for Every Database-Backed Record A factory is a function that creates an instantiated version of some record with all the necessary fields filled in, including related attributes. These are invaluable for seeding local machines with data, seeding staging servers, setting up unit tests, or doing performance tests. Think about the developer experience here by doing things like creating characters in your overall seeder (e.g., in our case, we have users such as
buyer,seller,admin,beta_user,pro_plan_user, etc., that have rich data attached). Another pro tip: make sure you give rich, real images in your seed data—otherwise, QA will hate working with this data. And make your factories "stable" by seeding their randomness with a fixed value; this will ensure that running the seeder on your machine will work the same as on your colleague's machine.
Deploys
- Deploys Need to Be Fast—Really Fast This de-risks production errors because they can be quickly fixed with a follow-up deploy. The worst case for us is deploys to the Apple App Store. Deploys here could take up to a full week due to the app review process. To address this, we managed to implement an over-the-air system to deliver about 90% of updates to app code in under 1 minute. This completely transformed our willingness to make changes to our native app because deploys became less terrifying.
- Automate Decisions About What Should Happen During a Deploy For example, if a certain script to re-generate category data should be run post-deploy if a
categoryfile is modified, then the deploy script should know whether to run it. This seems painfully obvious, but because different people/teams might write the categories feature than the team who maintains the deploy script, these issues can creep in. Distributing deploy responsibility—so as to make deploys multiple people's problems—creates the right incentives to have an all-in-one deploy script that "just works". - Have a Post-Deploy Smoke Test Against Production This can ensure that a few key pages render correctly (e.g., using Playwright/Cypress, etc.). This is worth its weight in gold because this will immediately flag any serious issue to the deployer (who might have become complacent about checking manually if the last 150 deploys were boring). You can ensure that the correct commits are live (by surfacing them in HTML on each page and parsing this).
- Make Your Deploy Script Play Audio at the Key Moment ADHD traits—including hyperfocus (on something that isn't the deploy!)—are common in IT. For that reason, we make our deploy script play an audio file—e.g., a foghorn sound—at the moment the changes are about to go live. This makes sure the deployer does not forget to be on high alert at the appropriate time when action might be needed. It's too easy to get lost in a Slack conversation and forget—especially if your deploy script is still slow (ours was once 28 minutes...).
- Avoid Full-Fledged Maintenance Mode We have something like 400 database tables. If we were to use maintenance mode for every deploy that modified these, our users would flip. Thus, IMO, it's better to default to just let the related endpoints fail for 15 seconds while the database migration happens than to take the entire website down for all customers for 30 seconds by putting the whole thing into maintenance mode. Obviously, there is a lot of nuance here: if the database migration needs an exclusive lock on the table and it gets written to quickly, you simply will not be able to migrate at all without maintenance mode due to the inability to grab that lock. In my experience, this only affects the 3-5 main tables on the website (think
userstable orlistingstable if you are a site like eBay). I am fully aware that there are ways to break up migrations into multi-step deploys to limit these issues; however, I do not feel the cost to team velocity is justified for the web applications I have worked on. - Be Aware That Maintenance Mode Is Not Just the Process When you legitimately need maintenance mode for difficult DB schema changes, you don't want to just halt the web process: you will also need to stop cron jobs, worker queues, etc. These are often missed, leading to incidents during deploys (e.g., the DB migration taking longer than it should while the code is already in the later version that assumes the DB migration has already been done).
- Split FE and BE Deploys Regarding making deploys faster for typical web applications (React/Django in our case), we got ours down from 28 minutes to under 2 minutes through a combination of:
- Splitting up FE and BE deploys and parallelizing most work (with a syncing step at the end).
- Aggressively optimizing FE compile time (moving to Rust-based systems, disabling heavy, low-value optimizations).
- Running this optimized JS compile on fast machines (our own MacBook M-series laptops were over 10x faster than whatever low-end commodity hardware Heroku uses for compilation).
- Using
awsCLI power tools instead of half-baked library implementations in your web framework of choice.aws s3 syncis likely to blow any of these systems away in speed—and be more stable. - Postponing any work that can happen post-deploy until then (e.g., uploading source maps to Sentry need only happen once the critical changes are live).
- Trimming out unneeded parts of the repo when sent to the deploy server (less data transfer/smaller "slug")—for example, we remove all front-end and native-app code from our monorepo when sending to production servers.
- Deleting "dead" FE assets from S3 buckets. If you have hundreds of thousands of files from old deploys lingering in an S3 bucket, commands like
aws s3 synccan spend up to 3 minutes figuring out what to sync! We avoid race conditions (ChunkLoadErrors) by ensuring that any asset deleted in S3 is still available in our CDN for another 4 hours. This way, any FE client that has not done a page reload yet and still requires the "old" assets will still function for four hours, during which time it is likely to do a page request.
- Break Up PRs Smaller PRs equal smaller deploy risk. So we tend to request any PR over 500 lines of non-boilerplate code be split up (not always—it's a judgment thing at the discretion of the code reviewer).
- Mention Every Config Change (Including on Third-Party Services) in an
#operationsChannel in Slack—including leaving a link—so that every team member is aware of what changed... and can quickly undo it if problems are later discovered and the deployer is not available. - Be Prepared for Problems with Cache Systems During Deploys Other than DB migrations, the most common source of deploy surprises is issues with the cache system (Redis in our case). Two steps we made here to help were:
- Move all cache keys to a single file so that any cache item can be quickly located and cleaned out with a single CLI command.
- Make some cache keys (i.e., for data that changes shape often) dependent on the hash of the schema shape—this prevents accidental use of a "stale" cache shape without the need to micro-manage things.
- Create a Team "Playbook" of How to Handle Deploy Issues This contains short instructions on things like killing locked DB connections, clearing cache surgically (without flushing the cache and logging everyone out), or forcing FE assets to be cache-busted by appending a new version URL, etc. Having people trained on this decreases the length of downtime during incidents.
- Don't Deploy Anything Risky on a Friday If there are bugs, they might not manifest until Friday evening or Saturday when you will have low/no staff availability, or risk ruining the plans of some engineers. Small hotfixes or changes to internal tools like admin panels are fine.
- Give Many People the Power to Deploy At one point, only the CTO was allowed to deploy when I started. We changed this to allow five people to deploy via training and providing a playbook for handling the 15 most common issues that crop up during deploys. People in "deploy training" are only allowed to deploy if a trained deployer is also online.
Frontend
- Generate a typed FE SDK From the BE API We replaced all ad-hoc front-end API code with a fully typed SDK automatically generated from the Django DRF BE (using Spectacular). This made it easy for programmers to move to TS and also to detect regressions in the FE when the BE changed.
- Front-End Costs Are Often More Driven by Request Count Than Bandwidth From a front-end asset cost-saving perspective, a few learnings are that:
- Be more aggressive with caching—use content hash caching (and get to the point where you trust it), then cache assets with cache-control headers indefinitely. Simply using ETags is too conservative and will lead to large numbers of requests to the CDN. This can double or even triple the costs per month.
- Similarly, choose your "chunk size" wisely for JS (and other) bundles that are lazily loaded (e.g., sub-routes on React pages).
- This is obvious, but nevertheless it has a way of getting out of hand: only serve image sizes that are sufficient for the screen size and the place the image will render. Do things like use
srcsetto avoid giving retina images to devices that do not support them, etc. - And lazy-load images—do not request images below the fold if the user is still above the fold—this wastes too much bandwidth.
- Speed Up the Loop from Editor-Save to Seeing Visual Changes At a million lines of code, this can get very slow. Some things that worked for us: moving from Webpack to a Rust-based equivalent (Rspack), removing Docker (Docker has a fair bit of overhead when working with many small files—this is possibly solvable with serious Docker-fu, but it was easier for us to remove it), using "cheap" source maps locally, using a persistent compile cache locally, using incremental compiling, disabling minification locally (since local bandwidth is a non-issue), and ensuring
node_modulesare excluded from the "watch list" during polling. - React Context Is Trash Use a state library—e.g.,
zustand. - Google Translate (and Other Tools) Can Break Your Code They do this by rewriting the HTML, which can cause load-order (and other) errors to be introduced. We started detecting when these tools are used and appending this information to Sentry.
- Consider How to Migrate Browser-Local Data If you use systems such as
localStorage, browsers might be storing data that is incompatible with changes to your JavaScript code. You will therefore need some concept of front-end data migrations to handle this smoothly.
Debugging
- Put the Current User's Username in Every Log Entry This can be done by making the current request a thread-local variable (or an equivalent, safer system depending on your server setup). The default in logs is just an IP address, but this is not good enough because: a) IP addresses rotate a lot, b) it introduces indirection for your debugging, and c) it obscures the story if someone is using multiple devices.
- Communicate to End-Users When They Have Internet Connectivity Issues Many problems are transient and occur due to users simply having network problems. By having your UX update and say "internet connection lost" (generally) or "poor network signal" (e.g., during video streaming), you can avoid a lot of unactionable customer service load (that will end up causing your engineers to do fruitless bug hunts).
- Always Tell Your User What Went Wrong I.e., by surfacing a detailed error name and message. For backend errors, you could show a toast. Typically, this will be a lump of JSON with validation errors, often containing enough information for the user to fix the issue themselves without needing to contact your customer support. Even if the user doesn't understand the error, having them write in with that information will make the customer support process more efficient. For front-end errors (e.g., React page crashes), you can show a user-friendly error page that gives the error message and asks them to screenshot the page and send it to your customer service team for help if the error persists.
- Intercept Third-Party/Vendor Errors and Make Them Comprehensible for Your Users For example, in payment code, instead of surfacing "Payment failed: PayPal error code: 1245", you could show "Payment failed due to an issue with your card's expiry date". This improves the user experience and reduces customer service (CS) load, which can become significant at scale even due to "expected" errors.
- Be Able to Hijack User Accounts Often, a bug depends on the exact data or settings that a particular user has in their account. Therefore, having a way to "hijack" the account that reported the bug can help greatly with reproducing it.
- Have a Way to Emulate the Specific Device/Browser There are services such as BrowserStack that let you simulate specific hardware devices, OSs, or browser versions. We have issues with our native apps that only occur on specific Samsung devices, for example, and would never have been able to debug them without such a service.
- Log Intention to Create Important Side-Effects on External Systems Before Taking Action - Especially for Billing Before we carry out credit card charges or refunds, we log the intent—e.g., "bill user XYZ $100 for Y". Think of this as a simple Write-Ahead Log. This has been a life-saver during a few incidents.
- Ensure Hard-to-Reach Corners Also Have Exception Monitoring For instance, we initially lacked coverage in our Amazon Lambdas for image conversion—as well as in some cron jobs. This masked some issues for far too long.
- Have a Way to Interweave FE and BE Logs in a Centralized Bucket We did this by providing an endpoint that FE code can hit to add an entry to the "unified" logs. This helps us connect FE actions with BE actions for complex debugging (e.g., live video streaming issues, image upload issues). We do not enable this for all FE logging due to noise.
- Fail Loudly During Errors This is classic advice, but don't swallow errors and let issues silently fail. This prevents the original developer, QA team, and exception monitoring system in production from noticing errors at the earliest possible point (which is when damage is minimized). So, in general, raise errors and let them bubble up to the user space, sending the error details to your team. You still might want to fail gracefully—but only after sending the error details to logs, Sentry, etc. A pattern I like is: "crash if an error occurs locally; fail gracefully but report it if it happens in production."
- Gather as Much Data as Humanly Possible Around Exceptions This seems obvious, but in my experience, it is criminally underused. Tools like Sentry and New Relic provide many plugins to send additional info like stack traces, JS logs, source-map conversions, commit hashes, request information, SQL queries, Redis queries, and all sorts of other data. Checking out what's cutting-edge every year pays dividends.
- Create Tools to Quickly Identify All Accounts Affected by a Bug The naive approach to fixing a bug is just repairing the code causing the crash (etc.). However, this may not be enough. Some bugs (e.g., in the method
conclude_auction) can lead to lost user intentions or corrupt data. Therefore, a clean-up operation to repair affected accounts might need to be carried out. You can get this info programmatically from Sentry events (etc.). - If Using an ORM, Make It Trivial to See All SQL Ideally, you should be able to toggle a single ENV variable to interweave SQL into logs or show in some FE dashboard.
Caching
- Caching Is a Last Resort Caching code is monstrously hard to get right. All things equal, caching should be a last resort after you've exhausted the possibility of fixing poor data modelling decisions, writing more efficient SQL, adding DB indexes, removing N+1 queries, and creating more efficient data structures and algorithms. Over the years, I've found that some of our most severe performance problems were caused by naive/non-scalable attempts to cache that initially worked but eventually became problematic. The general pattern is that a programmer adds something to speed up a (relatively) small set of records (e.g., they might cache the whole set in memory under a single key)... however, this becomes disastrous if that set becomes a few orders of magnitude larger because the time to read from the Redis cache (which is single threaded) can block the whole system for that time.
- Plan for Generic/Public Data vs Personalized Data Any sort of personalization by user in a given endpoint will make it dramatically harder to cache. Let's say you have an endpoint to get all real-estate listings that match a given filter. If that endpoint also returns a boolean for each listing about whether or not the current user has liked that listing, then a simple system that caches the contents of the endpoint based on the URL alone will not be possible. Instead, you'll need a lower-level cache just for the "public data" component and then merge that with freshly fetched data for the user-specific data. Figuring this out on your two to three heaviest endpoints is high ROI. The lowest hanging fruit could be to cache hard at the endpoint level if someone is not logged in.
- Localize in FE Where Possible For example, localization of timestamps (potentially currencies) can sometimes be done there. This helps you cache "higher up" in the cache hierarchy (e.g., full endpoint or even at the CDN level).
- Avoid Your Users Getting Confused by Your Caching A typical web user does not understand caching and will often write in and report a bug because their edit is not immediately visible. One good pragmatic hack is 1) not to serve from cache if the current user is looking at their own data. For example, let's say they are editing a listing and you cache these pages for 10m. You can save a lot of customer-service headache by skipping the cache if
current_useris the owner of the data. However, this is not always practical in a performant way. The second pragmatic piece of advice is to warn users after they click save that "your data may take up to X minutes to be visible on public pages due to caching".
- N+1 Can Happen in the Caching Layer Too It's possible to have N+1 issues with Redis too. A typical example is if you track server-side render counts for a page of 24 listings being delivered to the FE, you might hit Redis 24 times if naive. Another is just reading 24 items per page from cache rather than using a pipeline or "read-many" series of functions.
- Cache in Local RAM - Not Just Redis Sometimes it's better to cache things in local RAM rather than Redis because that one millisecond extra can add up a lot. So the idea here is that a read might be done on the first request of the data from the database; then it's cached. For example, the result of a function is cached. One example might be the list of current admin users. Another example is our category hierarchy. Be aware though that in a multi-process server, this cached data might be created N times (e.g., 14 times if you have 14 processes). This leads to RAM bloat and -- more dangerously -- potential cache stampede issues when rebooting (e.g., during deploys). Storing these items in RAM shared by all processes helps here (but can be tricky).
- Lean on the "Updated" Timestamp Using the
updatedtimestamp of a record as part of a cache key (e.g., for an ETag) is an effective way to avoid excess computation. For example, we have an endpoint/methat gets all the data for the current user -- settings, unseen message counts, etc. This data is strewn across about 5 data models. When any of these are changed, we bump theupdatedtimestamp of theUserProfileobject and thus prevent a stale cache. Considering we fetch this record for every request if logged in, we skip a DB query. - Anticipate the Thundering Herd Problem Be mindful of the thundering herd problem. For example, if you cache some heavily used resource - but the act of writing to / setting that cache is slow (e.g., 5s), then naive expiration of that cache might cause (e.g.) 200 requests for that resource to come in during a few seconds. If each of these takes 5s, it can cause serious performance degradation or even downtime for a few minutes. This problem is most apparent for resources that are cached post-deploy (e.g., where the cache key is dependent on the commit hash).
- Use Contextual Information to Calculate Optimal TTLs Example: we have code for auction data. The ideal TTL is the auction end-time since we know the cache also gets manually expired if a new bid comes in (etc.). This beats hard-coding some value, like 10m/1hr or whatever.
- Make Sure a Single Setting Works to Enable/Disable Cache Locally Large code-bases tend to accumulate various systems for caching: endpoint-level caching of full responses in Redis, browser caching using HTTP headers, in-RAM caching of key data. Providing a single variable
CACHE_ENABLEDthat toggles all of these helps debug challenging cache issues.
Testing and QA
- End-to-End (Browser/App) Tests > Integration Tests (That Hit the DB) > Unit Tests While it's useful to write tests for individual backend endpoints, etc., you will get 10x more bang for your buck with E2E tests that exercise these endpoints in conjunction with the frontend. Your users only care that your full system works, so this should be your priority too. But do not end-to-end test everything: E2E tests come at a high cost in maintainability and CI slowdown. For me, the compromise is having one E2E test for the happy path of every major feature (e.g., "can sign up," "can buy something," "can create and edit a listing"), and then covering the details by fanning out with a layer of integration or unit tests, which are faster and easier to maintain.
- Find Ways to Automatically Add Tests to Cookie-Cutter Features For example, in Django, we have a system to test about 300 of our admin pages that list/edit objects. This works by parsing the name of the admin page, finding the associated factory, creating a DB record with it, and then checking if the admin page with that record works. I am considering rolling out the same approach as a fallback for any CRUD API endpoint without an existing test.
- Have Many Staging Servers Having just a single one will create coordination overhead and slow down team velocity. A good rule of thumb is one staging server for every two programmers, or having a way to spin up arbitrarily many at will.
- Make Sure Your Product Team Can Self-Serve Staging a PR We added a GitHub Actions (GHA) flow so they can stage any PR at will instead of asking us, which also cuts down on coordination overhead.
- Ensure Rich Artifacts Are Available for Test Failures on CI By this, I mean things like screenshots of the failing page, trace files for tools like Playwright (that include network requests, JS errors, etc.), and logs at various levels (BE, Playwright, FE, SQL). If you have all this data, you can ask AI tools to download it all via CLI tools and use it to debug failing tests. <!-- FLAG: The original phrase "download them all use CLI tools" was slightly ambiguous. I changed it to "download it all via CLI tools", but please verify this matches your intended meaning. -->
- Create Obvious Conventions Linking Features to Their Associated Tests As code scales, it can be difficult to know what tests need to be run to ensure a particular feature was not broken. While it's true that the CI suite will run all tests, this adds significant lag and slows down feedback loops. Therefore, one convention I've adopted to help people/AI discover related tests pre-CI is mentioning "Tested via: X" in a code comment above the Python class, React page, or BE route as an index. I would like to explore more sophisticated ideas here (e.g., maintaining some general index that can be referred to).
- Isolate Tests Completely to Avoid Brittleness Brittleness damages team morale for testing and is annoying. Stamping it out is harder than it seems. Some dimensions that have tripped me up over the years:
- Anything involving time, especially the current time, can cause trouble. Mock or freeze time.
- Ensure that all sources of randomness are removed by using a fixed seed. This may need to happen at multiple levels or in multiple libraries.
- Never allow external web requests to occur in (non-E2E) tests. Instead, mock them using libraries like VCR (which allow you to re-record them as needed).
- For the caching layer, provide a separate Redis DB or namespace for testing (and auto-clear it before each test).
- Do not allow your test system to read even a single variable from your terminal environment (or your
.envfile). This will lead to "but it worked on my machine" situations, or worse yet, accidental actions against real resources (like email servers or payment systems). - Use a separate DB for E2E versus unit tests (due to the typical inability of E2E tests to work inside DB transactions).
- Only Mock External Objects Mocks are a useful tool IF AND ONLY IF used in the way they were intended: you should only mock external objects. Do not mock internal methods of the current object. Ideally, too, the mock should be used for "expensive" external objects. The idea is that these external objects already have tests for the interfaces being mocked, and thus are covered elsewhere. That being said, if you can rewrite the test to work with the original object without serious tradeoffs in test execution speed (or side effects), then this is much better.
- Use CI to Check for Annoyances as Well as Errors Our CI system also checks if: a) warnings were introduced into various systems (such as test suites or package manager install steps), b) the SDK is out of sync with the BE endpoints, and c) migrations were not generated in Django, etc.
Security
- Understanding What an Attacker Can Potentially Gain from Attacking Your System Is the Starting Point. Most attackers have motivations (often financial). Thinking in these terms is the most direct path to understanding your security needs and what trade-offs are going to be acceptable. There is no one-size-fits-all approach. For us, attackers take over legitimate accounts in order to scam other users, create thousands of new accounts in order to spam other users with phishing links, or DDoS us.
- Each of these had a targeted core fix:
- Account takeover protections: Mandatory 2FA for all users (prevents leaked passwords from being used on our site to take over accounts).
- Phishing spam: Captchas on sign-up, preventing sign-ups from suspicious usernames (e.g., calling themselves "[COMPANY NAME] Customer Service"), and spam filters flagging messages from users that contain phishing language, external URLs, and so on.
- DDoS protections: Cloudflare anti-bot protection + internal nginx rate limits (but make exceptions for external webhooks!) + caching core endpoints for damage control if they do get hammered.
- Invest in Systems to Detect Multiple Accounts from the Same User. This is usually correlated with trouble—whether they are full-on hacking attempts or other bad behavior, such as creating fake reviews. IP addresses are weak alone, but if you detect VPN usage and prevent certain actions (like sign-ups) while a VPN is active, that can give you a better chance at capturing the real IP address. Dropping cookies in the user's browsers seems like it could never work, but in my experience, it is surprisingly effective because scammers often slip up here. All in all, browser fingerprinting is probably the cutting edge.
- Native Apps Are More Secure. In general, it's a lot harder for an attacker to carry out an attack using an iPhone or iOS app. There are more ways to identify someone in these circumstances. Additionally, scripting an attack using a mobile app would require much higher technical sophistication compared to attacking a normal website. Therefore, if security is a big deal, I would recommend any company require the native app for people creating an account. I believe it's no surprise that the latest generation of online banking services is native-app first.
- Log Any "Security Event". For example: a successful login attempt (along with the IP address), a failed login attempt, account creation, password reset, primary email change, 2FA device changes, social account connection, and any other relevant events. This is invaluable for damage control in the event of an attack.
- Rate-Limit Sign-Ups and Login-Related Actions. This should be independent of simple, general rate limits at nginx. For example, the rate limit for logins should look at multiple factors, such as the IP address or the username that people are attempting to access. Additionally, there should probably be system-level rate limits (e.g., enforced at the Cloudflare level). There's no harm in redundancy here.
- Give Customer Support Teams Admin Tools to Tweak Security Rules. For example: triggers in user profiles or user-sent messages that will cause alerts or auto-bans, lists of domain names to whitelist for allowing links in messages, lists of disallowed usernames, etc. Changing something in the admin panel while under attack will allow you to react many times faster than asking the engineering team to change and deploy something.
Thanks for reading. If you're interested in chatting about these topics or working together, reach out to jack.kinsella@gmail.com.