Sankalp Jonna

Postgres triggers saved me from a world of pain

JUNE 25, 2026

Almost every product has a number on the screen that has to be both right and instant. In our app it is the little open-ticket counts in the sidebar, the ones sitting next to every channel, every teammate, and every tag. Your support team lives by these numbers. They glance at them all day, and the moment one of them assigns a ticket or closes a conversation, they expect the count to move.

The naive way to keep a number like this correct is wonderfully simple: every time someone loads the page, count. Run a COUNT over the open conversations and show the result. It is one query, it is always right, and for a long time it is completely fine.

Imagine you are a doorman standing outside a club, and every few seconds someone walks up and asks you how many people are inside. The honest way to answer is to walk in, count every single head, walk back out, and tell them. It works perfectly when the club has twelve people in it. It is a disaster when the club is packed and a hundred people are asking. That COUNT on every page load is you, walking the room, over and over, for thousands of clubs at once.

This is the story of how I stopped recounting the room. It is also, quietly, a story about doing something I genuinely could not have done a year ago.

I knew the answer on day one

The obvious way to stop recounting the room every time

Here is the thing I want to be honest about up front: the clever fix was never a mystery to me. Anyone who has thought about this for thirty seconds arrives at the same place. You do not recount the whole room to answer "how many people are inside". You keep a running tally. You give the doorman a little clicker, and every time someone walks in he clicks it up by one, and every time someone walks out he clicks it down by one. When a person asks for the count, the doorman just reads the number off the clicker. No walking the room. No counting heads.

In database terms, instead of computing COUNT from scratch on every read, you keep the count maintained, and you nudge it by +1 or -1 whenever something changes. Reads become free. The number is always sitting there, ready.

So if the idea is this obvious, why did our app spend years recounting the room? Because the idea was never the hard part. Reliably capturing every single click was the hard part. And that turns out to be a much deeper hole than it looks.

Why capturing every change is the hard part

The doorman with a clicker, and all the ways he miscounts

In the application layer, the natural tool for "run a bit of my code whenever a record changes" is a Django signal. You listen for a conversation being saved, and in that handler you nudge the count. On paper, this is the clicker. In practice, this is a doorman who cannot quite be trusted with the clicker, and every flaw in him is a different way for the count to silently go wrong.

The first problem is the phantom click. Your signal fires when the conversation is saved, but a save inside a transaction is not final. The transaction can still roll back. So the doorman sees someone step toward the door, clicks them in, and then that person changes their mind and walks back out. You have now counted somebody who never actually entered. The count drifts up, and nothing tells you it happened.

The second problem is the lost update. Two people walk through the door at the exact same instant. Two requests, running concurrently, both read the count as 40, both add one, both write back 41. The real answer is 42. One of those entries just vanished into thin air. To stop this you would reach for a lock, so that only one doorman can touch the tally at a time. But now everyone is queuing up to write to the same row, and you have introduced contention on your database, which is precisely the load you were trying to get rid of. Every safeguard you add to make the count correct costs you the very thing you were trying to save.

The third problem is the worst, because it is invisible. It is the side door. Django signals only fire when you save a model instance one at a time. The moment anyone runs a bulk update, a .update() on a queryset, a bulk close, a raw SQL statement, no signal fires at all. A whole tour group walks in through a side door the doorman is not even watching, and the count does not budge. And bulk operations are not some rare edge case. Closing a hundred tickets at once is a Tuesday.

So this is the trap. The clicker idea is correct and obvious. But the application-layer version of the doorman misses rollbacks, races against himself, and is blind to half the doors in the building. To make him trustworthy you have to fight rollbacks, fight concurrency, and somehow watch every side door, all without piling on the database load that was the whole reason you wanted a tally in the first place.

Why I kept recounting instead

The clean fix lived one floor down, in a layer I had never opened

There is a place where all of these problems simply disappear, and I knew where it was. It is not in the application layer with the doorman standing outside. It is inside the room, bound to the door frame itself, down in the database.

If the clicker lives in the database, the phantom click cannot happen, because the click and the entry are part of the same transaction. They commit together or they roll back together. The lost update cannot happen, because the database is already the thing serializing these changes. And the side door cannot happen, because the database sees every row that changes no matter how it changed, whether it was one save or a bulk update of ten thousand rows.

I knew this. I had always known this. The fix lived in Postgres triggers, in PL/pgSQL, in transition tables and isolation levels. And that was a layer I had simply never operated in.

My whole career has been in the application layer. Django models, serializers, views, querysets. The database, to me, was something I queried. It was a well-behaved box I sent SQL into and got rows back from. It was emphatically not something I programmed. Writing a trigger that would run on every write to one of our busiest tables, on a production database serving thousands of merchants, and getting the isolation and the edge cases right enough to trust it with numbers people stare at all day? The distance between "I know triggers exist" and "I will confidently ship one to prod" was enormous. The cost of crossing it alone was more than the problem seemed to justify.

So I did what I had always done. I gave up on the tally and went back to recounting the room. I kept the full COUNT, and I tried to make the COUNT cheaper.

The first thing I tried was the old mistake

Still recounting the room, just trying to recount it faster

This is the part of the story I find most interesting in hindsight, because it shows how much the limits of your past quietly shape the moves you make in the present.

When I came back to this problem, my very first instinct was not the tally. It was to make the recount faster. I rewrote the count path to use leaner raw SQL instead of the ORM, convinced that a tighter query was the win. It benchmarked beautifully on a single account. I shipped it.

And on production, across every brand at once, it fell over. A query that looks cheap for one account becomes a real, sustained load when it is firing for thousands of them on every page load. The database load became unpredictable, and I reverted the change the same day.

I was still thinking like the version of me who could not build the tally. I was optimising the recount because, somewhere in the back of my head, the recount was the only tool I believed I had. The failure was the thing that finally shook me out of it. The lesson was not "write a faster query". The lesson was that I should not be recounting the room at all. The tally was always the right answer. I just, for the first time, had a way to actually capture it.

The clicker that cannot miss

A Postgres trigger fires inside the transaction, so the click and the change are one

This is where this project became possible for me, and I want to be precise about why. I did not type "build me a real-time counting system" and walk away. What changed is that I could finally operate one full layer below where I had ever been comfortable, because I had something alongside me that knew that layer cold. I could describe what I wanted in plain language, have a real back-and-forth about transition tables and isolation levels, and understand every line before it went anywhere near production. Claude did not replace the understanding. It collapsed the distance to it.

And in the database, the doorman finally becomes trustworthy. A Postgres trigger is a small function that the database runs automatically whenever a row changes. The crucial part is that it runs inside the same transaction as the change that triggered it. So when a conversation is opened, closed, assigned, or tagged, the trigger fires, and because it sees both the old and the new version of the row, it knows exactly which way to click: +1 or -1.

Every problem from earlier evaporates:

  • No phantom clicks. If the change rolls back, the trigger's work rolls back with it. The doorman literally cannot count someone who did not finish walking in.
  • No lost updates. The database is already the thing that serializes these writes. We are no longer fighting concurrency from the outside; we are riding the mechanism that exists to handle it.
  • No side door. A statement-level trigger fires on the change regardless of how the rows changed. One save or a bulk update of ten thousand rows, the trigger sees them all.

A simplified version of the trigger looks like this. When a conversation's state changes, work out whether it just became "open" or just stopped being "open", and record the nudge:

CREATE FUNCTION capture_open_count_change() RETURNS trigger AS $$
BEGIN
  -- a conversation that just became open: +1
  IF NEW.status = OPEN AND OLD.status <> OPEN THEN
    INSERT INTO open_count_outbox (account_id, bucket, delta)
    VALUES (NEW.account_id, NEW.channel_id, +1);

  -- a conversation that just stopped being open: -1
  ELSIF OLD.status = OPEN AND NEW.status <> OPEN THEN
    INSERT INTO open_count_outbox (account_id, bucket, delta)
    VALUES (NEW.account_id, OLD.channel_id, -1);
  END IF;

  RETURN NULL;
END;
$$ LANGUAGE plpgsql;

Note: the best part is the price. I was braced for triggers to slow down our writes, because they run on the hot path of every change. So we measured it carefully on production, and the added load was too small to even detect. It makes sense once you say it out loud: the database is already touching that row to change it. Clicking a counter while it is right there is almost free. The doorman is not slowing down the door, because the doorman is the door now.

The notepad: an outbox

The trigger does not touch the tally. It scribbles the change on a notepad.

You might have noticed the trigger above does not update the count itself. It writes a row into a table called open_count_outbox. This is deliberate, and it is the second key idea.

If the trigger tried to update the live tally directly, we would be right back in lost-update territory, with every write fighting over the same counter. So the trigger does the smallest, cheapest thing it possibly can: it scribbles a note on a pad and moves on. One tiny, append-only row per change. Account A, this bucket, plus one. Account B, that bucket, minus one. Appending a row is something databases are extremely good at, and there is no shared counter to contend over.

The notepad fills up with these little signed notes, in order, faster than anyone is reading them. The door stays fast. Nobody waits.

The person reading the notepad: the drainer

The full path: a change becomes a click, a note, a tally, and finally a live number on the screen

The notepad is useless until someone reads it. That someone is a small background worker we called the drainer.

The drainer's whole job is to read the notes off the pad in order and apply them to the actual running tally, which lives in Redis where reads are instant. It walks the outbox from where it last left off, adds up the deltas, and updates the numbers. It keeps a watermark of how far it has read, so it never applies the same note twice and so it can pick up exactly where it stopped if it ever restarts. There is a single drainer doing this, which sounds like a bottleneck but is actually the point: one reader applying notes in order is what makes the whole thing impossible to double-count.

Then, once a tally moves, the drainer pushes the new number straight to the browser over a websocket. This is the part the support team feels. A teammate assigns a ticket, and the count in everyone else's sidebar ticks down a heartbeat later, without anyone refreshing anything. The number stopped being something we compute when you ask, and became something that is simply always current.

So the full path is: a conversation changes, a trigger turns that into a signed click, the click lands on the notepad, the drainer reads the notepad and updates the tally, and the tally is pushed live to the screen. Each piece does one small thing, and the expensive thing, recounting the room, happens nowhere in it.

But is it correct? A simulation that hunts for drift

Keep a running tally, but keep recounting in the shadows, just to check it never lies

Here is the uncomfortable truth about a running tally: a count that is fast but subtly wrong is far more dangerous than a count that is slow but right. The entire scheme rests on a single claim, that the deltas, applied faithfully, always equal the truth. If that claim is wrong even occasionally, the numbers rot quietly and you do not find out until a customer does.

I was not willing to take that on faith, and this is the part of the project I am proudest of. Before trusting the tally with anything, I built a simulation that ran the real drainer against live production traffic in a sandbox, and then, on a schedule, did the thing the tally was supposed to replace. It walked the room. It ran a full, honest COUNT from scratch and compared it, number for number, against what the tally said. It was looking for a single mismatch.

Then I let it soak. Not for an hour, for days. It ran clean through 260 of these checks without a discrepancy, and a continuous version later ran across roughly thirty thousand cycles with zero mismatches. That is the kind of evidence that lets you actually sleep.

And it earned its keep, because it caught a real bug. Early on, the periodic recount would occasionally disagree with the tally by a hair. The cause was subtle: the recount was reading the table without a consistent snapshot, so it was seeing the room during a change rather than at one frozen instant. It was taking a panorama when it needed a photograph. The fix was to run that recount at a stricter isolation level so it saw exactly one moment in time. I would not have found that by reading the code. The simulation found it by doing the boring, diligent thing thousands of times until the one-in-a-thousand showed itself. This is, again, where having Claude alongside mattered: standing up a faithful copy of the drainer, a sandbox harness, and an adversarial check-it-constantly loop was the kind of work that would have taken me ages alone, and I could interrogate every correctness assumption as I went.

But is it fast enough? Benchmarking the drainer

The lag was not the drainer being slow. It was something else holding the door.

Correct is one half. The other half is whether the drainer can actually keep up. If notes pile onto the notepad faster than the drainer reads them, the tally falls behind reality, and "real-time" quietly becomes "a few minutes late".

So I instrumented the drainer to report three things on every cycle. How long a cycle took. How many unread notes were sitting on the pad, the backlog. And how stale the oldest unread note was, the lag. For the most part these were boring in the best way: cycles measured in single-digit milliseconds, a backlog near zero, lag you could barely see.

But every so often, the lag would spike. And chasing down why led somewhere I never expected, somewhere that turned out to be more valuable than the counts themselves.

The drainer was not slow during those spikes. It was being held up. Certain long-running database transactions, elsewhere in the system, entirely unrelated to counts, were holding a database snapshot open, and that quietly stalls work waiting behind them. When I dug into those transactions, the root cause was a real and somewhat alarming pattern: some of them were staying open far too long because they were making network and I/O calls inside the transaction. They had opened a transaction, then sat there waiting on the outside world before closing it, holding the door for everyone else the whole time.

We would never have gone looking for this. It was not causing any obvious symptom on its own. It took building a real-time pipeline that was sensitive to exactly this kind of stall to surface a class of slow transaction that had been hiding in the system all along. The benchmarking did not just prove the drainer was fast. It handed us a thread to pull on that we did not know was there.

The payoff

Less load on the database, and a number that is now instant

When the dust settled, the scoreboard read like this.

The count that used to cost a database aggregate on every single page load now costs a single read from Redis. One of the heavier count endpoints went from around 693 milliseconds to around 112. The load that the old counting queries put on the database effectively dropped to zero, because we stopped asking the database to count at all. And the numbers became genuinely real-time, moving the instant a teammate acts rather than on your next refresh.

That is the rare outcome where you do not have to trade one thing for another. We took load off the database and made the product better at the same time. The database is doing less work, and the support team is looking at numbers that are both faster and more alive than before.

What actually changed for me

The idea was never the blocker. The layer was.

I want to end on the part that I think matters more than the triggers.

A year ago, this project does not happen. And the reason is not that I could not think of the tally, because, as I said at the start, the tally was obvious from day one. The reason is that the gap between "I have application-layer experience" and "I am going to put PL/pgSQL triggers on a production database and bet thousands of merchants' sidebars on them" was simply too wide for me to cross casually. So I never crossed it. I stayed where I was comfortable and recounted the room.

What changed is not that an AI wrote the code for me. It is that the gap collapsed. I could step one layer below my own ceiling, into the database itself, and still move carefully, verify obsessively, and ship something I genuinely understand line by line. The understanding is still mine. The diligence, the simulation, the benchmarking, all of that is still the work. But the layer stopped being off-limits.

That is the thing I would tell another developer who has spent their career in one layer. The interesting shift is not "AI writes code". It is that you can now reach below your current ceiling, safely, and pull off the obvious solution you always knew was right but never had a way to actually build.

Closing notes

If any of this is new to you and you want to go deeper, these are the things worth reading:

Fun fact

The internal name for this whole pipeline, the thing the team says when they want me to pull up the numbers, is "the scoreboard". Which means that after all of this, the running tally, the triggers, the notepad, the drainer, the simulation, I built a doorman with a very expensive clicker and then named him after the sport he was always secretly playing.

Have a nice day and happy coding!