The Set data structure saved me from a world of pain

When we were taught the Set theory at school, I wondered why they dedicated an entire chapter in math to teach us about how to group items into categories. Isn’t it obvious that entities that have similar properties get clubbed together?

More than a decade later I realised that when they teach you something, maybe the entire topic is not as important as that one small nugget of information about this topic that might come in handy down the line.

For me, this nugget of information about Set theory was this.

‍

Why did this seemingly insignificant fact become the key to my mental sanity? And more importantly how did it help me prevent my codebase from turning into spaghetti? ‍

The problem

I am building a customer support tool where the CS agent can solve customer queries raised from multiple channels of communication.

So by its very design we required a UI element which indicated how many conversations are left unattended for each of the channels.

Open conversations in every support channel

‍

As you may have guessed, this information is being retrieved on every refresh of the browser page and therefore running a count query through the conversation database for each support channel on every refresh is not very efficient.

Sure, it may not be a big deal at first but if we scale to even half of what we plan to hit with this product, a query like this will bite me in the rear sooner than I would want it to.

The obvious solution

The solution that occurred to me was to cache the count of open conversations for each support channel. This made sense at first but it quickly turned into a nightmare when I gave it a little more thought.

Okay so I plan to save a count of the open conversations in the cache, which means every time a conversation is opened I increment the count and every time a conversation is closed, I decrement the count?

Alright that doesn’t sound so bad but how do I trigger this increment/decrement operation? I suppose I could do it whenever a conversation gets updated in the database.

And the problems begin

But wait, I can’t trigger it every single time a conversation is updated, what if a conversation is open but some other data in the conversation gets updated while the status is still open?

Do I have to write an if-else condition to run the increment/decrement operation only when status changes from open to closed or from closed to open?

What happens if a conversation update finished running but the increment/decrement operation failed for some reason or vice versa? How do I re-run an update without corrupting the data?

In case you haven’t guessed where I was heading with this, here is a nice visual aid.

The elegant solution

So how did Set theory get me out of this mess? What if I maintained a Set of IDs that belong to open conversations for each support channel?

All I have to do now is this: On every conversation update, if status is open, I add that conversation ID to the Set and I remove it from the Set if status is closed.

This Set would be stored on the RAM of course and yes it would take up more memory than storing just a single count for every support channel but all we are doing is storing IDs and not the entire data object so it would hardly cost much in terms of memory.

How does this help with the problem? Well, if you want the count of open conversations just get the number of elements in the Set, the cardinal number of the Set if you will.

Why is this elegant?

Because a Set cannot have duplicates. It does not matter how many times I add an item to a Set, a unique conversation ID can be added to it only once and similarly once a remove operation is performed, subsequent remove operations on the same conversation ID won’t have an affect on the data. The data has become idempotent.

This means that I don’t need an if-else condition when a conversation update operation is performed. All I need to do is perform an add/remove operation based on the status of the conversation post update.

It doesn’t matter if there is an explicit status update or not. Every update can trigger this logic and the data remains incorrupt. If any of the operations fail midway, the conversation update can just be re-run.

The code now looks like this

‍
How to implement a Set data structure in your application

The most obvious way to do this is to use a JSON dictionary data structure like this for every support channel.

If a conversation needs to be added to the Set, update the dictionary with the key as the conversation id and value as true or 1. For remove operations first check if the conversation id is already in the dictionary and remove It from the dictionary. Do nothing if it’s not already present.

This dictionary can now be stored in a variable in memory and used throughout the application. To get the count of open conversations for a support channel, just get the number of keys present in the dictionary.

Skip the trouble and use Redis instead

Clearly somebody else figured out the importance of a Set as a data structure way before I did, because Redis - an elegant In memory database used for caching data already offers the Set data structure along with operations that let you add, remove and find number of items in a Set.

I would highly recommend using this in your application instead of maintaining a dictionary in a variable because the good folks at Redis already solve a host of other problems that might occur as a result of maintaining an in-memory cache.

‍

Conclusion

Perhaps there is some merit in revisiting some of the concepts we learnt as children and see if they are still relevant and can somehow be applied to the real world.

At the very least this would give us closure and instil hope that maybe everything we were taught that seemed useless at the time has a certain aspect to it that might come in handy at some point or the other in our lives and we just have to keep our eyes, ears and most importantly our minds open.