5 min read

What Happens Behind a Push-Notification

A story of managing services and millions of data.
What Happens Behind a Push-Notification

Notification is the heart of media companies. Upon journalists write the news, we want to make sure the users are always up to date with the latest and important of ‘what happens’ recently around. And yet users also should not be disturbed with irrelevant news that is not within their interest. On the operational side, they should be able to monitor the status of the notification they sent, whether it’s sent or not, see the notifications engagement, or moreover who received it.

To be able to send notification instantly and efficiently, correct data structure, system architecture becomes the primary thing to be considered. Things on the product side such as future features, instant sending process, real-time userbase updates are equally important. To achieve those, my team and I thought to make the system to be independent of each other. As we know, microservice comes with costs like operational complexity, consistency, etc. In spite of that, we engineer loves to take that challenge.

Creating The Userbase

In my current employer, we group notification functionally. There are five notification types. Each notification type has its own way to collect userbase, for example, in the segmented-channel notification, the userbase generated from the user which has interests in some channel or entities. The channel is generally a simple taxonomy like a category, whereas entities are generated by machine learning models. For example, if you read or interact in an article on channel Sports, then your user ID will be marked as interested in the “Sports” channel. Likewise, if you occasionally read or interact(comment, like) on articles about Barcelona FC. Every time the journalist sends a notification either about sports or has entity “Barcelona FC” simply you’ll get the notification.

Other notifications generally collect their userbase through a subscribe’s button or member invitation. To collect userbases, we have many CronJob runs periodically in the Kubernetes cluster to update our notification userbase. Because of the large number of users and rapid activity, we chose the eventual-consistent strategy to build and update our userbase outside the OLTP database in order to reduce unnecessary loads on our OLTP databases.

Basically, the userbase generator job will aggregate the data from various sources or even from a different database to create the userbase. In most cases, I only need to join many tables and transform them into CSV files and land them on Google Cloud Storage before loaded into BigQuery. I used to use the streaming-insert but then I realized loads CSV to BigQuery is free while streaming insert costs $ 0.010/200MB. When I load a large amount of data, a while-true loop, and using a generator function is my go-to strategy. Why? Because I don’t have to load all the data into memory at once.

Filter-out and Deduplicate User

Before we send a notification, first we need to check whether the user has read the post or not. Therefore there’s a filter-out mechanism that reads data from Google BigTable to check it. Additionally, we also need to make sure each user receives only one notification for each device. Sometimes one device has many OneSignal device ID (probably the user had reinstall the app which generated new device id or simply user logged-in in many devices) so we make sure only get the latest and active-only device.

Previously, the process filter took around 5 mins, includes the query time and IO operation from Google BigTable. After some improvement, I managed to reduce the filter-out time to ~50s. Thanks to Celery, the filter-out process now becomes parallel.

After filtering-out the user, as per the product requirements, we need to remove duplicate device ids. The `Set()` data structure comes to the rescue. In Python, I simply convert the list to set to remove duplicate data.

And now, we’re ready to send the notification.

Sending Notification

Basically, sending a notification is quite easy. But, is it really that easy? The first thing to consider is, it’s related to technical things, we must be able to monitor if there was a bug and in which pipeline the bug occurred since our service is segregated and mostly communicates through a message broker (Google PubSub). Secondly, it’s related to the operational department which is the notification statistics. They must be able to know, who sends, which channel of notification, who received, who failed to receive, how many users open or ignore the notifications, etc.

Before sending a notification request to OneSignal (my current employer uses OneSignal), we have to split the list of device ids into chunks, given our large device id list and considering OneSignal ability to process our requests. While sending, I fetch the users of the device id and store them to Redis as a Set to be queried later in notification history service.

Upon the request has sent to OneSignal, they will give the response which device id is failed to send. We have to collect and transport the failed or successful device IDs to our data warehouse on BigQuery for analytic purposes. To efficiently store data to BigQuery, I prefer to store all the data on Google Cloud Storage as CSV first, then load all the data to BigQuery at once.

At last, send a message to the operational department’s slack channel just to inform them that the notification has successfully sent.

Notification History

On the notification history service, to write million of documents I use Elastic bulk insert helper to write in parallel. Ten thousand documents can be written in ~1sec more or less. Since we don’t store the document forever, I made the index name partitioned by day like notification_history_2020_11_30 so later I can easily remove the document using cronjob if the index is older than certain days from today.

Finally

The last thing to do is to send a notification log to the journalist’s slack channel to notice them whether their push-notification has been successfully sent or not.

So, that is it. It’s a long process getting a notification to be sent. And here’s what I learned so far:

  • Use a generator when you want to load and process a large amount of data. Don’t load all the data at once in memory.
  • Store the data in storage that matches your needs. If your data is relatively small and requires a high-read operation, store the data in SQL, or if you need to store it temporarily you can use a memory-storage like Redis. If you need a high-write operation use columnar-storage DB, etc.
  • Process in parallel (using background process like Celery, multiprocess, or anything) if possible.
  • Choose the correct data structure and leverage it.
  • If you have a large amount of data, don’t forget to set a partition on them. For example, in my case, I partition the Elastic documents by day.
  • Less joins query or denormalized column tends to give you a good query performance.

That’s all. Thanks for reading. Don’t forget to follow me on Medium and Linkedin.