How to Build a Reliable Kafka Data Processing Pipeline, Focusing on Contention, Uptime and Latency
10 Feb 2024 (7 months ago)
System Overview
- The company sends around 13 billion push notifications daily and had a team of 10 engineers to manage the backend.
- The original system used synchronous PostgreSQL writes, blocking the HTTP request until the write completed.
- Traffic spikes occurred at specific times (hourly and half-hourly) due to customers scheduling notifications.
Solution Implementation
- Introduced a layer of queuing using Apache Kafka, making the system asynchronous.
- Kafka is a distributed streaming platform that uses topics to logically group messages.
- Each message in a topic has a numerical ID called an offset that starts at zero and increases over time.
- Consumers pull messages from Kafka topics and process them.
- Partitions are numbered logs of messages within a topic that can be consumed independently by multiple instances of a consumer.
- Subpartition processing is a technique used to process Kafka messages concurrently within each partition in memory, allowing for increased concurrency and flexibility.
- Created more CUs (queues) to ensure updates for the same row are processed concurrently.
- Added a cap on the number of messages each consumer instance can hold in memory to prevent overloading.
Issue Identification and Resolution
- Observed high lag and low CPU usage, contradicting expectations.
- Implemented centralized logging to gain more observability.
- Discovered that a single customer (Closely) was dominating the updates, with a single row ID receiving constant incompatible updates.
- Identified that the updates were related to the "set email" method in the SDK, which was causing 4.8 million user updates to be mirrored to a single record.
- Updates to the closely app admin record were skipped, and limits were implemented to prevent customers from linking too many records together.
Lessons Learned
- Shifting intensive API workloads to asynchronous workers reduces operational burden.
- Subpartition queuing increases consumer currency.
- Centralized observability is crucial in tracking down issues.
- Customers can be more creative than engineering, design, and product teams in finding unexpected use cases.