How Netflix Ensures Highly-Reliable Online Stateful Systems
12 Mar 2024 (6 months ago)
Reliable Stateful Systems
- Reliable stateful systems ensure successful read and write operations with expected consistency models, within designated latency service level objectives, and always in a reliable state.
- Different failure modes in stateful services require different solutions.
- Netflix focuses on building reliable stateful servers, pairing them with reliable stateful clients, and designing APIs to use reliability techniques.
Netflix's Techniques for Reliable Stateful Services
- Netflix uses a combination of techniques to achieve reliability for its stateful services, including:
- Decoupling the stateful process from the OS kernel.
- Using snapshot restoration to move state between instances quickly.
- Monitoring drives and JVMs for potential problems.
- Limiting the frequency of maintenance operations.
- Using in-place imaging to roll out software changes faster.
- Netflix also uses caching to improve the reliability of its stateful services.
- Caches are treated as materialized view engines and are highly reliable.
- Caches are placed in front of services to protect them from load.
- Netflix has developed a technique called total near caching where the source of truth data store is not involved in the read path.
- Netflix uses a number of techniques to make its stateful clients reliable, including:
- Signaling service level objectives per namespace and access pattern.
- Hedging requests.
- Using exponential backoff.
- Load unbalancing.
- Setting concurrency limits.
- Compression reduces bite scent, adds useful properties like checksumming, and improves reliability by reducing the chances of SLO (Service Level Objective) busters.
- SLOs define target and maximum latency objectives for services, and communicate concurrency limits to clients.
- SLOs can be tuned based on namespace, client, and observed average latency.
- Hedging involves sending multiple requests to different servers to improve reliability and meet SLOs.
- Dynamic hedging adjusts the hedging strategy based on whether the client is likely to get a positive result.
- Concurrency limiting prevents too much load from going to backend services.
- GC-tolerant timers are used to prevent incorrect timeouts caused by garbage collection pauses.
- Load balancing strategies like choice of two and weighted choice of N are used to avoid slow servers and improve latency.
- Weighted choice of N exploits a priori knowledge about networks in the cloud to route requests to the closest replica.
Resilient Stateful APIs at Netflix
- The video discusses techniques for building resilient stateful APIs at Netflix.
- The key techniques mentioned are hedging, retrying, and breaking down work into smaller units.
- Item potency tokens are used to ensure that mutable APIs are safe to retry.
- Different types of item potency tokens are discussed, including client monotonic clocks, regional isolated tokens, and global isolated tokens.
- The reliability and consistency trade-offs of different item potency token types are explained.
- Real-world examples of how these concepts are applied in Netflix's key-value and time-series services are provided.
- The importance of measuring and understanding the behavior of clocks in distributed systems is emphasized.
- The video concludes by discussing how these techniques are implemented in Netflix's stateful APIs, including the use of paginated APIs and dynamic concurrency control for scans.
- Item potency tokens are used to ensure idempotent writes and avoid duplicate operations.
- Large responses are broken down into multiple pages to improve throughput and meet service level objectives (SLOs).
Additional Techniques
- Decommissioned instances are detached but not terminated immediately to prevent them from re-entering the fleet.
- Pre-flight checks help reject degraded hardware from re-entering the fleet.
- Netflix targets an 80/20 split, with 80% of services using abstraction layers and 20% accessing storage engines directly.
- Netflix provides a 25-page memo to users who access storage engines directly, outlining best practices for item potency, data store backups, and avoiding data loss.
- Netflix offers data store client libraries in all major supported languages.
- Netflix encourages users to use APIs with built-in item potency and resiliency techniques.
- Netflix sometimes contributes resiliency techniques back to the open-source community.