Mastering Superfast Data Planes: Boosting Cloud Performance for Millions of Packets per Second

23 Sep 2024 (over 1 year ago)

Packet Processing Performance Optimization

100 Gig network interfaces are common in networking and can process approximately 10 million packets per second. (30s)
With a 100 nanosecond time budget per packet, there are 300 CPU clock cycles available to process each packet. (38s)

A simplified packet processing device is presented that matches packet headers, applies rewrite policies, and rewrites specific header sections. (4m19s)
The process packet function will be called approximately 10 million times per second. (6m15s)

Using the "inline" keyword, specifically the "always inline" attribute in C, can improve performance by eliminating function call overhead. (7m30s)
Utilizing vector instructions, such as the VP instruction for logical AND operations on 256-bit vectors, can significantly enhance performance by processing multiple data elements simultaneously. (11m1s)
Intel Intrinsics are functions provided by Intel that make it easier to use low-level vector instructions. (12m30s)
AVX 512, the next iteration of vector instructions, introduces Ternary Logic Operations, which allow binary logic between three arguments simultaneously. (13m42s)

A Swiss table, a data structure developed by Google, splits a hash into two parts: H1 identifies the group and bucket location, while H2, stored in a metadata array, enables direct entry access. (15m53s)
Packed metadata arrays can be efficiently compared using vector instructions, minimizing entry probing time. (18m21s)
Using a Swiss table implementation with a similar hash function results in a performance improvement from 400 clock cycles per packet to 300. (19m5s)

Interleaving involves prefetching memory required for packet processing, minimizing memory stall time by overlapping memory access with the processing of other packets. (21m20s)
The program writer's understanding of code execution allows for more efficient interleaving compared to relying solely on the execution unit's optimization techniques. (22m49s)
Instead of processing individual packets, a burst of 20 packets is processed at a time. (23m14s)
To improve the performance of processing network packets, a technique called prefetching is used to load necessary data into the cache before it's needed. (23m54s)
Prefetching the metadata array for the Swiss table, which is used for packet lookup, significantly reduces processing time from 300 clock cycles per packet to 80. (23m57s)

Loop unrolling is another technique that can be used to further optimize packet processing by reducing loop overheads and enabling parallel instruction execution. (26m12s)
Reducing clock cycles from 80 to 65, a difference of 15, leads to a significant performance boost. (29m15s)

While techniques like inlining and loop unrolling can enhance performance, they can also increase code size, potentially leading to more instruction cache misses and reduced performance. (30m30s)
Excessive prefetching of memory, especially into the small L1 cache, can result in cache eviction, where prefetched data is replaced before being used, negatively impacting performance. (31m11s)

The Rust programming language's default hashmap implementation utilizes a Swiss table data structure. (35m12s)

When optimizing code, it is important to consider the trade-off between impact and complexity, with techniques falling into quadrants of easy/low impact, easy/high impact, hard/low impact, and hard/high impact. (35m26s)

Vune is a powerful tool for identifying memory stores during performance benchmarking. (39m0s)

Developers should use both micro-benchmarks for rapid iteration and large-scale performance tests for end-to-end validation to mitigate performance issues. (41m24s)

When selecting a programming language for a performance-critical project, it's essential to choose a language that provides fine-grained control over optimization, such as Rust, which allows direct access to Intel intrinsics. (42m30s)

Premature optimization should be avoided, and benchmarking should be performed early and continuously throughout the development process to identify actual bottlenecks and prevent wasted effort on unnecessary optimizations. (44m34s)