Effective Performance Engineering at Twitter-Scale
25 Jun 2024 (3 months ago)
- Traditional performance engineering approaches are no longer sufficient to handle the complexity of modern software applications.
- Performance engineering used to be easy due to consistent hardware advancements, but now it's harder because of specialized hardware.
- Modern software applications are highly complex, making it challenging to identify and optimize performance issues.
- Systems thinking provides a language and framework for modeling and understanding complex relationships within a system, enabling performance engineers to identify and address bottlenecks and inefficiencies.
- Performance engineering should be viewed as a counting exercise on top of a system model, where resources are counted and analyzed to determine their utilization and impact on performance.
- Having a system model in place allows performance engineers to count resources at the appropriate granularity to accurately assess and optimize performance.
- At Twitter's scale, performance engineering involves building a model of the system and collecting metrics at a fine granularity.
- Custom samplers and systems are used to collect low-level telemetry with minimal overhead.
- The data infrastructure built for performance engineering is also used to solve real problems, such as optimizing GC intervals and identifying underutilized resources.
- The data generated is also used by other teams for capacity planning and service optimization.
- Twitter's performance engineering team developed a data aggregation pipeline to analyze trace data and gain insights into the performance of their distributed systems.
- They used tracing and profiling techniques to capture the interactions between different services and understand how they contribute to overall performance.
- They built a service dependency explorer to visualize the connectivity and load amplification between services, allowing them to identify bottlenecks and optimize resource allocation.
- The team also developed a model called "Laten here" to perform causal reasoning and determine the root causes of latency issues.
- The data engineering efforts enabled the team to answer complex performance-related questions that were not possible with other data sets.
- The team's work extended beyond performance engineering to include data privacy analysis, where they leveraged the system dependency data to identify sensitive information and access patterns.
- Performance engineering doesn't fit neatly into traditional organizational structures due to its cross-functional nature.
- To be effective, performance engineers need to align their work with either top-line or bottom-line objectives.
- Building a strong network of advocates and supporters within the organization is crucial for performance engineers to gain influence and drive change.
- The symbolic significance of having a dedicated performance engineering team emphasizes the importance of performance optimization within the organization.
- Performance engineers should encourage and support other engineers to contribute to performance work, rather than trying to centralize all performance-related tasks.
- To build a successful performance engineering team, you need a diverse team with different skill sets.
- Start with a small team of people who are much better than you at some things.
- Do a lot of odd jobs and favors to justify your existence and build trust.
- Make mistakes, learn from them, and move on.
- Write down your vision and methodology once you have a bit of trust and some small wins.
- As your team grows, intentionally brand yourselves as the performance people.
- Invest in talks, publishing, and writing papers to share your knowledge.
- Once you're mature, start creating platforms and products that allow others to do similar things.
- Design the team to fit the organization structure, don't copy what others do.
- Outreach and adoption are serious work, treat them as such.
- People make work happen and their strengths and personalities differ, so it's important to respect that and seek diversity in skills and perspectives.
- Embrace chance and don't have a predetermined mindset, as long as everyone is going in the right direction, the particular path is not that important.
- Software engineering is a social enterprise, so to succeed in the long term, it's important to be helpful, generous, and make friends.
- Understanding the structure of a service and the distributed system is valuable for performance optimization.
- eBPF is a powerful tool for gathering metrics and can be used to obtain metrics that are traditionally obtained in-process, often with less cost and better performance.
- Traces are stored in a real-time data pipeline with different indices depending on the level of information needed.
- Queries on traces may have a five-minute delay, but they can answer questions that no other data sources can.