Ana Medina on Chaos Engineering, Game Days, and Learning
02 Oct 2024 (5 days ago)
Gremlin's Status Checks
Gremlin, where Medina works as a Senior Chaos Engineer, has launched a feature called status checks to check the health of a system before running chaos experiments. (2m46s)
Status checks can be integrated with tools like DataDog, New Relic, and PagerDuty, and users can also create their own using API endpoints. (3m50s)
Impact of Complex Systems
Complex systems are impacted by many factors, including world events like pandemics. (6m17s)
The pandemic highlighted the difference between organizations that were prepared for high traffic and those that were not. (7m30s)
Game Days and Chaos Engineering Workshops
Gremlin has resources for running game days, but a fully developed remote game day runbook has not yet been created. (10m33s)
Successful virtual game days can be run with proper planning, communication, and collaboration tools like Zoom and Google Docs.(11m40s)
Assigning specific roles, such as commander, note-taker, observer, and tester, helps participants focus on their tasks and contributes to a more successful game day experience. (12m30s)
Gremlin's chaos engineering workshops incorporate hands-on experiments in a cloud infrastructure environment, using Kubernetes, monitoring tools, and a microservice demo environment, to provide practical experience. (13m14s)
Benefits of Chaos Engineering
Chaos engineering can reveal inaccuracies in architecture diagrams by demonstrating how an entire application can break down when traffic to a single service or container is blocked. (16m17s)
When implementing chaos engineering, it is recommended to prioritize testing critical, high-impact services (tier zero and tier one) to maximize the return on investment. (18m26s)
Past incidents, documented in a blameless postmortem format, provide valuable insights for chaos engineering experiments by highlighting system vulnerabilities and areas for improvement. (19m9s)
Importance of Training and Ethics
There is a lack of focus on training and ethics in software engineering despite the potential for technology to cause harm. (22m41s)
Organizations should ideally begin planning 3-6 months in advance for important dates like Cyber Monday to ensure system resilience. (24m51s)
Code freezes are a warning sign that things need to change and that teams may not be equipped to handle changes during incident-heavy periods. (26m23s)
Gremlin's Chaos Conf will be held virtually on October 6-8, featuring tracks on reliability practices, completing the DevOps loop, and data-driven reliability culture. (28m10s)
The best way to contact Ana Medina is through her Twitter handle, Anna _ Medina. (29m23s)