DOP-404R Amazon’s approach to high-availability deployment – Key Takeaways
The Key
Two most important thing in large scale distributed deployments: automation and learn from errors
Automation
You cannot manually manage 100,000s of teams and deployments, so use monitoring tools and rule enforcements to set a standard, so they are still autonomous, but conforms to same level of standard
Learn from errors
Errors will always happen, have a formal process of learning from errors and prevent them from happening again
Have your data with you when do an analysis, get data from monitoring
If an error can be automatically remedied, do it, e.g. a rollback and trigger alarm
The Takeaways
Correction-of-errors (COEs) is how Amazon describes a failure
What happened
Root cause analysis (RCA)
Supporting data and metrics
Lessons learned
Customer impact
Corrective actions
Tools
Improvements
Best practices
Customer impact > generate COE > COE action items sent to Service Team
Goals: prevention, shorter response time, smaller impact
Amazon favors autonomous two-pizza teams
Independent operation, decentralized ownership, local decisions, but
Needs consistent standards and identical tools / platforms
Small teams means huge number of them, requires ways to keep them under control