DOP-404R Amazon’s approach to high-availability deployment – Key Takeaways

The Key

  • Two most important thing in large scale distributed deployments: automation and learn from errors
  • Automation
    • You cannot manually manage 100,000s of teams and deployments, so use monitoring tools and rule enforcements to set a standard, so they are still autonomous, but conforms to same level of standard
  • Learn from errors
    • Errors will always happen, have a formal process of learning from errors and prevent them from happening again
    • Have your data with you when do an analysis, get data from monitoring
    • If an error can be automatically remedied, do it, e.g. a rollback and trigger alarm

The Takeaways

  • Correction-of-errors (COEs) is how Amazon describes a failure
    • What happened
    • Root cause analysis (RCA)
    • Supporting data and metrics
    • Lessons learned
    • Customer impact
    • Corrective actions
      • Tools
      • Improvements
      • Best practices
  • Customer impact > generate COE > COE action items sent to Service Team
    • Goals: prevention, shorter response time, smaller impact
  • Amazon favors autonomous two-pizza teams
    • Independent operation, decentralized ownership, local decisions, but
    • Needs consistent standards and identical tools / platforms
    • Small teams means huge number of them, requires ways to keep them under control
  • Tools: audit + enforcement
    • Audit
      • Explicitly requires multiple stages of testing
        • Unit, smoke, integration, load, regional configuration, soak, canary, wave 1, wave 2
      • Health monitor and alarm
    • Enforcement
      • Automated control with rules
  • Anomaly detection
    • Application metrics: faults, traffic, errors
    • System metrics: CPU, disk, memory
    • Runtime metrics: heap, GC, threads
    • Rollback on threshold breach