Site Reliabilty Engineer
Key Responsibilities:
Design, implement, and maintain highly available and scalable systems across cloud and hybrid environments.
Develop automated monitoring, alerting, and self-healing systems to proactively address reliability issues.
Build and manage CI/CD pipelines to improve deployment frequency and reduce change failure rates.
Collaborate with software engineering teams to improve application performance, observability, and incident response.
Perform root cause analysis and drive resolution of production incidents with minimal downtime (postmortem culture).
Define and monitor SLIs/SLOs/SLAs to ensure system reliability standards are met.
Develop and maintain Infrastructure as Code (IaC) using tools like Terraform, CloudFormation, or Pulumi.
Contribute to capacity planning, load testing, and performance tuning efforts.
Work with security teams to ensure systems are compliant and secure.