Senior Site Reliability Engineer
Job Description
Role Overview\nWe are looking for a skilled and proactive Site Reliability Engineer (SRE) to take end-to-end ownership of production...
Job Description
We are building for reliability, speed, and impact. MyOperator values ownership, critical thinking, and execution. This is a high-expectation, high-learning environment where engineers are empowered to solve complex problems and build systems that directly affect customer outcomes.\n\nKey Responsibilities\nOwn production reliability, uptime, latency, and error budgets across critical services.\nDesign and manage production-grade monitoring using Grafana, VictoriaMetrics (Prometheus), and AWS CloudWatch.\nDefine and enforce SLIs, SLOs, and SLA thresholds for AI communication systems (voice bots, WhatsApp APIs, call routing).\nBuild real-time operational dashboards for incident response, capacity planning, and leadership visibility.\nImplement end-to-end distributed tracing using OpenTelemetry (OTEL Collector).\nDesign and maintain centralized logging with strong correlation between logs, metrics, and traces.\nCreate SLO-based alerting systems with minimal noise and fast incident detection.\nLead incident response lifecycle: alert triage, mitigation, RCA documentation, and preventive improvements.\nDrive MTTR reduction through structured monitoring, automation, and reliability engineering practices.\nMonitor and troubleshoot AWS EKS (Kubernetes) production workloads.\nInstrument and monitor LLM API integrations, AI inference pipelines, and messaging systems.\nAnalyze logs using OpenSearch / ELK for anomaly detection and root cause identification.\nAutomate operational workflows using Python or Bash to eliminate manual toil.\nDrive performance optimization, scalability improvements, and capacity planning.\nCollaborate with engineering teams to instrument new services from day one.\n\nRequired Skills & Qualifications\n3–6 years of experience in Site Reliability Engineering, DevOps, or Platform Engineering roles.\nHands-on experience with: VictoriaMetrics / Prometheus (time-series monitoring), Grafana dashboards and visualization & PromQL for writing complex queries and alerts\nExperience implementing distributed tracing using OpenTelemetry (Mandatory).\nStrong experience with centralized logging systems (ELK / OpenSearch / Loki).\nExperience with alerting frameworks such as Alertmanager or Grafana Alerts.\nStrong understanding of SLIs, SLOs, SLA design, and reliability engineering principles.\nHands-on experience managing AWS production workloads (EC2, RDS, ELB, CloudWatch, IAM).\nExperience with Kubernetes (AWS EKS preferred).\nGood understanding of Linux systems, networking, and cloud infrastructure.\nExperience handling production incidents and participating in on-call rotations.\nAbility to automate operational tasks using Python or Bash.\n\nGood to Have\nExperience with OpenSearch / ELK log pipelines and anomaly detection.\nKubernetes monitoring (pod health, node metrics, autoscaling behavior).\nCI/CD observability integration (Jenkins, GitHub Actions).\nExperience monitoring LLM APIs and AI inference pipelines.\nFamiliarity with MLOps or AI observability tools (Arize, WhyLabs, etc.).\nService mesh exposure (Istio).\nInfrastructure as Code (Terraform, CloudFormation).\nExperience with chaos engineering or load testing tools.\nMulti-cluster or multi-region architecture exposure.\n\nKey Expectations\nOwnership of production systems and high availability.\nStrong troubleshooting and debugging skills.\nFocus on automation and reliability improvements.\nProactive approach to incident prevention.\nAbility to reduce alert noise and improve signal quality.\nData-driven approach to reliability engineering.\n\nThis Role Is Not For\nCandidates with purely development experience and no production ownership.\nCandidates without real incident response or on-call experience.\nFreshers or candidates with less than 3 years of experience.
Below are some other jobs we think you might be interested in.
-
Senior Site Reliability Engineer
- Josys
- Bangalore, Karnataka, India
Senior Site Reliability Engineer (SRE)About JOSYSJosys, a dynamic B2B SaaS platform startup, has embarked on a mission to revolutionize IT operations...13 May -
Senior Site Reliability Engineer
- Saama Technologies Inc
- India
Job Title: Senior Site Reliability Engineer Job Summary: We are seeking a highly motivated and experienced Site Reliability Engineer to join our team....13 May -
Senior Site Reliability Engineer
- Jobgether
- India
This position is posted by Jobgether on behalf of a partner company. We are currently looking for a Senior Site Reliability Engineer in India. ...27 May -
Senior Site Reliability Engineer
- Motorola Mobility (a Lenovo Company)
- Bengaluru, KA, IN
Job Description Hiring for Senior Site Reliability Engineer (SRE) !!! About Our Team Lenovo is building Quantum, a next‑generation hybrid AI...21 May -
Senior Site Reliability Engineer
- Pocket FM
- Bengaluru, KA, IN
Job Description Senior Site Reliability Engineer (SRE) Company: Pocket FM About the Role Pocket FM is a global audio entertainment platform serving...21 May -
Senior Site Reliability Engineer
- 73 Strings
- Bengaluru,India,IN,560102
OVERVIEW OF 73 STRINGS:73 Strings is an innovative platform providing comprehensive data extraction, monitoring, and valuation solutions for the private...13 May -
Senior Site Reliability Engineer
- LiveRamp
- Hyderabad, TG, IN
Job Description LiveRamp powers exceptional experiences by making it safe and easy to connect the world's data, people, and applications. We are the...23 May -
Senior Site Reliability Engineer
- Nexthink
- Bengaluru,Karnataka,India
Company Description Nexthink is the leader in digital employee experience management software. The company provides IT leaders with...19 May -
Senior Site Reliability Engineer
- Chingari®
- Bengaluru, KA, IN
Job Description Company Description Chingari® is a social media platform that enables users to create, share, and discover short-form content while...03 Jun -
Senior Site Reliability Engineer
- MyOperator
- Bhopal, MP, IN
Job Description Role Overview We are looking for a skilled and proactive Site Reliability Engineer (SRE) to take end-to-end ownership of production...05 Jun -
Senior Site Reliability Engineer
- MyOperator
- Nagpur, MH, IN
Job Description Role Overview We are looking for a skilled and proactive Site Reliability Engineer (SRE) to take end-to-end ownership of production...05 Jun -
Senior Site Reliability Engineer
- Granicus India
- Remote,IN
The CompanyServing the People Who Serve the People Granicus is driven by the excitement of building, implementing, and maintaining technology that is...13 May -
Senior Site Reliability Engineer
- MyOperator
- Madurai, TN, IN
Job Description Role Overview\nWe are looking for a skilled and proactive Site Reliability Engineer (SRE) to take end-to-end ownership of production...05 Jun -
Senior Site Reliability Engineer
- MyOperator
- Eluru, AP, IN
Job Description Role Overview\nWe are looking for a skilled and proactive Site Reliability Engineer (SRE) to take end-to-end ownership of production...05 Jun -
Senior Site Reliability Engineer
- MyOperator
- Kozhikode, KL, IN
Job Description Role Overview\nWe are looking for a skilled and proactive Site Reliability Engineer (SRE) to take end-to-end ownership of production...05 Jun -
Senior Site Reliability Engineer
- MyOperator
- Surat, GJ, IN
Job Description Role Overview\nWe are looking for a skilled and proactive Site Reliability Engineer (SRE) to take end-to-end ownership of production...05 Jun -
Senior Site Reliability Engineer
- Zuora
- Chennai, Tamil Nadu, India
About ZuoraAt Zuora, we help businesses grow smarter and adapt faster. Our platform powers modern business models - from subscriptions and usage-based...29 May -
Senior Site Reliability Engineer
- MyOperator
- Kollam, KL, IN
Job Description Role Overview We are looking for a skilled and proactive Site Reliability Engineer (SRE) to take end-to-end ownership of production...04 Jun -
Senior Site Reliability Engineer
- SiteMinder
- Pune, Maharashtra, India
At SiteMinder we believe the individual contributions of our employees are what drive our success. That's why we hire and encourage diverse teams that...28 May -
Senior Site Reliability Engineer
- MyOperator
- Nadiād, GJ, IN
Job Description Role Overview We are looking for a skilled and proactive Site Reliability Engineer (SRE) to take end-to-end ownership of production...05 Jun