Posted 05 June, 2026

Senior Site Reliability Engineer

MyOperator

Salem, TN, IN Full Time

Reference: cf21ba80efc59ee5

Job Description Role Overview We are looking for a skilled and proactive Site Reliability Engineer (SRE) to take end-to-end ownership of production...

Expand

Job Description

Role Overview

We are looking for a skilled and proactive Site Reliability Engineer (SRE) to take end-to-end ownership of production reliability, observability, and performance engineering across MyOperator’s AI-powered communication infrastructure.

This role is not operational-only — it requires strong system design thinking, deep troubleshooting ability, and a production ownership mindset. You will define reliability standards, build observability frameworks, lead incident response, and drive SLO-based engineering practices across distributed AWS and Kubernetes environments.

About MyOperator

MyOperator is a Business AI Operator platform that enables businesses, teams, and AI agents to work together seamlessly for customer operations such as Sales, Support, Escalations, Feedback, and Refund processes. With 12,000+ businesses using our platform, we operate at meaningful scale and power mission-critical communication workflows including voice bots, WhatsApp automation, and intelligent call routing. We are building for reliability, speed, and impact. MyOperator values ownership, critical thinking, and execution. This is a high-expectation, high-learning environment where engineers are empowered to solve complex problems and build systems that directly affect customer outcomes.

Key Responsibilities

Own production reliability, uptime, latency, and error budgets across critical services.
Design and manage production-grade monitoring using Grafana, VictoriaMetrics (Prometheus), and AWS CloudWatch.
Define and enforce SLIs, SLOs, and SLA thresholds for AI communication systems (voice bots, WhatsApp APIs, call routing).
Build real-time operational dashboards for incident response, capacity planning, and leadership visibility.
Implement end-to-end distributed tracing using OpenTelemetry (OTEL Collector).
Design and maintain centralized logging with strong correlation between logs, metrics, and traces.
Create SLO-based alerting systems with minimal noise and fast incident detection.
Lead incident response lifecycle: alert triage, mitigation, RCA documentation, and preventive improvements.
Drive MTTR reduction through structured monitoring, automation, and reliability engineering practices.
Monitor and troubleshoot AWS EKS (Kubernetes) production workloads.
Instrument and monitor LLM API integrations, AI inference pipelines, and messaging systems.
Analyze logs using OpenSearch / ELK for anomaly detection and root cause identification.
Automate operational workflows using Python or Bash to eliminate manual toil.
Drive performance optimization, scalability improvements, and capacity planning.
Collaborate with engineering teams to instrument new services from day one.

Required Skills & Qualifications

3–6 years of experience in Site Reliability Engineering, DevOps, or Platform Engineering roles.
Hands-on experience with: VictoriaMetrics / Prometheus (time-series monitoring), Grafana dashboards and visualization & PromQL for writing complex queries and alerts
Experience implementing distributed tracing using OpenTelemetry (Mandatory).
Strong experience with centralized logging systems (ELK / OpenSearch / Loki).
Experience with alerting frameworks such as Alertmanager or Grafana Alerts.
Strong understanding of SLIs, SLOs, SLA design, and reliability engineering principles.
Hands-on experience managing AWS production workloads (EC2, RDS, ELB, CloudWatch, IAM).
Experience with Kubernetes (AWS EKS preferred).
Good understanding of Linux systems, networking, and cloud infrastructure.
Experience handling production incidents and participating in on-call rotations.
Ability to automate operational tasks using Python or Bash.

Good to Have

Experience with OpenSearch / ELK log pipelines and anomaly detection.
Kubernetes monitoring (pod health, node metrics, autoscaling behavior).
CI/CD observability integration (Jenkins, GitHub Actions).
Experience monitoring LLM APIs and AI inference pipelines.
Familiarity with MLOps or AI observability tools (Arize, WhyLabs, etc.).
Service mesh exposure (Istio).
Infrastructure as Code (Terraform, CloudFormation).
Experience with chaos engineering or load testing tools.
Multi-cluster or multi-region architecture exposure.

Key Expectations

Ownership of production systems and high availability.
Strong troubleshooting and debugging skills.
Focus on automation and reliability improvements.
Proactive approach to incident prevention.
Ability to reduce alert noise and improve signal quality.
Data-driven approach to reliability engineering.

This Role Is Not For

Candidates with purely development experience and no production ownership.
Candidates without real incident response or on-call experience.
Freshers or candidates with less than 3 years of experience.

This listing expired on 08 Jun. Applications are no longer accepted.

Below are some other jobs we think you might be interested in.

Senior Site Reliability Engineer
- Josys
- Bangalore, Karnataka, India
Senior Site Reliability Engineer (SRE)About JOSYSJosys, a dynamic B2B SaaS platform startup, has embarked on a mission to revolutionize IT operations...
13 May
Senior Site Reliability Engineer
- Motorola Mobility (a Lenovo Company)
- Bengaluru, KA, IN
Job Description Hiring for Senior Site Reliability Engineer (SRE) !!! About Our Team Lenovo is building Quantum, a next‑generation hybrid AI...
21 May
Senior Site Reliability Engineer
- Pocket FM
- Bengaluru, KA, IN
Job Description Senior Site Reliability Engineer (SRE) Company: Pocket FM About the Role Pocket FM is a global audio entertainment platform serving...
21 May
Senior Site Reliability Engineer
- Saama Technologies Inc
- India
Job Title: Senior Site Reliability Engineer Job Summary: We are seeking a highly motivated and experienced Site Reliability Engineer to join our team....
13 May
Senior Site Reliability Engineer
- Jobgether
- India
This position is posted by Jobgether on behalf of a partner company. We are currently looking for a Senior Site Reliability Engineer in India. ...
27 May
Senior Site Reliability Engineer
- MyOperator
- Thiruvananthapuram, KL, IN
Job Description Role Overview We are looking for a skilled and proactive Site Reliability Engineer (SRE) to take end-to-end ownership of production...
06 Jun
Senior Site Reliability Engineer
- MyOperator
- Panchkula, HR, IN
Job Description Role Overview We are looking for a skilled and proactive Site Reliability Engineer (SRE) to take end-to-end ownership of production...
08 Jun
Senior Site Reliability Engineer
- MyOperator
- Jamnagar, GJ, IN
Job Description Role Overview We are looking for a skilled and proactive Site Reliability Engineer (SRE) to take end-to-end ownership of production...
08 Jun
Senior Site Reliability Engineer
- MyOperator
- Hosūr, TN, IN
Job Description Role Overview We are looking for a skilled and proactive Site Reliability Engineer (SRE) to take end-to-end ownership of production...
08 Jun
Senior Site Reliability Engineer
- MyOperator
- Tirupur, TN, IN
Job Description Role Overview\nWe are looking for a skilled and proactive Site Reliability Engineer (SRE) to take end-to-end ownership of production...
06 Jun
Senior Site Reliability Engineer
- Zuora
- Chennai, Tamil Nadu, India
About ZuoraAt Zuora, we help businesses grow smarter and adapt faster. Our platform powers modern business models - from subscriptions and usage-based...
29 May
Senior Site Reliability Engineer
- Nexthink
- Bengaluru,Karnataka,India
Company Description Nexthink is the leader in digital employee experience management software. The company provides IT leaders with...
19 May
Senior Site Reliability Engineer
- Boston Consulting Group
- Gurgaon,Haryana,India
Who We AreBoston Consulting Group partners with leaders in business and society to tackle their most important challenges and capture their greatest...
06 Jun
Senior Site Reliability Engineer
- Headout
- Bengaluru
The RoleAs a Senior Site Reliability Engineer, you will be responsible for infrastructure management, working with Kubernetes clusters on cloud and...
22 May
Senior Site Reliability Engineer
- MyOperator
- Baddi, HP, IN
Job Description Role Overview We are looking for a skilled and proactive Site Reliability Engineer (SRE) to take end-to-end ownership of production...
07 Jun
Senior Site Reliability Engineer
- MyOperator
- Guntur, AP, IN
Job Description Role Overview We are looking for a skilled and proactive Site Reliability Engineer (SRE) to take end-to-end ownership of production...
08 Jun
Senior Site Reliability Engineer
- MyOperator
- Ajmer, RJ, IN
Job Description Role Overview We are looking for a skilled and proactive Site Reliability Engineer (SRE) to take end-to-end ownership of production...
08 Jun
Senior Site Reliability Engineer
- MyOperator
- Nagpur, MH, IN
Job Description Role Overview We are looking for a skilled and proactive Site Reliability Engineer (SRE) to take end-to-end ownership of production...
08 Jun
Senior Site Reliability Engineer
- MyOperator
- Salem, TN, IN
Job Description Role Overview We are looking for a skilled and proactive Site Reliability Engineer (SRE) to take end-to-end ownership of production...
08 Jun
Senior Site Reliability Engineer
- Granicus India
- Remote,IN
The CompanyServing the People Who Serve the People Granicus is driven by the excitement of building, implementing, and maintaining technology that is...
13 May