Skip to main content
Posted 05 June, 2026

Senior Site Reliability Engineer

MyOperator
Salem, TN, IN Full Time
Reference: cf21ba80efc59ee5

Job Description Role Overview We are looking for a skilled and proactive Site Reliability Engineer (SRE) to take end-to-end ownership of production...

Job Description

Role Overview

We are looking for a skilled and proactive Site Reliability Engineer (SRE) to take end-to-end ownership of production reliability, observability, and performance engineering across MyOperator’s AI-powered communication infrastructure.


This role is not operational-only — it requires strong system design thinking, deep troubleshooting ability, and a production ownership mindset. You will define reliability standards, build observability frameworks, lead incident response, and drive SLO-based engineering practices across distributed AWS and Kubernetes environments.


About MyOperator

MyOperator is a Business AI Operator platform that enables businesses, teams, and AI agents to work together seamlessly for customer operations such as Sales, Support, Escalations, Feedback, and Refund processes. With 12,000+ businesses using our platform, we operate at meaningful scale and power mission-critical communication workflows including voice bots, WhatsApp automation, and intelligent call routing. We are building for reliability, speed, and impact. MyOperator values ownership, critical thinking, and execution. This is a high-expectation, high-learning environment where engineers are empowered to solve complex problems and build systems that directly affect customer outcomes.


Key Responsibilities

  • Own production reliability, uptime, latency, and error budgets across critical services.
  • Design and manage production-grade monitoring using Grafana, VictoriaMetrics (Prometheus), and AWS CloudWatch.
  • Define and enforce SLIs, SLOs, and SLA thresholds for AI communication systems (voice bots, WhatsApp APIs, call routing).
  • Build real-time operational dashboards for incident response, capacity planning, and leadership visibility.
  • Implement end-to-end distributed tracing using OpenTelemetry (OTEL Collector).
  • Design and maintain centralized logging with strong correlation between logs, metrics, and traces.
  • Create SLO-based alerting systems with minimal noise and fast incident detection.
  • Lead incident response lifecycle: alert triage, mitigation, RCA documentation, and preventive improvements.
  • Drive MTTR reduction through structured monitoring, automation, and reliability engineering practices.
  • Monitor and troubleshoot AWS EKS (Kubernetes) production workloads.
  • Instrument and monitor LLM API integrations, AI inference pipelines, and messaging systems.
  • Analyze logs using OpenSearch / ELK for anomaly detection and root cause identification.
  • Automate operational workflows using Python or Bash to eliminate manual toil.
  • Drive performance optimization, scalability improvements, and capacity planning.
  • Collaborate with engineering teams to instrument new services from day one.


Required Skills & Qualifications

  • 3–6 years of experience in Site Reliability Engineering, DevOps, or Platform Engineering roles.
  • Hands-on experience with: VictoriaMetrics / Prometheus (time-series monitoring), Grafana dashboards and visualization & PromQL for writing complex queries and alerts
  • Experience implementing distributed tracing using OpenTelemetry (Mandatory).
  • Strong experience with centralized logging systems (ELK / OpenSearch / Loki).
  • Experience with alerting frameworks such as Alertmanager or Grafana Alerts.
  • Strong understanding of SLIs, SLOs, SLA design, and reliability engineering principles.
  • Hands-on experience managing AWS production workloads (EC2, RDS, ELB, CloudWatch, IAM).
  • Experience with Kubernetes (AWS EKS preferred).
  • Good understanding of Linux systems, networking, and cloud infrastructure.
  • Experience handling production incidents and participating in on-call rotations.
  • Ability to automate operational tasks using Python or Bash.


Good to Have

  • Experience with OpenSearch / ELK log pipelines and anomaly detection.
  • Kubernetes monitoring (pod health, node metrics, autoscaling behavior).
  • CI/CD observability integration (Jenkins, GitHub Actions).
  • Experience monitoring LLM APIs and AI inference pipelines.
  • Familiarity with MLOps or AI observability tools (Arize, WhyLabs, etc.).
  • Service mesh exposure (Istio).
  • Infrastructure as Code (Terraform, CloudFormation).
  • Experience with chaos engineering or load testing tools.
  • Multi-cluster or multi-region architecture exposure.


Key Expectations

  • Ownership of production systems and high availability.
  • Strong troubleshooting and debugging skills.
  • Focus on automation and reliability improvements.
  • Proactive approach to incident prevention.
  • Ability to reduce alert noise and improve signal quality.
  • Data-driven approach to reliability engineering.


This Role Is Not For

  • Candidates with purely development experience and no production ownership.
  • Candidates without real incident response or on-call experience.
  • Freshers or candidates with less than 3 years of experience.

This listing expired on 08 Jun. Applications are no longer accepted.

Below are some other jobs we think you might be interested in.

  • Senior Site Reliability Engineer
    • Josys
    • Bangalore, Karnataka, India
    Senior Site Reliability Engineer (SRE)About JOSYSJosys, a dynamic B2B SaaS platform startup, has embarked on a mission to revolutionize IT operations...
    13 May
  • Senior Site Reliability Engineer
    • Motorola Mobility (a Lenovo Company)
    • Bengaluru, KA, IN
    Job Description Hiring for Senior Site Reliability Engineer (SRE) !!! About Our Team Lenovo is building Quantum, a next‑generation hybrid AI...
    21 May
  • Senior Site Reliability Engineer
    • Pocket FM
    • Bengaluru, KA, IN
    Job Description Senior Site Reliability Engineer (SRE) Company: Pocket FM About the Role Pocket FM is a global audio entertainment platform serving...
    21 May
  • Senior Site Reliability Engineer
    • Saama Technologies Inc
    • India
    Job Title: Senior Site Reliability Engineer Job Summary: We are seeking a highly motivated and experienced Site Reliability Engineer to join our team....
    13 May
  • Senior Site Reliability Engineer
    • Jobgether
    • India
    This position is posted by Jobgether on behalf of a partner company. We are currently looking for a Senior Site Reliability Engineer in India. ...
    27 May
  • Senior Site Reliability Engineer
    • MyOperator
    • Thiruvananthapuram, KL, IN
    Job Description Role Overview We are looking for a skilled and proactive Site Reliability Engineer (SRE) to take end-to-end ownership of production...
    06 Jun
  • Senior Site Reliability Engineer
    • MyOperator
    • Panchkula, HR, IN
    Job Description Role Overview We are looking for a skilled and proactive Site Reliability Engineer (SRE) to take end-to-end ownership of production...
    08 Jun
  • Senior Site Reliability Engineer
    • MyOperator
    • Jamnagar, GJ, IN
    Job Description Role Overview We are looking for a skilled and proactive Site Reliability Engineer (SRE) to take end-to-end ownership of production...
    08 Jun
  • Senior Site Reliability Engineer
    • MyOperator
    • Hosūr, TN, IN
    Job Description Role Overview We are looking for a skilled and proactive Site Reliability Engineer (SRE) to take end-to-end ownership of production...
    08 Jun
  • Senior Site Reliability Engineer
    • MyOperator
    • Tirupur, TN, IN
    Job Description Role Overview\nWe are looking for a skilled and proactive Site Reliability Engineer (SRE) to take end-to-end ownership of production...
    06 Jun
  • Senior Site Reliability Engineer
    • Zuora
    • Chennai, Tamil Nadu, India
    About ZuoraAt Zuora, we help businesses grow smarter and adapt faster. Our platform powers modern business models - from subscriptions and usage-based...
    29 May
  • Senior Site Reliability Engineer
    • Nexthink
    • Bengaluru,Karnataka,India
    Company Description Nexthink is the leader in digital employee experience management software. The company provides IT leaders with...
    19 May
  • Senior Site Reliability Engineer
    • Boston Consulting Group
    • Gurgaon,Haryana,India
    Who We AreBoston Consulting Group partners with leaders in business and society to tackle their most important challenges and capture their greatest...
    06 Jun
  • Senior Site Reliability Engineer
    • Headout
    • Bengaluru
    The RoleAs a Senior Site Reliability Engineer, you will be responsible for infrastructure management, working with Kubernetes clusters on cloud and...
    22 May
  • Senior Site Reliability Engineer
    • MyOperator
    • Baddi, HP, IN
    Job Description Role Overview We are looking for a skilled and proactive Site Reliability Engineer (SRE) to take end-to-end ownership of production...
    07 Jun
  • Senior Site Reliability Engineer
    • MyOperator
    • Guntur, AP, IN
    Job Description Role Overview We are looking for a skilled and proactive Site Reliability Engineer (SRE) to take end-to-end ownership of production...
    08 Jun
  • Senior Site Reliability Engineer
    • MyOperator
    • Ajmer, RJ, IN
    Job Description Role Overview We are looking for a skilled and proactive Site Reliability Engineer (SRE) to take end-to-end ownership of production...
    08 Jun
  • Senior Site Reliability Engineer
    • MyOperator
    • Nagpur, MH, IN
    Job Description Role Overview We are looking for a skilled and proactive Site Reliability Engineer (SRE) to take end-to-end ownership of production...
    08 Jun
  • Senior Site Reliability Engineer
    • MyOperator
    • Salem, TN, IN
    Job Description Role Overview We are looking for a skilled and proactive Site Reliability Engineer (SRE) to take end-to-end ownership of production...
    08 Jun
  • Senior Site Reliability Engineer
    • Granicus India
    • Remote,IN
    The CompanyServing the People Who Serve the People Granicus is driven by the excitement of building, implementing, and maintaining technology that is...
    13 May