Skip to main content
Posted 04 June, 2026

Senior Site Reliability Engineer (AI)

OneTrust
Bengaluru, India Full Time
Reference: 102_700149_7954029

The Challenge

  • Own production services end-to-end, including reliability, scalability, and operational excellence
  • Participate in on-call rotation and lead incident response

Your Mission

Engage and partner with various Engineering, Operations, and Product teams to design, deliver, and maintain a highly available and performant application platform.

  • Collaborate with different functional groups to identify gaps, prioritize, and resolve issues
  • Defining, implementing, and maintaining SLIs and SLOs aligned with customer experience.
  • Design and instrument SLIs such as latency, error rates, and availability across critical services
  • Manage and enforce error budgets to balance system reliability with product feature velocity.
  • Improving alert quality by reducing noise and focusing on actionable, high-signal alerts
  • Embed with product teams to review architectures and catch reliability risks early
  • Share your knowledge and experience with the Engineering organization
  • Share your findings with technical leadership and senior management
  • Build scripts in python/bash/java or ruby for operational automation and incident response

You Are

A hands-on engineer familiar with running production services and providing understanding and solutions to appropriately monitor and automate those services.

Your Experience Includes

  • Bachelor's degree in computer science, Engineering, or related technical or business field
  • 4+ yrs. of application development experience with Java or other equivalent language
  • Experience with Spring environment .
  • Experience in cloud-based infrastructure (Azure, AWS, GCP, etc.)
  • Experience with the factors influencing performance of software applications at multiple layers (Database, network, CPU utilization, JVM tuning, memory analysis, thread management, query performance etc.)
  • An understanding of the importance of centralizing logging, metrics dashboards, and alerting. Able to talk about some of the tools used for these tasks
  • A good understanding of databases (ideally SQL/NoSQL)
  • Hands-on experience with observability tools (Datadog, Prometheus, Grafana, etc.)
  • Familiarity with CI/CD pipelines and infrastructure-as-code (Terraform, Helm, jenkins, gitlab)
  • Build and operate AI-assisted incident response systems (root cause analysis, log summarization, anomaly triage)
  • Develop or integrate LLM-based tools to reduce MTTR and improve alert quality
  • Apply machine learning techniques for anomaly detection, capacity prediction, or failure pattern analysis
  • Experience deploying AI systems in production (not just experimentation)
  • Familiarity with vector databases, embeddings, or RAG architectures for operational intelligence
  • Strong understanding of prompt engineering and evaluation of LLM outputs in reliability workflow
  • Kubernetes and container orchestration (EKS/AKS/GKE)
  • Experience with distributed systems at scale
  • Familiarity with service meshes and microservices architecture

Nice to Have

  • Experience with chaos engineering tools (Gremlin, Chaos Monkey)
  • Background in product-facing services with high traffic scale
  • Knowledge of incident management platforms (PagerDuty/DataDog alerts)

Sign up for Job Alerts