Skip to main content
Posted 04 June, 2026

Lead Site Reliability Engineer

Concentrix
Mangalore, KA, IN Full Time
Reference: 00793db910971109

Job Description

About the Role :

As a Lead Site Reliability Engineer, you will own the reliability and availability of our production systems. You will champion SRE principles across engineering teams — defining SLOs, managing error budgets, and leading a culture of blameless incident response. This is a hands-on leadership role where you will partner closely with product and engineering teams to balance the pace of innovation with the stability our customers depend on.

  • Title: Site Reliability Engineer
  • Shift- General/UK Shift
  • Location: India, Remote Any location near CNX offices

Responsibilities:

  • Reliability Ownership
  • · Define, implement, and own Service Level Objectives (SLOs), Service Level Indicators (SLIs), and Error Budgets across critical services.
  • · Use error budget policies to drive data-informed conversations between engineering and product on release velocity vs. reliability trade-offs.
  • · Conduct capacity planning and proactive risk assessments to prevent incidents before they occur.
  • Incident Management
  • · Lead incident response as incident commander — coordinating teams, driving resolution, and maintaining clear stakeholder communication during outages.
  • · Facilitate thorough, blameless postmortems and ensure action items are tracked, prioritized, and resolved.
  • · Develop and continuously improve runbooks, escalation paths, and on-call practices to reduce MTTD and MTTR.
  • Observability & Monitoring
  • · Design and maintain observability strategies using modern tooling (Prometheus, Grafana, OpenTelemetry, ELK) to ensure full visibility into system health.
  • · Define intelligent alerting that is actionable and minimizes alert fatigue.
  • · Drive adoption of distributed tracing and structured logging across services.
  • Toil Reduction & Automation
  • · Identify and measure toil across the engineering organization and lead initiatives to eliminate it through automation.
  • · Build internal tooling and self-service capabilities that improve developer productivity and system reliability.

Infrastructure & Platform Reliability

  • · Collaborate with platform and infrastructure teams on cloud-native patterns for fault tolerance, auto-scaling, and disaster recovery.
  • · Provide SRE input into CI/CD pipelines and deployment strategies (e.g., canary releases, blue/green deployments) to minimize production risk.
  • · Manage infrastructure using IaC practices (Terraform or equivalent) with a focus on reliability and consistency.

Leadership & Culture

  • · Mentor and grow junior SREs, fostering a culture of ownership, curiosity, and continuous improvement.
  • · Act as an SRE advocate across engineering — embedding reliability thinking into the software development lifecycle.
  • · Partner with key stakeholders to align SRE strategy with broader organizational goals.
  • · Conduct regular 1:1s with direct reports and participate in team rituals.

AI Expectations

  • As with all engineers at our organization, this role requires an AI-native mindset. Specifically, you will be expected to:
  • · Embed AI tools and practices into how we build and run our platform — deploying AI-powered capabilities and shipping real AI features into production.
  • · Support engagement and solutioning for AI-powered offerings, translating technical capabilities into tangible business value.
  • · Collaborate with cross-functional partners — including Product, Data, Security, and Legal — to ensure AI is delivered safely, effectively, and in compliance with relevant standards.
  • Skills you will need:
  • 7+ years of experience in SRE, platform engineering, or a related discipline.
  • Proven experience defining and managing SLOs, SLIs, and error budgets in a production environment.
  • Strong incident management experience, including leading postmortems and driving reliability improvements.
  • Hands-on experience with observability tooling (Prometheus, Grafana, OpenTelemetry, or similar).
  • Solid understanding of cloud platforms (AWS, Azure, or GCP) and containerized environments (Kubernetes).
  • Proficiency in at least one scripting or programming language (Python, Go, or Bash).
  • Nice to Have
  • Experience with chaos engineering tools (e.g., Chaos Monkey, Gremlin, LitmusChaos).
  • Familiarity with IaC tooling such as Terraform or Pulumi.
  • Knowledge of DevSecOps practices and security tooling.
  • Experience with GitOps workflows and CI/CD pipelines.
  • Bilingual proficiency (English & Spanish).
  • Complete all assigned, mandatory training within the timeframe provided.
  • Conduct and/or participate in regularly scheduled 1:1 meetings with your direct manager and/or direct reports

Sign up for Job Alerts