Posted 02 June, 2026
Lead SRE
NR Consulting - India
Chennai, Hyderabad & Bangalore, IN
Full Time
Reference: 26-05673-2220-1
Title: Lead SRE
Location: Chennai, Hyderabad & Bangalore
Exp: 8-14 yrs
Job Description:
Proactive Reliability Engineering
· Identify patterns, trends, and signals to prevent incidents before they occur
· Continuously improve alert quality, reduce noise, and increase signal fidelity
· Partner with engineering teams to enhance system resilience and reliability
Automation & Toil Reduction
· Eliminate manual work by automating operational tasks, ticket handling, and repetitive workflows
· Build and improve tooling across incident response, observability, and operations
· Leverage AI-assisted development tools (e.g., Cursor, Claude) where they provide clear value
Platform & Systems Support
Troubleshoot across a hybrid ecosystem including:
· On-prem VMs (Linux & Windows; VMware)
· Cloud platforms (AWS, GCP, Azure)
· Containerized environments (Kubernetes clusters)
Diagnose and resolve issues across:
· Networking (connectivity, latency, DB access interruptions)
· Kubernetes (ingress, environment variables, cluster-level issues)
· CDN and traffic management layers (Akamai, waiting rooms – plus)
Required Technical Skills & Experience
Core Engineering & Operations
· Strong experience in incident management and triage in production environments
· Proven ability to troubleshoot complex distributed systems under pressure
· Solid understanding of Linux systems administration (including performance, networking, NTP, etc.)
Cloud & Infrastructure
· Hands-on experience with AWS core services (S3, Lambda, Load Balancers, ECS, EC2)
· Familiarity with GCP and/or Azure environments
· Experience operating in multi-cloud and hybrid environments
Containers & Orchestration
· Experience troubleshooting Kubernetes clusters (pods, ingress, configuration issues)
· Understanding of containerized application architectures
DevOps & CI/CD
Strong knowledge of DevOps practices and CI/CD pipelines
Hands-on experience with:
· Harness
· GitHub and/or GitLab
Application & Technology Stack Awareness
Working knowledge of:
· Java, Node.js, React-based applications
Understanding of database connectivity and dependencies across:
· Oracle, MariaDB, MSSQL (no DBA ownership, but strong troubleshooting awareness required)
Networking
Strong foundational knowledge of:
· TCP/IP, DNS, HTTP(S)
· Load balancing and network troubleshooting
· Diagnosing connectivity issues between services and databases
Preferred Qualifications
• Experience in large-scale enterprise (Fortune 500) environments supporting mission-critical applications
• Prior experience as an Incident Commander or similar leadership role during outages
• Familiarity with Akamai CDN and traffic management tools
• Experience in high-volume, high-availability production environments
Location: Chennai, Hyderabad & Bangalore
Exp: 8-14 yrs
Job Description:
Proactive Reliability Engineering
· Identify patterns, trends, and signals to prevent incidents before they occur
· Continuously improve alert quality, reduce noise, and increase signal fidelity
· Partner with engineering teams to enhance system resilience and reliability
Automation & Toil Reduction
· Eliminate manual work by automating operational tasks, ticket handling, and repetitive workflows
· Build and improve tooling across incident response, observability, and operations
· Leverage AI-assisted development tools (e.g., Cursor, Claude) where they provide clear value
Platform & Systems Support
Troubleshoot across a hybrid ecosystem including:
· On-prem VMs (Linux & Windows; VMware)
· Cloud platforms (AWS, GCP, Azure)
· Containerized environments (Kubernetes clusters)
Diagnose and resolve issues across:
· Networking (connectivity, latency, DB access interruptions)
· Kubernetes (ingress, environment variables, cluster-level issues)
· CDN and traffic management layers (Akamai, waiting rooms – plus)
Required Technical Skills & Experience
Core Engineering & Operations
· Strong experience in incident management and triage in production environments
· Proven ability to troubleshoot complex distributed systems under pressure
· Solid understanding of Linux systems administration (including performance, networking, NTP, etc.)
Cloud & Infrastructure
· Hands-on experience with AWS core services (S3, Lambda, Load Balancers, ECS, EC2)
· Familiarity with GCP and/or Azure environments
· Experience operating in multi-cloud and hybrid environments
Containers & Orchestration
· Experience troubleshooting Kubernetes clusters (pods, ingress, configuration issues)
· Understanding of containerized application architectures
DevOps & CI/CD
Strong knowledge of DevOps practices and CI/CD pipelines
Hands-on experience with:
· Harness
· GitHub and/or GitLab
Application & Technology Stack Awareness
Working knowledge of:
· Java, Node.js, React-based applications
Understanding of database connectivity and dependencies across:
· Oracle, MariaDB, MSSQL (no DBA ownership, but strong troubleshooting awareness required)
Networking
Strong foundational knowledge of:
· TCP/IP, DNS, HTTP(S)
· Load balancing and network troubleshooting
· Diagnosing connectivity issues between services and databases
Preferred Qualifications
• Experience in large-scale enterprise (Fortune 500) environments supporting mission-critical applications
• Prior experience as an Incident Commander or similar leadership role during outages
• Familiarity with Akamai CDN and traffic management tools
• Experience in high-volume, high-availability production environments