AI-Native SRE Architect / Reliability Transformation Lead
Job Description
Experience -15 to 20 years
Location - Bengaluru/Chennai/Hyderabad/Pune/Noida
We're Hiring: AI-Native SRE Architect / Reliability Transformation Lead
Are you passionate about building resilient, scalable, and automation-first enterprise platforms?
We are looking for experienced SRE leader to drive enterprise-wide reliability transformation across cloud, digital, SaaS, industrial, AI-enabled, and mission-critical platforms. This role combines technical expertise with strategic transformation leadership to modernize operations through SRE, AI-driven observability, autonomous operations, platform engineering, and intelligent automation practices.
As a trusted advisor and hands-on transformation leader, you will partner with engineering, operations, cloud, platform, architecture, security and AI-engineering teams to institutionalize modern Site Reliability Engineering practices at scale.
Key Responsibilities
- Define and drive enterprise SRE strategy, governance, and reliability standards.
- Establish SLOs, SLIs, Error Budgets, and reliability KPIs.
- Define modern observability standards leveraging OpenTelemetry, AI-powered analytics, event intelligence, and intelligent monitoring platforms.
- Drive autonomous operations through AI-assisted troubleshooting, predictive incident prevention, intelligent remediation, self-healing, and operational copilots.
- Enable automation-first operations using Infrastructure as Code (IaC), GitOps, CI/CD, platform engineering, and policy-driven operational workflows.
- Partner with platform engineering teams to improve developer experience through self-service platforms, deployment patterns, golden paths, and reliability guardrails.
- Drive reliability engineering practices for AI-enabled platforms, including operational resilience, observability, governance, scalability, and performance optimization for LLM-based and intelligent systems.
- Lead incident management initiatives including intelligent RCA, operational analytics, resiliency engineering, and continuous reliability improvement programs.
- Coach and mentor engineering and operations teams on modern reliability engineering principles.
- Collaborate with cloud, platform, architecture, and security teams on enterprise modernization initiatives.
Requirements
- Strong expertise in SRE, Platform Engineering, Production Systems Engineering, DevOps or Cloud Engineering leadership roles.
- Hands-on experience with cloud platforms, observability, automation, and operational excellence.
- Expertise with modern observability and telemetry platforms such as Dynatrace, Splunk, OpenTelemetry, and modern monitoring platforms.
- Experience in incident management, RCA, resiliency engineering, and automation.
- Strong stakeholder management, mentoring, and communication skills.
- Experience across enterprise ecosystems including Azure, Java/.NET, SAP, Salesforce, SaaS/COTS, and legacy platforms is a plus.
- Exposure to reliability and operational governance for AI/ML or LLM-enabled systems is highly desirable.
Leadership Attributes
- Enterprise transformation mindset with ability to influence cross-functional engineering and operations organizations.
- Strong balance of strategic thinking and hands-on technical depth.
- Ability to drive cultural transformation toward reliability ownership, automation-first engineering, and platform-centric operating models.
- Passion for modern engineering practices, AI-enabled operations, developer productivity, and operational innovation.