Posted 24 June, 2026
Technical Lead - Site Reliability Engineer
Mumba Technologies, Inc.
Tirunelveli, TN, IN
Full Time
Reference: 4097d693af748c85
Job Description
About your role:
Technical Leadership & Architecture
- Own and drive the technical direction for your team's infrastructure systems, making architectural decisions that balance reliability, scalability, and cost.
- Design systems of moderate to high complexity using distributed systems best practices; anticipate future use cases and minimize technical debt.
- Conduct architectural reviews and advance design patterns across the organization.
- Identify and implement improvements to existing software architecture; define and expand design patterns to solve common platform problems.
- Define and enforce security best practices across team-owned systems; proactively surface gaps to senior leadership.
Reliability & Operational Excellence
- Own the reliability posture of team-owned services — establish SLOs, monitor SLAs, and hold the team accountable to them.
- Lead incident response for complex, multi-service issues; systematically debug, identify root causes, and ensure issues do not recur.
- Establish standards for logging, monitoring, and operationalization across all team-owned systems.
- Foresee potential operational issues and implement preventative measures to safeguard the customer experience.
- Participate in and help lead the on-call rotation; ensure production systems are appropriately instrumented.
Project & Delivery Ownership
- Act as DRI (Directly Responsible Individual) for medium-to-large SRE projects spanning months and involving cross-team collaboration.
- Partner with Engineering Managers and Product Managers to scope roadmap initiatives, break down work into actionable increments, and commit to delivery plans.
- Negotiate scope effectively when required, ensuring adjustments align with customer needs and project goals.
- Proactively identify and resolve project risks — dependencies, architectural drift, and staffing blockers — before they impact delivery.
What We Are Looking For
Required Experience
- 7-10 years of experience in Site Reliability Engineering, DevOps, or Platform Engineering in a production cloud environment.
- 5+ years of hands-on experience with AWS cloud services across compute, networking, storage, and security.
- 5+ years managing Linux-oriented production environments at scale.
- 5+ years using Infrastructure-as-Code (Terraform, CDK, CloudFormation) and/or GitOps best practices.
- 3+ years operating and troubleshooting production Kubernetes environments.
- 3+ years applying AWS Well-Architected Framework principles across reliability, security, performance, and cost pillars.
- 3+ years in cloud security best practices including IAM, secrets management, network security, and compliance.
- 3+ years working with PostgreSQL in production: performance tuning, replication, backup, and recovery.
- Demonstrated track record of leading multi-person technical projects from scoping through delivery.
Technical Skills
- Strong general programming skills; comfort writing automation scripts and tooling in Python, Go, or similar.
- Deep knowledge of observability tooling — metrics, logging, distributed tracing — and how to use them to drive reliability.
- Solid understanding of data retention, backup, and recovery processes across cloud-native systems.
- Experience with CI/CD pipelines, release management, and deployment automation.
- Familiarity with service mesh, API gateway patterns, and microservices architectures.
AI Fluency
- Experience using AI-assisted workflows across the SDLC, with an emphasis on production reliability, operability, and maintainability of large-scale systems (design, deployment, monitoring, incident response)
- Hands-on experience integrating LLMs or AI systems into production environments, with a focus on reliability, latency, observability, and failure handling (e.g., automated triage, incident copilots, runbook automation)
- Familiarity with agent-based or workflow automation systems applied to operational use cases such as alert triage, remediation loops, system diagnostics, or automated runbook execution
- Demonstrated ability to apply AI tools to improve system reliability, reduce MTTR, automate operational workflows, and enhance observability and alerting systems
- Working knowledge of LLMs, embeddings, RAG, and their operational constraints in production systems (latency, cost, drift, safety, and observability)
- Ability to identify opportunities where AI can meaningfully improve system reliability, on-call efficiency, incident response, and infrastructure automation
Nice to have (SRE):
- Experience handling model degradation, fallback strategies, and cost anomalies
Leadership & Collaboration
- Proven ability to lead technical discussions, drive alignment across engineering and product, and communicate decisions clearly to stakeholders.
- Experience mentoring junior and mid-level engineers in both technical skills and professional development.
- Able to operate independently with minimal supervision; comfortable making final technical decisions as DRI.
- Strong communication skills in English — written and verbal — with experience influencing cross-functional partners.
Thank you!