Skip to main content
Posted 23 June, 2026

Technical Lead - Site Reliability Engineer

Mumba Technologies, Inc.
Kolkata, WB, IN Full Time
Reference: 06bd89e2f64d4b36

Job Description

About your role:\nTechnical Leadership & Architecture\nOwn and drive the technical direction for your team's infrastructure systems, making architectural decisions that balance reliability, scalability, and cost.\nDesign systems of moderate to high complexity using distributed systems best practices; anticipate future use cases and minimize technical debt.\nConduct architectural reviews and advance design patterns across the organization.\nIdentify and implement improvements to existing software architecture; define and expand design patterns to solve common platform problems.\nDefine and enforce security best practices across team-owned systems; proactively surface gaps to senior leadership.\nReliability & Operational Excellence\nOwn the reliability posture of team-owned services — establish SLOs, monitor SLAs, and hold the team accountable to them.\nLead incident response for complex, multi-service issues; systematically debug, identify root causes, and ensure issues do not recur.\nEstablish standards for logging, monitoring, and operationalization across all team-owned systems.\nForesee potential operational issues and implement preventative measures to safeguard the customer experience.\nParticipate in and help lead the on-call rotation; ensure production systems are appropriately instrumented.\nProject & Delivery Ownership\nAct as DRI (Directly Responsible Individual) for medium-to-large SRE projects spanning months and involving cross-team collaboration.\nPartner with Engineering Managers and Product Managers to scope roadmap initiatives, break down work into actionable increments, and commit to delivery plans.\nNegotiate scope effectively when required, ensuring adjustments align with customer needs and project goals.\nProactively identify and resolve project risks — dependencies, architectural drift, and staffing blockers — before they impact delivery.\n\nWhat We Are Looking For\nRequired Experience\n7-10 years of experience in Site Reliability Engineering, DevOps, or Platform Engineering in a production cloud environment.\n5+ years of hands-on experience with AWS cloud services across compute, networking, storage, and security.\n5+ years managing Linux-oriented production environments at scale.\n5+ years using Infrastructure-as-Code (Terraform, CDK, CloudFormation) and/or GitOps best practices.\n3+ years operating and troubleshooting production Kubernetes environments.\n3+ years applying AWS Well-Architected Framework principles across reliability, security, performance, and cost pillars.\n3+ years in cloud security best practices including IAM, secrets management, network security, and compliance.\n3+ years working with PostgreSQL in production: performance tuning, replication, backup, and recovery.\nDemonstrated track record of leading multi-person technical projects from scoping through delivery.\nTechnical Skills\nStrong general programming skills; comfort writing automation scripts and tooling in Python, Go, or similar.\nDeep knowledge of observability tooling — metrics, logging, distributed tracing — and how to use them to drive reliability.\nSolid understanding of data retention, backup, and recovery processes across cloud-native systems.\nExperience with CI/CD pipelines, release management, and deployment automation.\nFamiliarity with service mesh, API gateway patterns, and microservices architectures.\nAI Fluency\nExperience using AI-assisted workflows across the SDLC, with an emphasis on production reliability, operability, and maintainability of large-scale systems (design, deployment, monitoring, incident response)\nHands-on experience integrating LLMs or AI systems into production environments, with a focus on reliability, latency, observability, and failure handling (e.g., automated triage, incident copilots, runbook automation)\nFamiliarity with agent-based or workflow automation systems applied to operational use cases such as alert triage, remediation loops, system diagnostics, or automated runbook execution\nDemonstrated ability to apply AI tools to improve system reliability, reduce MTTR, automate operational workflows, and enhance observability and alerting systems\nWorking knowledge of LLMs, embeddings, RAG, and their operational constraints in production systems (latency, cost, drift, safety, and observability)\nAbility to identify opportunities where AI can meaningfully improve system reliability, on-call efficiency, incident response, and infrastructure automation\nNice to have (SRE):\nExperience handling model degradation, fallback strategies, and cost anomalies\nLeadership & Collaboration\nProven ability to lead technical discussions, drive alignment across engineering and product, and communicate decisions clearly to stakeholders.\nExperience mentoring junior and mid-level engineers in both technical skills and professional development.\nAble to operate independently with minimal supervision; comfortable making final technical decisions as DRI.\nStrong communication skills in English — written and verbal — with experience influencing cross-functional partners.\nThank you!

Sign up for Job Alerts