Site Reliability Engineer (SRE)
Role: Site Reliability Engineer (SRE)
Position Type: Contract
Duration: 12 months
Location: Gurgaon (ONSITE) - Arya Samaj Rd, Durga Colony, Sector 39, Gurugram, Haryana 122003
Shift Timing: 12:00 PM - 8:00 PM IST
PRIMARY DUTIES:
We are seeking a proactive and detail-oriented Site Reliability Engineer (SRE) with 3+ years of experience to ensure high availability, reliability, and performance of production systems.
This role focuses on automation, observability, incident management, and cross-team coordination to drive operational excellence.
Key Responsibilities
- Maintain reliable, scalable, and secure production environments.
- Implement and manage monitoring, alerting, and logging solutions.
- Contribute to defining and tracking SLIs/SLOs and support error budget practices.
- Automate operational tasks to improve efficiency and reduce manual effort.
- Perform troubleshooting and Root Cause Analysis (RCA) for production incidents.
- Optimize system performance, availability, and capacity.
- Maintain runbooks, SOPs, and incident documentation in Confluence.
- Adhere to change management, deployment governance, and disaster recovery standards.
- Support incident response for critical production services.
Collaboration & Tools
- Coordinate with external vendors and internal cross-functional teams.
- Work closely with Engineering, Product Owners, and Operations teams.
- Manage incidents and changes using ServiceNow & JIRA.
- Collaborate through Slack and structured communication channels.
Technical Skills
Systems & Cloud
- Strong knowledge of Windows and Linux/Unix systems.
- Solid understanding of networking fundamentals (DNS, TCP/IP, Load Balancing, Firewalls).
- Experience with at least one cloud platform (AWS, Azure, or GCP).
Automation & CI/CD
- Proficiency in one scripting/programming language (Python, Go, Bash, PowerShell, or Java).
- Understanding of CI/CD pipelines and automation practices.
Containers & Observability
- Hands-on experience with Docker and Kubernetes.
- Experience with monitoring tools such as Grafana or Power BI.
- Ability to analyze logs, metrics, and traces for troubleshooting.
ITSM & Documentation
- Experience with ServiceNow & JIRA (incident/change/problem workflows).
- Working knowledge of Confluence for technical documentation and knowledge management.
Additional Experience (Preferred)
- Background in DevOps, Cloud Engineering, or Platform Engineering.
- Understanding of security best practices and compliance standards.
- Familiarity with AI-assisted engineering tools (Claude Code, Jellyfish, GitHub Copilot).
- Exposure to large-scale or production-grade systems.
Interested candidates may share their updated resumes to:
[email protected] / [email protected]