Posted 21 May, 2026
Devops Observability Engineer
ARTech
Hyderabad
Full Time
Reference: 365_625640_26-17094
Job Title: Observability / SRE Engineer
Location: Hyderabad Only
Job Description:
We are looking for an experienced Observability / Site Reliability Engineer (SRE) with strong expertise in monitoring, cloud-native technologies, and automation. The ideal candidate should have hands-on experience in Kubernetes environments, observability platforms, distributed tracing, and proactive incident management to improve system reliability and performance.
Required Experience:
- 10+ years of overall IT Infrastructure experience.
- Minimum 8+ years of experience in Observability, Monitoring, or Site Reliability Engineering (SRE) roles.
Required Skills:
- Strong expertise in Kubernetes and containerized environments.
- Hands-on experience with monitoring and observability tools such as Prometheus, Grafana, Datadog, and Dynatrace.
- Experience with distributed tracing tools like Jaeger and OpenTelemetry.
- Strong scripting and automation skills using Python or Go.
- Experience with logging and log analytics tools such as Splunk, ELK Stack, Fluentd, and Loki.
- Strong understanding of observability concepts including metrics, logging, and tracing.
- Experience working with cloud platforms such as AWS, Azure, or GCP and integrating observability solutions in cloud-native environments.
- Familiarity with databases such as MySQL and PostgreSQL.
- Hands-on experience with Infrastructure as Code (IaC) tools like Terraform or Helm.
Key Responsibilities:
- Design, implement, and maintain enterprise observability and monitoring solutions.
- Drive self-healing mechanisms, intelligent monitoring, and proactive incident response strategies.
- Collaborate with SRE, DevOps, Infrastructure, and Development teams to improve system reliability and operational efficiency.
- Implement monitoring, logging, tracing, and alerting solutions for cloud-native applications and Kubernetes platforms.
- Automate operational tasks, incident management workflows, and infrastructure monitoring processes.
- Perform root cause analysis (RCA), troubleshooting, and performance optimization activities.
- Ensure high availability, scalability, and reliability of enterprise applications and infrastructure environments.