Skip to main content
Posted 21 May, 2026

Devops Observability Engineer

ARTech
Hyderabad Full Time
Reference: 365_625640_26-17094

Job Title: Observability / SRE Engineer

Location: Hyderabad Only


Job Description:

We are looking for an experienced Observability / Site Reliability Engineer (SRE) with strong expertise in monitoring, cloud-native technologies, and automation. The ideal candidate should have hands-on experience in Kubernetes environments, observability platforms, distributed tracing, and proactive incident management to improve system reliability and performance.


Required Experience:

  • 10+ years of overall IT Infrastructure experience.
  • Minimum 8+ years of experience in Observability, Monitoring, or Site Reliability Engineering (SRE) roles.

Required Skills:

  • Strong expertise in Kubernetes and containerized environments.
  • Hands-on experience with monitoring and observability tools such as Prometheus, Grafana, Datadog, and Dynatrace.
  • Experience with distributed tracing tools like Jaeger and OpenTelemetry.
  • Strong scripting and automation skills using Python or Go.
  • Experience with logging and log analytics tools such as Splunk, ELK Stack, Fluentd, and Loki.
  • Strong understanding of observability concepts including metrics, logging, and tracing.
  • Experience working with cloud platforms such as AWS, Azure, or GCP and integrating observability solutions in cloud-native environments.
  • Familiarity with databases such as MySQL and PostgreSQL.
  • Hands-on experience with Infrastructure as Code (IaC) tools like Terraform or Helm.

Key Responsibilities:

  • Design, implement, and maintain enterprise observability and monitoring solutions.
  • Drive self-healing mechanisms, intelligent monitoring, and proactive incident response strategies.
  • Collaborate with SRE, DevOps, Infrastructure, and Development teams to improve system reliability and operational efficiency.
  • Implement monitoring, logging, tracing, and alerting solutions for cloud-native applications and Kubernetes platforms.
  • Automate operational tasks, incident management workflows, and infrastructure monitoring processes.
  • Perform root cause analysis (RCA), troubleshooting, and performance optimization activities.
  • Ensure high availability, scalability, and reliability of enterprise applications and infrastructure environments.

Sign up for Job Alerts