Skip to main content
Posted 12 June, 2026

Senior Site Reliability Engineer (HPC/Cloud)

Virtusa
Chennai, Tamil Nadu, India Full Time
Reference: 55_537753_77396

Key Responsibilities:

  • Respond to and resolve operational incidents, identify root causes for critical issues, and implement strategies to prevent recurrence and improve platform resiliency.

  • Proactively create and manage monitoring, logging, and alerting systems to ensure high availability, performance, and visibility across all services.

  • Take a Site Reliability Engineering approach to our services, improving the deployment, monitoring and incident response end-to-end.

  • Solve complex technical problems, with SCP applications, infrastructure and end user's use of the services.

  • Administer platform tools likeAnsible, Vault,Consul,Prometheus, andGrafanato support core functions like configuration management, secrets management, monitoring, and observability.

  • Mentor and coach junior engineers in the team, fostering a collaborative and high-performing culture.

  • Drive automation for deployment and management processes using GitOps workflows as well as CI/CD pipelines.

Essential Knowledge, Skills, and Experience:

  • Experienced administering, maintaining and troubleshooting a Linux environment

  • Competent in automation and bash scripting

  • Highly customer focused; able to explain IT technical concepts in a manner which non-IT experts can understand

  • Hands-on experience working in a DevOps team and using agile methodologies

Plus some of the following areas of expertise:

  • Hands-on knowledge of a range of scientific and HPC applications such as simulation software, bioinformatics tools or 3D data visualization packages

  • Experience administering and optimizing SLURM

  • Experience deploying and administering OpenStack

  • Experience with configuration automation and infrastructure as code (e.g. Ansible, Hashicorp Terraform, AWS CloudFormation, Amazon Cloud Developer Kit)

  • Experience deploying infrastructure and code to public cloud, especially AWS

  • Experience with software distribution frameworks such as Easybuild or Spack

  • Familiarity with container runtimes such as Docker, Singularity or enroot

  • Experience with frameworks for regression tests and benchmarks for HPC applications, like Reframe HPC

Sign up for Job Alerts