Posted 24 June, 2026
Service Operations Specialist
StarZen
Haryāna, HR, IN
Full Time
Reference: d1d59f5372489f86
Job Description
JOB OVERVIEW
JOB TITLE
Service Operations Architect
LOCATION
Offshore
ENGAGEMENT TYPE
Multiple concurrent enterprise engagements across healthcare, public-sector and AI/cloud platforms
GENERAL JOB DESCRIPTION
Operational Excellence and SRE lead for hybrid environments. Owns service reliability across traditional 3-tier stacks (IIS / Tomcat / MS SQL), cloud-native Kubernetes microservices, and GPU-accelerated AI inference workloads. Drives ITIL v4 + SRE practices, SLA adherence, DR readiness and observability.
DUTIES & RESPONSIBILITIES
- Define and govern incident, problem, change and release management aligned to client SLAs.
- Own DR plan execution and failover/failback runbooks using VMware Site Recovery Manager, Velero and multi-region cloud strategies.
- Design observability across application, database, integration, microservices and GPU tiers (Grafana, Prometheus, ELK, Zabbix, Azure Monitor).
- Lead post-incident reviews and continuous improvement.
- Ensure compliance with security baselines, patching cadence and access controls.
- Drive automation (Ansible, Terraform, PowerShell, Bash) for operational and routine maintenance.
- Coordinate Level-1, Level-2 and Level-3 teams across concurrent engagements.
- Produce monthly operations and SLA reports.
SKILLS & ABILITIES
- Strong ITIL v4 and SRE practice expertise.
- Hands-on with IIS, Tomcat, Windows Server and Linux administration.
- Strong Kubernetes operations (AKS / EKS / GKE / on-prem); Rafay or other managed-K8s platforms a plus.
- Observability stacks: Grafana, Prometheus, ELK, Zabbix, Azure Monitor / App Insights.
- Familiarity with GPU operations: NVIDIA DCGM, GPU scheduling, inference service reliability.
- DR / BCP design including VMware vSphere + Site Recovery Manager.
- Infrastructure-as-Code (Terraform, Ansible).
POTENTIAL BACKGROUND
- Bachelor's in CS / Engineering or related.
- 8+ years in service operations / SRE with at least 3 years leading a 24x7 service.
- Exposure to healthcare, public-sector or AI/cloud platforms preferred.
- ITIL v4 certification preferred; CKA/CKAD a plus.
PREFERRED TOOLS / SOFT SKILLS
Preferred tools:
- ServiceNow, Jira Service Management
- Grafana, Prometheus, ELK, Zabbix, Azure Monitor
- VMware vSphere, Site Recovery Manager
- Kubernetes (AKS / EKS / GKE), Helm
- Rafay (managed K8s), NVIDIA DCGM
- Terraform, Ansible, PowerShell, Bash
Soft skills:
- Calm under pressure
- Structured incident commander
- Strong written and verbal reporting
- Customer-obsessed mindset