Skip to main content
Posted 24 June, 2026

Service Operations Specialist

StarZen
Gurugram, HR, IN Full Time
Reference: 49c58ec7955c6ba9

Job Description

JOB OVERVIEW
JOB TITLE
Service Operations Architect
LOCATION
Offshore
ENGAGEMENT TYPE
Multiple concurrent enterprise engagements across healthcare, public-sector and AI/cloud platforms
GENERAL JOB DESCRIPTION
Operational Excellence and SRE lead for hybrid environments. Owns service reliability across traditional 3-tier stacks (IIS / Tomcat / MS SQL), cloud-native Kubernetes microservices, and GPU-accelerated AI inference workloads. Drives ITIL v4 + SRE practices, SLA adherence, DR readiness and observability.
DUTIES & RESPONSIBILITIES
Define and govern incident, problem, change and release management aligned to client SLAs.
Own DR plan execution and failover/failback runbooks using VMware Site Recovery Manager, Velero and multi-region cloud strategies.
Design observability across application, database, integration, microservices and GPU tiers (Grafana, Prometheus, ELK, Zabbix, Azure Monitor).
Lead post-incident reviews and continuous improvement.
Ensure compliance with security baselines, patching cadence and access controls.
Drive automation (Ansible, Terraform, PowerShell, Bash) for operational and routine maintenance.
Coordinate Level-1, Level-2 and Level-3 teams across concurrent engagements.
Produce monthly operations and SLA reports.
SKILLS & ABILITIES
Strong ITIL v4 and SRE practice expertise.
Hands-on with IIS, Tomcat, Windows Server and Linux administration.
Strong Kubernetes operations (AKS / EKS / GKE / on-prem); Rafay or other managed-K8s platforms a plus.
Observability stacks: Grafana, Prometheus, ELK, Zabbix, Azure Monitor / App Insights.
Familiarity with GPU operations: NVIDIA DCGM, GPU scheduling, inference service reliability.
DR / BCP design including VMware vSphere + Site Recovery Manager.
Infrastructure-as-Code (Terraform, Ansible).
POTENTIAL BACKGROUND
Bachelor's in CS / Engineering or related.
8+ years in service operations / SRE with at least 3 years leading a 24x7 service.
Exposure to healthcare, public-sector or AI/cloud platforms preferred.
ITIL v4 certification preferred; CKA/CKAD a plus.
PREFERRED TOOLS / SOFT SKILLS
Preferred tools:
ServiceNow, Jira Service Management
Grafana, Prometheus, ELK, Zabbix, Azure Monitor
VMware vSphere, Site Recovery Manager
Kubernetes (AKS / EKS / GKE), Helm
Rafay (managed K8s), NVIDIA DCGM
Terraform, Ansible, PowerShell, Bash
Soft skills:
Calm under pressure
Structured incident commander
Strong written and verbal reporting
Customer-obsessed mindset

Sign up for Job Alerts