Posted 24 June, 2026

Service Operations Specialist

StarZen

Haryāna, HR, IN Full Time

Reference: d1d59f5372489f86

Job Description

JOB OVERVIEW

JOB TITLE

Service Operations Architect

LOCATION

Offshore

ENGAGEMENT TYPE

Multiple concurrent enterprise engagements across healthcare, public-sector and AI/cloud platforms

GENERAL JOB DESCRIPTION

Operational Excellence and SRE lead for hybrid environments. Owns service reliability across traditional 3-tier stacks (IIS / Tomcat / MS SQL), cloud-native Kubernetes microservices, and GPU-accelerated AI inference workloads. Drives ITIL v4 + SRE practices, SLA adherence, DR readiness and observability.

DUTIES & RESPONSIBILITIES

Define and govern incident, problem, change and release management aligned to client SLAs.
Own DR plan execution and failover/failback runbooks using VMware Site Recovery Manager, Velero and multi-region cloud strategies.
Design observability across application, database, integration, microservices and GPU tiers (Grafana, Prometheus, ELK, Zabbix, Azure Monitor).
Lead post-incident reviews and continuous improvement.
Ensure compliance with security baselines, patching cadence and access controls.
Drive automation (Ansible, Terraform, PowerShell, Bash) for operational and routine maintenance.
Coordinate Level-1, Level-2 and Level-3 teams across concurrent engagements.
Produce monthly operations and SLA reports.

SKILLS & ABILITIES

Strong ITIL v4 and SRE practice expertise.
Hands-on with IIS, Tomcat, Windows Server and Linux administration.
Strong Kubernetes operations (AKS / EKS / GKE / on-prem); Rafay or other managed-K8s platforms a plus.
Observability stacks: Grafana, Prometheus, ELK, Zabbix, Azure Monitor / App Insights.
Familiarity with GPU operations: NVIDIA DCGM, GPU scheduling, inference service reliability.
DR / BCP design including VMware vSphere + Site Recovery Manager.
Infrastructure-as-Code (Terraform, Ansible).

POTENTIAL BACKGROUND

Bachelor's in CS / Engineering or related.
8+ years in service operations / SRE with at least 3 years leading a 24x7 service.
Exposure to healthcare, public-sector or AI/cloud platforms preferred.
ITIL v4 certification preferred; CKA/CKAD a plus.

PREFERRED TOOLS / SOFT SKILLS

Preferred tools:

ServiceNow, Jira Service Management
Grafana, Prometheus, ELK, Zabbix, Azure Monitor
VMware vSphere, Site Recovery Manager
Kubernetes (AKS / EKS / GKE), Helm
Rafay (managed K8s), NVIDIA DCGM
Terraform, Ansible, PowerShell, Bash

Soft skills:

Calm under pressure
Structured incident commander
Strong written and verbal reporting
Customer-obsessed mindset

Apply to this Job

Service Operations Specialist

Job Description

Sign up for Job Alerts

Share this Job