Skip to main content
Posted 24 June, 2026

Service Operations Specialist

StarZen
Haryāna, HR, IN Full Time
Reference: d1d59f5372489f86

Job Description

JOB OVERVIEW

JOB TITLE

Service Operations Architect

LOCATION

Offshore

ENGAGEMENT TYPE

Multiple concurrent enterprise engagements across healthcare, public-sector and AI/cloud platforms

GENERAL JOB DESCRIPTION

Operational Excellence and SRE lead for hybrid environments. Owns service reliability across traditional 3-tier stacks (IIS / Tomcat / MS SQL), cloud-native Kubernetes microservices, and GPU-accelerated AI inference workloads. Drives ITIL v4 + SRE practices, SLA adherence, DR readiness and observability.

DUTIES & RESPONSIBILITIES

  • Define and govern incident, problem, change and release management aligned to client SLAs.
  • Own DR plan execution and failover/failback runbooks using VMware Site Recovery Manager, Velero and multi-region cloud strategies.
  • Design observability across application, database, integration, microservices and GPU tiers (Grafana, Prometheus, ELK, Zabbix, Azure Monitor).
  • Lead post-incident reviews and continuous improvement.
  • Ensure compliance with security baselines, patching cadence and access controls.
  • Drive automation (Ansible, Terraform, PowerShell, Bash) for operational and routine maintenance.
  • Coordinate Level-1, Level-2 and Level-3 teams across concurrent engagements.
  • Produce monthly operations and SLA reports.

SKILLS & ABILITIES

  • Strong ITIL v4 and SRE practice expertise.
  • Hands-on with IIS, Tomcat, Windows Server and Linux administration.
  • Strong Kubernetes operations (AKS / EKS / GKE / on-prem); Rafay or other managed-K8s platforms a plus.
  • Observability stacks: Grafana, Prometheus, ELK, Zabbix, Azure Monitor / App Insights.
  • Familiarity with GPU operations: NVIDIA DCGM, GPU scheduling, inference service reliability.
  • DR / BCP design including VMware vSphere + Site Recovery Manager.
  • Infrastructure-as-Code (Terraform, Ansible).

POTENTIAL BACKGROUND

  • Bachelor's in CS / Engineering or related.
  • 8+ years in service operations / SRE with at least 3 years leading a 24x7 service.
  • Exposure to healthcare, public-sector or AI/cloud platforms preferred.
  • ITIL v4 certification preferred; CKA/CKAD a plus.

PREFERRED TOOLS / SOFT SKILLS

Preferred tools:

  • ServiceNow, Jira Service Management
  • Grafana, Prometheus, ELK, Zabbix, Azure Monitor
  • VMware vSphere, Site Recovery Manager
  • Kubernetes (AKS / EKS / GKE), Helm
  • Rafay (managed K8s), NVIDIA DCGM
  • Terraform, Ansible, PowerShell, Bash

Soft skills:

  • Calm under pressure
  • Structured incident commander
  • Strong written and verbal reporting
  • Customer-obsessed mindset


Sign up for Job Alerts