Associate Director-GTS Run
Roles & responsibilities
Own end-to-end engineering, reliability, availability, scalability, performance, and capacity planning for mission-critical Audit portfolio platforms, ensuring enterprise-grade operational excellence.
Provide 24x7x365 production support, including weekends and holidays, participating in a global follow-the-sun on-call model and ensuring rapid incident response and service restoration.
Ensure 99.0% (target 99.9%+) availability through SRE best practices including SLIs/SLOs, error budgets, proactive monitoring, intelligent alerting, and automated remediation.
Architect and operate secure, scalable, and resilient Azure cloud environments leveraging AKS, App Services, VM Scale Sets, Azure SQL, Data Lake, Azure Storage, and Microsoft Fabric for large-scale data and analytics workloads.
Implement and manage Infrastructure as Code using Terraform, ensuring consistent, repeatable, and compliant infrastructure provisioning across environments.
Drive DevOps and platform engineering practices using Azure DevOps, CI/CD pipelines, GitOps, and release automation, enabling faster and more reliable deployments.
Design and manage containerized and microservices architectures using AKS, including scaling, networking, security, service mesh integration, and zero-downtime deployments.
Implement deep observability using Azure Monitor, Application Insights, Log Analytics, KQL, and Azure Managed Prometheus, enabling full-stack monitoring, distributed tracing, and performance insights.
Proactively monitor production systems to detect early signals of failures, prevent performance degradation, and eliminate capacity bottlenecks using predictive and AI-driven insights.
Build and integrate AIOps and AI-powered automation, including anomaly detection, predictive alerting, automated incident triage, and self-healing infrastructure.
Lead major incident management, root cause analysis (RCA), and problem management, ensuring blameless postmortems and continuous reliability improvements.
Design and validate high availability and disaster recovery architectures, including multi-region deployments, failover strategies, backup/restore, and RTO/RPO adherence.
Plan and execute ITDR drills, disaster recovery testing, and audit readiness activities, including evidence collection and compliance validation.
Manage environment lifecycle including infrastructure upgrades, secure deployments, vulnerability remediation, patching, and end-of-life transitions.
Implement strong security and secrets management using Azure Key Vault, managed identities, RBAC, and zero-trust architecture principles.
Ensure compliance with enterprise and regulatory standards through policy enforcement, audit controls, and governance frameworks.
Optimize cloud spend using FinOps practices, including cost allocation, tagging, rightsizing, reserved instances, and continuous cost-performance optimization.
Manage and optimize data platforms including Data Lake, Azure Storage, and Fabric, ensuring high availability, scalability, data integrity, and performance.
Establish advanced capacity planning and forecasting models, leveraging historical telemetry and AI-driven predictions.
Automate operational workflows using scripting (PowerShell, Python, Bash) and orchestration tools to minimize manual intervention and reduce MTTR.
Drive resilience engineering practices including chaos engineering, failure testing, and system hardening to improve overall platform reliability.
Collaborate with engineering, architecture, cloud, security, and business teams to design and operate scalable, secure, and compliant solutions.
Build and maintain comprehensive technical documentation, runbooks, and operational playbooks for consistent and efficient support.
Act as a senior technical leader and individual contributor, influencing architecture decisions, driving innovation, and mentoring engineers across teams without direct people management.
This role is for you if you have the below
Educational qualifications
Bachelor's degree in Computer Science
Work experience
10+ Years of Experience