Posted 31 May, 2026
AgenticOps Platform Engineer Lead
Bridge AI
Varanasi, UP, IN
Full Time
Reference: 732e3f177d64434e
Job Description
We are looking for a senior, hands-on AgentOps Platform Engineer to design, build, and operate the cloud-native infrastructure that powers our AI agents at scale.\nThis is a lead-by-example role:\nYou write the Terraform\nYou build the pipelines\nYou own the platform in production\nGCP is your primary environment, but you will design with multi-cloud in mind (AWS, Azure), ensuring portability, resilience, and long-term flexibility. This role sits at the intersection of DevOps, MLOps, and AgentOps, with deep responsibility for reliability, security, observability, and cost.\n\nKEY RESPONSIBILITIES\n\nPlatform & Infrastructure Ownership\nDesign, build, and operate production-grade infrastructure for AI agents and LLM services\nOwn Terraform-based Infrastructure as Code for all environments (dev, uat, prod)\nLead infrastructure decisions through hands-on implementation, not diagrams\nBuild scalable foundations for: Agent orchestration Inference services RAG pipelines Vector stores\nOptimise cloud resources for performance and cost efficiency\nAgentOps & AI Platform Enablement\nEnable safe, continuous operation of autonomous agents\nDesign agent runtime environments with: Isolation & sandboxing Failover and recovery strategies Controlled rollout mechanisms\nSupport prompt/version management, agent configuration, and tool/plugin lifecycle\nWork closely with Agentic RAG engineers to operationalise research into production\nCI/CD & Automation\nBuild and maintain CI/CD pipelines for: Infrastructure Agent services Prompt and config changes Model/version rollouts\nAutomate workflows for: Vector DB updates RAG index refreshes Agent memory stores Tool registration and validation\nReduce manual ops toil aggressively through automation\nObservability & Production Readiness\nDesign and implement deep observability for agent systems: Platform health Agent execution metrics Latency, cost, and throughput Failure modes and retries\nBuild dashboards, alerts, and telemetry using: Prometheus Grafana OpenTelemetry (or equivalent)\nEnable visibility into agent decision traces and runtime behavior\nSecurity, Safety & Reliability\nImplement secure cloud architecture and IAM best practices\nOwn production reliability, incident response, and recovery\nEnforce operational guardrails and safety controls for agent APIs\nSupport responsible AI practices from an infrastructure and runtime perspective\nCollaboration & Technical Leadership\nWork closely with: Agentic RAG engineers AI engineers Product & CTO Office\nDefine SLOs, reliability targets, and operational metrics\nSet the technical bar for AgentOps at BridgeAI\nMentor engineers by example and code, not process overhead\n\nREQUIRED SKILLS & EXPERIENCE\n\nCore Platform & DevOps\n5+ years in DevOps, Platform Engineering, SRE, or MLOps\nStrong, hands-on experience with GCP: GKE / Compute Engine Cloud Run / Functions Cloud Storage, Pub/Sub Vertex AI (or equivalent)\nDeep experience with Terraform (mandatory)\nContainers, CI/CD & Automation\nDocker, Kubernetes, Helm\nCI/CD tooling (GitHub Actions, Jenkins, ArgoCD)\nPython and Bash for automation and platform glue code\nAgentic & AI Systems\nExperience supporting LLM-based systems in production\nUnderstanding of: Prompt/version management Context handling & caching Model rollout strategies\nHands-on experience with vector databases (Weaviate, FAISS, Pinecone)\nFamiliarity with RAG pipelines and agent execution patterns\nObservability & Security\nMonitoring and telemetry using Prometheus, Grafana, OpenTelemetry\nStrong understanding of cloud security, IAM, and operational safety\n\nNICE TO HAVE\nMulti-cloud experience (AWS, Azure)\nExposure to agent frameworks (LangChain, LangGraph, AutoGen, CrewAI)\nEvent-driven systems (Temporal, Airflow)\nExperience with responsible AI operations or safety monitoring\nWHAT SUCCESS LOOKS LIKE\nInfrastructure is reproducible, observable, and boring (in a good way)\nAgent failures are visible, debuggable, and recoverable\nCloud costs are understood and controlled\nEngineers trust the platform and move faster because of it\nYou are the go-to authority for AgentOps at BridgeAI\nWHAT THIS ROLE IS (AND IS NOT)\nDeeply hands-on\nTerraform-first\nProduction ownership\nSets standards by building\nNot a people-manager role\nNot a ticket-based ops role\nNot a “just keep the lights on” job