Skip to main content
Posted 17 May, 2026

Executive Manager

PepsiCo
Hyderabad,TS,IN,500075 Full Time
Reference: 97_638873_2026-452883

Overview

The AI Observability Senior Engineer is a seasoned individual contributor who partners alongside Jr. AI Observability Architects to deliver high-quality, production-grade observability capabilities across the enterprise AI platform. This role brings deeper technical experience and greater independent execution capability- owning delivery across one or two specialization tracks with reduced need for day-to-day direction - while working as a genuine peer within the team rather than in a supervisory or mentorship capacity.

The Sr. Engineer is expected to be a strong, self-sufficient technical contributor who can take a complex observability requirement from design through implementation and into production operation within their assigned tracks. They bring cross-track awareness that helps the team as a whole, contribute to engineering standard discussions, and collaborate with peer architects on solving shared technical challenges - all without requiring supervisory authority.

Responsibilities

1. Observability Platform Engineering & OTEL Integration (25%)

  • Design and implement OpenTelemetry (OTEL) instrumentation within one or two assigned agent frameworks or platforms - including custom exporters, span enrichers, semantic convention tagging, and distributed trace context propagation - with the ability to work independently from requirements through to production deployment.
  • Build and maintain telemetry pipeline components (collectors, processors, exporters) that reliably route metrics, logs, traces, and semantic signals to observability backends - owning the full lifecycle of assigned pipeline components including testing, deployment, and on-call support.
  • Contribute to the integration of OTEL with enterprise agentic platforms (Salesforce AgentForce, ServiceNow, Microsoft Agent 365, or internal frameworks) within the assigned scope - implementing instrumentation to architecture patterns established by the L11.
  • Develop and maintain observability dashboards, alerting rules, and SLO/SLA definitions for assigned sub-domains - validating signal quality and tuning alert thresholds to achieve low false-positive rates.
  • Participate in on-call rotations and production incident response - contributing to RCA documentation, runbook authoring, and post-incident improvement actions.
  • Write comprehensive unit, integration, and end-to-end tests for all telemetry components owned; maintain >80% test coverage across assigned services and proactively identify gaps in existing coverage.

2. Safety, Security & Red Teaming Observability (15%)

  • Implement safety-critical signal capture within assigned telemetry pipelines - building reliable instrumentation for guardrail trigger rates, policy violation events, adversarial detection flags, hallucination indicators, and trust boundary crossing alerts.
  • Build observability components that support red team exercises - instrumenting assigned agent systems to capture adversarial test events, behavioral deviations, and attack surface signals in a measurable, repeatable way.
  • Implement secure trace handling patterns within assigned pipelines - applying data masking, PII redaction, and audit-log retention configurations as specified by the security architecture.
  • Contribute to the Security Observability Playbook - documenting assigned instrumentation patterns, updating escalation procedures based on observed incidents, and maintaining accuracy of the playbook sections within scope.
  • Monitor agent-to-agent protocol traffic (A2A, UCP, AP2) within the assigned domain for anomalous patterns - flagging deviations for review in a timely manner with sufficient diagnostic context.

3. Responsible AI (RAI) & Governance Signal Instrumentation (10%)

  • Implement RAI signal collectors within assigned agent workflows - building reliable pipelines that capture fairness indicators, bias detection outputs, explainability scores, and content safety classifications with validated data quality.
  • Maintain RAI telemetry pipelines within scope - ensuring completeness, accuracy, and timeliness of governance signals that feed into compliance dashboards, and resolving data quality issues proactively.
  • Ensure all AI decision traces within the assigned domain include required governance metadata and comply with retention policies - contributing to the audit-readiness of the observability platform.
  • Identify and document RAI signal coverage gaps within the assigned scope - reporting findings to the L11 with sufficient detail to inform remediation planning.

4. Quality Engineering for Agentic Solutions - Post Go-Live & Continuous QE (15%)

  • Build and maintain quality gate components within CI/CD pipelines for assigned agent services - implementing regression detection logic, performance degradation alerts, and SLA breach notifications using production observability data.
  • Instrument and monitor Skill Evaluations (evals) across assigned Memory, Skills, and MCP harness components - collecting eval telemetry, tracking pass/fail trends over time, and alerting on regression thresholds with appropriate context.
  • Implement continuous quality monitoring for post-go-live agentic solutions within scope - tracking agent success rates, tool-call fidelity, latency distributions, and user outcome proxies against defined baselines.
  • Execute structured testing of new agent capabilities using standardized eval harnesses - documenting results clearly, flagging anomalies, and contributing findings to quality improvement cycles.
  • Build and maintain automated quality reports and metric dashboards for assigned areas - ensuring stakeholders have timely, accurate visibility into agent behavioral quality and trend direction.

5. Memory, Skills, MCP & Harness Engineering Observability (10%)

  • Instrument agent memory operations within the assigned scope - building reliable monitoring of read/write latency, cache hit rates, memory staleness, and semantic drift across episodic, semantic, and working memory backends.
  • Add trace instrumentation to MCP server interactions within assigned components - implementing OTEL semantic tagging for tool registrations, skill invocations, context injections, and result returns.
  • Capture telemetry for self-evolving harness and RL system components as assigned - implementing signal capture for reward distributions, policy update events, environment state transitions, and convergence indicators.
  • Monitor eval harness execution within assigned scope - building detection for flaky eval environments, setup failures, and result inconsistencies that could obscure real capability regressions.

6. Python Engineering & Data Science Observability (10%)

  • Write production-quality Python for assigned observability components - custom OTEL exporters, signal aggregators, data transformation pipelines, and anomaly detection logic - consistently meeting team engineering standards for code quality, testing, and documentation.
  • Apply data science methods to assigned telemetry data - time-series analysis, statistical threshold tuning, distribution characterization - to improve signal accuracy and reduce alerting noise within the assigned domain.
  • Contribute to shared Python SDK and library components - implementing well-tested, documented additions that improve OTEL onboarding experience for agent developers.
  • Actively participate in code reviews - both receiving feedback from peers and contributing constructive technical review of peer engineers' pull requests within areas of expertise.

7. Agent Fleet, Physical AI & Multi-Modal Observability (5%)

  • Implement telemetry for agent fleet coordination components as assigned - building signal capture for spawn/termination events, inter-agent message traces, load distribution metrics, and fleet-level health indicators.
  • Contribute to observability instrumentation for physical AI or multi-modal pipelines within the assigned scope - focusing on latency, data quality, and reliability signals as directed by the L11 architecture.
  • Document instrumentation patterns for fleet, physical AI, and multi-modal components - ensuring observability approaches are reproducible and transferable to other team members.

8. Agentic Marketplace, Registry & Agent Protocol Observability (5%)

  • Instrument assigned Agentic Marketplace and Agent Registry components with usage telemetry - building signal capture for agent invocations, capability health, adoption patterns, and dependency relationships within scope.
  • Implement protocol observability for assigned A2A, UCP, and AP2 communication flows - capturing message latency, error rates, retry patterns, and trust boundary events with sufficient granularity for incident diagnosis.
  • Contribute to Marketplace Observability Dashboard development - building data connectors, metric calculations, and visualization components for assigned areas as directed.

9. Peer Collaboration, Standards Contribution & Continuous Learning (5%)

  • Collaborate actively and constructively with peer Jr. AI Observability Architects - sharing technical knowledge, co-designing solutions to shared problems, and contributing to a high-quality, high-trust team environment.
  • Contribute to engineering standards discussions - bringing informed technical perspectives on OTEL conventions, instrumentation patterns, and telemetry design decisions based on hands-on experience in assigned tracks.
  • Participate fully in agile ceremonies - sprint planning, stand-ups, retrospectives - contributing accurate estimates, early identification of blockers, and transparent delivery status updates.
  • Stay current with evolving OTEL specifications, agent communication protocols, AI safety research, and observability tooling - proactively applying new knowledge to improve the quality and coverage of assigned work.
  • Contribute to internal documentation, engineering wikis, and instrumentation guides - ensuring that the approaches used in assigned tracks are clearly documented and accessible to the broader team.

Qualifications

  • Bachelor's or Master's degree in Computer Science, Software Engineering, AI/ML, Data Science, or a related technical field.
  • 10+ years of professional software and AI/ML engineering or platform engineering experience, with at least 2 years of hands-on observability, distributed systems monitoring, or telemetry pipeline development.
  • Demonstrated experience delivering production-grade software end-to-end - from design through deployment and on-call operation - in a collaborative team environment.
  • Experience working in or adjacent to AI/ML platform, data engineering, or cloud infrastructure roles; exposure to agentic AI systems or LLM pipelines is a strong plus.
    • Technical Proficiency: Implement monitoring for agent failure modes: tool-call failures, infinite loops, timeouts, hallucination risk signals, retrieval misses, and degraded response quality. Create alerts aligned to operational SLOs (availability, latency, tool reliability) and AI-specific indicators (cost spikes, loop bursts, retrieval anomalies). Support guardrail observability: policy blocks, content filtering events, and safety classifier outcomes (where applicable). Build onboarding automation (IaC, templates, CI checks) that makes observability "default-on" for all agentic services.
    • Problem-Solving: Ability to translate business challenges into technical solutions.
    • Collaboration Skills: Effective at working within cross-functional teams.
    • Agility: Flexibility to adapt to changing requirements and new technologies.
    • Communication Skills: Capable of explaining complex technical concepts to non-technical stakeholders.
    • Observability & OpenTelemetry: Solid hands-on proficiency with OpenTelemetry (OTEL) SDK instrumentation - custom exporters, collector configuration, semantic conventions, and distributed trace propagation. Able to independently instrument a service and validate signal quality end-to-end.
    • Python Engineering: Strong Python development skills - clean, well-tested, production-ready code. Familiarity with async patterns, type hints, testing frameworks (pytest), and CI/CD integration. Able to build and maintain Python-based telemetry tooling with minimal guidance.
    • Distributed Systems: Good working knowledge of microservices, event streaming (Kafka or equivalent), REST/gRPC API design, and containerized deployment (Docker, Kubernetes). Able to reason about distributed failure modes and their observability implications.
    • Cloud Platforms: Hands-on experience with at least one major cloud provider (Azure, AWS, or GCP) - managed services, IAM, storage, and cost awareness sufficient to make responsible deployment decisions.
    • Data Analysis Applied to Telemetry: Ability to query, analyze, and interpret time-series and log data using Grafana, Datadog, Prometheus, Splunk, or equivalent - including threshold tuning and basic statistical interpretation of signal distributions.
    • CI/CD & DevOps: Working experience with CI/CD pipelines, GitOps practices, automated testing, and infrastructure-as-code concepts sufficient to contribute to and extend existing pipeline configurations.
    • AI/ML Awareness: Familiarity with LLM-based workflows, agentic AI concepts, and common agent patterns (tool/function calling, RAG, memory, multi-step planning) - sufficient to understand observability requirements without needing deep ML expertise.
    • Safety & Security Fundamentals: Basic understanding of AI safety concepts (guardrails, policy enforcement, prompt injection) and data security practices (PII handling, access control, audit logging) as applied to telemetry systems.
    • Quality Engineering Basics: Familiarity with software quality concepts - regression detection, eval frameworks, test harnesses - and the ability to implement quality gate components within CI/CD pipelines using observability data.
    • RAI Awareness: Working knowledge of Responsible AI principles - fairness, explainability, bias - sufficient to implement signal capture pipelines for RAI governance requirements as specified.
    • Direct experience with agentic AI frameworks such as LangChain, LangGraph, AutoGen, Semantic Kernel, CrewAI, or Bedrock Agents.
    • Familiarity with MCP (Model Context Protocol), A2A, UCP, or AP2 agent communication protocols.
    • Exposure to reinforcement learning concepts, RL training infrastructure, or self-supervised learning pipelines.
    • Experience contributing to or consuming developer-facing Python SDKs or observability libraries.
    • Background with vector databases (Pinecone, Weaviate, pgvector) or semantic search in the context of RAG pipeline observability.
    • Contributions to open-source observability or AI tooling projects.
    Familiarity with AI safety frameworks, adversarial ML concepts, or red team tooling applied to LLM systems.
Employment Type: FULL_TIME

Sign up for Job Alerts