Posted 31 May, 2026
Kafka Senior Software Engineer - Pune - EAIS
ClifyX
India
Full Time
Reference: 365_594563_26-02771
| Description: Sr Software engineer –Kafka Microservices AWS & Cloud (5+ Years Experience to 12 years) Required Skills & Experience: RSA Kafka/Axon JD (4 demands) Role Overview The Site Reliability Engineer (SRE) for the Axon / Kafka Platform is responsible for ensuring the reliability, availability, scalability, and operational excellence of Client's enterprise event streaming platform. Axon is a fully managed, multi-tenant Kafka based platform (Platform as a Service) that supports mission critical, high-volume workloads across regions and environments. This role blends software engineering, distributed systems, and production operations, with a strong focus on incident management, observability, automation, and continuous reliability improvement. The scope and impact of responsibilities increase with job level, from hands on execution to platform level ownership and technical leadership. The scope and impact of responsibilities increase with job level, from hands on execution to platform level ownership and technical leadership. Key Responsibilities Platform Reliability & Availability • Ensure high availability, fault tolerance, and performance of Kafka clusters and Axon platform services. • Operate and improve reliability mechanisms for brokers, partitions, replicas, Schema Registry, and replication services. • Define, track, and improve Service Level Indicators (SLIs) and Service Level Objectives (SLOs). Incident Management & Root Cause Analysis • Participate in on call rotations and provide hands on support during production incidents. • Lead or contribute to incident triage, mitigation, and recovery for Kafka and Axon related issues. • Perform root cause analysis (RCA) and drive corrective and preventive actions to closure. • Partner with application, infrastructure, and security teams during high severity incidents. Monitoring, Alerting & Observability • Design, implement, and maintain monitoring, alerting, and dashboards for Kafka and Axon services. • Ensure incidents are proactively detected through alerts rather than customer impact. • Continuously tune alerts to reduce noise and improve signal quality. Change, Release & Operational Governance • Support production changes, maintenance activities, and platform upgrades (e.g., broker patching, certificate renewals, Schema Registry upgrades). • Review change requests (CRQs), deployment plans, and validation steps to ensure operational readiness. • Assess risk and ensure rollback and recovery plans are defined and tested. Automation & Toil Reduction • Automate repetitive operational tasks, health checks, and validation workflows. • Improve operational efficiency through scripting, tooling, and platform enhancements. • Reduce manual intervention and improve mean time to recovery (MTTR). Platform Enablement & Collaboration • Partner with application teams to support onboarding, scaling, and operational best practices. • Provide guidance on Kafka usage patterns, consumer group behavior, partitioning, and resiliency. • Create and maintain runbooks, SOPs, and operational documentation. • Share learnings through post incident reviews and knowledge sharing forums. Required Qualifications Technical Skills • Strong understanding of distributed systems and production operations. • Hands on experience with Apache Kafka or large-scale messaging/streaming platforms. • Experience with monitoring, logging, and alerting tools (metrics, logs, dashboards). • Proficiency in at least one scripting or programming language (e.g., Python, Bash, Java). • Solid knowledge of Linux, networking fundamentals, and system troubleshooting. Professional Experience • Experience supporting mission critical production systems with on call responsibility. • Proven ability to troubleshoot complex, cross system issues under pressure. • Experience working in environments with strong change management and operational governance. Soft Skills • Strong ownership mindset and accountability for production stability. • Clear and effective communication during incidents and cross team engagements. • Ability to work collaboratively across globally distributed teams. • Continuous improvement mindset with a focus on automation and reliability. Preferred Qualifications • Experience operating enterprise Kafka platforms (multi cluster, multi region). • Familiarity with Schema Registry, replication, and security/encryption mechanisms. • Experience supporting platforms with high availability and regulatory requirements. • Exposure to cloud or hybrid infrastructure environments. | |||
|---|---|---|---|
|
558508 | ||
|
MAH | PUNE | ||
|
(No Value) | ||
|
(No Value) | ||
|
(No Value) | ||
|
Hari PrakashYadlapalli | ||
|
5+ years | ||
|
Java, SB, MS, AWS, CI/CD | ||
|
Java, Spring Boot, Microservices, AWS, cloud technologies, Any AI tools for code generation | ||
|
Banking | ||
|
PreOB | ||
|
5+ to 12+ years | ||
|
NA | ||
|
NA | ||
|
Face to Face | ||
|
Hybrid | ||
|
9 AM to 6 PM | ||
|
General |