Kubernetes & Bare Metal Engineer, ISS
Kubernetes & Bare Metal Engineer - Member of Technical Staff
About Infrastructure Shared Services (ISS)
Infrastructure Shared Services (ISS) is responsible for Everpure's engineering infrastructure, development environments, and production-adjacent services across our global data centers and public cloud environments. We partner with internal engineering teams to deliver reliable, secure, and scalable platforms so they can focus on building highquality products.
Within ISS, the Bare Metal Kubernetes Platform team designs, builds, and operates largescale Kubernetes environments on bare metal servers, backed by Everpure arrays and Portworx, and integrated with ISS's observability, CI/CD, and multitenancy frameworks.
SHOULD YOU ACCEPT THIS CHALLENGE...
As an Bare Metal & Kubernetes Engineer, you will be a senior individual contributor responsible for designing, deploying, and operating largescale baremetal Kubernetes clusters and platform services in our onprem data centers.
You will:
- Lead technical design and implementation for new cluster features and capabilities
- Own critical areas of the platform (e.g., cluster lifecycle, networking, storage, observability, or multitenancy)
- Drive reliability, performance, and security of the Kubernetes platform used by multiple business units
- Mentor other engineers and influence best practices across ISS and partner teams
Key Responsibilities
Platform Design & Architecture
- Design and evolve baremetal Kubernetes architectures including control plane, worker nodes, networking, and storage integrations (Portworx on FlashArray/FlashBlade).
- Define standards for cluster lifecycle management (provisioning, upgrades, decommissioning) using tools like Kubespray, Foreman, and internal CD pipelines.
- Contribute to design of multitenant, secure clusters including RBAC, OIDC/SSO, namespace isolation, and quota/limit strategies.
Implementation & Operations
- Deploy, operate, and continuously improve largescale baremetal Kubernetes clusters across multiple data centers (dev, stg, prod).
- Implement and maintain cluster networking: CNI (e.g., Cilium), BGP, load balancers, ingress, and multirack/ToR topologies.
- Build and maintain GitOpsbased workflows (e.g., ArgoCD) and CI/CD pipelines to manage cluster addons, platform services, and tenant workloads.
- Ensure observability of the platform (metrics, logs, traces) using Prometheus, Elastic stack, Grafana, and related tooling; define SLOs and alerts with SRE teams.
- Participate in "follow the sun" on call for the production system. Lead or contribute the incident management and incident postmortem
Reliability, Security & Compliance
- Own and improve reliability and performance of clusters and platform components; lead root cause analysis and longterm fixes for complex incidents.
- Implement and enforce security best practices for Kubernetes, including secure defaults, RBAC policies, network policies, and secrets management.
- Collaborate with SRE, Security, and Network Engineering to meet agreed SLIs/SLOs and support models for onprem Kubernetes.
Collaboration & Leadership
- Partner closely with BU engineering teams (e.g., GitHub Actions runners, ELK, KubeVirt workloads) to onboard and run production use cases on the baremetal clusters.
- Provide technical leadership on crossteam projects: lead design reviews, write design docs, and drive decisions that balance reliability, cost, and user experience.
- Mentor junior and midlevel engineers, sharing best practices in Kubernetes, automation, and production operations.
WHAT YOU'LL NEED TO BRING TO THIS ROLE...
- 6+ years of experience in infrastructure, SRE, or platform engineering roles, including at least 3 years running Kubernetes in production, with significant experience on bare metal.
- Strong proficiency in Linux systems administration, networking, performance tuning, and security hardening.
- Deep understanding of Kubernetes internals (API server, etcd, controllers, scheduler, kubelet) and key concepts (Pods, Deployments, Services, Ingress, ConfigMaps, Secrets, HPA).
- Handson experience with Kubernetes networking: CNI plugins (preferably Cilium), Services/Ingress, NetworkPolicies, and L4/L7 loadbalancing.
- Proficiency with Infrastructure as Code (IaC) and automation tools such as Ansible, Terraform, or equivalent.
- Strong experience with observability stacks (e.g., Prometheus, Elastic/ELK, Grafana, Fluentd/Fluent Bit) for cluster and workload monitoring.
- Solid scripting or programming skills (e.g., Python, Go, or similar) for automation, tooling, and integration work.
- Excellent communication and documentation skills, with the ability to collaborate effectively across distributed teams and write clear technical documentation and runbooks.
MINIMUM QUALIFICATIONS (EDUCATION & EXPERIENCE)
- Experience building or operating KubeVirt or other virtualization solutions on top of Kubernetes.
- Prior work with onprem GitHub Actions runners or similar CI/CD runners on Kubernetes (cloud or bare metal).
- Familiarity with Portworx and Everpure Storage arrays as persistent storage for Kubernetes clusters.
- Experience in multitenant platform design: authentication via OIDC/Okta, RBAC design, tenant isolation, and selfservice onboarding flows.
- Background in data center networking (BGP, MLAG, ECMP, spineleaf architectures) and how it interacts with Kubernetes networking at scale.
- Handson experience with OpenStack in production (Nova, Neutron, Cinder) and integration patterns between OpenStack, Kubernetes, and onprem infrastructure.