Senior Staff System Engineer, GPU Fleet
Company Introduction
We exist to wow our customers. We know we're doing the right thing when we hear our customers say, "How did I ever live without Coupang?" Born out of an obsession to make shopping, eating, and living easier than ever, we're collectively disrupting the multi-billion-dollar e-commerce industry from the ground up. We are one of the fastest-growing e-commerce companies that established an unparalleled reputation for being a dominant and reliable force in South Korean commerce.
We are proud to have the best of both worlds - a startup culture with the resources of a large global public company. This fuels us to continue our growth and launch new services at the speed we have been since our inception. We are all entrepreneurs surrounded by opportunities to drive new initiatives and innovations. At our core, we are bold and ambitious people that like to get our hands dirty and make a hands-on impact. At Coupang, you will see yourself, your colleagues, your team, and the company grow every day.
Our mission to build the future of commerce is real. We push the boundaries of what's possible to solve problems and break traditional trade-offs.Join Coupang now to create an epic experience in this always-on, high-tech, and hyper-connected world.
Role Overview
We are seeking a Sr Staff System Engineer, GPU Fleet for our Coupang Intelligent Cloud (CIC) team, to serve as the senior technical owner for our hyperscale GPU compute infrastructure. In this role, you will define fleet architecture, drive reliability and automation at scale, and lead the operation and evolution of GPU systems supporting largescale AI training and inference workloads. This is a handson, stafflevel individual contributor role with broad technical ownership, high operational impact, and significant crossfunctional influence across hardware, infrastructure, and datacenter operations.
CIC builds the infrastructure for abundant intelligence. We partner with leading AI labs, governments, and enterprises to deliver hyperscale GPU compute with high reliability, performance, and efficiency. Our infrastructure supports some of the most demanding AI training and inference workloads in production today.
We operate with urgency, deep ownership, and a strong bias toward execution. Reliability, operational excellence, and rigorous systems engineering are core to our business.
What You Will Do
As a Sr Staff System Engineer, GPU Fleet, you will be the senior technical owner for CIC's largescale GPU compute infrastructure. This is a handson senior individual contributor role with fleetlevel responsibility and broad crossfunctional influence.
You will define the technical direction for how GPU fleets are architected, operated, automated, and evolved across multiple generations of hardware. Your work will directly affect fleet reliability, operating efficiency, scalability, and customer success.
This role does not involve people management, but it carries principallevel scope, autonomy, and decisionmaking authority across infrastructure, hardware, and operations.
Key Responsibilities:
Fleet Architecture & Technical Ownership
- Own the endtoend technical architecture of hyperscale GPU fleets, including hardware platform selection, firmware strategy, OS configuration, drivers, networking, and observability.
- Define and enforce technical standards and best practices for fleet reliability, availability, performance, and operability.
- Lead major fleetwide initiatives such as new GPU platform bringups, multigeneration hardware transitions, and architectural redesigns.
- Evaluate tradeoffs across cost, performance, reliability, and timetodeploy, and make technically sound decisions under ambiguity.
Reliability, Availability & Performance
- Set and drive fleetlevel reliability, availability, and performance objectives.
- Lead rootcause analysis and resolution of complex, systemic failures affecting large portions of the fleet or multiple datacenters.
- Identify recurring failure patterns and drive longterm fixes spanning hardware, software, automation, and operational processes.
- Work directly with hardware vendors and partners to resolve platformlevel issues and influence future hardware designs.
Automation & Systems Engineering
- Design and build largescale automation systems for:
- GPU fleet provisioning and lifecycle management
- GPU health validation, diagnostics, and certification
- Automated remediation, recovery, and replacement workflows
- Eliminate manual operational toil through durable, welldesigned tooling that scales with fleet growth.
- Ensure all fleet systems are observable, testable, and resilient under failure conditions.
Operational Leadership
- Act as a senior escalation point for critical production incidents impacting GPU availability or customer workloads.
- Participate in oncall rotations with a strong emphasis on preventing future incidents, not just responding to them.
- Lead highseverity postincident reviews and ensure learnings are translated into concrete engineering and process improvements.
Technical Influence & Mentorship
- Provide technical mentorship and guidance to system and infrastructure engineers across the organization.
- Serve as a trusted technical partner to platform engineering, networking, datacenter operations, and leadership teams.
- Influence CIC's longterm infrastructure roadmap through strong technical judgment and datadriven recommendations.
Basic Qualifications
- 12+ Years of overall experience with at least 8+ years of experience in Linux systems engineering, infrastructure engineering, or datacenter operations, operating production environments with strict uptime and performance requirements.
- Deep, handson expertise in Linux system internals, including process scheduling, memory management, filesystem behavior, networking, kernel behavior, and system performance analysis.
- Demonstrated experience operating hardwareintensive infrastructure in production, including baremetal servers at scale.
- Proven ability to debug complex issues across multiple system layers, including hardware components, firmware/BIOS, kernel drivers, OS configuration, and userspace services.
- Extensive experience writing productiongrade automation using Python and Bash for provisioning, configuration management, diagnostics, remediation, and fleet operations.
- Strong understanding of how to design systems that are observable, resilient, and safe under failure, rather than reliant on manual intervention.
Preferred Qualifications
- Direct experience operating largescale GPU fleets supporting AI/ML training and/or inference workloads in production.
- Familiarity with modern GPU platforms and ecosystems, including GPU drivers, CUDA, NCCL, and highperformance compute workloads.
- Experience with highspeed interconnects and datacenter networking, such as NVLink, InfiniBand, RDMA, and highthroughput Ethernet.
- Prior ownership of fleetwide or platformwide initiatives, such as new hardware bringups, major architectural changes, or reliability transformations.
- Experience partnering directly with hardware vendors or manufacturers to troubleshoot systemic issues or influence future platform designs.
- Strong intuition for failure modes at scale, including cascading failures, correlated faults, and secondorder effects across systems.
- History of acting as a technical authority or escalation point for ambiguous, highimpact production problems.
- Ability to mentor engineers through design reviews, technical problem solving, and modelling strong operational ownership.
- Experience participating in oncall rotations and responding to highseverity production incidents with clear ownership, urgency, and technical leadership.
- Strong written and verbal communication skills, including clear postincident reviews and technical documentation.
Type of work model
Hybrid
Details to consider
- Those eligible for employment protection (recipients of veteran's benefits, the disabled, etc.) may receive preferential treatment for employment in accordance with applicable laws.
Privacy Notice
- Your personal information will be collected and managed by Coupang as stated in the Application Privacy Notice located below. https://privacy.coupang.com/en/land/jobs/