Skip to main content
Posted 09 June, 2026

GPU Infrastructure & Networking Architect

Jio
Mumbai, Maharashtra, IN Full Time
Reference: 6a280346c6f19b5e0a20000d

The GPU Infrastructure & Networking Architect is responsible for the Low-Level Design (LLD) and Day-0 implementation of the physical GPU compute and DPU-centric networking fabric within the CloudXP Sovereign AI Cloud. This role owns the end-to-end design of the GB300 NVL72 bare-metal compute layer, NVIDIA BlueField-3 DPU deployment in DPF mode, Spectrum-X GPU TAN, Spectrum-3 ancillary fabric, and F5/Netris internet egress translating the High-Level Design (HLD) into deployable, validated configurations aligned to the NVIDIA Reference Architecture.


Key Responsibilities


1. Design Infrastructure LLD & Day-0 Configuration

Produce rack-level and node-level LLD for GB300 NVL72 bare-metal GPU nodes including cabling, power, and cooling topology

Design and document BF-3 DPU deployment in DPF mode across all node types (GB300 and ancillary); validate DOCA DPF Operator configuration and lifecycle

Define NVIDIA NICo zero-trust enrollment workflow for GB300 nodes; document pre-boot attestation sequences and BMC integration

Produce IP addressing schemes for GPU TAN (Spectrum-X), ancillary compute fabric (Spectrum-3), OOB management, and storage planes

Author Day-0 runbooks for Spectrum-X spine-leaf bring-up, rail-optimised RoCE configuration, and MTU/ECN tuning for GB300 workloads


2. Design Networking & Egress Design

Design distributed BGP egress using FRR DaemonSet with /32 EIP injection; configure eBGP peering between AS 65100 (compute) and AS 65000 (Spectrum-3 SN4600C spine)

Own F5 AWAF hardware egress configuration as the sole internet exit path; define F5 BNK + DPF co-deployment on BF-3 for north-south traffic

Design OVN-Kubernetes overlay for ancillary nodes and DPF Host-Trusted mode integration; validate SF+VF coexistence on BF-3

Produce VXLAN segment maps, VRF isolation design, and HBN VRF scaling analysis for ZoneVPC DaemonSet model

Define RDMA/RoCE network policies and lossless fabric requirements for GB300 scale-out training workloads

Implement netris based cloud virtual functions and


3. Design Storage Connectivity

Design VAST NFS mount architecture for GB300 nodes; validate data-path performance at scale and define NFS tuning parameters

Design NetApp ONTAP block storage connectivity for ancillary KubeVirt VMs; document iSCSI/NVMe-oF path configuration

Produce storage LLD covering StorageGRID object connectivity, zone affinity, and multi-tenancy namespace isolation


4. Implement Validation & NVIDIA Alignment

Execute hardware bring-up validation against NVIDIA GB300 NVL72 Dual-Plane Networking Reference Architecture

Coordinate with NVIDIA field engineering on DTS Prometheus port accessibility in DPF mode and UFM 6.x metric naming compatibility

Produce test plans and acceptance criteria for network fabric, GPU TAN, and egress path; participate in NVIDIA NCP validation reviews


Required Skills & Experience


Must-Have

10+ years in data-centre infrastructure architecture with 3+ years on GPU/AI cluster deployments at scale (100+ nodes)

Hands-on experience with NVIDIA BlueField DPUs (BF-2 or BF-3); knowledge of DOCA SDK, DPF Operator, and OVN-K integration

Deep expertise in BGP (eBGP/iBGP), RoCE/RDMA networking, and lossless Ethernet fabric design (PFC, ECN, DCQCN)

Proficiency with NVIDIA Spectrum switches (UFM, SHARP, rail-optimised topology) or comparable InfiniBand/Ethernet AI fabrics

Experience with F5 BIG-IP (hardware AWAF / BIG-IP Next for Kubernetes); familiarity with BIG-IP as Kubernetes ingress/egress

Strong Linux networking background: VXLAN, VRF, VLAN, OVN/OVS, kernel datapath, SR-IOV, VF/SF configuration

Experience with Netris implementation and customization

Proficiency in Python and/or Go for automation scripts and infrastructure-as-code tooling; Ansible/Terraform for Day-0 provisioning


Nice-to-Have

Familiarity with NVIDIA NICo zero-trust bare-metal enrollment and attestation workflows

Experience with NetApp ONTAP and VAST Data NFS storage platforms in high-performance compute environments

Knowledge of NVIDIA UFM (Unified Fabric Manager) 6.x and Telemetry Streaming for GPU fabric observability

Prior engagement with NVIDIA Cloud Partner (NCP) programme or Sovereign AI Cloud deployments

Understanding of Kubernetes CNI plugins (OVN-Kubernetes, Cilium) and their interaction with DPU offload

Sign up for Job Alerts