GPU Infrastructure & Networking Architect
The GPU Infrastructure & Networking Architect is responsible for the Low-Level Design (LLD) and Day-0 implementation of the physical GPU compute and DPU-centric networking fabric within the CloudXP Sovereign AI Cloud. This role owns the end-to-end design of the GB300 NVL72 bare-metal compute layer, NVIDIA BlueField-3 DPU deployment in DPF mode, Spectrum-X GPU TAN, Spectrum-3 ancillary fabric, and F5/Netris internet egress translating the High-Level Design (HLD) into deployable, validated configurations aligned to the NVIDIA Reference Architecture.
Key Responsibilities
1. Design Infrastructure LLD & Day-0 Configuration
Produce rack-level and node-level LLD for GB300 NVL72 bare-metal GPU nodes including cabling, power, and cooling topology
Design and document BF-3 DPU deployment in DPF mode across all node types (GB300 and ancillary); validate DOCA DPF Operator configuration and lifecycle
Define NVIDIA NICo zero-trust enrollment workflow for GB300 nodes; document pre-boot attestation sequences and BMC integration
Produce IP addressing schemes for GPU TAN (Spectrum-X), ancillary compute fabric (Spectrum-3), OOB management, and storage planes
Author Day-0 runbooks for Spectrum-X spine-leaf bring-up, rail-optimised RoCE configuration, and MTU/ECN tuning for GB300 workloads
2. Design Networking & Egress Design
Design distributed BGP egress using FRR DaemonSet with /32 EIP injection; configure eBGP peering between AS 65100 (compute) and AS 65000 (Spectrum-3 SN4600C spine)
Own F5 AWAF hardware egress configuration as the sole internet exit path; define F5 BNK + DPF co-deployment on BF-3 for north-south traffic
Design OVN-Kubernetes overlay for ancillary nodes and DPF Host-Trusted mode integration; validate SF+VF coexistence on BF-3
Produce VXLAN segment maps, VRF isolation design, and HBN VRF scaling analysis for ZoneVPC DaemonSet model
Define RDMA/RoCE network policies and lossless fabric requirements for GB300 scale-out training workloads
Implement netris based cloud virtual functions and
3. Design Storage Connectivity
Design VAST NFS mount architecture for GB300 nodes; validate data-path performance at scale and define NFS tuning parameters
Design NetApp ONTAP block storage connectivity for ancillary KubeVirt VMs; document iSCSI/NVMe-oF path configuration
Produce storage LLD covering StorageGRID object connectivity, zone affinity, and multi-tenancy namespace isolation
4. Implement Validation & NVIDIA Alignment
Execute hardware bring-up validation against NVIDIA GB300 NVL72 Dual-Plane Networking Reference Architecture
Coordinate with NVIDIA field engineering on DTS Prometheus port accessibility in DPF mode and UFM 6.x metric naming compatibility
Produce test plans and acceptance criteria for network fabric, GPU TAN, and egress path; participate in NVIDIA NCP validation reviews
Required Skills & Experience
Must-Have
10+ years in data-centre infrastructure architecture with 3+ years on GPU/AI cluster deployments at scale (100+ nodes)
Hands-on experience with NVIDIA BlueField DPUs (BF-2 or BF-3); knowledge of DOCA SDK, DPF Operator, and OVN-K integration
Deep expertise in BGP (eBGP/iBGP), RoCE/RDMA networking, and lossless Ethernet fabric design (PFC, ECN, DCQCN)
Proficiency with NVIDIA Spectrum switches (UFM, SHARP, rail-optimised topology) or comparable InfiniBand/Ethernet AI fabrics
Experience with F5 BIG-IP (hardware AWAF / BIG-IP Next for Kubernetes); familiarity with BIG-IP as Kubernetes ingress/egress
Strong Linux networking background: VXLAN, VRF, VLAN, OVN/OVS, kernel datapath, SR-IOV, VF/SF configuration
Experience with Netris implementation and customization
Proficiency in Python and/or Go for automation scripts and infrastructure-as-code tooling; Ansible/Terraform for Day-0 provisioning
Nice-to-Have
Familiarity with NVIDIA NICo zero-trust bare-metal enrollment and attestation workflows
Experience with NetApp ONTAP and VAST Data NFS storage platforms in high-performance compute environments
Knowledge of NVIDIA UFM (Unified Fabric Manager) 6.x and Telemetry Streaming for GPU fabric observability
Prior engagement with NVIDIA Cloud Partner (NCP) programme or Sovereign AI Cloud deployments
Understanding of Kubernetes CNI plugins (OVN-Kubernetes, Cilium) and their interaction with DPU offload