AI Inference Junior Engineer WFH
Job Description
In that case please do not apply as it will get auto-rejected\nNote - this job requires working late night India time until 4AM to overlap with USA working times. Do not apply if this timing doesn't work\nSalary depends on experience and current verifiable (paychecks) compensation.\nJunior candidates with 2 years experience are suitable\n\nAbout Qubrid AI\n\nQubrid AI is building the next generation AI infrastructure platform that enables organizations to deploy, scale, and monetize AI workloads across cloud, on-premises, and hybrid environments. Our platform combines GPU cloud infrastructure, inference APIs, model deployment services, RAG pipelines, fine-tuning capabilities, and AI orchestration software into a unified AI stack.\nWe are seeking an experienced and hands-on AI Inference Engineer to design, optimize, and scale large-scale AI inference systems supporting thousands of concurrent users and enterprise AI workloads.\n\nRole Overview\n\nAs an AI Inference Engineer, you will be responsible for deploying, optimizing, and operating open-source and commercial AI models across NVIDIA GPU infrastructure.
You will work at the intersection of machine learning, distributed systems, GPU optimization, and cloud infrastructure to deliver low-latency, high-throughput AI services.\n\nThis is a highly technical role requiring deep expertise in LLM serving, GPU performance tuning, model optimization, inference frameworks, and large-scale production deployments.\n\nResponsibilities\n\nAI Model Deployment & Serving Deploy and manage Large Language Models (LLMs), multimodal models, vision models, speech models, and embedding models in production.\nBuild and optimize inference pipelines for enterprise and public AI workloads.\nImplement scalable serving architectures using modern inference frameworks.\nSupport model versioning, rollbacks, canary deployments, and A/B testing.\n\nGPU Performance Optimization Optimize GPU utilization, memory allocation, throughput, and latency.\nImplement model quantization techniques including FP16, BF16, INT8, GPTQ, AWQ, and GGUF.\nTune inference workloads across NVIDIA H100, H200, B300, B200, A100, L40S, and other accelerator platforms.\nAnalyze bottlenecks using NVIDIA profiling and monitoring tools.\n\nAI Infrastructure Engineering\n\nDesign scalable inference clusters using Kubernetes and containerized workloads.\nImplement auto-scaling, load balancing, and fault-tolerant architectures.\nBuild GPU scheduling and resource allocation strategies.\nOptimize multi-tenant AI serving environments.\n\nInference Framework Expertise\n\nDeploy and optimize models using:\nvLLM\nNVIDIA TensorRT-LLM\nTriton Inference Server\nSGLang\nTGI (Text Generation Inference)\nOllama\nRay Serve\nOpenAI-compatible serving stacks\nNVIDIA Dynamo\n\nModel Optimization Implement batching, continuous batching, speculative decoding, KV cache optimization, and context caching.\nOptimize token throughput and cost efficiency.\nEvaluate emerging inference technologies and frameworks.\nBenchmark models across performance, accuracy, and cost metrics.\n\nPlatform Development Develop APIs and backend services supporting AI inference workloads.\nIntegrate authentication, billing, token metering, and usage tracking.\nWork closely with platform engineering teams to improve reliability and scalability.\nContribute to Qubrid's AI Model Studio and AI Compute Platform.\n\nRequired Qualifications Bachelor's or Master's degree in Computer Science, Engineering, AI/ML, or related field.\n2+ years of software engineering experience.\n2+ years of production AI/ML infrastructure experience.\nStrong Python programming expertise.\nDeep understanding of transformer architectures and modern LLMs.\nExperience deploying models such as Llama, DeepSeek, Qwen, Mistral, Gemma, and other open-source models.\nStrong Linux systems administration skills.\nExperience with Docker and Kubernetes.\nExperience with distributed systems and cloud-native architectures.\n\nTechnical Skills\nAI & ML PyTorch\nHugging Face Transformers\nModel quantization\nFine-tuning workflows\nEmbedding models\nRAG architectures\nVector databases\n\nGPU & Infrastructure NVIDIA CUDA\nTensorRT\nNCCL\nNVLink\nNVSwitch\nMulti-GPU optimization\nGPU monitoring and profiling\n\nCloud & DevOps Kubernetes\nDocker\nTerraform\nCI/CD pipelines\nAWS, Azure, GCP, or private cloud environments\n\nDatabases & Backend PostgreSQL\nMongoDB\nRedis\nREST APIs\ngRPC\nEvent-driven architectures\n\nPreferred Qualifications Experience building AI API platforms similar to OpenAI, Anthropic, Together AI, Fireworks, or DeepInfra.\nExperience operating large-scale inference clusters with hundreds or thousands of GPUs.\nKnowledge of GPU virtualization and multi-tenancy.\nExperience with distributed training and fine-tuning.\nFamiliarity with NVIDIA DGX, HGX, and enterprise GPU environments.\nContributions to open-source AI infrastructure projects.\n\nWhat Success Looks Like Deliver highly optimized AI inference platform services with industry-leading latency and throughput.\nImprove GPU utilization and reduce infrastructure costs.\nScale AI services reliably across cloud and on-premise environments.\nEnable customers to deploy and consume AI models through Qubrid's unified AI platform.\nDrive innovation in AI inference, model optimization, and GPU infrastructure.\n\nWhy Join Qubrid AI Build the future of AI infrastructure.\nWork on cutting-edge NVIDIA GPU platforms.\nInfluence the architecture of a rapidly growing AI platform.\nSolve challenging problems in inference, scale, performance, and distributed systems.\nHelp democratize access to AI infrastructure globally.\n\nQubrid AI is an equal opportunity employer and welcomes applicants passionate about building the future of AI infrastructure.