AI Inference Junior Engineer WFH
Job Description
Read everything carefully. The requirements and screening questions are critical and if not answered correctly and satisfactorily will result in auto-rejection and waste of your time.
- Work from Home.
- This is a full-time role. If you plan to do 2 or more jobs at the same time or want to do this part-time, that won't work for us. In that case please do not apply as it will get auto-rejected
- Note - this job requires working late night India time until 4AM to overlap with USA working times. Do not apply if this timing doesn't work
- Salary depends on experience and current verifiable (paychecks) compensation.
- Junior candidates with 2 years experience are suitable
Qubrid AI is building the next generation AI infrastructure platform that enables organizations to deploy, scale, and monetize AI workloads across cloud, on-premises, and hybrid environments. Our platform combines GPU cloud infrastructure, inference APIs, model deployment services, RAG pipelines, fine-tuning capabilities, and AI orchestration software into a unified AI stack.
We are seeking an experienced and hands-on AI Inference Engineer to design, optimize, and scale large-scale AI inference systems supporting thousands of concurrent users and enterprise AI workloads.
As an AI Inference Engineer, you will be responsible for deploying, optimizing, and operating open-source and commercial AI models across NVIDIA GPU infrastructure. You will work at the intersection of machine learning, distributed systems, GPU optimization, and cloud infrastructure to deliver low-latency, high-throughput AI services.
This is a highly technical role requiring deep expertise in LLM serving, GPU performance tuning, model optimization, inference frameworks, and large-scale production deployments.
- Deploy and manage Large Language Models (LLMs), multimodal models, vision models, speech models, and embedding models in production.
- Build and optimize inference pipelines for enterprise and public AI workloads.
- Implement scalable serving architectures using modern inference frameworks.
- Support model versioning, rollbacks, canary deployments, and A/B testing.
- Optimize GPU utilization, memory allocation, throughput, and latency.
- Implement model quantization techniques including FP16, BF16, INT8, GPTQ, AWQ, and GGUF.
- Tune inference workloads across NVIDIA H100, H200, B300, B200, A100, L40S, and other accelerator platforms.
- Analyze bottlenecks using NVIDIA profiling and monitoring tools.
- Design scalable inference clusters using Kubernetes and containerized workloads.
- Implement auto-scaling, load balancing, and fault-tolerant architectures.
- Build GPU scheduling and resource allocation strategies.
- Optimize multi-tenant AI serving environments.
- Deploy and optimize models using:
- vLLM
- NVIDIA TensorRT-LLM
- Triton Inference Server
- SGLang
- TGI (Text Generation Inference)
- Ollama
- Ray Serve
- OpenAI-compatible serving stacks
- NVIDIA Dynamo
- Implement batching, continuous batching, speculative decoding, KV cache optimization, and context caching.
- Optimize token throughput and cost efficiency.
- Evaluate emerging inference technologies and frameworks.
- Benchmark models across performance, accuracy, and cost metrics.
- Develop APIs and backend services supporting AI inference workloads.
- Integrate authentication, billing, token metering, and usage tracking.
- Work closely with platform engineering teams to improve reliability and scalability.
- Contribute to Qubrid's AI Model Studio and AI Compute Platform.
- Bachelor's or Master's degree in Computer Science, Engineering, AI/ML, or related field.
- 2+ years of software engineering experience.
- 2+ years of production AI/ML infrastructure experience.
- Strong Python programming expertise.
- Deep understanding of transformer architectures and modern LLMs.
- Experience deploying models such as Llama, DeepSeek, Qwen, Mistral, Gemma, and other open-source models.
- Strong Linux systems administration skills.
- Experience with Docker and Kubernetes.
- Experience with distributed systems and cloud-native architectures.
- PyTorch
- Hugging Face Transformers
- Model quantization
- Fine-tuning workflows
- Embedding models
- RAG architectures
- Vector databases
- NVIDIA CUDA
- TensorRT
- NCCL
- NVLink
- NVSwitch
- Multi-GPU optimization
- GPU monitoring and profiling
- Kubernetes
- Docker
- Terraform
- CI/CD pipelines
- AWS, Azure, GCP, or private cloud environments
- PostgreSQL
- MongoDB
- Redis
- REST APIs
- gRPC
- Event-driven architectures
- Experience building AI API platforms similar to OpenAI, Anthropic, Together AI, Fireworks, or DeepInfra.
- Experience operating large-scale inference clusters with hundreds or thousands of GPUs.
- Knowledge of GPU virtualization and multi-tenancy.
- Experience with distributed training and fine-tuning.
- Familiarity with NVIDIA DGX, HGX, and enterprise GPU environments.
- Contributions to open-source AI infrastructure projects.
- Deliver highly optimized AI inference platform services with industry-leading latency and throughput.
- Improve GPU utilization and reduce infrastructure costs.
- Scale AI services reliably across cloud and on-premise environments.
- Enable customers to deploy and consume AI models through Qubrid's unified AI platform.
- Drive innovation in AI inference, model optimization, and GPU infrastructure.
- Build the future of AI infrastructure.
- Work on cutting-edge NVIDIA GPU platforms.
- Influence the architecture of a rapidly growing AI platform.
- Solve challenging problems in inference, scale, performance, and distributed systems.
- Help democratize access to AI infrastructure globally.
AI & ML
Qubrid AI is an equal opportunity employer and welcomes applicants passionate about building the future of AI infrastructure.