AI Tech Lead
Job Description
We are building production-grade AI systems for capital markets, including an AI-powered investing assistant that runs on cloud-native infrastructure and integrates with regulated trading and research platforms. We are hiring an AI Tech Lead to own the architecture, model strategy, and technical direction of one or more LLM products.
This is a hands-on technical leadership role — not a step away from the keyboard. You are expected to be the strongest debugger in the room, contribute meaningful code every week, and set the technical bar by example. If you have stopped coding, this role is not for you.
Experience
- 8–14 years in software engineering, with 3+ years leading AI/ML or LLM-based products in production.
- Track record of owning an AI system end-to-end — from model selection through evaluation, deployment, and operation at scale.
- Experience leading small-to-mid sized engineering teams (4–10 engineers) without losing technical depth.
Required Skills
1. LLM Hosting & Serving
- Hands-on experience hosting LLMs for testing, evaluation, and production inference at scale.
- Deep working knowledge of inference servers and runtimes: vLLM, TGI (Text Generation Inference), TensorRT-LLM, Ollama, llama.cpp. Able to tune throughput vs latency and explain the trade-offs.
- Experience deploying open-weight models (Llama, Mistral, Qwen, Nemotron, GPT-OSS, DeepSeek, etc.) on GPU instances — strong grasp of quantization (GPTQ, AWQ, GGUF, FP8), batching strategies (continuous batching, paged attention), and KV-cache management.
- Production experience with managed model hosting platforms: AWS Bedrock, SageMaker, Azure OpenAI, Vertex AI, or equivalent.
- Owns the host-vs-API decision — must be able to defend it with cost models, latency budgets, throughput projections, and data-residency constraints.
2. LLM Evaluation & Testing
- Has designed and operated case-specific test suites for LLM-based applications in production — not just experimented with generic benchmarks.
- Builds and maintains eval datasets from production traffic, edge cases, and adversarial prompts. Treats evals as a first-class engineering artifact.
- Deep hands-on with evaluation frameworks: Langfuse, Promptfoo, DeepEval, RAGAS, OpenAI Evals, LM-Eval-Harness, or equivalents.
- Has built LLM-as-a-judge pipelines and personally diagnosed their failure modes (judge bias, position bias, verbosity bias).
- Sets the team's regression-testing discipline for prompts, models, and tool chains.
- Defines what \"production quality\" means quantitatively: faithfulness, groundedness, answer relevance, tool-selection accuracy, hallucination rates, latency percentiles, token cost per query — and refuses to ship below the bar.
3. LLM Frameworks & Orchestration
- Strong working knowledge of LangChain, LangGraph, LlamaIndex, Haystack, or equivalent orchestration frameworks — including their internals and failure modes.
- Deep experience with agentic patterns: ReAct, ReWoo, Reflexion, Plan-and-Execute, multi-agent workflows. Able to choose between them based on the problem.
- MCP (Model Context Protocol) and tool-calling: design tool schemas, handle tool-selection failures, build recovery loops for malformed tool calls.
- Comfortable working outside Python ecosystems — building LLM applications in Go, Java/Kotlin, TypeScript/Node, or custom in-house frameworks. Does not assume Python is the right answer for production services.
- Streaming responses (SSE, WebSockets), session management, and handling long-running agentic loops gracefully.
4. Retrieval & Context Engineering
- Hands-on with embedding models, vector databases (pgvector, OpenSearch, Pinecone, Weaviate, Milvus), and hybrid search (BM25 + dense).
- Owns retrieval design end-to-end: chunking strategies, re-ranking (Cohere Rerank, cross-encoders), query rewriting, retrieval evaluation.
- Knows when RAG is the wrong answer — and is willing to say so.
5. Technical Leadership
- Leads design reviews and writes the design documents others reference.
- Sets coding and evaluation standards by example, not by edict.
- Mentors senior engineers — raises the team's depth on inference, retrieval, and agentic systems.
- Partners credibly with product, security, compliance, and data-governance teams.
- Communicates trade-offs to engineering leadership with numbers, not adjectives.
6. Good-to-Have
- Fine-tuning / instruction-tuning / LoRA / QLoRA / DPO on open-weight models.
- RLHF or RLAIF exposure.
- Prompt distillation, model routing, and cost optimization at scale.
- Guardrails: PII redaction, jailbreak detection, output validation (Guardrails AI, NeMo Guardrails, Llama Guard).
- Experience with multimodal models (vision, audio, ASR/TTS).
- Contributions to open-source AI/ML projects.
Responsibilities
- Own end-to-end architecture of one or more LLM products: model selection, orchestration pattern (ReAct / ReWoo / multi-agent), retrieval design, evaluation strategy, and deployment topology.
- Drive build-vs-buy and host-vs-API decisions with cost models, latency budgets, and compliance constraints — present these to engineering leadership and defend them.
- Set the bar for evaluation: define what \"production quality\" means for each product, build the eval infrastructure, and refuse to ship below the bar.
- Lead technical design reviews; mentor senior engineers; raise the team's depth on inference, retrieval, and agentic systems.
- Partner with security, compliance, and data-governance teams to embed PII redaction, audit logging, data residency, and access control into AI systems from day one.
- Stay current with the model landscape (open and closed) and bring back specific, costed recommendations — not just paper summaries.
- Contribute code regularly. Lead by example in the codebase, not just in documents.
- Drive the roadmap with Product; own technical risk and communicate it upward clearly.
What We Are Not Looking For
- Architecture astronauts who no longer code.
- Candidates who cannot quantitatively defend a model choice (cost per 1M tokens, p95 latency, eval scores).
- Candidates who treat frameworks as black boxes and cannot describe what happens at the HTTP, tokenizer, or attention layer.
- Candidates who defer all hard technical calls to \"the team.\"
- Candidates uncomfortable working outside Python — we ship production AI in Go and Java as well.