Principal AI engineer
We are building Cognium - an enterprise-grade platform for building, deploying, orchestrating, and governing AI agents at scale. As Principal Architect, you will own the end-to-end system design and make every foundational technical decision from the distributed architecture (Kubernetes, Temporal, Kafka, NATS) to the AI-native patterns (RAG pipelines, LLM routing, multi-agent orchestration, guardrail frameworks). This is not a slide-deck architect role - you will write code, review every critical PR, and personally build the hardest subsystems.You will be the technical conscience of the platform. When the team debates Temporal vs. a custom state machine, Kafka vs. Pulsar, pgvector vs. Weaviate, or monolith vs. microservices - your judgment settles it, backed by hands-on prototyping, not just theory System Architecture & Design Own the end-to-end architecture of the Cognium platform across all layers API Gateway (Envoy), Agent Orchestration (Temporal), Agent Runtime, LLM Router, RAG Engine, Tool Gateway, Policy Engine, Observability, and Infrastructure. Design and maintain the logical architecture layers Presentation API Gateway Security Pipeline Orchestration Runtime LLM/RAG/Tools Persistence Infrastructure. Define the Control Plane vs. Data Plane separation global metadata (CockroachDB) vs. per-region execution (PostgreSQL Citus, Redis, pgvector). \ Hands-On Technical Leadership Personally design and implement the most complex subsystems LLM Router (smart routing, fallback chains, A/B testing), multi-agent orchestration engine (supervisor pattern, handoff protocol, shared scratchpad), and the security pipeline (prompt injection defense, guardrail framework). Write production code in Go (performance-critical services), Java/Spring Boot (business logic services), and Python (ML pipelines, RAG engine). Expected contribution 40-50% hands-on coding in the first 12 months AI/ML Architecture Design the RAG pipeline architecture document processing chunking embedding hybrid retrieval (pgvector Elasticsearch BM25) RRF fusion re-ranking citation building. Own the RAGAS quality framework (Faithfulness, Relevancy, Precision, Recall). Architect the LLM Router for model-agnostic operation unified invocation interface across Anthropic, OpenAI, Google, Mistral, and self-hosted models (vLLM/TGI). Design routing rules, fallback chains, and A/B testing infrastructure ...