AI Evaluation Engineer
Job Description
About the Role
CK-12 (www.ck12.org) is on the lookout for talented, creative, and dedicated people to join our mission to provide great education to students around the world. We are looking for candidates to join our office in Bangalore as a Subject Matter Expert (Chemistry) !
We have a strong education platform that has served over 352+ Million users, have got over 2.88+ Billion questions answered, and have more than 350 thousand customized Flexbooks. We have embarked on an exciting journey to build an AI-powered student tutor and Teacher Assistant to build the next generation of learning platforms.
We are hiring an AI Eval Engineer to build evaluation systems, tools, and workflows for AI-powered education products.
This is a hands-on technical role for someone who enjoys working with LLMs, writing scripts, analyzing model behavior, and building practical tools to improve AI quality. The role combines Python-based development, LLM workflow testing, rubric design, data analysis, and education-domain judgment.
You will work on AI systems used for tutoring, teacher support, content generation, assessments, feedback, and student guidance. The goal is to measure and improve whether AI outputs are accurate, reliable, curriculum-aligned, pedagogically useful, age-appropriate, safe, and ready for real users.
Key Responsibilities
- Build and maintain evaluation pipelines for LLM-powered education workflows across student-facing and teacher-facing products.
- Write Python scripts, notebooks, and lightweight tools for batch testing, output analysis, model comparison, regression checks, and quality tracking.
- Create test sets, rubrics, golden datasets, annotation guidelines, and review workflows for evaluating AI outputs at scale.
- Evaluate and debug AI workflows involving prompts, structured outputs, retrieval, tool use, and multi-step reasoning.
- Analyze outputs to identify hallucinations, reasoning gaps, weak explanations, grade-level mismatches, regressions, and recurring failure patterns.
- Work with engineering, product, curriculum, QA, and SME teams to integrate evals into prompt iteration, workflow improvement, and product releases.
Required Skills & Qualifications
- 3+ years of experience in software development, applied AI, AI evaluation, data analysis, QA automation, edtech, curriculum/content quality, or related areas.
- Strong proficiency in Python or another scripting language for automation, data processing, analysis, and evaluation workflows.
- Experience with LLM APIs, prompt-based systems, JSON, structured outputs, datasets, notebooks, and API-based workflows.
- Ability to convert qualitative expectations into measurable rubrics, scoring criteria, test cases, and review processes.
- Strong debugging mindset, with the ability to inspect AI outputs, trace failure patterns, compare versions, and recommend improvements.
- Understanding of education quality factors such as curriculum alignment, learning objectives, grade-level appropriateness, scaffolding, misconceptions, feedback quality, and instructional clarity.
Desired Skills
- Experience building internal tools, dashboards, annotation pipelines, QA automation systems, or evaluation infrastructure for AI/LLM applications.
- Familiarity with automated evals, LLM-as-judge workflows, human-in-the-loop review, regression testing, and experiment tracking.
- Familiarity with RAG, context management, tool-calling workflows, JSON schemas, structured generation, and multi-step LLM systems.
- Experience with educational products, tutoring systems, assessments, curriculum development, instructional design, or learning science.
- Experience comparing outputs across models, prompts, retrieval strategies, languages, curricula, grade bands, or product workflows.
- Familiarity with Git, notebooks, dashboards, lightweight web tools, or iterative product development practices.
What Success Looks Like
Success in this role means building evaluation systems that make AI quality measurable, repeatable, and easier to improve over time.
You will help teams detect regressions, compare models, understand failure patterns, and improve prompts, workflows, retrieval strategies, and product behavior using evidence from evaluations.
This role is a strong fit for someone with a developer mindset who enjoys building practical tools, debugging AI systems, and improving learning experiences through reliable evaluation.
Passionate about education? Join us at CK-12 !!
About the Company
CK-12's mission is to provide free access to open-source content and technology tools that empower students as well as teachers to enhance and experiment with different learning styles, resources, levels of competence, and circumstances.
To achieve this noble and ambitious vision, we at CK-12 are challenging the traditional model of education to transform it dramatically. Technology has opened up lots of opportunities to revolutionize education for the benefit of students, teachers and parents.
We have chosen to be non-profit so that we can effectively realize our mission and do the right thing! It also provides us with the ability to experiment with big and bold ideas. CK-12 is backed by Vinod Khosla, a renowned technology venture capitalist.
At CK-12, you'll experience the benefits of working in a dynamic, entrepreneurial, innovative and non-bureaucratic environment where you will get a lot of cool things done than you ever imagined! We are a small group of passionate folks who are determined to disrupt the current form of education. We came together from companies such as Apple, Amazon, McGraw-Hill, and startups.
Technology is key to scale education and we deeply believe in it. Come develop great solutions on our AI-first platform delivering rich and interactive content.
Does our mission, people and technologies excite you? If the answer is YES! and you are a great technologist who will challenge status-quo (no order takers please!) by innovating, please come join us! Together, we will change the world!
Benefits: Comprehensive Medical and Accident Insurance, Complimentary lunch orders
Check out our latest product offering - Introducing Flexi 2.0
Flexi, our AI-powered Student Tutor -
AI-powered Teacher Assistant -
Location: