Imagine building an endless stream of perfectly accurate, curriculum-aligned math questions for every student in grades K-12, automatically. That was the bold vision behind LearnWithAI, an in-house EdTech venture at Trilogy, a large software conglomerate using AI to create adaptive learning experiences for kids.
The challenge
LearnWithAI set out to generate multiple-choice questions that covered every Common Core math standard across K-12, however the initial question generations were not representative of the math curriculum, and were plagued with hallucinations. In order to be marketable the content needed to be
- Mathematically accurate: At the time, available LLMs struggled with math and routinely hallucinated answers and logic.
- Aligned with the curriculum: Content had to mirror school exams in format and difficulty, and many standards required integrated graphs, tables, or diagrams.
- Fully automated: The system needed to be adaptive and effectively infinite. In other words, created entirely by LLMs with no human intervention.
The approach
Trilogy brought in the DeepRails team (then a group of contractors) to design a robust, automated content engine with purposeful guardrails. They built a multi-stage workflow that combined precision prompting, rigorous evaluations, and programmatic visuals:
- A quality evaluation framework: for every question type, the team put together a customized evaluation process for the LLM generated content. Questions were initially created using a generator prompt, and then were fed into a series of evaluation prompts to check for correctness, difficulty, and completeness. Questions that failed the evaluation framework were regenerated. Finally, in order to calibrate outputs and iterate on the evaluation framework, the final question outputs were manually graded with a detailed rubric that assessed question correctness, difficulty, curriculum alignment, and, for stimulus based questions, the quality of the stimulus.
- Programmatic stimuli: For standards requiring visuals, DeepRails created Python drawing functions with Matplotlib for each stimulus type. The LLM outputs included the exact data needed to render matching graphs, tables, or diagrams, ensuring the visuals aligned with the question content.
- Smarter reasoning: Careful model selection prioritized LLMs with advanced reasoning. The team also used the Assistants API with the Wolfram plugin to strengthen mathematical rigor.
Why it worked
Several elements came together to push LLMs beyond their typical limits:
- Exemplars for few-shot learning: High-quality sample questions that matched the desired format, difficulty, and coverage were sourced or created for every standard to guide the models.
- Expert prompt design: Years of prompt engineering informed clear, concise, and comprehensive prompts, with techniques like chain-of-thought reasoning and decomposed prompting used extensively.
- Iteration at scale: The entire workflow was stress-tested and refined until it consistently delivered top-tier results.
- Strict production guardrails: The evaluation and regeneration framework created during the prompt development process was put into place in production to ensure that each and every question generated for students was mathematically correct and of the highest educational quality.
The business impact
- Accuracy that sells: Question accuracy jumped from an average of 20% to 99%. (a 395% increase!), transforming AI content from a liability into a marketable advantage.
- Lower cost per item: Automation reduced reliance on manual writing and QA, compressing production cycles and protecting margins.
- Enterprise trust: Exam-format parity, difficulty control, and programmatic visuals built confidence with districts, parents, and partners.