Agentic AI: Why Learnosity is using generative AI to build the future of grading
How do you know which approach is right when laying the foundations for the next wave of AI innovations in assessment?
When your MO is to solve difficult problems, healthy debate is an ally.
In software development, you might even consider it a necessity. It’s by challenging “sacred cows”—old orthodoxies and conventional ways of thinking—that you can adapt, reinvent, and keep your footing as the landscape changes.
And how things have changed.
It’s in the large shadow of AI that the most lively tech debate is now concentrated. This is true in edtech too, where a new topic has entered the chat:
Which method is best for AI grading: generative AI or classification AI?
Let's start with the basics
Broadly speaking, the difference between the two approaches goes as follows:
Classification AI works by sorting responses into predefined categories. Models are trained to assign the response a score, picking the most probable one from the pre-defined list given the large labeled dataset on which they’re trained. Classification models are explainable and consistent (and therefore deemed trustworthy), but limited when student responses don’t map neatly to those pre-defined categories.
Generative AI by contrast, analyzes the context and nuance in language, so it can score open-ended responses more flexibly, even when learners phrase things in unexpected ways. This makes it especially valuable for essay grading and short-response scoring, where creativity and reasoning don’t always fit rigid categories.
Of course it’s not quite as cut and dried as that. Generative models (LLMs in this case) aren’t without their flaws when it comes to grading. The most frequently cited of these are:
- They’re prone to hallucination
- They’re inconsistent in their output
- It’s not always easy to understand why they’re wrong
- They expose data to train future versions
Looks like a problem for the AI team
To successfully apply AI’s potential to assessment, these are all areas Learnosity’s developers had to analyze closely after the company decided to go big on AI back in 2023. The deep well of domain expertise to draw from gave us a rolling start. But tackling those challenges required scaled investment—which led to the creation of AI Labs, a large, dedicated team of AI specialists.
“Generative AI like LLMs are an underlying technology,” says Kate Hake, a Product Manager based in Learnosity’s Dublin office who oversees the AI team’s product output and roadmap. “Placing a narrow focus on the risks of generative AI mistakes the technology for the solution and doesn’t account for the agency of those building with it.”
“The vision we had for AI Labs was to assemble a multi-disciplinary team with diverse areas of expertise,” she continues. “Whenever we introduce new features we step back and ask: what does a high-quality outcome look like for this specific task?”
“Placing a narrow focus on the risks of generative AI mistakes the technology for the solution and doesn’t account for the agency of those building with it.”
“From there, we run everything through a rigorous evaluation pipeline. The team defines the right metrics, ensures we have the proper datasets, and tests different models and techniques. The goal is always the same: consistently high-quality results. Once features are live, we continue to monitor performance closely to catch and prevent regressions.”
This rigorous approach is evident in the quality of Learnosity’s multi-award winning AI essay grading engine, Feedback Aide, which lets product builders add high-quality AI-scoring to their product with lightweight API integration while taking care of mission-critical performance areas like reliability, accessibility, and scalability.
Human level AI scoring with generative AI—a hallucination?
But let’s get back to the specifics. While generative AI’s tendency to hallucinate might be laughed off as meme-worthy in certain industries, in education—and especially assessment—it’s no laughing matter. Finding solutions to overcome the problem is critically important.
“Hallucinations stem from uncertainty, but it’s important to note they’re not unique to LLMs,” says Sean McCrossan, a data scientist on the AI Labs team. “Similar errors occur in fine-tuned classifiers and manifest as overconfident mislabelling.”
“At Learnosity, we address it using a three-pronged approach:
- Prompt design: we craft prompts to minimize hallucinations
- Evaluation: we test multiple prompting strategies and model combinations to measure hallucination rates.
- Confidence metric: as part of our agentic system, we propose a confidence score that helps the human reviewer contextualize the feedback."
The result of this sytemmatic approach is measurable, with Feedback Aide achieving a QWK (Quadratic Weighted Kappa) score of 0.91—on par with human grading for accuracy and higher than most consistency scores in high stakes grading contexts.
However, for Kate Hake, it’s the last point that’s worth highlighting in the context of assessment grading.
Recommended by LinkedIn
"That combination of smart automation and thoughtful human oversight is what makes our approach so effective.”
“Of course human review remains a critical part of the process,” she says. “Every workflow is designed with a human reviewer who has the final say. That combination of smart automation and thoughtful human oversight is what makes our approach so effective.”
I’ll explain everything: Making essay grading interpretable
Taking extensive measures to achieve consistent, high-quality scoring is one thing, but explaining to graders and learners where the mark came from is another.
“LLMs are criticized for being ‘opaque’,” explains McCrossan. “You can’t understand why they’re wrong while classification models are thought to be more interpretable. But that’s not really accurate.”
“Even powerful classification systems are also a black box—most likely built on neural networks or transformers, since simpler methods like regression or decision trees wouldn’t achieve sufficient accuracy for essay grading. So any claim of transparency is overstated.”
“Our approach to explainability is different,” he continues. “We use an agentic system that grades each rubric trait individually, making it clear where marks were lost.
Additional agents handle tasks like checking whether the essay is off-topic, analyzing citations, or evaluating the introduction. Together, these provide a transparent breakdown of why a score was given, rather than a single opaque output.”
Monet Slinowsky, who works as the Product Manager for Feedback Aide, goes further, explaining that generative AI offers additional capabilities that deepen the learning opportunities for grading.
“Personalized feedback is a key aspect of our offering,” she says. “It isn’t possible to provide that with classification models, which can only classify inputs into categories. We use generative AI tech to call out specific elements in the response—and can do so more consistently and explicitly.”
The data dilemma
AI cannot exist without a supply of data. In order to continuously improve, AI models need to capture and absorb prodigious amounts of high-quality material, which provokes fierce—and justifiable—debates around hot-button issues such as privacy, copyright, and security.
As an ISO-certified company that’s committed to personal data protection, how does Learnosity handle the data dilemma?
“We don’t store or train models with customer data,” says McCrossan. “But some companies using classification models do in order to ‘learn’ your style, which could be a privacy violation in some places. So even that objection isn’t all on one side.”
“We use Azure and Bedrock specifically so providers like OpenAI can’t use customer data for training. We also offer a plethora of options—like marking style, structure and citation checks, learner level, various models—so graders can customize to their use case while preserving privacy.”
Generative AI v Classification AI: Have we a winner?
As a pioneer of digital assessment and co-chair of ATP's AI Subcommittee (in addition to being EVP of Business Relations at Learnosity), John Kleeman has a keen eye for breakthroughs that can upend the status quo. However, when it comes to generative versus classification models, he believes it is not a binary debate: there is a place for both models.
“The true answer here is that there are pros and cons of LLMs and classification models, both have value, and both have a role in the present,” he explains. “But in my opinion, the future is likely to belong to LLMs—they are progressing rapidly and getting better all the time. That will only continue.”
His confidence in the future of generative AI is shared by McCrossan.
“Generative AI offers more predictive power than classification models,” he says, offering an example.
“If an essay question and rubric were to change significantly—let’s say the essay is now about Winston Churchill instead of Abraham Lincoln, with a different rubric—our solution can easily grade this without losing much predictive power, whereas a custom classification model would likely have to be retrained on that new essay prompt.”
That additional flexibility makes generative AI the more exciting option in laying the foundations for future waves of innovation.
“Generative AI models form the basis of an agentic system, which is more capable of handling high-complexity tasks, resulting in greater autonomy and less friction for customers.”
The real questions to ask, then, may not be so much about which AI model you use as to how well you can use them? How much can you invest in testing, refining, integrating, and scaling them? How well can you sustain the pace of improvement?
When looked at in this way, it all seems to boil down to a far older debate—one familiar and recurring in all tech circles: build or buy?