What the Research Shows About Blended Assessment Approaches
Ask most educators whether generative AI will take over the work of giving feedback and you’ll get one of two reflexive answers: a hopeful “finally, something to ease the marking load,” or a defensive “a machine can’t replace a teacher.” The interesting question isn’t whether AI or humans should give feedback. It’s which parts of feedback each does best, and where the human hand has to stay firmly on the wheel.
Recent studies make a strong case for blended assessment. Let AI do what it’s genuinely good at, keep humans where their judgement is irreplaceable, and never let the machine quietly take over the decisions that matter.
Half Your Students Are Already There
The starting point is scale. In the largest study to date, a cross-sectional survey of 6,960 students across four major Australian universities, Henderson and colleagues found that use of generative AI for feedback was almost exactly evenly split. Roughly half the cohort had sought feedback from AI, and half had not.1 Strikingly, of those who had not, more than a quarter simply did not realise it was possible.
In other words, this is no longer an emerging behaviour to plan for. It is current practice, happening with or without institutional design around it. The relevant question is not whether students will use AI for feedback. It is how educators shape that use so it strengthens rather than undermines learning.
What Students Value AI For, And What They Don’t
The same study gives an unusually clear picture of why students turn to AI, and where it falls short.1
On helpfulness, the two sources were close. Around 84% of students found AI feedback somewhat or very helpful, against 82% for teacher feedback. However, the texture differed. Only about a quarter rated AI feedback as very helpful, compared with over 40% for their teachers. Students valued AI for its accessibility, speed, and sheer volume. They could get feedback at any time, in digestible language, and with far less interpersonal risk. Asking an AI carries none of the anxiety of exposing a weakness to a teacher you respect. This point is echoed in the feedback-seeking research by Zhou, Carless, and Nieminen, where students described the absence of emotional cost as a genuine draw.2
Then comes the gap that matters most for high-stakes assessment. On trustworthiness, the picture diverged sharply. About 90% of students rated teacher feedback as trustworthy, against just 60% for AI, a large and statistically robust difference.1 Students consistently described teacher feedback as more relevant, more contextualised, more expert, and more credible. AI was easier and more comfortable, but the teacher was trusted.
Henderson’s team drew the obvious conclusion. AI and teacher feedback are valued for different reasons and serve different needs. As a result, they are complementary rather than interchangeable, and AI feedback should not be seen as a replacement for the teacher.1
The Finding That Should Stop “Let AI Decide” In Its Tracks
If the trust gap is the headline, the most important result for assessment design comes from a smaller but far more rigorous study. Weidlich and colleagues ran a randomised, blinded field experiment. They compared teacher, peer, and AI (LLM) feedback on the same student work, with the source hidden from recipients so that judgements reflected the feedback itself rather than any bias about where it came from.3
The results are a cautionary tale about confusing perception with quality. Students rated the teacher feedback as less fair and harder to accept, and they said they were less willing to revise their work based on it. Yet when researchers measured actual improvement in the revised work, teacher feedback produced the strongest gains in scientific argumentation and formal quality. AI feedback produced the smallest improvement of the three.3
Read that again. The feedback students found least comfortable was the feedback that helped them most, while the feedback that felt smooth and agreeable moved the needle least. As the authors put it, technology can augment but not replace the pedagogical expertise of human instructors. This matters most when the goal is complex, higher-order skill, which is precisely what clinical competence demands.3
This is the empirical core of the case for keeping decisions human. If student-perceived ease and fairness do not track real learning gains, then they cannot be trusted as a proxy for a competence judgement. An AI optimising for what feels acceptable to a learner is optimising for the wrong target. Therefore, the result has to be calculated, weighed, and owned by a human who can tell the difference between feedback that flatters and feedback that develops.
Why A Blended Assessment Model Wins
None of this is an argument against AI. It is an argument for using it where it is strong, and reserving for humans what only humans can do. The research is increasingly explicit about how to draw that line, and it points firmly towards a blended assessment model.
Banihashem and colleagues set out a pedagogical framework for hybrid intelligent feedback. It describes a spectrum running from fully human-generated feedback to fully AI-generated, with the productive middle ground sitting between.4 Their three blended models are human-led with AI support, adaptive human-AI collaboration, and AI-led with human enrichment. All three share one feature throughout. Human involvement is always retained for enrichment and oversight. Their guiding principle is complementarity. AI delivers timely, scalable, data-driven input on routine aspects, while humans bring context, judgement, creativity, and emotional sensitivity. Crucially, they describe the educator’s role as reviewing and adjusting AI-generated feedback rather than blindly implementing it.4
That oversight is not a formality. In one study reported within that framework, an AI model was iteratively trained to produce feedback. Even so, around 30% of its statements still needed significant modification by an instructor before they were ready to reach students.4 In short, the AI drafts, and the human edits and approves. The sign-off is the human’s. Even the design of the AI’s output matters. An experimental comparison of directive, metacognitive, and hybrid AI-generated feedback found that the hybrid design, which pairs clear guidance with reflective prompts, drove the most student revision. However, it worked only because human designers had deliberately engineered it around established pedagogical principles.5
The Human Stays In Charge Of Judgement
Corbin, Tai, and Flenady give this a deeper rationale through their recognition-based framework.6 Effective feedback, they argue, rests on mutual recognition between teacher and student. This relationship of shared vulnerability, trust, and respect shapes how feedback is received and used. AI lacks any capacity for genuine recognition, so it sits outside that relationship. They call its contribution extra-recognitive. The practical implication is elegant. Let AI handle the extra-recognitive, scalable work, such as initial technical comments, drafting, and a low-stakes space to build confidence. Educators can then invest their time in the recognitive interactions that actually form a learner’s identity and judgement. In their phrase, this is integration without undermining.
Chen and Carless show this working in the classroom.7 In their study of teacher feedback dilemmas, AI helped mitigate the perennial problems of scale and timeliness. Even so, the teacher remained the orchestrator, designing the assessment criteria, crafting the prompts, and structuring the reflection. AI was an actant in the process, but never the author of the verdict. Both they and the hybrid-feedback researchers flag the same risk if that balance is lost. Over-reliance on AI erodes the very critical thinking, metacognition, and self-regulation that assessment is meant to develop.
Two Non-Negotiables For A Defensible Blended Model
Pulling the evidence together, a responsible blended assessment approach for clinical and competency assessment rests on two lines that AI does not cross.
Decisions and results stay human. The Weidlich experiment shows that students are not good judges of which feedback helps them most. That is precisely why the final result should not be handed to an AI. It cannot reliably tell genuine quality apart from feedback that simply goes down well.3 Scoring, standard-setting, borderline judgements, and the determination of competence are all acts of contextual, expert evaluation. This is exactly the territory where human feedback proved most credible1 and most effective.3 AI can surface patterns, flag inconsistencies, and support psychometric analysis, but the decision is made and signed by a person who is accountable for it.
Feedback is signed off by a human. AI is an excellent first drafter and a tireless scaler of formative comment. However, its output is a starting point, not a finished product. The 30% revision rate is a reminder that unreviewed AI feedback can be inaccurate, generic, or contextually wrong.4 The educator reads it, corrects it, contextualises it to the learner and the station, and then releases it under their own authority. The recognition that makes feedback land, the sense that someone who understands the work has engaged with it, only exists when a human stands behind the words.6
What This Means For Clinical Assessment
For OSCEs, MMIs, and workplace-based assessment, the path forward is encouraging rather than threatening. AI can do real work that has long strained educators. It can generate and refine station material, produce readable formative feedback at volume, and support evidence-based, data-rich analysis of performance. Done well, a blended assessment approach frees examiners from the parts of the workload that scale poorly. It lets them concentrate on the judgement, the conversation, and the relationship that determine whether a learner can be trusted with patients.
The future of feedback, as Weidlich’s team conclude, is an ecosystem in which human judgement, peer collaboration, and artificial intelligence complement one another.3 The evidence does not show AI replacing the educator. Instead, it shows AI making the educator’s irreplaceable contribution more sustainable, provided the institution holds firm on the two things that must remain human: the decision, and the sign-off.
Blended is the way forward, and the human stays in the loop on what matters most.
- Henderson, M., Bearman, M., Chung, J., Fawns, T., Buckingham Shum, S., Matthews, K. E., & de Mello Heredia, J. (2025). Comparing generative AI and teacher feedback: student perceptions of usefulness and trustworthiness. Assessment & Evaluation in Higher Education. https://doi.org/10.1080/02602938.2025.2502582 ↩︎
- Zhou, H., Carless, D., & Nieminen, J. H. (2025). Students’ motivations for feedback seeking: the value of combined monitoring and inquiry. Assessment & Evaluation in Higher Education. https://doi.org/10.1080/02602938.2025.2596351 ↩︎
- Weidlich, J., Gotsch, F., Schudel, K., Marusic-Würscher, C., Mazzarella, J., Bolten, H., Bütler, D., Luger, S., Wohlfender, B., & Maag Merki, K. (2025). Teacher, peer, or AI? Comparing effects of feedback sources in higher education. Computers and Education Open, 9, 100300. https://doi.org/10.1016/j.caeo.2025.100300 ↩︎
- Banihashem, S. K., Noroozi, O., Khosravi, H., Schunn, C. D., & Drachsler, H. (2025). Pedagogical framework for hybrid intelligent feedback. Innovations in Education and Teaching International, 63(2), 554-570. https://doi.org/10.1080/14703297.2025.2499174 ↩︎
- Alsaiari, O., Baghaei, N., Lodge, J. M., Noroozi, O., Gašević, D., Boden, M., & Khosravi, H. (2026). Directive, metacognitive, or a blend of both? A comparison of AI-generated feedback types on student engagement, confidence, and outcomes. Computers and Education: Artificial Intelligence, 10, 100553. https://doi.org/10.1016/j.caeai.2026.100553 ↩︎
- Corbin, T., Tai, J., & Flenady, G. (2025). Understanding the place and value of GenAI feedback: a recognition-based framework. Assessment & Evaluation in Higher Education, 50(5), 718-731. https://doi.org/10.1080/02602938.2025.2459641 ↩︎
- Chen, S. (Cindy), & Carless, D. (2026). Teacher feedback dilemmas and the use of GenAI: challenges or opportunities? Innovations in Education and Teaching International. https://doi.org/10.1080/14703297.2026.2648040 ↩︎




























