You cannot detect your way out of AI: the assessment redesign playbook

A professor pastes a student essay into an AI detector and gets back a number: 100% AI-generated. The problem is that the student wrote every word. This is not a rare malfunction. It is the predictable behavior of a tool that was never capable of doing the job it was bought to do, and the professor is now stuck either accusing a student they cannot prove anything about or quietly ignoring a result they paid for. Either way, the detector has made the situation worse.

This is the trap most institutions walk into first, because detection feels like the natural response to a cheating problem. It is the wrong response, and the case against it is now overwhelming. The honest summary is short. You cannot detect your way out of AI. You can only design your way out. This piece is about how to do the second thing. In the DREO framework from the previous post, this is the Redesign step, examined up close, with the model and the concrete moves that make an assignment hold up.

Detection is a losing game

Start with why the first instinct fails, because until detection is off the table, nobody invests seriously in redesign.

The accuracy is not there, and the failures are not random. A widely cited Stanford study found that seven major detectors flagged 61% of essays by non-native English speakers as AI-generated, while flagging native writers at a far lower rate, and that bias has proven durable rather than a first-generation glitch. Later peer-reviewed work widened the picture, finding that neurodivergent writers are among the groups most likely to be wrongly flagged as well. This is systematic bias that turns a detector into a machine for accusing the students already most exposed in an integrity process: international, multilingual, and neurodivergent.

The institutions and researchers closest to the tools have drawn the obvious conclusion, and recent evidence has only hardened it. Vanderbilt University disabled Turnitin's AI detector back in 2023, noting that even the vendor's claimed 1% false-positive rate, applied across roughly 75,000 papers a year, would wrongly flag about 750 students. That early call has aged well. A 2025 University of Chicago Booth working paper by Jabarian and Imas, which tested detectors across nearly 4,000 human and AI texts generated by four frontier models, argued that detector use should be governed by a strict false-positive cap, precisely because even a small error rate produces large absolute harm once you multiply it across a real student body. A February 2026 study in the International Journal for Educational Integrity by Hadra, Cambridge, and Mesbah put hard numbers on it: across a balanced set of student, professional, AI, and mixed texts, the two leading commercial detectors managed only 69% and 61% overall accuracy, performed worst of all on the hybrid human-and-AI writing that is now everywhere, and the authors concluded the tools are unsuitable as the sole basis for a misconduct decision. Turnitin's own current guidance now states its AI detection should not be the sole basis for action against a student, and institutions including MIT Sloan and the University of Kansas tell staff not to treat a detector score as standalone evidence. The lived result, as The New York Times documented in 2025, is a new burden on honest students, who now have to prove a negative, that they did not use AI, often with no way to do it.

There is one more detail that should end the conversation. Several detector vendors also sell tools that rewrite AI text to evade detection. The same companies profit from both sides of the arms race. Any strategy built on staying ahead in that race is a strategy you will lose, because the people you are racing are better funded than you and are sometimes the same people selling you the detector.

Detection is not worthless as a private signal that prompts a conversation. It is worthless as proof, and dangerous as a basis for sanctions. It cannot be the strategy. The strategy has to be design.

The real problem is validity, not cheating

Here is the reframe that makes redesign make sense, and it comes from the people who study assessment for a living.

Phillip Dawson, who co-directs the Center for Research in Assessment and Digital Learning at Deakin University, makes a point that sounds academic and is actually the whole game: cheating is just one threat to assessment validity, and fixating on catching it pulls attention away from the real question, which is whether the assessment still measures what it claims to measure. For decades, a take-home essay was a valid proxy for a student's thinking because producing it required thinking. Generative AI severed that link. The essay can now be produced without thinking, which means the essay is no longer a valid measure of whether any individual student cheated. The validity problem exists even in a class of perfectly honest students.

This is why the popular institutional response, writing a policy that tells students when they may and may not use AI, does so little. In a 2025 paper with the blunt title "Talk is cheap," Dawson and Danny Liu of the University of Sydney argue that traffic-light policies and syllabus statements share one fatal limitation: they communicate rules rather than change the mechanics of the assessment. A rule that says "do not use AI on this essay" does not make the essay any harder to fake. It just moves the institution from a design problem to an enforcement problem, which lands you right back at the detector. You do not get out of this by telling students what not to do. You get out by building assessments where the instruction is beside the point.

Note: This article was researched and written by Justice Jones with AI assistance, then reviewed and edited by our team. External studies and sources are credited to their original authors. Examples from our own work reflect our organizational practice.

The two-lane model

The most useful framework for doing that comes from the University of Sydney and Dawson's group, and it is simple enough to run a whole institution on. Sort every assessment into one of two lanes.

The secured lane is the assessment of learning. Its job is to verify, under conditions you control, that a student can actually do something on their own. The defining features, in Dawson's terms, are authentication, you know it is the student, and control of circumstances, you know what help they had. This is where you put the things a degree certifies: the core capabilities a graduate must genuinely possess. Crucially, secured does not mean a return to nothing but proctored exams. An oral defense of the student's own work is secured. An in-class analysis written in the room is secured. A live demonstration, a practical, a viva on a project the student built over a term, all secured, and all far more authentic than a fill-in-the-bubble exam.

The open lane is an assessment for learning. Here, AI use is permitted, expected, and part of the point, because the skill being developed includes working well with the tools. You are not trying to keep AI out. You are trying to see the student's judgment in how they use it. The artifact alone cannot carry the grade in this lane because it is fakeable, so the assessment shifts to the process, the decisions, and the justifications around it.

The power of the model is that it stops the impossible argument. You are no longer trying to make every assignment both AI-proof and AI-embracing, which cannot be done. You decide, per outcome, which lane it belongs in, and then you design honestly for that lane.

The redesign playbook

Once an assessment is in a lane, the moves get concrete. This is the pattern library. None of it requires a detector.

For the secured lane, move the evidence to where you can see it.

Shift the weight to in-class work. Make the graded thinking happen in the room: timed writing, problem-solving, and analysis of a case handed out that day. The take-home can still exist as practice, but the assurance of capability comes from the supervised piece.

Add an oral component. A short viva, five to ten minutes in which the student explains and defends their submission, is the single most efficient, AI-resilient move available. A student who outsourced the work cannot defend it. A student who did it can easily. It also scales better than people fear when used as a targeted check rather than a full re-examination.

Assess the process you witnessed. Staged drafts with in-class check-ins, a project built and reviewed over weeks, and a lab notebook create a chain of evidence that a last-minute generation cannot fake.

For the open lane, grade the judgment, not the output.

Require the working, not just the answer. Ask for the prompt history, the AI transcript, and a short reflection on what the student accepted, rejected, and changed. The authors of the AI Assessment Scale put it well: if you permit AI, grade the student's decisions, checks, and justifications, not their ability to push buttons. The deliverable becomes the thinking around the tool, which the tool cannot supply.

Anchor the task in the student's own context. An assignment that requires this week's specific class discussion, a student's own field placement, local data, or a personal position they must defend is hard for a general model to complete well because the model lacks the necessary context. This is not an arms-race trick. It is just a well-situated assessment.

Make AI the object of critique. Have students generate something using AI, then find its errors, challenge its reasoning, or improve it. Now the AI output is the raw material, and the assessed skill is the evaluation of it, which is exactly the capability you want graduates to have.

The thread running through both lanes is one principle: assess things AI cannot do for the student, or assess the judgment the student applies to AI. Everything in the playbook is a version of that sentence.

Where this goes wrong

Three honest objections, because a redesign sold without them does not survive contact with a real faculty.

It is not just exams in disguise. The most common pushback is that the secured lane is a nostalgia trip back to the invigilated hall. It is not, and selling it that way kills it. Oral defenses, in-class application, and authentic demonstrations are secured and often more valid than the essay they replace, because they assess the capability directly rather than through a proxy. If your secured lane is only proctored exams, you have done it badly.

It does not scale on heroics. Redesigning assessment course by course, on top of a full teaching load, is more than most faculty can carry, and a large share of teaching is done by contingent staff with no time funded for this. The research is clear-eyed about this: a 2025 study by Dawson and colleagues found that redesigning assessment well takes a "village," a collaborative team spanning the discipline, AI capability, and assessment-design expertise. That is an argument for program-wide redesign with real support, not for asking individual instructors to be heroes.

The secured lane has its own equity questions. Proctoring tools carry documented bias and accessibility problems, and oral assessment can disadvantage some students if designed carelessly. Secured does not automatically mean fair. The lane still has to be designed inclusively, with accommodations and varied formats, or it recreates the harm we were trying to escape from the detector.

Make it a program, not a heroic act

The reason most assessment redesign efforts stall is that they get dumped on individual faculty as a side quest, and the validity problem is too big for side quests. The institutions that get this right treat it as program-wide work, sequenced and supported, the Redesign step of a real readiness program rather than a memo asking everyone to fix their own courses by fall.

This is the work we do with institutions. The program-wide redesign, the research calls a "village," is exactly the co-design model 24/7 Teach uses: we bring the AI and assessment-design expertise, your faculty brings the discipline, and we rebuild the highest-stakes assessments together, then hand the capability off so your people run it without us. Across more than 50 organizations we have supported, the redesigns that hold are the ones the institution's own people built, with us, on their real courses. Human-led, AI-facilitated, applied to the assessment itself.

You cannot detect your way out of this. You cannot policy your way out of it either. You design your way out, one lane at a time, and you do it as a program rather than a pile of individual emergencies. The institutions that start now will have assessments they can stand behind in two years. The ones still buying detectors will have a stack of false accusations and a validity problem they never addressed.

About the author

Justice Jones is a Learning & Development Leader, Director of AI Integration, former K-12 principal, and the co-founder and CSO of 24/7 Teach. He built the company to close the gap between what schools teach and what teens and professionals need to succeed, and he leads AI strategy at its sister company, Naomi-AI, a K-8 classroom platform. Through 24/7 Teach, he and his team have supported more than 50 organizations and placed more than 600 adults in new careers.