We abandoned trial by ordeal because we decided that judgment required a human behind it. Eight centuries later, we are being asked to reverse the lesson.
In medieval Europe, when a court could not determine guilt, it outsourced the decision to God.
The method was called trial by ordeal. The accused would grasp a bar of red-hot iron and carry it nine paces. Their hand would be bandaged and inspected three days later. If the burns were healing cleanly, God had spoken: innocent. If the wound had festered, guilt was established. Variations involved boiling water, freezing rivers, consecrated bread. The specifics varied. The premise did not: remove fallible human judgment from the equation and delegate the verdict to a process that was beyond manipulation, beyond bias, beyond the frailties of the people involved.
The ordeal was objective. It was also monstrous — not because it was inaccurate, but because accuracy was never the point it missed. What it missed was that judgment is not a test result. It is a relationship between the person being judged and the person doing the judging. The ordeal severed that relationship. It placed the defendant before a process that could not hear them, could not be moved by their circumstances, and could not be held to account if it was wrong.
In 1215, the Fourth Lateran Council prohibited clergy from participating in ordeals. The decision was theological — the Church decided that asking God to perform on demand was presumptuous — but its practical consequence was enormous. Without clerical sanction, the ordeal lost its legitimacy. Courts across Europe were forced, over the following century, to develop something new: systems of human judgment. Juries. Examining magistrates. Rules of evidence. The infrastructure of people judging people, accountable to other people, within a framework that could be challenged, reformed, and held to its own standards.
It took eight centuries to build that infrastructure. We are now being asked to consider whether machines should replace it.
The question deserves a serious answer.
I. The Honest Case
The argument for algorithmic judgment is not trivial, and pretending otherwise weakens the response.
Human judges are inconsistent. Sentencing disparities between judges in the same jurisdiction, for the same offense, with the same facts, can span years of prison time. Racial bias in judicial decision-making is extensively documented in criminal justice research — researchers actively dispute magnitude and mechanism, but the directional evidence is not seriously contested.
Algorithms, the argument goes, are immune to mood and unconscious prejudice. They process the same inputs the same way every time. They don’t have bad days. They don’t carry the baggage of personal experience into the courtroom. If the goal is consistency — and consistency is a precondition of fairness — then a well-trained model should outperform the average human judge on the metrics that matter.
This case is strongest in domains where the decision is genuinely mechanical: bail calculations based on flight risk, scheduling, case routing, flagging procedural errors. There are places in the legal system where human judgment adds friction without adding value, and automating those places raises different and narrower questions.
But the case does not stop there. Its most ambitious version holds that the final judgment itself — guilty or not guilty, five years or ten, custody or freedom — should be made by a system that is demonstrably more consistent than the humans currently making it.
That version needs a different kind of answer.
II. The Consistency Trap
The answer is not that machines are biased too, though they are. It is that consistency and fairness are not the same thing, and the confusion of the two is where the argument breaks.
An algorithm trained on historical sentencing data does not transcend the biases in that data. It inherits them — launders them through mathematics — and presents the result wearing the costume of objectivity. When a model designed to predict reoffending assigns higher risk scores to Black defendants because the training data reflects decades of disproportionate policing, it is not being neutral. It is encoding the past’s injustices and calling them patterns.
The more sophisticated response is to argue for better data — cleaner proxies, more careful variable selection, datasets stripped of the markers that correlate with race. This is not a frivolous argument. Methodological care does reduce some forms of measurable bias. But it runs into a problem that cannot be engineered away: the question of what the model is actually trying to predict. A machine learning model is trained by feeding it historical examples and telling it what outcome to learn from — this is called the training target. For these risk tools, the target is typically something like reoffending rates, failure-to-appear rates, or past sentence lengths, as those outcomes were recorded by a system that was already skewed. Cleaning the inputs doesn’t fix this. If the outcome you’re learning from was shaped by decades of discriminatory policing and sentencing, the model doesn’t transcend that history — it learns it. Cleaner inputs fed into a skewed target produce cleaner-looking bias, not less of it. The historical injustice is not noise in the data. It is the signal the model is learning.
But the deeper issue is not the bias itself — human judges carry bias too, and at least the algorithm’s bias can, in theory, be examined. The deeper issue is what happens to the possibility of challenge.
When a human judge delivers a sentence that seems disproportionate, there is a human to confront. The defendant’s lawyer can argue that the judge misweighed a factor, misread the circumstances, or carried an assumption that doesn’t apply. The appellate court can review the reasoning. The judge can be questioned, publicly, about why — and the answer must be given in terms that another human being can evaluate and contest.
When an algorithm delivers a sentence, there is no why in the human sense. There is a model, a dataset, and an output. The defendant is told, in effect, that a process they cannot understand has determined their fate on the basis of patterns they cannot see in data they cannot examine. The bias doesn’t disappear. It becomes invisible — and invisibility makes bias not just dangerous but structurally unchallengeable.
A biased judge can be reformed. A biased algorithm looks, from the outside, like it’s working.
III. The Shaking Hands
Consider what judgment actually requires.
A man in his late twenties stands before a sentencing hearing. He has been convicted of armed robbery — his second offense. The facts are not in dispute. Under the sentencing guidelines, the range is four to eight years.
A machine, processing these inputs, would weigh the prior conviction, the weapon, the financial loss to the victim, the statistical likelihood of reoffense, and produce a number within the range. Consistently, every time, for every defendant with this profile. That is the feature.
She reads the pre-sentencing report — a document prepared by probation officers that summarizes the defendant’s history, circumstances, and background — and notices that the defendant’s first offense occurred at nineteen, three months after aging out of the foster care system. She reads the letter from his employer — a warehouse supervisor who took a chance on him after prison, who says he showed up every day for two years before relapsing into old networks when the warehouse closed during a regional economic downturn. She listens to the defendant’s statement, which is halting and inarticulate and clearly unrehearsed, and she watches his hands, which are shaking.
She also reads the victim impact statement. The convenience store clerk who had a gun pointed at her face. Who now locks her doors three times before sleeping. Who wants the maximum sentence and has every right to want it.
The judge holds both of these realities simultaneously — not averaging them, not optimizing between them, but sitting in the tension between mercy and accountability long enough to make a decision she will have to defend, in writing, to an appellate court — a higher court that reviews whether the sentence was lawful and proportionate — and to her own conscience.
She sentences him to five years, with a recommendation for vocational training. She explains her reasoning from the bench: the seriousness of the offense, the harm to the victim, the mitigating circumstances of the defendant’s history, and the court’s judgment that this sentence serves both punishment and the possibility of rehabilitation.
She may be wrong. The five years may be too few — he may reoffend. The five years may be too many — they may destroy the fragile structure of his life in ways that guarantee reoffense. She knows this. She makes the call anyway, and she signs her name to it.
This is not inefficiency. This is the act of judging. It requires sitting with another human being in a moment of consequence and making a decision that is irreducible to its inputs — because the inputs do not contain the meaning. The prior conviction is a data point. What it means that a nineteen-year-old with no family committed a crime and then spent two years trying to build a life before the ground collapsed again — that is not in the data. It is in the interpretation, and interpretation requires someone who has lived a human life and can read the situation with the full weight of that experience.
No model has this.
IV. The Score in the Room
Before turning to accountability, there is a version of this argument that deserves its own answer — because it is more sophisticated than the full-replacement case, and more dangerous.
No jurisdiction is currently proposing that an algorithm replace a judge. What is being deployed, widely, is something subtler: a risk score. A number, generated by a model like COMPAS or the Public Safety Assessment — tools used in bail and sentencing hearings across dozens of U.S. jurisdictions — that appears in the pre-sentencing report alongside the facts of the case. The judge reads it. The judge then decides.
The human is still in the chair. The bench is not empty. This, the argument runs, addresses the concern about answerability — a person is still deciding, still signing their name.
But the score is in the room.
Researchers who study how people respond to algorithmic recommendations have documented a consistent pattern they call automation bias: the tendency to defer to a machine’s output even when independent judgment would yield a different result. It is not unique to judges — it shows up in radiologists, pilots, financial analysts. The score does not mandate an outcome. It does not need to. It anchors one. A judge who might have sentenced at five years reads a “high risk” classification and gravitates toward seven. The reasoning she writes still sounds like her reasoning — she still cites the prior conviction, the weapon, the harm to the victim. The score does not appear in the written judgment. It simply adjusted the field of gravity inside the decision.
This is subtler and more insidious than full replacement. Full replacement at least makes the question explicit — a defendant can point to the algorithm and demand to know its methodology. When the algorithm sits in a report that a human judge read before rendering a decision, neither the defendant nor the appellate court has clean access to what happened. The human judgment is now partially downstream of a black box, but it is the human judgment that gets reviewed.
The hybrid model preserves the appearance of human accountability while quietly eroding its substance. It is the ordeal’s logic applied with more patience: not remove the judge, but give the judge a number and wait.
The line between tool and judge is not always crossed in one step. It is sometimes crossed in increments, each one defensible, until the crossing is complete.
V. The Accountability Void
Invisible bias cannot be challenged. But there is a second failure, distinct from the first: when algorithmic judgment goes wrong, it may not just be unchallengeable — it may be uncorrectable, because the architecture that allows correction has quietly dissolved.
When the judge is wrong — and judges are wrong, regularly, consequentially — there is a person who answers for it.
The defendant appeals. The appellate court examines the reasoning. If the sentence was based on an incorrect application of law, it is reversed. If the judge demonstrated bias, they can be censured, retrained, or removed from the bench. The system is imperfect, slow, and often fails to correct its own errors. But the architecture contains the mechanism for correction, because at every stage there is a human who made a decision and can be asked to justify it.
Automate the judgment and that architecture collapses.
When an algorithm produces a wrongful conviction, who is responsible? The engineers who built the model? They will say they built a general tool, not a judge. The company that deployed it? They will point to the procurement process and the government’s decision to use it. The officials who approved its adoption? They will note that they relied on the vendor’s accuracy metrics. The investigation remains open.
This is a structural problem, not an accident. It is not solved by better contracts, clearer liability clauses, or more careful procurement. The problem is that the act of judgment has been distributed across a pipeline in which no single node bears enough of the decision to bear the responsibility. Everyone contributed. No one decided.
The defendant, meanwhile, is in prison.
AI doesn’t just automate decisions. It automates the diffusion of blame — and the result is a life determined by patterns in data that no one in the room can explain and no one outside it can challenge.
And when that determination is wrong, the error compounds — because the system that made it was never designed to notice.
VI. The Asymmetry
The accountability void described above might be tolerable if the errors it conceals were minor and evenly distributed. They are neither. A wrongful conviction is not a service disruption. It is the destruction of a human life — and it disproportionately falls on those who already have the least power to contest it. The people most likely to be misclassified by a predictive model are the people whose lives are most poorly represented in the training data: the poor, the marginalized, the people whose circumstances don’t fit neatly into the patterns the model learned from a dataset shaped by decades of structural inequality.
The asymmetry gets worse at scale. A human judge who consistently delivers unjust sentences will eventually be noticed — by defense attorneys, by appellate courts, by journalists. The inconsistency itself is the signal. An algorithm that is systematically wrong in ways that are statistically invisible doesn’t look broken. It looks like it’s working.
You cannot appeal to a system that believes it is correct.
VII. What Cannot Be Automated
The argument against machine judges is not that machines are insufficiently accurate. Accuracy is the wrong frame entirely. The argument is that justice is not an optimization problem.
Justice is a social act. A verdict is not merely a classification — guilty, not guilty, a number of years. It is a message from a community to one of its members, mediated by a person who bears the weight of that community’s authority and can be held accountable for how they wield it. The defendant stands before a human being who represents the society that will punish or release them. That human being must look at them, hear them, and decide — and the decision must be given in terms the defendant can understand, contest, and appeal.
When a machine delivers a verdict, this contract is quietly broken. The defendant stands before a process that cannot see them, cannot be moved by anything they say — not because it has weighed their words and found them insufficient, but because it is structurally incapable of being moved. The verdict becomes a product rather than a judgment: consistent, efficient, and morally hollow.
The legitimacy of a legal system does not rest on its accuracy rate. It rests on the premise that a human being decided, that the decision can be explained, and that the person who decided can be held to account. Remove any of these and you do not have a more efficient justice system. You have a processing pipeline.
A machine that surfaces precedents, flags inconsistencies, or models sentencing ranges is a tool. A machine that delivers a binding verdict is something else entirely. The distance between the two is not a spectrum. It is a line — and crossing it is not a technical limitation that better models will resolve. It is a structural violation of what it means to judge.
VIII. The Question Underneath
The drive to automate judicial decision-making reveals something uncomfortable about how we think about justice.
If we believed that judging was a valuable human act — difficult, consequential, worthy of investment — we would be talking about how to make human judges better: better training, better data, better working conditions, better accountability mechanisms, better representation on the bench. We would be asking why judges are overworked, why public defenders are underfunded, why the system produces inconsistency not because judgment is inherently flawed but because we have systematically under-resourced the conditions that good judgment requires.
Instead, we are talking about replacing them. And the pattern of investment makes the values legible. In jurisdictions across the United States, algorithmic risk tools have been funded, procured, and integrated into court systems while public defender offices remain catastrophically under-resourced — with caseloads so high that meaningful individualized review is impossible for a large fraction of defendants. The technology that simulates judgment is funded. The human infrastructure that judgment actually requires is not. When a system consistently underfunds the conditions for good human judgment and then points to the resulting inconsistency as evidence that human judgment should be replaced, it is not solving a problem. It is completing one.
The drive to automate judgment is not a sign that we have too much faith in machines. It is a sign that we have lost faith in the act of judging itself — in the proposition that a human being, sitting with the full weight of another person’s fate, examining evidence, listening to arguments, and making a call they must defend and live with, is doing something that matters. Something that cannot be done faster, cheaper, and at scale.
In 1215, the Church told Europe that it could no longer ask God to judge on command. The courts, forced to improvise, invented something harder: human beings judging human beings, within rules, under scrutiny, accountable for their errors. It was slow, inconsistent, and expensive. It was also the foundation of every legal system that followed.
Eight centuries later, the temptation returns — dressed differently, speaking the language of accuracy and bias reduction rather than divinity, but structurally identical. Delegate the verdict to a process that is beyond human frailty. Remove the fallible person from the chair. Trust the output.
We tried that once. The process was objective. The burns were examined after three days. If they festered, you were guilty.
The bench should not be empty. Not because we have solved the problem of human bias — we have not, and may never. But because the alternative — a verdict delivered by something that cannot be questioned, cannot be moved, and cannot be held to account — is not justice made faster.
It is the ordeal, returned.
The question is never whether machines are fast enough, consistent enough, or accurate enough. The question is whether justice can survive being given to something that cannot answer for what it decided.