The Record

The Calibrator, Before and After

A public record of the v3 rebuild. Part one (Before) written 2026-06-11, while the old calibrator was still running. Part two (After) written 2026-06-11, the day v3 shipped.

The Rising Compass is changing how it reads songs. Not what it values, and not the tenets, which stay word for word. What changes is the procedure: the steps the reading agent walks through, who decides the final number, and how repeat readings get combined. A change that deep should not happen quietly. This document captures how the calibrator worked on the day before the rebuild began, because once the new one exists, the old one is gone and the "before" can never be reconstructed honestly. The "after" half gets appended at deploy, and the two halves together are the record.

This is written for a reader with no technical background. Where the methodology page explains what the compass believes, this page explains what the machinery did.

The v2 calibrator, as it ran on 2026-06-11

Before

How a song was read

Every reading started the same way: the agent was handed lyrics with the instruction to forget everything it knows about the artist. No reputation, no chart history, no cultural baggage. The page is all there is.

From there it walked a fixed sequence. First it named the song's dominant arc, the thing the lyrics are fundamentally about. Then it argued both directions from a starting position of zero: a case for moving the song up, citing specific lines that match the higher tenets, and a case for moving it down, citing lines that match the lower ones. If the upward case rested on personal growth, a transformation check asked whether the narrator actually changed by the end of the song or merely wished to. A contamination check then looked for isolated dark moments inside otherwise clean songs. A summary check tightened the public one-line description. Last came the verdict: a tier, a number, and a one-sentence reason.

The order matters, and it is the part v3 changes most. The old procedure went straight from reading into analysis. There was no step where the agent recorded its first impression before the reasoning started. That gap turned out to be the source of the calibrator's strangest failures, which part two will return to.

How the charge was produced

The agent itself named both the tier (the color band) and the charge (the precise number from -100 to +100). To keep the number honest it was given fifteen reference points, audited example placements like "-42: self-deprecation ratified as identity" or "+80: cosmic oneness with calm certainty," and asked to place each new song relative to them.

Then the server stepped in, but only as a janitor. If the agent's number fell outside the range of the tier it had named, the server silently clamped the number back inside the band. A song called Decent with a charge of -40 became a song called Decent with a charge of -24. Whatever tension existed between the tier and the number, the clamp erased it before anyone could see it. Disagreement between the two was actually useful information, a sign the reading was strained, and the system threw it away.

How the tier was chosen

Two different answers, depending on the path. On a single fresh reading, the tier was whatever the agent said it was. On the consensus path, where multiple readings of the same song get merged, the tier was derived mathematically from the merged charge. Two sources of truth for the same fact. They usually agreed. When they did not, nobody was watching the seam.

There was one more quiet failure here: if the agent ever returned a tier name the system did not recognize, the song silently defaulted to Decent, the middle of the scale. No error, no flag for review. Just a shrug rendered as a verdict.

How repeat readings combined

Every reading of a song was logged as a run, and a song's public charge drifted toward the combined view of its runs. The combining method was a weighted average, where each run's weight was the agent's own self-reported confidence. The agent graded its own homework, and the grade carried real power.

Two problems lived in that design. An average lets one strange reading drag the result, so a single outlier run could pull a settled song toward a tier boundary. And self-reported confidence turned out to be nearly worthless as a quality signal: the calibrator's worst documented misread shipped with a confidence of roughly 0.9. The agent was most sure of itself precisely when it had reasoned its way into the wrong answer.

Older songs also carried "seed" runs, snapshots of whatever the song's reading was before run-logging existed. Some of those snapshots came from the early chart-reading era and were stale, yet they kept voting in the average forever.

The failure modes, named plainly

These are the documented ways v2 got things wrong, the case file that motivated the rebuild.

The hatch. A song with a sympathetic surface (heartbreak, a party, a love story) and a harmful payload running through every chorus could slip the payload into the "contamination" footnote while keeping a kind tier. The payload should have set the tier. The agent found a hatch in the rules and the structure let it.
Confident self-deception. On certain songs the agent's first instinct was correct, and its long chain of careful reasoning then talked it out of the right answer. Because nothing recorded the first instinct, the drift was invisible. The misread shipped with high confidence attached.
The route error. A song reaching a high tier through encouragement, or witness, or collective protest was sometimes judged by the wrong checklist, like capping an encouragement song for failing a personal-transformation test that was never meant to apply to it. The rules had carve-outs for this stacked into a maze of scope disclaimers, and the maze occasionally lost.
The silent green default. Described above. Unrecognized output became Decent instead of becoming a flag for human review.
The clamp. Also described above. Tier and number could disagree, and the system hid it.
The wasted reading. A song submitted with no lyrics still triggered a full reading request, an expensive round trip whose only possible answer was "cannot calibrate."
Outlier drag and stale seeds. The averaging problems from the previous section.

None of these were daily events. The calibrator's readings were adjudicated by a human every single day, and most verdicts held up. But each failure in this list happened at least once in the audited record, and each one traces back to the same root: the procedure let deliberation run unchecked, and the final number belonged to the wrong party.

What the charge meant, before

Through all of this, the meaning of the published number never wavered: -100 is all-out war, with the self or with others, and +100 is all-out peace. That meaning survives the rebuild untouched. What changes is how the number gets made.

The v3 calibrator, as it shipped on 2026-06-11

After

The rebuild kept every value the compass holds and rewrote the procedure around them. The tenets read the same, word for word. What follows is what changed underneath them.

How a song is read now

The reading still opens the same way, with the agent handed lyrics and told to forget the artist. The first real change comes one step later, and it is the most important change in the whole rebuild.

Before any argument starts, the agent now records its gut read: a single honest first number, the impression the song leaves before reasoning gets a vote. That number is written down and kept. Everything after it is checked against it. When the careful reasoning ends up far from the gut, more than a quarter of the scale away, the system treats that gap as a warning rather than a result. The reading is reconciled, and if the gap holds it gets escalated for a harder look instead of shipped. The old calibrator let deliberation run wherever it wanted because nothing remembered where it started. The new one remembers, and the memory is load-bearing.

From there the agent walks a clearer path than before. It makes an explicit decision about how the song reaches whatever tier it reaches: through encouragement, through witness, through collective protest, through personal transformation, and so on down a fixed set of routes. Naming the route up front is what closes the old "wrong checklist" failure, because the route decides which tests even apply. An encouragement song is no longer quietly failed for not being a transformation song.

Then it reads the song along two axes at once, and it is required to say whether any harm it finds is isolated or runs through the whole song. That single required call, pervasive or not, is the fix for the worst old failure. A harmful message carried by every chorus can no longer be filed away as a footnote while the song keeps a kind tier. Pervasive harm sets the tier, every time, with no exceptions and no discretion.

Last, instead of inventing a number against a handful of example placements, the agent places the song against a standing table of precedents, audited reference readings the compass has already committed to. It does not copy a precedent's number. It argues that the new song sits above this one, below that one, between these two. The number falls out of where it lands.

How the charge is produced now

This is the cleanest break from the old design. The agent no longer names the final number at all.

The agent reports its pieces: the gut read, the route, the harm and whether it is pervasive, a center point for the charge, and four small adjustments that nudge the number up or down within the band. The server takes those pieces and computes the published charge with fixed arithmetic. The center is the anchored read; the four adjustments are bounded so they can fine-tune the position but can never cross into another tier or overpower the center. The math is the same every time and belongs to no one's opinion.

The old janitor-clamp is gone with it. There is nothing left to silently clamp, because the number is built from the parts rather than guessed and then corrected.

How the tier is chosen now

One answer now, on every path. The server derives the tier from the composed charge, using one tier function and only that function. The agent never names the tier. The two-sources-of-truth seam from the old design, where a fresh reading trusted the agent and a merged reading trusted the math, is closed because only the math decides.

The silent green default is gone too. If a reading comes back malformed or unreadable, it no longer collapses into Decent and ships. It is held back as a reading that needs a human, which is what it always should have been.

How repeat readings combine now

A song's public charge still moves toward the combined view of its logged readings, but the combining is now the median of the live readings instead of a weighted average. The median ignores a lone outlier, so one strange reading can no longer drag a settled song toward a tier boundary. And the agent's self-reported confidence no longer carries any weight in the result. It is still recorded, and it still helps decide whether a reading should be escalated, but it stopped being allowed to grade its own homework. The old stale snapshots from the early chart-reading era are excluded once a song has enough fresh readings, so they stop voting forever.

The failure modes, closed

The same case file from part one, each item and how the rebuild closes it.

The hatch. Closed by the required pervasiveness call. A payload running through every chorus is pervasive by definition, and pervasive harm sets the tier. There is no footnote to hide in.
Confident self-deception. Closed by the recorded gut read plus the divergence check. The first instinct is now written down, and a long argument that walks away from it gets caught and reconciled instead of shipped.
The route error. Closed by the explicit route step. The route is chosen up front and decides which tests apply, so an encouragement song is never graded as a transformation song.
The silent green default. Closed. Unreadable output becomes a flag for a human, not a shrug rendered as Decent.
The clamp. Gone. The number is composed from parts, so there is no agent guess to clamp and no hidden disagreement to erase.
The wasted reading. Closed. A song submitted with no lyrics short-circuits with no reading request at all, so there is no expensive round trip whose only answer was "cannot calibrate."
Outlier drag and stale seeds. Closed by the median and by excluding the early chart-reading snapshots once fresh readings exist.

What the era boundary means

A change this deep raises an honest question about everything read before it. The day v3 shipped, the compass drew a line. Every reading logged under the old calibrator was marked as belonging to the previous era, more than four hundred of them, so that the new consensus math never mixes old-procedure readings with new-procedure ones.

What this did not do is rewrite the past in place. The published verdict on every already-calibrated song, more than a thousand of them, was left exactly as it stood. Nothing flipped tier overnight. What changed is that those songs are now open to be read again under the new procedure, and when they are, the fresh reading is what counts. The compass did not pretend the old readings never happened, and it did not pretend they were produced the new way. It retired them honestly and let the new procedure earn each verdict back.

What the charge means, after

Unchanged, as promised. Minus one hundred is all-out war, with the self or with others, and plus one hundred is all-out peace. The rebuild never touched what the number means. It only changed how the number gets made, and it moved the final say from the party most able to talk itself into the wrong answer to a procedure that cannot.