Cross-Abstractive Alignment in Fact-Checking

Case Study (2025)

A followup analysis to FActScore (Min et al., 2023), we find a recurring pattern where automated fact-checkers get things wrong (what I call cross-abstractive alignment), and relate this back to system-vs-human agreement in a fact-checking environment.

NLPDeBERTaPyTorch

2x2 matrix of fact-checking error categories with example facts in each cell

Overview

LLMs are increasingly the first stop for people seeking information, fluent enough to mix supported claims with confident hallucinations in the same paragraph. FActScore (Min et al., 2023) gave the field a way to measure this: break a generation into atomic facts and report the fraction supported by a knowledge source. For ChatGPT biographies, human evaluators score around 58%.

The goal here was to characterize where automated fact-checkers get it wrong, not just how often. Using FActScore's released dev split - 31 ChatGPT-generated biographies, 221 human-labeled atomic facts, BM25-retrieved Wikipedia passages - surfaces a recurring failure mode I call cross-abstractive alignment.

How to fact check

The task: given an atomic fact ("Marianne McAndrew is a singer") and up to five BM25-retrieved Wikipedia passages, label it Supported or Not Supported.

A lexical baseline asks how many of the fact's content words appear in the passage, TF-IDF-weighted so common words count for less than rare ones. Reaches 79.2% accuracy. The intuition is that if a passage shares the fact's salient vocabulary, the fact is probably supported. That's workable until the fact and the passage say the same thing in different words, or until a single token decides truth ("Under-23" vs. "U-20").

The entailment pipeline below swaps bag-of-words for a model trained to recognize when one sentence logically implies another. Reaches 85.5%.

Windowing. Each passage is split into sentences and adjacent bi-sentence pairs, normalized (NFKC, diacritic stripping) and lemmatized with spaCy. Each window becomes a candidate premise - a chunk small enough for the NLI model to handle.
Top-k pruning. Each window gets a cheap composite score against the fact - $0.6 \cdot recall + 0.3 \cdot Jaccard + 0.1 \cdot bigram$ - and only the top 24 pass to NLI. A passage often has dozens of candidate windows; the pre-score filters distractors before the slow model sees them.
DeBERTa-v3-MNLI entailment. Each surviving window runs as premise against the fact as hypothesis, batched. The model outputs probabilities for entail, neutral, contradict. Model: MoritzLaurer/DeBERTa-v3-base-mnli-fever-anli.
Threshold + win-gap decision. Predict S only if $entail \geq 0.55$ AND $entail - max (neutral, contradict) \geq 0.15$ . The threshold catches confidence; this stops 0.55-entail / 0.46-neutral coin-flip calls dressed up as confidence.

Results

The error asymmetry is more interesting than the headline. Every false negative missed by failing the absolute entailment threshold, never by failing the win gap. The model wasn't borderline-uncertain on its misses - it was confidently wrong, often predicting entailment under 0.05 when the gold was S. False positives split roughly in half between high-confidence wrong on detail mismatches (entailment ≥ 0.95) and threshold-skirting calls just above 0.55.

Confusion matrix showing 92 true positives, 97 true negatives, 12 false positives, 20 false negatives

92 true positives, 97 true negatives, 12 false positives, 20 false negatives. 85.5% accuracy, balanced F1 (0.852 / 0.858).

85.5%
per-fact accuracy
+6.3 pp
over TF-IDF baseline
3.6 pp
FActScore gap vs. human

FActScore framing

FActScore measures system-vs-human agreement on percent-supported across a generation. On this dev subset, the classifier puts ChatGPT at 47.1% supported vs. human-derived 50.7% - a 3.6 percentage-point gap. The paper reports under 2 percentage points on the full test set using LLM-as-judge instead of NLI. The goal here wasn't to beat that number; it was to do the error analysis the single number hides.

Takeaways

Cross-abstractive alignment. Sorting the 32 errors by hand surfaces a single 2x2 (right): does the fact need to be compressed to a few salient tokens, or to a summarized concept? Same question for the support. The off-diagonals - where fact and passage live at different abstraction levels - are where things break. Example: "Gerhard Fischer is best known for inventing a metal detector" - the support proves it across distant sentences, but no single window covers the claim. Entailment = 0.023; false negative. The pattern: an accurate fact-checker has to compress both sides to comparable levels of detail before comparing them. Reframes build a better NLI model as align granularities first.

NLI is the bottleneck, not retrieval. Every false negative happens after NLI sees a relevant window - the model just scores it low, usually under 0.05. The cross-abstractive pattern explains why: NLI handles same-granularity comparisons cleanly and falls off cliffs at cross-granularity ones.
Win-gap beats threshold alone. Requiring entailment to beat $max (neutral, contradict)$ by 0.15 removes the kind of false positive where the model is 0.55 entail / 0.46 neutral. The decision rule needs more than one number.
Some of the errors aren't errors. About a third of the 32 misclassifications look like data quality issues - facts mislabeled by annotators ("Cleveland is in Ohio" labeled NS) or passages that don't mention the relevant entity. The printed-gold ceiling is probably around 90%.

Parting thoughts

Fact-checking matters because LLMs are now embedded in workflows where users treat their output as authoritative. The same property that makes them useful - synthesizing across context - also makes them prone to plausible-sounding errors that are hard to spot without verification. Automated fact-checking is one piece of the broader effort to keep these models accountable to ground truth; the cross-abstractive failure mode points at where current pipelines need work. Three things I'd try next:

Sentence-embedding pre-pruning. Run sBERT or E5 alongside the lexical composite score. Catches the FGM-style false negatives that get killed by zero lexical overlap before NLI sees them.
Per-fact threshold calibration. Detail mismatches and summarization-style facts almost certainly want different operating points. The fact that every false negative is a threshold miss suggests the threshold is wrong for some facts, more than the model is.
LLM-as-judge in place of NLI. The paper's actual automated pipeline. Would close the FActScore gap and handle the multi-sentence-synthesis cases NLI breaks on. The real reproduction of FActScore lives here.

Mechanistic Interpretability for Clinical JEPAs

Multi-modal wildfire ignition modeling

Syntactic Negation Probing