Syntactic Negation Probing

Case Study (2025)

A dependency-parser-driven analysis of ELECTRA-small's reliance on negation cue location in SNLI.

NLPSNLIPyTorch

Overview

If 'The room is never empty without music', then is 'The room empty with music'?

State-of-the-art NLI models routinely lean on shallow lexical heuristics - antonyms, bag-of-words overlap, and the presence of a single negation word - to clear benchmarks like SNLI. Dutifully so, negation is a common pitfall for these models, and we found that a fine-tuned ELECTRA-small drastically over-predicts when contradiction is in the premise, among other notable gaps.

This project asks the narrow question: where in a sentence does negation appear, and does that location change how the model behaves?

I fine-tuned a baseline ELECTRA-small discriminator on SNLI, then built a dependency-parse-driven analyzer (NPAS) that locates each negation cue inside the sentence's syntactic structure and assigns it to one of seven regions: root, subject, object, locative, attribute, modal, or quantifier. Slicing predictions by region - alongside lexical overlap and label distribution - surfaced a premise-side contradiction shortcut that a two-stage intervention (contrast-set augmentation plus slice-aware loss reweighting) could mitigate, but not remove.

Headline findings

Three results drove the rest of the project:

Negation has a location, and the location matters. Slicing by dependency region - not just by whether a negation word appears - exposes failures invisible to coarser analysis. Premise-root and premise-subject negation accuracy stayed in the 58-59% range while overall accuracy held at 86%.
Premise negation is vanishingly rare in SNLI. Only ~0.6% of validation examples contain a negation cue in the premise - captions from Flickr30k almost never assert what is not happening.
The model overgeneralizes 'negation -> contradiction', but only on the premise side. Hypothesis-negation reflects a real dataset prior (NRI ≈ 0.025); premise-negation injects an unwarranted bias (NRI ≈ 0.099) where gold labels don't actually skew toward contradiction. Augmentation reduced this on rare slices but pushed the bias up on PREM_ONLY (NRI ≈ 0.147) - the shortcut is structural, not just under-sampled.

The experiment loop

Vertical flow diagram showing SNLI feeding ELECTRA-small fine-tuning, slice analysis along six axes, diagnosis surfacing premise-negation failure, two intervention boxes (contrast augmentation and slice-aware reweighting) feeding back into a retraining step

The full project as a loop: fine-tune, slice, diagnose, intervene, retrain. Slice analysis runs along six axes - the negation pattern, lexical overlap, NPAS regions, label distributions, NRI, and per-slice confusion matrices.

Locating negation with dependency parses

The diagnostic engine for the whole project is a small pipeline I call NPAS (Negation-Perturbation Analysis with Slices). Most negation analyses in NLI ask a binary question: does this sentence contain a negation word? NPAS asks a structural one instead - where does the negation attach, and what is it modifying?

For each premise and hypothesis, NPAS runs a spaCy dependency parse, then detects cues using three parallel strategies. Token rules catch single-word negators (not, n't, never, no), negative pronouns (nobody, nothing), negative verbs (deny, refuse, lack, fail), the without preposition, and the rule out particle verb. Phrase patterns match multi-token cues like never ever and in no way via spaCy's Matcher. Sentence-level rules handle neither…nor coordination.

Each cue's anchor - the token it grammatically modifies - is then classified into one of seven regions by inspecting its dependency label, part-of-speech tag, and children: root for the main predicate, subject for negated noun-phrase subjects, object for direct/indirect objects and complements, locative for prepositional phrases, attribute for adjectival predicates, modal for auxiliary verbs, and quantifier for determiners and numerals. Each cue's scope is taken as the anchor's subtree, clipped to the sentence. This gives per-pair feature flags like prem_neg_subject and hyp_neg_root that the rest of the analysis slices on.

NPAS applied to a single sentence

Diagram showing the sentence 'The room is never empty without music' tokenized with dependency labels and NPAS region assignments for every token. Negation cues 'never' and 'without' are highlighted in coral with arcs connecting them to their syntactic anchors 'is' and 'music'. The region labels under is and music are coral; the labels under other tokens are faded gray. Below, a panel shows the three cue-detection strategies, with token-rule matching highlighted as the one that fired. The output is the feature flags prem_neg_root and prem_neg_locative.

Walking the pipeline on 'The room is never empty without music.' Two cues fire (never and without), each anchored on a different token via different dependency relations, lighting up two of the seven syntactic regions.

Baseline accuracy by negation pattern

Table showing accuracy across negation patterns and lexical overlap buckets

Out of 10,000 SNLI validation examples, only 208 contain explicit negation in the premise. Accuracy on premise-negation examples drops to 75.4% - well below the 85.4% overall baseline.

Negation Reliance Index

Counting accuracy by slice tells you where the model fails, not why. To quantify how much the model overuses negation as a cue, I defined a Negation Reliance Index (NRI):

$NRI (ℓ) = Δ_{pred} (ℓ) - Δ_{gold} (ℓ)$

where $Δ$ is the change in label probability conditioned on negation being present vs. absent. NRI > 0 means the model leans on negation as evidence for label $ℓ$ more strongly than the dataset itself does. This is the key separator: real dataset bias (which the model is supposed to learn) vs. spurious model bias (which it shouldn't).

NRI for CONTRADICTION: hypothesis vs. premise

NRI table comparing hypothesis-negation and premise-negation reliance for the contradiction label

Hypothesis-negation NRI is small (0.025) - the model mirrors a real dataset coupling. Premise-negation NRI is four times larger (0.099) and points the wrong way: gold labels for premise-negation are roughly balanced, yet the model predicts contradiction 40% of the time.

Intervention: contrast sets + slice-aware reweighting

The diagnosis pointed at two targets - under-coverage of premise-negation in the training distribution, and an over-strong shortcut that the model had already internalized. The intervention attacked both, deliberately keeping the architecture frozen so any improvement could be attributed to data-side changes alone.

First, I hand-built ~150 contrastive examples seeded from SNLI premises with explicit negation, generating hypotheses that (i) preserve the core event, (ii) cover under-represented region patterns identified by NPAS, and (iii) deliberately produce NEUTRAL or ENTAILMENT labels - directly attacking the 'premise negation = contradiction' shortcut. Examples were split 70/30 into train and validation after a deterministic shuffle.

Second, after a dry training run to recompute slice error rates, I applied a margin-dependent reweighting in the cross-entropy loss: examples in slices where the baseline performed worse than overall accuracy (premise root, premise subject, premise quantifier) were upweighted; slices the baseline already handled well were left alone. Optimizer, batch size, and epoch count stayed identical to the baseline.

Results: baseline vs. augmented

Comparison of baseline and augmented model accuracy across negation patterns

Augmentation improved overall accuracy (85.4% -> 86.2%) and dramatically improved the BOTH slice (50% -> 85%), but PREM_ONLY accuracy fell from 75.7% to 66.7% on a larger, harder evaluation slice. The premise-side shortcut survived.

Takeaways

Dependency-parse slicing surfaces failures that bag-of-words analysis hides. 'Does the sentence contain a negation word?' is too coarse - the same word in subject vs. root vs. locative position produces measurably different model behavior. The seven-region taxonomy turns out to be the right granularity for this dataset.
Slice-aware augmentation can stabilize aggregate accuracy while exposing rare patterns - the augmented model didn't regress on the easy majority slice, which isn't guaranteed.
Lexical overlap is not where the model fails most. The baseline handled high-overlap examples (92%) better than low-overlap (85%). The interesting failure mode is structural, not surface.
'Hypothesis-negation -> contradiction' is mostly a real dataset prior. The model isn't inventing it; it's learning what SNLI teaches. NRI is the tool that separates real coupling from spurious shortcut.
Premise-negation is the spurious one - and dataset-side fixes don't fully resolve it. Even after augmentation and reweighting, premise-root and premise-subject negation accuracy stayed in the 58-59% range. Architectural interventions (residual debiasing heads, ensemble-based artifact experts) are likely needed for real movement here.
Building NPAS was 60% of the value. Once you can ask 'where does negation live?' as a structured query rather than a regex, every other diagnostic in the project comes nearly free.

Limitations and what I'd do next

Three honest caveats. First, every result is tied to SNLI; the contrast sets are small and adversarial by construction, not a sample of natural premise-negation. Second, NPAS region assignment relies on spaCy's parser plus hand-written rules - parser errors and dependency-vs-semantic-scope mismatches are the dominant noise source, and complex sentences are likely mis-categorized at non-trivial rates. Third, the project deliberately froze the model architecture to isolate dataset-side effects; prior work (residual correction heads, artifact-expert ensembles, debiased fine-tuning) suggests the ceiling on premise-negation slices is much higher with model-side interventions.

If I picked this up again I'd swap the rule-based scope detection for a learned negation-scope model, scale the contrast set with templated generation rather than hand-editing, and pair the reweighting scheme with a small residual debiasing head - partly to test whether NRI on premise-negation can actually be driven to zero, and partly because the diagnostic toolkit deserves a stronger model to evaluate.

Mechanistic Interpretability for Clinical JEPAs

Multi-modal wildfire ignition modeling