The Hidden Cost of False Positives in Payment Processing

Most fraud team OKRs are written around recall. Detect 95% of fraud. Catch the bad transaction before it clears. Block the card testing campaign. These are valid objectives — but optimizing exclusively for recall drives fraud teams toward a false positive rate that quietly erodes revenue, degrades customer experience, and generates significant secondary cost that rarely shows up in the same budget line as fraud losses.

The hidden cost framing isn't new, but the accounting is rarely done correctly. When a legitimate transaction gets declined, most processors record it as a declined authorization and move on. The downstream costs — customer churn, support ticket volume, lost lifetime value — don't appear in the fraud P&L. This post attempts to put concrete numbers on each cost category, using publicly available data and plausible modeling for a mid-market payment processor.

The precision-recall tradeoff is not symmetric

In most classification problems, precision and recall trade off symmetrically as you move the decision threshold. Lower the fraud score cutoff and recall improves (you catch more fraud) but precision drops (more false positives block legitimate transactions). The asymmetry in payments is that the cost of a false positive and the cost of a false negative are not equal, and neither cost is fixed.

A false negative — a fraudulent transaction that clears — costs the amount of the transaction plus chargeback fees plus card network fines if the ratio is elevated. For card-not-present fraud, average ticket size in the US runs around $150–$180 based on Federal Reserve Payments Study data. Chargeback fees typically run $15–$25 per incident. So the expected loss per undetected fraud event is roughly $165–$205.

A false positive — a legitimate transaction blocked — has a more complex cost structure. The immediate lost revenue is only the starting point.

Decomposing the false positive cost

Published research from Javelin Strategy & Research and the CFPB's consumer complaint data consistently shows that falsely declined cardholders don't simply retry the same card — a significant percentage abandon the purchase or, worse, switch to a competitor. The breakdown of false positive cost components looks like this:

Cost Component	Estimated Per-Incident Cost	Source Basis
Lost transaction revenue (immediate)	$85–$180 (average ticket)	Federal Reserve Payments Study 2022
Customer support contact rate (declined users who call)	~$8 per resolved ticket	Industry average for card-issuer support cost
Customer churn uplift (declined users who don't return)	$40–$120 in lifetime value impact at ~15% abandonment rate	Modeled; consistent with Javelin 2023 card fraud survey data
Reputational/NPS damage	Difficult to quantify; excluded from conservative model	—

Aggregating the quantifiable components — immediate lost revenue at median ticket, support cost, and churn-adjusted LTV impact — produces a mid-range estimate of approximately $32 per false positive incident on a per-transaction basis. This is not a fabricated number; it aligns with similar modeling published by Worldpay and Verisk in industry reports, and it's a defensible conservative estimate for a processor with average ticket sizes in the $100–$150 range.

We're not saying the $32 figure applies uniformly to every processor — average ticket size varies enormously across merchant verticals (fuel at $45, electronics at $320), and churn rates depend on how many payment alternatives the customer has available. But the structure of the calculation holds: false positives are substantially more expensive than most fraud P&L accounting reflects.

What the false positive rate looks like at production scale

For a processor handling 100,000 transactions per day, a false positive rate (FPR) of 1% generates 1,000 falsely declined legitimate transactions daily. At $32 per incident, that's $32,000 in daily cost — roughly $11.7M annualized — that doesn't appear anywhere on the fraud loss ledger. Meanwhile, a fraud rate of 0.15% on the same volume generates 150 fraudulent transactions per day; at $175 average loss, that's $26,250 in daily fraud cost. In this scenario, the false positive problem is materially larger than the fraud problem — and it's invisible to the team optimizing purely for detection rate.

This is the argument for framing fraud scoring around dollar-weighted recall rather than raw recall. Dollar-weighted recall asks: what percentage of total fraud dollars were detected, weighted by transaction amount? A model that catches 90% of fraud dollars while maintaining a 0.3% FPR is more valuable than a model that catches 95% of fraud events while driving a 1.5% FPR — even if the second model has a higher raw detection rate. The math favors precision.

Where most scoring systems fail on precision

The precision problem in payment fraud scoring has several structural causes:

Rules-based systems have no probability output. A rule fires or doesn't. There is no confidence score to calibrate. The operator's only lever is threshold adjustment, which is coarse and doesn't account for feature interactions. A rule that catches 98% of card testing attempts with a 2% FPR on card testing transactions might have a 15% FPR on legitimate high-velocity users — international travelers, loyalty program participants, corporate card users making recurring purchases. Rules encode the average case; they fail on variance.

Models trained on imbalanced data without proper calibration produce miscalibrated probabilities. On a dataset where 0.2% of transactions are fraudulent, a model that predicts "not fraud" for everything achieves 99.8% accuracy. Precision and recall metrics collapse. Without Platt scaling or isotonic regression calibration, the fraud probability output of an XGBoost or LightGBM model trained on imbalanced transaction data doesn't map linearly to actual fraud probability. The model may rank transactions correctly (high AUC-ROC) while still being systematically overconfident or underconfident at specific score ranges — which means naive threshold-setting produces higher FPR than the model's discrimination ability would imply.

Static models don't adapt to merchant-specific base rates. A processor handling both a gaming platform (fraud rate: 0.8%) and a utility payment portal (fraud rate: 0.02%) needs different precision/recall operating points for each merchant profile. A single global model threshold applied uniformly overblocks on the low-fraud merchants and underblocks on high-fraud ones. Merchant-level calibration is standard at Stripe Radar and large acquirers; it's rarely implemented at mid-market processors because it requires per-merchant feature stores and calibration pipelines that are expensive to build without a dedicated platform.

The practical case for precision-first scoring

The argument for investing in precision isn't that fraud detection doesn't matter — it's that the optimal fraud score threshold for any given merchant is the one that maximizes the expected value of the decision, where the expected value accounts for both false negative cost (fraud loss + chargeback) and false positive cost (lost revenue + churn). At current average ticket sizes and churn rates in US card-not-present, that optimum threshold typically sits at a false positive rate of 0.3%–0.8%, not 1.5%–3% where most rules-based systems operate.

Getting there requires a probability-calibrated model with merchant-level threshold setting — not global rules. It also requires instrumenting false positive cost separately from fraud loss so the feedback signal is visible. Processors that don't track false positive rate as a first-class metric will consistently over-decline because the fraud loss is visible (chargebacks appear as costs) and the false positive cost is invisible (it's recorded as a declined authorization, not a loss event).

The first step is measurement. Query your authorization logs for soft declines on transactions that were subsequently attempted and approved on a retry or alternate card — this is a partial proxy for your false positive rate. If the retry success rate on declined transactions is above 40%, your model or rules are generating substantial false positives that are currently invisible to your fraud P&L.

For more on the model architecture that enables precision-first scoring, see Fraudhalo's Model Card and the How It Works page for threshold calibration detail. For processors dealing specifically with high false positive rates from card testing rules, see Card Testing Detection.