Latency vs Accuracy: The Tradeoff Every Risk Engineer Faces

Every fraud scoring system that operates in the authorization path faces the same fundamental constraint: the card network authorization window. Visa and Mastercard expect an authorization response within approximately 1–3 seconds at the network level; actual acquirer processing SLAs commonly require a fraud decision in 150–300ms to avoid affecting authorization rates. Push beyond that ceiling and you start seeing timeout-based approvals, partial authorization failures, or — worst outcome — the issuer declining on a network timeout code that looks like a payment failure to the cardholder.

The latency constraint is therefore not optional. It's an architectural boundary condition that every fraud scoring system has to design around. The engineering question is: given a hard latency budget of, say, 100ms p99, what's the accuracy you can actually achieve? And what features do you sacrifice when you're forced to shed computation time?

Where time goes in a fraud scoring pipeline

The p99 latency budget gets consumed across several distinct stages. Understanding where the time goes is the prerequisite for making informed tradeoffs. In a typical production fraud scoring pipeline:

Pipeline Stage	Typical p50 (ms)	Typical p99 (ms)	Primary Optimization Lever
API ingestion & schema validation	1–2	4–8	Protobuf over JSON; async validation
Feature store lookup (Redis Cluster)	3–6	12–25	Topology; replica reads; connection pooling
Real-time feature computation (velocity, ratios)	2–5	8–18	Pre-aggregation; sliding window approximations
Model inference (GBM or LightGBM)	4–8	12–22	Tree depth; feature count; quantization
Response serialization & return	1–2	3–6	Minimal; connection reuse
Total (happy path)	11–23ms	39–79ms	—

The numbers above represent a well-optimized pipeline with a moderate feature set (~40 features, primarily velocity and device signals). They're consistent with what Fraudhalo observes in our production environment. The p99 blow-up — where a 23ms p50 becomes a 79ms p99 — is almost always a feature store tail latency event: a Redis replica failover, a sudden keyspace hotspot from a BIN attack scenario flooding the same key prefix, or a JVM GC pause in an older infrastructure setup.

The accuracy cost of feature pruning

When you're over budget on latency, the standard engineering instinct is to remove features. Cut the slow ones, ship a leaner model. The problem is that accuracy degradation from feature removal is non-linear — removing the 5 slowest features doesn't reduce accuracy by 5%, it might reduce accuracy by 20% if those 5 features happen to be the most predictive ones. This is the tradeoff that's often made badly: engineers optimize for latency without quantifying the precision/recall cost of each removed feature.

The discipline here is feature importance × latency profiling done simultaneously. Before removing any feature from the scoring pipeline, measure:

The feature's contribution to model AUC-ROC on the held-out evaluation set
The feature's marginal p99 latency contribution (what does p99 improve by if this feature is removed?)
The precision@N impact: for the top-N scored transactions per day, what fraction of fraud does this feature contribute to correct ranking?

The ratio of precision impact to latency improvement is the tradeoff score. Features with low precision impact and high latency contribution are safe to remove. Features with high precision impact and moderate latency contribution should be optimized (precomputed, approximated) rather than removed. The worst outcome is removing a high-precision feature to save 8ms p99 when that feature is responsible for correctly scoring 15% of the confirmed fraud in your top-0.5% flagged transactions.

Graph features — cross-account linkage counts, device-to-card association depths — are typically the most time-consuming and the most precision-critical for certain attack types (ring fraud, coordinated account takeover). Removing them to hit a latency target typically degrades detection of exactly the highest-damage attacks.

Approximation techniques that preserve accuracy

The right architecture response to latency pressure is approximation before elimination. Several techniques are well-established in production fraud scoring systems:

Pre-aggregated velocity features. Instead of computing card_count_15m from raw event logs at query time, maintain a sliding window counter in Redis that is updated asynchronously on each transaction event. At scoring time, the feature lookup is a single Redis GET rather than a log scan. The counter is slightly stale (typically <100ms) but accurate enough for fraud detection purposes — a 15-minute window count that's 1 second old still carries its predictive value.

Approximate count-min sketches. For high-cardinality velocity features (BIN-level counts across the full BIN space), a count-min sketch data structure provides approximate counts in O(1) query time with a small, bounded error. At the query volumes typical of payment processing, the approximation error is typically below 1% — acceptable for a feature that feeds into a probability model rather than a hard threshold rule.

Two-stage scoring. Run a fast, lightweight model (10–15 features, sub-20ms p99) as a first pass. Only escalate to the full model for transactions above a low-risk score threshold. Transactions scoring very low on the fast model are approved without full model evaluation. This reduces full-model inference load by 60–80% while preserving accuracy on the high-risk tail — where it matters most. The accuracy tradeoff is: you accept a slightly higher false negative rate on transactions that the fast model scores low but the full model would score high. Empirically, this population is small if the fast model is reasonably calibrated.

fast_score = fast_model.predict(fast_features)
if fast_score < FAST_APPROVE_THRESHOLD: # e.g. score < 15
 return {"decision": "allow", "score": fast_score, "stage": "fast"}
else:
 full_score = full_model.predict(full_features)
 return {"decision": decide(full_score), "score": full_score, "stage": "full"}

The p99 target is not always 100ms

It's worth being specific about which transactions require sub-100ms fraud scoring. Not all payment products have the same latency requirements. In-store POS transactions (EMV contact or contactless) operate on a tighter authorization window — typically 500ms total processing time including network round-trip — which means the fraud scoring component needs to be sub-50ms p99 to leave headroom. Card-not-present e-commerce transactions operate on a more generous window; 150–200ms p99 is typically acceptable when the checkout UI includes a processing indicator.

ACH and RTP payments have different latency profiles entirely. FedNow and RTP transactions are typically evaluated on a pre-initiation scoring model where the latency budget is 500ms–5 seconds — enough time to run graph features and deep behavioral models without approximation. The fraud architecture for real-time account-to-account payments is therefore different from card authorization scoring: you have more time but less transaction data (no card network signals, no MCC, no merchant ID in the traditional sense).

We're not saying you should treat all payment rails the same way for fraud scoring architecture. The p99 target should be derived from the specific payment rail's authorization protocol, not set as a universal constant.

What to measure, and when to escalate

For risk engineers running production fraud scoring, the operational metrics to monitor continuously are: p50 latency (catches gradual degradation), p99 latency (catches tail events), feature store hit rate (cache misses are the primary p99 blow-up cause), and model inference throughput (transactions per second at current latency budget). When p99 exceeds the threshold, the diagnostic starts at the feature store, not the model — in our experience, the feature store is the latency culprit in approximately 70% of p99 regression events.

For the specific latency benchmarks Fraudhalo maintains in production and the feature computation architecture that achieves sub-80ms median scoring, see How It Works. For the model architecture tradeoffs between speed and accuracy on specific fraud vectors, see the Model Card.