Case Study: Curing a Document False-Positive Class with HADES v1.4.4
TL;DR
A customer scanning a 20 KB Microsoft Excel spreadsheet on HADES Community tier saw it flagged Score 84 HIGH. The file was a school class roster, structurally clean, no macros, no embedded content. We took the report seriously, traced the FP across the engine, found four structural defects plus one bonus YARA-filter gap, shipped a layered cure, and added a 100-document business-document regression corpus so this class of FP can't recur silently. We also caught and closed a separate PII-redaction gap discovered during the investigation: the FP report itself would have exposed personal email addresses in customer SIEM events.
What happened
A customer-facing instance of HADES v1.4.3 was scanning a Microsoft Excel file (a school class roster .xlsx). The Community-tier scan reported Score 84 HIGH. The file was:
- 20 KB on disk.
- A standard OOXML zip wrapper, 12 members, no
vbaProject.bin, no embedded executables. - Created with Excel 2016 and round-tripped through Google Sheets.
- Containing 18
mailto:hyperlinks to legitimate personal email addresses.
A second customer report on a different file (HADES Whitepaper v1.1.0 PDF, scanned on Pro tier) showed Score 80 HIGH with the YARA noise filter masking which rules fired (YARA: filtered 5/5 noisy matches for PDF format). Two FPs, two file formats, two license tiers. The shape of the bug was bigger than either sample on its own.
What we investigated (and what was easy to miss)
The natural triage was "find the YARA rule firing on the xlsx and silence it." That turned out to be wrong on three counts.
First, the source CLI scored the xlsx at 16 LOW, not 84 HIGH. The customer was running the deployed Nuitka binary or the demo container; the source HEAD at v1.4.3 had partially-fixed something but not propagated to the customer. We documented this gap explicitly so the fix would address both paths.
Second, the FP wasn't a single bug. It was four structural defects stacked, plus a fifth that surfaced only when we built a synthetic 100-doc test corpus:
- The polyglot detector iterated every file format for OOXML zips, hitting
PK\x03\x04from the zip wrapper at offset 0 and flagging the file as a ZIP-format polyglot of itself. - The MIME validator had no entries for DOCX, XLSX, or PPTX. When libmagic returned
application/zipfor an xlsx (structurally accurate), validation reported a 5.0 threat score. - The base64 pattern scanner fired on legitimate Office metadata values (revision GUIDs, Google Sheets
roundtripDataChecksum, hash digests) without any context check on the field name. - The format detection iterated DOCX first in the file-signatures table, so any xlsx tied on signature got classified as DOCX, cascading into wrong MIME comparisons.
- (Bonus) The pre-existing YARA noise filter for compressed formats only fired when the file was larger than 100 KB. Real business documents are typically smaller; 35 KB generated contracts lit up multiple noisy rules on every clean DOCX and PPTX at Pro tier because they bypassed the size gate.
Third, the FP was leaking PII. The xlsx carried 18 personal email addresses. HADES findings would include those addresses in the description and evidence fields, which then flowed to customer SOC dashboards, portal submissions, and Splunk events. A customer scanning a HIPAA-regulated file (medical roster, insurance roll, financial export) would have HADES leak the PII into their incident-response pipeline. We promoted PII redaction from "nice to have" to "hard prerequisite" for the v1.4.4 ship.
What we shipped
Three surgical patches in the detection path
- An OOXML-aware skip set in the polyglot detector so zip-wrapped containers don't iterate their own sibling formats and don't whole-body-scan their compressed XML for random magic bytes.
- DOCX, XLSX, PPTX, and JAR MIME-type entries in both the heuristic and parser MIME maps, plus extension-aware OOXML format disambiguation, plus a "treat the OOXML family as MIME-interchangeable" rule that accepts
application/zipfrom libmagic. - A two-path context gate on the long-base64 pattern: skip the score when the containing field is a known base64-bearing field (Office revision GUID, document ID, roundtrip checksum) OR when the value structurally matches a canonical hash shape.
One emergent bonus patch in the YARA noise filter
Drop the 100 KB size gate for the OOXML and JAR families (always zip-wrapped at all sizes) so noisy rules are suppressed on real business documents regardless of file size. Add the hidden-archive-in-metadata rule to the OOXML-only suppression list because the rule was designed for multi-archive families but mis-fires on pure OOXML wrappers. Non-OOXML formats keep the 100 KB threshold unchanged.
One architectural addition (the WWCD piece)
A deterministic structural-fingerprint fast-path that short-circuits heuristic analysis BEFORE the engine runs IOC and heuristic stages, when a document matches an "obviously clean" predicate (whitelisted member list, no macros, no embedded executables, no http/s external Targets outside known schema hosts, well-formed XML, valid PDF xref with single %%EOF). This mirrors our existing v1.16.1 SHA-256 hash allowlist and v1.9.0 magic-byte-format ML feature pattern: the deterministic allowlist is the trust boundary, ML is the second opinion, raw heuristics are the third. Acquirer-grade by design: detection-engine vendors plug their cloud reputation services into this layer cleanly.
Two architectural interface stubs
- A
result.verdictfield separating the customer-facing answer from the engine-internal detection signal, so a future verdict layer can apply prevalence, publisher-trust, and tenant-policy overrides without touching scoring math. - A
PrevalenceOracleinterface returning "unknown" by default, so a future cloud reputation service can plug in without changing call sites.
One mandatory cross-cutting hardening
A PII redactor at the result-serialization boundary that masks emails, phones, SSNs, Luhn-valid credit cards, and user-home filesystem paths, while preserving SHA-256 hashes, MAC addresses, UUIDs, ISO timestamps, system paths, and Luhn-invalid CC-shaped tracking numbers. A new --ir-mode CLI flag opts out for SOC incident-investigation workflows.
One regression gate
A 100-document business-document clean corpus (25 each XLSX, DOCX, PDF, PPTX) generated deterministically by a committed script. The regression test asserts zero HIGH or CRITICAL on both Community and Pro tiers across all 100 documents plus both real-world customer-reported FP samples. Auto-generated on demand so the test "just works" in local dev and CI.
What we explicitly did NOT do
WWCD architectural discipline means knowing when to STOP. We deferred several items to v1.5+ rather than entangle them with the FP cure:
- Document engine as a separate class from the executable engine. Real architecturally clean separation, but a multi-week refactor that risks scoring-path regressions. Ships as a standalone PR with its own G5 corpus replay before AND after.
- Allowlist-as-signed-JSON policy bundles. Customer-extensible policy hot-swap is right when a tenant asks for it. None has. Code-as-allowlist is fine for v1.4.4.
- A two-path benign-verdict pattern (fingerprint AND publisher AND prevalence all agree to clear). Needs the interface stubs to have real implementations. Ships in v1.5.
- Tier 2 ML schema-v9 features and the v1.17 retrain. Pairs naturally with the next ML retrain cycle, not with a detection-engine patch release.
- Customer portal Tier 4 extension for document-class FP submissions with a PII-attestation checkbox. Pairs with v1.5 portal work.
The discipline of stopping at the right boundary is itself the WWCD lesson: every fix is an architectural investment, not a tactical unblock, but every investment is sized to what the signal supports.
What we learned
- Real-world clean documents must be in the regression gate. Until we built the 100-doc business corpus, we had no test that asserted "100 clean DOCX, XLSX, PPTX, and PDF score zero HIGH or CRITICAL." That's exactly the gate the v1.3.0 retrospective said we needed for malicious detection; we extended it symmetrically to benign detection.
- YARA noise filters need size-class awareness, not just size thresholds. A 100 KB size gate that excludes 35 KB business documents is a gate that doesn't fire when it's most needed. For inherently-compressed formats (OOXML, JAR, ZIP, archives in general) the gate should be format-aware, not byte-counted.
- Customer-visible findings must never carry raw PII. This is HIPAA and GDPR baseline regardless of detection accuracy. The audit log can carry originals (encrypted at rest, role-gated); the SIEM event, the portal submission, and the customer report cannot.
Numbers
| Metric | Pre-v1.4.4 | Post-v1.4.4 |
|---|---|---|
| Customer xlsx, Community tier | 84 HIGH (binary) / 16 LOW (source) | 0 SAFE |
| Customer xlsx, Pro tier | not reproduced | 0 SAFE |
| 100-doc business corpus HIGH/CRIT (Community) | 0 / 100 | 0 / 100 |
| 100-doc business corpus HIGH/CRIT (Pro) | 50 / 100 | 0 / 100 |
| Customer-visible PII leak in findings | yes (18 emails in xlsx) | redacted |
| MWB 3,271-file detection rate | 98.66% | 98.7% |
| Contagio 11,890-file detection rate | 99.89% | 99.9% |
Trade-secret hygiene
The exact predicate set inside the deterministic structural fingerprint and the YARA rule names in the suppression sets are proprietary detection logic. This case study describes the architectural pattern (deterministic allowlist + ML + heuristics in priority order; OOXML-family size-class awareness; context-gated PII patterns) but does not publish the specific allowlist members, rule names, or threshold values. Customers and acquirer due-diligence teams can validate the cure via the public regression corpus and the public detection metrics; the detection-engine internals remain in the proprietary source tree.
What's next
- v1.4.5: PDF whitepaper FP cure in the deep-format analyzer (the deferred half of this investigation); macro-detection regex tightening on
.binmember catch-all, sample-set validated; Windows x86_64 Nuitka binary. - v1.5: Real
VerdictEnginebehind the interface stub;LocalAllowlistOracleimplementation; document-engine class refactor as standalone PR; ML schema-v9 retrain; portal Tier 4 document FP submission flow. - v1.6+:
PortalPrevalenceOracleonce the customer portal has accumulated enough cross-tenant signal. - v2.0: Tenant-curated allowlist manager with per-tenant signed policy bundles.
We will not wait until v2.0 to listen to customer FP reports. Every FP submission is sized at the time it lands; v1.4.4 is the proof that we ship the fix at the right altitude for the report.
Tested May 2026 • HADES v1.4.4 • Customer-reported samples plus synthetic 100-document business corpus