RealWork — Procurement & Vendor-Risk Intelligence from Live Web Data

The PDF Black Hole — Live

LA County publishes 1,652 homeless-service provider invoices as scanned PDFs under the LA Alliance settlement. They're public record — but completely unqueryable. No CSV. No API. An auditor who wants to know what Del Amo Hospital billed last month has to open PDFs one by one. We built a pipeline to fix that — working toward all 1,600+ documents.

Pipeline running — dataset growing

864

invoices extracted

$211M

in billing structured

145

distinct providers

of 1,652 total documents · dataset grows as pipeline runs · ledger.json on GitHub ↗

⚠️

Risk Analyst Finding

Gemini 2.5 Pro with Google Search grounding identified billing patterns in the extracted invoices consistent with known Medi-Cal billing violations. Findings are under private review prior to any disclosure.

Before & After — real invoices, real extractions

Before

LAFH_SupportiveServices_Oct2024.pdf
1.8 MB · 64 pages · scanned

[unreadable billing summary grid]
[handwritten corrections]
[rotated signature page]

→ to query: open manually

After — Gemini 2.5 Flash, 11 seconds

{
  "vendor": "OH-HELP, INC.",
  "invoice_date": "2024-10-01",
  "billed_amount": 1078638.00,
  "deliverables": [
    "Interim housing beds",
    "Supportive services",
    "Case management"
  ],
  "confidence": "high"
}

Latest extractions

Provider	Date	Billed	Confidence
Loading…

Extracted from publicly-available LA County invoices · updated as pipeline runs · full dataset on GitHub ↗

The Tracking Gap

California's Grant Information Act (AB-132, Gov Code §8333) requires the state to track post-award disbursement. We pulled both available fiscal years from data.ca.gov's CKAN API and ran the audit. The field is empty for every single record. This is the systemic gap that makes a tool like this necessary.

26,907 / 26,907

Grant records across FY 22-23 + FY 23-24 with no centralized spend tracking populated. 100% of both datasets.

$36.5 Billion

Total award value with zero state-level disbursement record. Nobody knows whether the money was spent as awarded.

What the Tool Surfaces

The pipeline ran against 50,000 DGS purchase orders and surfaced six threshold-edge patterns matching documented procurement-fraud signatures. The strongest example is below. Note the framing: we are not accusing anyone. We are showing what the audit tool flags and how it honestly weakens those flags when contextual evidence emerges — exactly the calibration the State Auditor's office needs.

THRESHOLD-CEILING PATTERN — CONTEXT-DEPENDENT $599,300

The $49,950 Pattern

12 purchase orders at exactly $49,950 each — $50 below the competitive-bidding threshold

All 12 amounts identical to the dollar — $49,950, not approximate

5 of 6 most recent contracts signed by a single named procurement officer at Cal Fire

No subcontractors disclosed on any PO (rules out DBE pass-through fraud variant)

Charitable context confirmed: Cal Fire was actively battling the King Fire (Aug 14–18) and Dillon Fire (Aug 28+) during this window

The honest read: California's State Contracting Manual prohibits "splitting purchases" to evade competitive bidding (Public Contract Code §§ 10301-10340 require formal bidding above $50,000), and the 2019 Caltrans bid-rigging prosecution (USA v. Yong, Miller, Opp — 49-78 month sentences, $3M restitution) was built on a similar fact pattern.

However, the dates fall during a real wildfire emergency. Cal Fire ramped from ~55 personnel on August 14 to ~1,760 personnel at the Dillon Fire's September 5 peak. Multiple food orders to one caterer in tight succession during active fire response is plausible, and emergency procurement has expedited rules.

What the fire context does not explain: why every PO rounded to exactly $49,950. Genuine emergency procurement scales with actual crew size — 256 firefighters needs less than 1,760 does. Identical-amount POs across a span where crew size varied 7× looks more like "use the threshold as a default PO ceiling" than "this is what the food actually cost."

This is now reframed as a threshold-ceiling pattern warranting a State Auditor review of two questions: (1) is exactly $49,950 the default PO size for this procurement officer, regardless of actual need, and (2) what was the per-meal unit cost across the 12 POs. Answering either definitively requires line-item access the public Power BI does not currently expose.

Five more vendors with the same pattern

REPEATED THRESHOLD-EDGE POs 28 POs

McKesson Medical-Surgical

$4,943-$4,957 to Correctional Health Care Services — 93% buyer concentration

28 purchase orders clustered just under the $5,000 micro-purchase threshold, 93% to a single buyer, repeating "medical supplies" description. Plausible legitimate explanation: recurring supply orders. The pattern still warrants surface review.

FUEL ORDERS AT THRESHOLD 13 POs

Falcon Fuels Inc

~$4,999 to Cal Fire (85% concentration) — "fuel for dept vehicles"

13 fuel purchase orders just under the $5K threshold, same buyer, same description. Could be emergency fuel during fire response, could be the same pattern as Panini Time at smaller dollar amounts. Worth a closer look.

IDENTICAL AMOUNTS IN 3 DAYS 3 × $4,999

Progressive Medical, Inc.

3 POs at exactly $4,999 within 3 days — single buyer, identical scope

3 purchase orders at precisely $4,999 issued over a 3-day window, 100% to Correctional Health Care, all with description "negative pressure wound therapy rental." The exact-amount-plus-tight-timing combination is the strongest split-contract signature we found at this threshold.

CYLINDER RENTAL CLUSTER 6 POs

Airgas USA LLC

6 POs at ~$4,999 to Air Resources Board — "cylinder rental"

3 contracts in a single 7-day window, all at the threshold edge, same description. Pattern matches but vendor itself is large and reputable — likely benign procurement habit rather than intentional avoidance.

What the Tool Surfaces — Nonprofit Track

The pipeline cross-referenced 500 California nonprofit state-grant recipients against their publicly-filed IRS Form 990 returns via ProPublica. EIN-match sanity validation dropped wrong-EIN false positives (a tiny dormant local chapter being matched against a national org's name). 36 organizations survived as HIGH PRIORITY anomalies. Every number below comes from a public 990 filing. The State Auditor's office decides what warrants follow-up — the pipeline just makes the queue tractable.

$4.75M STATE / $556K REVENUE $4,750,157

Finish First Academy

State grants 8.5× reported total revenue

Form 990 reports $556K total revenue against $4.75M in state grants

Officer compensation grew 77% YoY ($56K → $100K)

Total expenses grew 190% YoY ($210K → $609K)

An organization reporting $556K in revenue received $4.75M in state grants. Officer compensation and total expenses both spiked sharply in the same year. Charitable explanations exist (multi-year grants disbursed over time, fiscal sponsorship arrangements). The pattern itself warrants verification by an entity with subpoena authority.

$338K OFFICER COMP, 25% OF EXPENSES $4,456,234

Land Together

$338K officer compensation at a $1.34M-expense organization

25.2% of total expenses going to one officer

Officer comp grew 55% YoY ($218K → $338K)

$4.46M in CA grants against $1.93M reported revenue

Sector median for officer compensation as a share of expenses is roughly 8-10%. Land Together's 25% ratio is more than 2.5x sector norm. The org also received $4.46M in state grants while reporting $1.93M total revenue. Multiple flags surviving validation simultaneously is the strongest signal in our Track B audit.

$722K OFFICER COMPENSATION $1,634,722

Veterans Transition Center of California

$722K officer comp, year-over-year spike, $1.6M state grants

$722,477 officer compensation

YoY spike during state-funding year

Triple-flag: HIGH_OFFICER_COMP_SMALL_ORG, HIGH_OFFICER_COMP_RATIO, OFFICER_COMP_YoY_SPIKE. The signature pattern is state grants flowing in and executive compensation spiking the same fiscal year. Every flag survived our wrong-EIN sanity check.

$2.07M FOR ONE OFFICER $4,549,765

Golden Gate National Parks Conservancy

Officer compensation more than doubled in one year

Officer compensation: $900,777 → $2,071,526 (+130%)

A large, well-known nonprofit. The 990 filing shows compensation to a single officer of $2.07M in 2023, up from $900K the prior year. May be entirely lawful — large public-facing nonprofit CEOs sometimes earn at this level — but a $1.17M YoY increase coinciding with $4.5M in state grants is exactly the pattern Track B is designed to detect.

EXPENSES GREW 5× $2,395,659

CityServe Network

Total expenses jumped 384% YoY ($6.7M → $32.5M)

An organization that spent $6.7 million in one fiscal year and $32.5 million the next. Some growth is plausible — large grants do scale organizations — but a 5× increase in a single year warrants verification of where the money went.

+355% OFFICER COMP YoY $2,725,000

Community Action Partnership of Kern

Officer compensation: $451K → $2.05M YoY

$451,111 to $2,053,022 in officer compensation in a single fiscal year. This is the largest YoY officer-compensation spike in our entire validated dataset. The org received $2.7M in state grants in the same period.

30 more validated anomalies are in data/track_b/validated_report.md in the repo. Every flag references a public 990 filing. The pipeline that surfaced them is reproducible.

What a Complete Dossier Looks Like

The patterns the tool surfaces are screening signals. Converting one into an actionable lead means writing a tight, cited, defensible dossier that an oversight body can act on. This is the artifact the State Auditor's confidential channel actually wants — not a webpage, not a tweet, a single document where every claim ties to a public record. We wrote one as a worked example.

METHODOLOGY DEMONSTRATION — ANOMALY WARRANTING REVIEW $834K vs $500K

Trybe, Inc. — Worked Dossier

Officer compensation reported on Form 990 exceeds the entire state grant by approximately 67%

$834,009 in officer compensation reported on Form 990 (tax year 2023)

$500,000 in California state grants received in the same fiscal year

Both Gemini 2.5 Pro and GPT-4o (via AI/ML API) independently flagged this pattern

Dossier lists 4 charitable explanations that must be ruled out before any further claim

The complete dossier — at data/dossiers/DOSSIER_trybe_inc.md in the repo — names the entity, cites every claim to a public record, lists the four charitable explanations that the State Auditor's subpoena power could rule out (multi-year grant amortization, fiscal sponsorship, comparable-position salary justification, board-approval process), and recommends the specific oversight channel: submission to the California State Auditor's confidential hotline using one of three pre-drafted tip letter templates we shipped (STATE_AUDITOR_TIP_TEMPLATES.md). This is what depth looks like.

The honest answer to "did you find fraud": we found patterns that warrant review by an entity with subpoena authority. The State Auditor's office is exactly that entity. The dossier above is the artifact you hand to them. WHAT_DEPTH_LOOKS_LIKE.md in the repo gives the 9-step methodology that converts an aggregate-pattern flag into a State Auditor-ready dossier.

Pre-Drafted State Auditor Tip Templates →

How the Tool Works

A six-stage forensic audit pipeline. ETL → heuristic flagging → external verification → primary LLM synthesis → cross-model ensemble validation → transparent reporting. Built with Bright Data, Gemini, AI/ML API, and ProPublica. Hard budget cap, JSONL cost ledger, dead-end log, reproducible end-to-end.

1

Multi-Source ETL

26,907 grant records (CA Grants Portal), 50,000 purchase orders (DGS), 500 nonprofit 990s (ProPublica). All normalized, deduplicated, persisted to SQLite.

2

Heuristic Flagging

Per-source anomaly detection: just-under-threshold PO clusters, repeating exact amounts, buyer-vendor concentration, nonprofit overhead ratios, exec-comp YoY spikes.

3

Bright Data Verification

Web Unlocker bypasses anti-bot on bizfileonline.sos.ca.gov, CCLD, ProPublica, SAM.gov. SERP API runs 5-variant queries across all flagged entities. Scraping Browser drives the state's public Power BI procurement dashboard — which is where the pipeline surfaced a single named procurement officer signing 5 of 6 just-under-threshold contracts within 17 days. Hard budget cap throughout.

4

LLM Synthesis

Gemini 2.5 Pro distinguishes real anomalies from data quality issues and DBA-trap false positives. Cleared cases get CLEARED with reasoning; survivors get WARRANTS INVESTIGATION.

5

Validation Pass

EIN-match sanity checks drop false positives (e.g., MADD flagged when we matched a $5K-revenue local chapter, not the national org). 102 of 104 nonprofit flags survived; 36 reached HIGH priority.

6

Transparent Reporting

Auto-generated markdown reports per round. Every finding cites public records. Dead-end log documents what was cleared, so future investigators don't repeat our work.

What We Explicitly Do Not Claim

✕ We do not claim fraud has been proven
✕ We do not claim any individual entity has acted illegally
✕ We do not use "fraud" as a conclusion — only as a pattern label
✓ We surface patterns present in public records
✓ We frame every finding as "warrants investigation by oversight bodies"
✓ We publish the methodology so anyone can verify or extend it

The path from "anomaly warranting investigation" to confirmed fraud runs through the State Auditor, the DOJ Procurement Collusion Strike Force, and the courts — not through a hackathon submission. Our role is to ship the tool that surfaces candidates and reduces the cost of investigation.

Built With

Bright Data — Web Unlocker · SERP API · Scraping Browser Gemini 2.5 Flash — PDF extraction · document intelligence Gemini 2.5 Pro + grounding — risk analysis AI/ML API — GPT-4o ensemble cross-validation ProPublica · Socrata · CA CKAN API

This tool documents patterns that fraud detection systems flag. Those patterns are derived from publicly available sources: the California State Contracting Manual, the filings in the 2019 Caltrans procurement-fraud prosecution, the State Auditor's published risk frameworks, and academic fraud-detection literature. They are well-known to fraud examiners.

Publishing detection methodology supports oversight; it does not create new evasion opportunities. The defender's advantage is that detection systems cross-reference many signals simultaneously — evading all of them is harder than evading any one.

This tool is intended for use by oversight bodies — the California State Auditor, the DOJ Procurement Collusion Strike Force, qui tam attorneys, accountability journalists, and other entities whose mandate is public integrity. It is not intended for use by parties who would game procurement.