The Cross-Reference Trick That Turned F-Grade Data into A-Grade
I had 18,000 Houston pre-foreclosure records and more than a quarter of them were missing the one thing an investor actually needs: the property address. No street, no city, no zip. Just a name and a case number. That's an F-grade dataset — technically large, practically useless.
Six hours later, 2,160 of those records had real addresses, assessed values, and enough information to act on. The address coverage went from 71.6% to 85.0%. I didn't buy anything, didn't call anyone, and didn't scrape a single web page. I cross-referenced one free public dataset against another one.
This is the move that most data businesses skip, and it's the one that matters most.
Why foreclosure records are broken by design
A pre-foreclosure notice is a legal filing. It's written for the court, not for you. The document names the borrower, cites the loan, describes the property in legal terms — "Lot 22, Block 3, GUARANTY INVESTORS SUBDIVISION, HARRIS COUNTY, TEXAS" — and sets an auction date.
What it usually does not include is a street address. The legal description is sufficient for the court's purposes. The court doesn't need to know it's 4215 Maple Drive; it needs to know which parcel in which subdivision is being foreclosed on.
But an investor needs 4215 Maple Drive. Without it, the filing is noise.
This is why most foreclosure "data products" are quietly terrible. The vendor scraped the filings, counted the rows, put "18,000 pre-foreclosures" on the landing page — and buried the fact that 5,000 of them can't be located on a map. The row count sells; the coverage rate doesn't.
The join
Here's the fix, and it's embarrassingly simple in concept.
The county appraisal district maintains a separate database: the appraisal roll. Every parcel in the county, with the owner's name and the property's situs (physical) address. Harris County's roll has 1.6 million records. I downloaded it for free from hcad.org — a ZIP file, a couple of flat files, no email required.
The foreclosure notice has the owner's name. The appraisal roll has the owner's name tied to a street address. Match the names, pull the address across.
That's the entire trick. A join on owner name between two free public datasets.
Why it's harder than it sounds
If it were as clean as "match the names," everyone would do it. The reason they don't is the normalization step.
Foreclosure filings write names however the attorney felt like writing them that day:
- "MARTINEZ, EVA" in one filing
- "Eva Martinez" in another
- "Martinez, Eva M." in a third
- "MARTINEZ EVA & MARTINEZ ROBERTO" when there's a co-borrower
The appraisal roll has its own conventions:
- "MARTINEZ EVA M" (no comma, middle initial)
- "MARTINEZ PROPERTIES LLC" (different entity entirely)
- "MARTINEZ EVA MARIE ETAL" (heirs)
To match these, you normalize both sides: uppercase everything, strip suffixes (LLC, Inc, Et Al, Trust, Living, Revocable), remove punctuation, collapse whitespace. Then you do an exact match on the normalized strings.
This is deliberately conservative. I could fuzzy-match and catch more, but fuzzy matching on common names like Martinez risks linking the wrong Eva Martinez to the wrong property — and a false match is worse than no match. The investor who drives to the wrong house wastes more than the one who skips a record.
The results, honestly
Harris County (Houston):
- 18,386 pre-foreclosure records total
- 5,264 were missing property addresses
- After cross-referencing against 1.6M HCAD parcels: 2,160 addresses filled
- Coverage: 71.6% to 85.0%
- Average assessed value of matched properties: $287,000
- Remaining gap: 3,104 records. Of those, 743 have no owner name at all (just case numbers — nothing to match against). The rest are names that didn't match HCAD (truncated names, entities, spelling variations).
Travis County (Austin):
- 88 addresses filled from 485,000 TCAD parcels
- Smaller number because Austin had fewer gaps to begin with (96% coverage pre-enrichment)
Dallas County:
- 24 addresses filled from 860,000 DCAD parcels
- Small yield because 84% of Dallas pre-foreclosure records have no owner name at all — the county clerk's filings include only case numbers. You can't cross-reference what doesn't exist.
Cash buyers (three counties combined):
- 354 records enriched with property addresses
- Dallas cash-buyer address coverage: 63% to 92%
Total across all tables and counties: 2,626 records enriched. Every one of them went from a name-and-case-number that couldn't be located to a record with a real street address and assessed value.
Why this is the moat
The cross-reference is simple. A first-year data analyst could write the query. So why don't the big vendors do it?
Because they'd have to admit their data had gaps.
A vendor whose pitch is "18,000 pre-foreclosures" has a marketing problem the moment they say "but 5,000 of them don't have addresses." The incentive is to count the rows and let the customer discover the gaps after they've paid.
When you build the pipeline yourself — starting from the raw county filings and the raw appraisal roll — you see every gap from the start. You can't hide from it, and you don't need to. You just fix what's fixable, document what's not, and hand the customer a dataset where the coverage numbers are honest.
That honesty is the product. A dataset with 85% real addresses and a clear explanation of why the other 15% are missing is more valuable than a dataset with "18,000 records" and a silent coverage rate of 71%.
How to do it yourself
The recipe, if you want to replicate it:
- Get the foreclosure filings — county clerk's office, usually scrapeable or available via PACER/public search.
- Get the appraisal roll — free download (Harris, Tarrant, Collin) or one-email PIR (Travis, Dallas, Bexar). My earlier post has the exact template.
- Normalize both sides — uppercase, strip suffixes, remove punctuation, collapse whitespace. The normalization function is about fifteen lines of code.
- Join on normalized owner name — exact match only. No fuzzy matching unless you're willing to manually verify every match.
- Pull the situs address and assessed value across — that's your enrichment.
- Document the gap — how many records couldn't be matched, and why (no owner name, entity names, truncated names). This is the part that makes you trustworthy.
Two free datasets. One join. The F-grade becomes a B, and with some geocoding on top, it reaches an A.
The data-quality game is won in the cross-reference, not the scrape. If you're building on public data — or evaluating someone else's — the newsletter is where I share exactly what the numbers look like before and after.
Everything in this blog is built on the same data that powers Texas Signals: 8M+ property records, pre-foreclosures, tax delinquencies, and distress signals across 12+ Texas counties. Free 7-day trial.
Try Texas Signals FreeCheck out my books on Amazon and Gumroad — same operator voice, deeper frameworks.