Parsing Every Weird CAD Format in Texas (Tilde-Delimited, 10K-Wide Fixed-Width, and the One That Was Just an Excel File)
Every Texas county publishes its property data. Almost none of them publish it the same way.
I've now loaded twelve counties into a single database — 8.1 million parcels — and the formats ranged from "clean spreadsheet with column headers" to "9,716-character fixed-width lines with values stored in cents and no documentation." Each one was a small puzzle, and a few were genuinely funny in their hostility toward the person trying to use them.
Here are the war stories, in order of increasing weirdness.
Level 1: The civilized counties
Cameron County (Brownsville) handed me an Excel file. Real columns. Real headers. Owner name, situs address, market value — all labeled, all structured. I loaded 222,000 parcels in about forty seconds. I almost didn't trust it because nothing had gone wrong.
Dallas County (DCAD) uses quoted CSV — ACCOUNT_INFO.CSV for ownership and addresses, ACCOUNT_APPRL_YEAR.CSV for values. Standard quoted fields, comma-delimited, headers in the first row. The only quirk: you have to join the two files on the account number to get a complete record. 860,000 parcels, clean.
Collin County publishes a straightforward CSV download. 351,000 parcels. Nothing to report. It just works.
These are the counties that make you think "this isn't so hard." They're lying to you. They're the tutorial level before the game starts.
Level 2: Tab-delimited flat files (Harris County)
Harris County (HCAD) — 1.6 million parcels, the third-largest county in the country — publishes its data as tab-delimited flat files. Two key files: owners.txt and real_acct.txt.
Tab-delimited sounds simple, and it mostly is, except:
- The files are big.
owners.txtis 1.9 million rows.real_acct.txtis 1.6 million. You can't casually open these in Excel. - The situs address is split across three columns:
site_addr_1(street),site_addr_2(city),site_addr_3(zip). Concatenating them sounds trivial until you discover thatsite_addr_2is blank for about 40% of records, and you need to decide whether to default to "HOUSTON" or leave it empty. - You need to join
owners.txttoreal_acct.txton the account number to get owner + address in one record.
The real lesson from Harris isn't the format — it's the size. When you're processing 1.6 million rows, every inefficiency in your parser multiplies. A regex that takes 1ms per row takes 27 minutes over the full file. I learned to normalize first, match second, and never iterate twice when once will do.
Level 3: Fixed-width (Travis County)
Travis County (TCAD) is where the fun starts. Their bulk export is a file called PROP.TXT — 528MB zipped, 17.7 gigabytes uncompressed — and it has no delimiters at all. No tabs, no commas, no pipes. Just one long unbroken string per line, exactly the same width, every line.
This is a fixed-width format. Each field occupies a specific byte range. Owner name might be bytes 608 through 647. Situs street number might be bytes 1038 through 1043. The only way to know where one field ends and another begins is the layout document — a PDF from the Texas Department of Community Affairs that maps every field position.
The process:
- Open the layout PDF.
- Find the field you want (say, "owner name").
- Note the start position (608) and length (40).
- Write code that slices each line at
line[607:647](zero-indexed). - Repeat for every field you need.
- Pray you didn't miscount a byte.
I miscounted a byte on the first try. The situs city field started bleeding into the zip code field, and I spent twenty minutes wondering why half of Austin's properties were in a city called "AUSTIN787" before I found the off-by-one.
492,000 total records. After filtering out personal property, mineral rights, and utility accounts: 422,000 real-property parcels with owner, address, and value.
Level 4: Extreme fixed-width (Galveston County)
If Travis was fixed-width, Galveston is fixed-width turned up to eleven. Their bulk export uses the TDCA APPRAISAL_INFO format where each line is 9,716 characters wide.
Let that sink in. A single line of data — one property — is nearly ten thousand characters. Open the file in a text editor and each "row" wraps across your entire screen several times over.
The field positions are spread across this enormous line width:
- Account number: somewhere near the front
- GeoID (the geographic identifier): position varies
- Owner name: starts at position 608
- Situs street address: position 1038
- Situs city: position 1100
- Market value: position 1920
And the market value is stored in cents. Not dollars. Cents. So a property worth $250,000 appears in the file as 00000025000000. Miss that one detail in the layout document and every property in the county appears to be worth a hundred times its actual value.
I built a custom parser that maps twelve field positions, slices each 9,716-character line at the right offsets, divides the value by 100, and strips the padding. 209,924 parcels. It works perfectly now, and I hope I never have to touch it again.
Level 5: The tilde file (El Paso County)
El Paso is the format that made me stop and laugh.
Their bulk data export uses tildes as delimiters. Not tabs. Not commas. Not pipes. Tildes. The ~ character. I've been writing parsers for twenty years and I've never seen tildes used as a field separator in production data.
But it gets better. The data isn't in one file — it's in three:
Properties— parcel IDs, legal descriptionsOwners— owner names, mailing addressesValues— appraised values, land values, improvement values
Each file is tilde-delimited, and they're all keyed on a Property_dbId field. To get one complete parcel record, you have to join all three files on that ID — essentially reconstructing the SQL Server database they exported from.
The process: parse each tilde-delimited file into a dictionary keyed on Property_dbId, then merge the three dictionaries into a single record per parcel. 452,000 parcels total.
The first bulk load hit a server error at 360,000 records. The Supabase API returned a 500, the batch died, and I was looking at 360K loaded and 92K orphaned. The fix was row-level retry with idempotent upserts — each record uses ON CONFLICT DO UPDATE, so re-running the load from the start just skips what's already there and fills in the rest. The full 452,000 loaded clean on the second pass.
Level 6: Nueces County (surprise format)
Nueces County (Corpus Christi) publishes a free download from nuecescad.net. The file uses yet another fixed-width format — PACS, not TDCA — with its own field positions for everything. By this point I had the pattern down: find the layout, map the positions, write the slicer. 219,000 parcels. Twenty minutes of work. The repetition pays off.
What I learned (the meta-lesson)
After twelve counties and twelve formats, here's what compounds:
The first county is the hardest. Travis took me a full day — understanding fixed-width, debugging off-by-one errors, building the pipeline. County twelve (Cameron, the spreadsheet) took forty minutes. Not because Cameron was easier (it was), but because by that point I'd already solved every variant of the core problem.
The formats are hostile, but not random. Most Texas counties use one of three underlying systems: TDCA standard export, TrueAutomation/Prodigy, or their own legacy format. Once you've parsed one TDCA county, the next one is mostly the same — different field positions, same structure.
Normalize early, everything. Owner names come in as "MARTINEZ, EVA M." or "Eva Martinez" or "MARTINEZ EVA & ROBERTO" depending on the county. Street addresses come as "4215 MAPLE DR" or "4215 Maple Drive" or "4215 MAPLE DR." The normalization step — uppercase, strip suffixes, collapse whitespace, standardize abbreviations — needs to happen before you do anything else. I wrote the normalizer once and it's the same fifteen-line function across all twelve counties.
The moat is in the ugly middle. Anyone can download the Harris County file — it's sitting right there on hcad.org. Very few people will write the parser for El Paso's tilde-delimited three-file join, or Galveston's 9,716-character lines with values in cents. The willingness to deal with the ugly formats is the barrier to entry, and it's a barrier that doesn't go away — because the counties aren't going to redesign their exports for you.
If parsing tilde-delimited property data sounds like your idea of a good time, the newsletter is where I share the next county's war story — and the playbooks for turning these files into revenue.
Everything in this blog is built on the same data that powers Texas Signals: 8M+ property records, pre-foreclosures, tax delinquencies, and distress signals across 12+ Texas counties. Free 7-day trial.
Try Texas Signals FreeCheck out my books on Amazon and Gumroad — same operator voice, deeper frameworks.