The data model changed twice before we shipped anything. Week 1.
Living Archive is a RAG system for community archives. The pilot is giving us the first real test. A collection of oral histories, photographs, event records, and handwritten notes spanning forty years, held by a community organisation with no technical infrastructure and very limited capacity. Forty years of a community's memory sitting in boxes and folders, and we're trying to make it findable without flattening it.
The question we kept hitting was deceptively simple. How do you structure documents for retrieval without losing the relational context that makes the archive meaningful? A transcript is not just a transcript. It connects to a person, a date, a neighbourhood, a set of other documents. The first data model treated everything as a flat chunk. The second tried to nest relationships explicitly. Both were wrong in different ways, and I'm still sitting with why.
The ingestion pipeline is clean, which is the good news. We can take a PDF scan, OCR it, extract structure, and get it into the vector store without manual intervention for most document types. Everything downstream is where it gets interesting. Chunking strategies that preserve semantic coherence across handwritten documents? Not solved. Metadata schemas that work for one community archive break on the next one. We thought we'd have a queryable prototype by end of week. We have a working ingestion layer and a much better understanding of what we're actually building.
I'm writing this series because the public record of building software usually starts after the messy part. You see the demo, not the two data model rewrites. This is the two rewrites. If you're building something similar, whether that's community archival tech, RAG over unstructured historical records, or anything that requires domain knowledge to chunk sensibly, these notes will be specific enough to be useful. Or at least honest enough to save you one of the wrong turns.