Privacy-Aware Corpus Intelligence Pipeline
Privacy-safe intelligence from large text archives.
A local-first corpus analysis platform that separates personal, regulated, and review-worthy conversational records from public-safe knowledge candidates. The project focuses on cost-conscious privacy screening, multi-model validation, and auditable data products without sending the full archive through bulk LLM inference.
The system ingests a large exported conversation corpus, normalizes it into classification units, applies deterministic privacy policy gates, cross-checks the result with independent non-LLM detectors, and writes JSON plus Markdown evidence for review.
What inspired this build
We often hear that conversations may be recorded for audit, quality, or training purposes. I wanted to go behind the curtain and build the data engineering version of that problem: how do you remove identity, sensitive context, and private identifiers from a large text archive while still preserving useful knowledge? This project became a real-time privacy filtering pipeline that validates multiple non-LLM approaches and chooses an ensemble route that is practical, inspectable, and affordable in 2026.
Platform Overview
The platform was designed for the uncomfortable middle ground between raw text archives and publishable or reusable knowledge. The objective is not to summarize everything. The objective is to build a governed filter that can identify private identity markers, sensitive domains, and ambiguous records before any downstream writing, analytics, indexing, or knowledge-base process touches the content.
Dataset Information
The source material is described only as a large exported conversational corpus. The page intentionally does not disclose personal topics, raw excerpts, names, account details, or source-file identifiers. All public examples below are synthetic but shaped like the real pipeline contracts.
End-to-End Data Flow
The flow is built to avoid token burn. Cheap local passes do most of the work. Expensive semantic review is reserved for disagreement rows, review queues, or manually selected candidates.
Live Preview of the Classification Passes
This section shows generic source payloads and curated outputs using the same language as the implementation. The examples are harmless and synthetic: a public technical note, an identity-heavy private note, and a borderline review item.
Source Payload
Select a sample and run the preview.
Curated Output
Waiting for classifier output...
def classify_unit(unit):
identifier_hits = pii_detector.scan(unit.text)
sensitive_hits = domain_policy.scan(unit.title, unit.text)
topic_scores = topic_classifier.score(unit.text)
if identifier_hits.hard_private or sensitive_hits.private_domain:
return Decision(label="private", reason=sensitive_hits.primary_reason)
if topic_scores.best_score < MIN_PUBLIC_SIGNAL:
return Decision(label="skip", reason="low_signal")
return Decision(label="public", topic=topic_scores.best_topic)
def strict_detector(unit):
direct_identifiers = regex_bank.find_identifiers(unit.text)
sensitive_domains = strict_taxonomy.find_private_context(unit.text)
if direct_identifiers or sensitive_domains.high_confidence:
return "private"
if sensitive_domains.partial or mixed_private_public(unit):
return "review"
return "public"
def semantic_score_classifier(unit):
private_score = weighted_terms(unit.text, private_vocab)
public_score = weighted_terms(unit.text, public_vocab)
margin = abs(public_score - private_score)
if private_score >= HARD_PRIVATE_SCORE:
return "private"
if margin < REVIEW_MARGIN:
return "review"
return "public" if public_score > private_score else "private"
def advanced_non_llm_pass(unit):
presidio_entities = presidio_analyzer.analyze(text=unit.text, language="en")
spacy_entities = nlp(unit.text).ents
custom_hits = custom_recognizers.scan(unit.text)
return AdvancedSignals(
hard_private=has_account_email_phone_secret(presidio_entities, custom_hits),
review_signals=repeated_person_location_or_org(spacy_entities),
explanation=build_evidence_trace(presidio_entities, spacy_entities, custom_hits),
)
Schema and Source Contract
Classification Unit Contract
Classifier Result Contract
Model Comparison and Ensemble Results
The system does not trust one classifier blindly. It compares multiple independent views of the same corpus: a production policy classifier, a strict privacy detector, a semantic scoring classifier, Presidio, spaCy NER, and a final ensemble route. The large full-corpus pass establishes the production baseline. The advanced NLP sample adds a free, state-of-the-art non-LLM validation layer.
Agreement Snapshot
Why the ensemble works
The policy classifier enforces the product rule: private domains and identifiers override usefulness. The strict detector challenges that decision with a narrower safety lens. The semantic classifier catches topic intent. Presidio and spaCy provide independent entity recognition. The ensemble protects the corpus by preserving clear public candidates, excluding high-confidence private records, and routing ambiguous items to review.
Quality and Privacy Gates
The pipeline treats privacy as a data quality problem. Every unit receives evidence, not just a label. That evidence can be audited, sampled, validated, and improved without exposing the raw corpus.
Engineering Toolchain
Implemented Stack
Expansion Path
Project Knowledge Bank
The notes below document the engineering thinking behind the project: how the privacy filter is shaped, why the classifiers are separated, how validation works, and how the platform can mature into a larger corpus governance system.