Privacy-Aware Corpus Intelligence Pipeline

Privacy-safe intelligence from large text archives.

A local-first corpus analysis platform that separates personal, regulated, and review-worthy conversational records from public-safe knowledge candidates. The project focuses on cost-conscious privacy screening, multi-model validation, and auditable data products without sending the full archive through bulk LLM inference.

Local privacy pipeline implemented

The system ingests a large exported conversation corpus, normalizes it into classification units, applies deterministic privacy policy gates, cross-checks the result with independent non-LLM detectors, and writes JSON plus Markdown evidence for review.

What inspired this build

We often hear that conversations may be recorded for audit, quality, or training purposes. I wanted to go behind the curtain and build the data engineering version of that problem: how do you remove identity, sensitive context, and private identifiers from a large text archive while still preserving useful knowledge? This project became a real-time privacy filtering pipeline that validates multiple non-LLM approaches and chooses an ensemble route that is practical, inspectable, and affordable in 2026.

Platform Overview

The platform was designed for the uncomfortable middle ground between raw text archives and publishable or reusable knowledge. The objective is not to summarize everything. The objective is to build a governed filter that can identify private identity markers, sensitive domains, and ambiguous records before any downstream writing, analytics, indexing, or knowledge-base process touches the content.

Dataset Information

The source material is described only as a large exported conversational corpus. The page intentionally does not disclose personal topics, raw excerpts, names, account details, or source-file identifiers. All public examples below are synthetic but shaped like the real pipeline contracts.

End-to-End Data Flow

The flow is built to avoid token burn. Cheap local passes do most of the work. Expensive semantic review is reserved for disagreement rows, review queues, or manually selected candidates.

Live Preview of the Classification Passes

This section shows generic source payloads and curated outputs using the same language as the implementation. The examples are harmless and synthetic: a public technical note, an identity-heavy private note, and a borderline review item.

GitHub repository

Source Payload

Select a sample and run the preview.

Curated Output

Waiting for classifier output...
def classify_unit(unit):
    identifier_hits = pii_detector.scan(unit.text)
    sensitive_hits = domain_policy.scan(unit.title, unit.text)
    topic_scores = topic_classifier.score(unit.text)

    if identifier_hits.hard_private or sensitive_hits.private_domain:
        return Decision(label="private", reason=sensitive_hits.primary_reason)
    if topic_scores.best_score < MIN_PUBLIC_SIGNAL:
        return Decision(label="skip", reason="low_signal")
    return Decision(label="public", topic=topic_scores.best_topic)
def strict_detector(unit):
    direct_identifiers = regex_bank.find_identifiers(unit.text)
    sensitive_domains = strict_taxonomy.find_private_context(unit.text)

    if direct_identifiers or sensitive_domains.high_confidence:
        return "private"
    if sensitive_domains.partial or mixed_private_public(unit):
        return "review"
    return "public"
def semantic_score_classifier(unit):
    private_score = weighted_terms(unit.text, private_vocab)
    public_score = weighted_terms(unit.text, public_vocab)
    margin = abs(public_score - private_score)

    if private_score >= HARD_PRIVATE_SCORE:
        return "private"
    if margin < REVIEW_MARGIN:
        return "review"
    return "public" if public_score > private_score else "private"
def advanced_non_llm_pass(unit):
    presidio_entities = presidio_analyzer.analyze(text=unit.text, language="en")
    spacy_entities = nlp(unit.text).ents
    custom_hits = custom_recognizers.scan(unit.text)

    return AdvancedSignals(
        hard_private=has_account_email_phone_secret(presidio_entities, custom_hits),
        review_signals=repeated_person_location_or_org(spacy_entities),
        explanation=build_evidence_trace(presidio_entities, spacy_entities, custom_hits),
    )

Schema and Source Contract

Classification Unit Contract


          

Classifier Result Contract


          

Model Comparison and Ensemble Results

The system does not trust one classifier blindly. It compares multiple independent views of the same corpus: a production policy classifier, a strict privacy detector, a semantic scoring classifier, Presidio, spaCy NER, and a final ensemble route. The large full-corpus pass establishes the production baseline. The advanced NLP sample adds a free, state-of-the-art non-LLM validation layer.

Agreement Snapshot

Why the ensemble works

The policy classifier enforces the product rule: private domains and identifiers override usefulness. The strict detector challenges that decision with a narrower safety lens. The semantic classifier catches topic intent. Presidio and spaCy provide independent entity recognition. The ensemble protects the corpus by preserving clear public candidates, excluding high-confidence private records, and routing ambiguous items to review.

Quality and Privacy Gates

The pipeline treats privacy as a data quality problem. Every unit receives evidence, not just a label. That evidence can be audited, sampled, validated, and improved without exposing the raw corpus.

Engineering Toolchain

Implemented Stack

Expansion Path

Project Knowledge Bank

The notes below document the engineering thinking behind the project: how the privacy filter is shaped, why the classifiers are separated, how validation works, and how the platform can mature into a larger corpus governance system.