Personal Health Lakehouse

Local health telemetry, governed like a warehouse.

A privacy-first local data platform that ingests Apple Health export ZIP baselines and ongoing Health Auto Export JSON updates into PostgreSQL medallion layers. The system preserves raw evidence, tracks lineage, deduplicates repeated payloads, validates quality, and publishes BI/agent-facing marts without moving sensitive health data into hosted analytics systems.

GitHub repository Data Platforms
Local-first health data platform

The architecture treats raw files as evidence and transformed tables as rebuildable products. ZIP exports provide complete historical baselines; Health Auto Export JSON provides freshness; SQL transforms, quality checks, and local monitoring hooks make the pipeline inspectable without depending on a cloud warehouse.

5warehouse schemas
2ingest paths
0cloud warehouse dependency

System Architecture

The project solves a practical data engineering problem: Apple Health exports are complete but manual, while app-based syncs are fresher but can be incomplete or delayed. The lakehouse combines both sources into one local governed platform with immutable raw archives, deterministic hashing, medallion transforms, and monitoring.

Apple Health export.zip ─┐
                         ├─ landing folder ─ stable-file wait ─ immutable archive ─ bronze
Health Auto Export JSON ─┘
Health Auto Export REST ─ FastAPI intake ─ spool ─ queue ─ worker ─ bronze

bronze raw evidence → silver conformed facts → gold rollups → health_mart BI + agent views

Ingestion Design

Baseline ZIP intake

Full Apple Health exports are copied into an immutable archive before parsing. The worker waits for file stability, fingerprints the payload with SHA-256, registers lineage in metadata tables, parses large XML exports, and preserves unknown fields or raw payload fragments for rebuildability.

Ongoing JSON intake

Health Auto Export JSON can arrive by folder drop or authenticated REST POST. The pipeline spools inbound payloads, deduplicates repeated app sends, and supports rolling overlap so late or edited health samples can be safely absorbed without double counting.

Warehouse Layers

health_meta

Ingest runs, file catalog, lineage state, metric catalog, quality events, and operational evidence.

health_bronze

Raw ZIP, XML, GPX, ECG, clinical JSON, and app payload evidence retained for rebuilds.

health_silver

Conformed metrics, sleep, workouts, routes, blood pressure, and normalized event records.

health_gold

Daily, weekly, and monthly rollups for activity, sleep, cardio, workout, and coverage analytics.

health_mart

BI and agent-facing views designed for compact summaries rather than raw personal data exposure.

Runtime and Transform Flow

ComponentRoleEngineering Signal
FastAPI receiverReceives Health Auto Export payloads.Token-authenticated REST intake with raw payload spooling.
Worker serviceWatches landing/spool folders and loads warehouse tables.Stable-file waits, queueing, SHA-256 dedupe, parser isolation.
Makefile/runtime commandsStart services, initialize the database, trigger source ingestion, and run tests.Simple local-first operation with explicit commands instead of hidden managed infrastructure.
SQL transform layerBuilds conformed and analytical layers.Rebuildable silver/gold transformations with medallion separation and mart views.

Quality and Reliability Model

Data quality gates

The quality suite checks populated bronze/silver/gold tables, non-null metric keys, nonnegative core quantity metrics, chronological sleep/workout intervals, required Apple Health metrics, and reasonable date coverage in the daily summary.

Rebuild posture

Raw data is archived first and never overwritten. Transformed tables are treated as rebuildable products. Duplicate ZIPs and JSON payloads are ignored by hash, while ingest runs and quality events stay visible in metadata tables.

Observability

The monitoring posture is local and operationally explicit: the repository includes Prometheus, Grafana, Loki, and exporter configuration so pipeline health, warehouse row counts, service logs, and dashboard-ready indicators can be inspected without exposing raw health data externally.

Prometheuspipeline, container, and database metrics
Grafanahealth insight and log dashboards
Lokiservice, container, and optional syslog streams

Boundary

This is a data engineering and personal analytics system, not a medical device or diagnostic tool. Its purpose is governed ingestion, trend visibility, data quality, local observability, and structured personal insight from privately held health telemetry.