Surya Vaddhiparthy | Data Engineering, AI Engineering, Cloud Data Platforms

Professional Summary

I am a Data Engineer with eight years of experience designing, building, and operating large-scale data platforms that power analytics and decision-making. My work focuses on turning complex, multi-source data into reliable, governed systems that teams can trust, with measurable improvements in performance, quality, and cost.

I specialize in end-to-end data engineering across batch and streaming systems, with deep experience in Python, SQL, Airflow, AWS, Snowflake, dbt, Kafka, and Spark. I have led initiatives that improved time-to-insight by approximately 42 percent, increased query performance by nearly three times, reduced data incident resolution time by 35 percent, and lowered warehouse compute costs by around 20 percent through disciplined FinOps and optimization practices.

My approach emphasizes strong engineering fundamentals: automated testing and validation, monitoring and observability, CI/CD and infrastructure-as-code, and reusable pipeline patterns that make data platforms easier to scale and maintain. I work closely with analytics and business stakeholders to establish clear data models, consistent KPI definitions, and semantic layers that prevent metric drift and accelerate confident decision-making.

I am driven by building data systems that are not only fast and scalable, but also dependable, cost-efficient, and easy for organizations to grow on.

Data Engineering Skills

Core Data Engineering: Python, SQL; ELT/ETL pipelines, reusable ingestion patterns, incremental loads, backfills, auditability.
Modeling & Analytics: Dimensional modeling (Star/Snowflake schema), curated marts, governed metrics, semantic layers.
Orchestration & Transformation: Airflow, dbt (models/macros/tests/docs), DAG scheduling, dependency management, reliability patterns.
Streaming & Distributed Compute: Kafka, Spark, PySpark (Batch + Streaming fundamentals).
Cloud & Warehousing: Snowflake; AWS (S3, Glue, Lambda, ECS/Fargate, Redshift); Databricks/Delta Lake.
Quality & Observability: Great Expectations, Monte Carlo; SLAs, lineage, monitoring, incident runbooks.
BI & Semantics: Power BI, Tableau; KPI design and governance.
Dev & Automation: Git, CI/CD, Docker, Terraform; testing, code review, repeatable deployments.
Intelligent Systems Engineering: assistant design, retrieval workflows (LangChain/LangGraph), vector databases (pgvector/Milvus), and response quality evaluation.

Active R&D and Engineering Projects

↗GH

FinLens: Regulatory-Grade Banking Data Platform

End-to-end banking analytics platform ingesting FDIC, FRED/ALFRED, QBP, and NIC public datasets into raw, curated, and analytical layers with dbt-style modeling, Airflow orchestration, Great Expectations validation, FastAPI health surfaces, and Streamlit dashboards. The project demonstrates production data engineering patterns: connector readiness checks, source traceability, quality gates, warehouse-ready marts, structured documentation, and deployment scaffolding across Docker, Terraform, S3, Snowflake, and DuckDB.

PPythonASAWS S3SSnowflakeDDuckDBDdbtGEGreat ExpectationsAAirflowFFastAPISStreamlitLLangChainCChromaTTerraformGAGitHub ActionsDDocker

↗GH

Privacy-Aware Corpus Intelligence Pipeline

Local-first corpus classification pipeline for separating public-safe knowledge from private or sensitive text in large exported conversation archives. The system streams split JSON exports, recovers safe chunks from mixed conversations, detects hard identifiers, applies sensitive-domain routing, scores public topic families, and writes Markdown/JSON review artifacts. Optional validation compares the primary policy classifier against strict rules, semantic scoring, Presidio, spaCy, and ensemble routes so privacy decisions remain auditable and reproducible without sending raw corpus text to hosted models.

PPythonJSJSON StreamingPDPII DetectionPPPrivacy Policy RulesPPresidioSspaCySSSemantic ScoringMAMarkdown/JSON ArtifactsLPLocal-First ProcessingRTRegression Tests

↗GH

Privacy Preserving Authentication Audit Data Platform

Local-first authentication-event ingestion platform with SQS-compatible queue intake, versioned event contracts, HMAC tokenization for IP/device identifiers, deterministic replay-safe event IDs, PostgreSQL curated storage, quarantine tables, and batch-level audit evidence. The demo/API surface exposes health, flow, contract, transform preview, and table-preview endpoints, while the implementation proves reproducible local execution through LocalStack, Docker Compose, validation tests, offline dataset adapters, and compact evidence publishing without hosting full sensitive datasets.

PPythonFFastAPIPPostgreSQLDCDocker ComposeHTHMAC TokenizationPEPrivacy EngineeringDCData ContractsSLStructured LoggingUunittest

↗GH

Failure-Aware Metric Realignment for Post-Hoc Dense Retrieval

Retrieval research workspace for evaluating whether post-hoc retrieval corrections can improve legal and contract search without retraining base embedding models. The project includes dataset loaders, dense and sparse retrieval baselines, BM25/dense blending, reciprocal rank fusion, query rewrite branches, HyDE expansion, CPU reranking, checkpointed long-run execution, saved Parquet/JSON artifacts, notebook validation inputs, and dashboard/report surfaces for comparing Recall@K, MRR, latency, and method-specific failure behavior across CUAD, Legal RAG Bench, and MTEB Bar Exam QA.

PPythonNNumPyDDuckDBPPandasSTSentence TransformersHFHugging Face DatasetsRMRetrieval MetricsVTVector-Space TransformationsJAJSONL/CSV/Parquet ArtifactsLELocal Evaluation Pipelines

↗GH

Agentic Planning and Execution Intelligence Platform

Local-first AI operations planning service that turns scoped goals, retained feedback, scheduled planning runs, and model output into traceable execution plans. The current implementation includes FastAPI routes, a browser console, PostgreSQL/SQLite storage, prompt registry, model router, guardrail checks, static evaluation harness, scheduler controls, feedback capture, run-history persistence, and public demo endpoints that show planning iteration playback without requiring live model credentials.

PPythonFFastAPIPPostgreSQLSSQLitePPydanticASAsync SchedulingPRPrompt RegistryMRModel RouterGGuardrailsEHEvaluation HarnessBUBrowser UI

Contact

↗Visit Contact PageUse the dedicated page for live chat, email, scheduling, or the professional contact form.