Building the Lakehouse

Abhijeet Kumar 

Pandit

Data Engineer | Databricks Certified

Scroll

Engineering Data for Impact.

I am a Software Engineer at TCS Digital with a strong focus on Data Engineering. I specialize in designing and implementing robust data pipelines, data lakes, and lakehouses using modern cloud architectures.

With deep expertise in the Databricks ecosystem, Apache Spark, and cloud platforms, I transform raw data into actionable insights, ensuring high performance, scalability, and data quality across complex ETL processes.

+
Years of Experience
+
Pipelines Built
+
TB Processed
Certifications

The Lakehouse Flow.

Hover over the nodes below to explore the technologies I use at each layer of the Medallion Architecture.

Data Sources

APIs, Logs, Databases

Kafka, Azure Event Hubs

Bronze Layer

Raw Ingestion

Delta Lake, Auto Loader

Silver Layer

Cleaned & Filtered

PySpark, dbt, SQL

Gold Layer

Business Level

Aggregated Delta Tables

Analytics

BI & ML Serving

PowerBI, MLflow, Databricks SQL

Journey & Experience.

Software Engineer (Data Engineering)

TCS Digital
2021 - Present

Developing scalable data pipelines and implementing Databricks Lakehouse architecture.

  • Architected and deployed end-to-end ETL pipelines using PySpark and Databricks.
  • Migrated legacy data warehouses to Delta Lake, improving query performance by 40%.
  • Orchestrated workflows using Apache Airflow and Azure Data Factory.
  • Optimized Spark jobs and managed Databricks cluster configurations to reduce compute costs.
DatabricksPySparkDelta LakeAzureAirflow

Data Engineering Intern

Previous Company
2020 - 2021

Assisted in building data ingestion frameworks.

  • Developed Python scripts for automated data extraction from APIs.
  • Created SQL views for reporting dashboards.
PythonSQLAWS

Technical Arsenal.

Technologies and tools I use to build scalable data lakehouses and robust ETL pipelines.

Data Engineering

  • Databricks
  • Apache Spark
  • PySpark
  • Delta Lake
  • Snowflake
  • Kafka

Cloud Platforms

  • Azure Data Factory
  • AWS Glue
  • Azure Blob
  • AWS S3
  • Synapse Analytics

Languages

  • Python
  • SQL
  • Scala
  • Bash

Tools & Orchestration

  • Apache Airflow
  • Git
  • Docker
  • Terraform
  • CI/CD
Databricks
Apache Spark
PySpark
Delta Lake
Snowflake
Kafka
Azure Data Factory
AWS Glue
Azure Blob
AWS S3
Synapse Analytics
Python
SQL
Scala
Bash
Apache Airflow
Git
Docker
Terraform
CI/CD
CI/CD
Terraform
Docker
Git
Apache Airflow
Bash
Scala
SQL
Python
Synapse Analytics
AWS S3
Azure Blob
AWS Glue
Azure Data Factory
Kafka
Snowflake
Delta Lake
PySpark
Apache Spark
Databricks

Credentials & Certifications.

Validated expertise in building modern data architectures and lakehouses.

Databricks Certified Data Engineer Professional

Databricks2024

Verify Credential

Databricks Certified Data Engineer Associate

Databricks2023

Verify Credential

Featured Pipelines.

A selection of data engineering projects, architectures, and robust pipelines I've built.

View full GitHub archive
Architecture

Enterprise Data Lakehouse

Designed a multi-hop medallion architecture (Bronze, Silver, Gold) using Databricks and Delta Lake for streaming and batch data processing.

DatabricksDelta LakePySparkAzure Data Factory
Pipeline

Real-time Streaming Pipeline

Built a high-throughput streaming pipeline using Kafka and Spark Structured Streaming to process millions of events per hour.

Apache KafkaSpark StreamingPythonAWS
Tooling

Automated Data Quality Framework

Developed a comprehensive data quality and validation framework to monitor data drift and ensure integrity across the pipeline.

PythonSQLGreat ExpectationsAirflow