Building the Lakehouse

Abhijeet Kumar
Pandit

Data Engineer | Databricks Certified

Scroll

Engineering Data for Impact.

I am a Software Engineer at TCS Digital with a strong focus on Data Engineering. I specialize in designing and implementing robust data pipelines, data lakes, and lakehouses using modern cloud architectures.

With deep expertise in the Databricks ecosystem, Apache Spark, and cloud platforms, I transform raw data into actionable insights, ensuring high performance, scalability, and data quality across complex ETL processes.

Years of Experience

Pipelines Built

TB Processed

Certifications

The Lakehouse Flow.

Hover over the nodes below to explore the technologies I use at each layer of the Medallion Architecture.

Data Sources

APIs, Logs, Databases

Kafka, Azure Event Hubs

Bronze Layer

Raw Ingestion

Delta Lake, Auto Loader

Silver Layer

Cleaned & Filtered

PySpark, dbt, SQL

Gold Layer

Business Level

Aggregated Delta Tables

Analytics

BI & ML Serving

PowerBI, MLflow, Databricks SQL

Journey & Experience.

Software Engineer (Data Engineering)

TCS Digital

2021 - Present

Developing scalable data pipelines and implementing Databricks Lakehouse architecture.

▹Architected and deployed end-to-end ETL pipelines using PySpark and Databricks.
▹Migrated legacy data warehouses to Delta Lake, improving query performance by 40%.
▹Orchestrated workflows using Apache Airflow and Azure Data Factory.
▹Optimized Spark jobs and managed Databricks cluster configurations to reduce compute costs.

DatabricksPySparkDelta LakeAzureAirflow

Data Engineering Intern

Previous Company

2020 - 2021

Assisted in building data ingestion frameworks.

▹Developed Python scripts for automated data extraction from APIs.
▹Created SQL views for reporting dashboards.

PythonSQLAWS

Technical Arsenal.

Technologies and tools I use to build scalable data lakehouses and robust ETL pipelines.

Data Engineering

Databricks
Apache Spark
PySpark
Delta Lake
Snowflake
Kafka

Cloud Platforms

Azure Data Factory
AWS Glue
Azure Blob
AWS S3
Synapse Analytics

Languages

Python
SQL
Scala
Bash

Tools & Orchestration

Apache Airflow
Git
Docker
Terraform
CI/CD

Databricks

Apache Spark

PySpark

Delta Lake

Snowflake

Kafka

Azure Data Factory

AWS Glue

Azure Blob

AWS S3

Synapse Analytics

Python

SQL

Scala

Bash

Apache Airflow

Git

Docker

Terraform

CI/CD

Terraform

Docker

Git

Apache Airflow

Bash

Scala

SQL

Python

Synapse Analytics

AWS S3

Azure Blob

AWS Glue

Azure Data Factory

Kafka

Snowflake

Delta Lake

PySpark

Apache Spark

Databricks

Credentials & Certifications.

Validated expertise in building modern data architectures and lakehouses.

Databricks Certified Data Engineer Professional

Databricks • 2024

Verify Credential

Databricks Certified Data Engineer Associate

Databricks • 2023

Verify Credential

Featured Pipelines.

A selection of data engineering projects, architectures, and robust pipelines I've built.

View full GitHub archive

Architecture

Enterprise Data Lakehouse

Designed a multi-hop medallion architecture (Bronze, Silver, Gold) using Databricks and Delta Lake for streaming and batch data processing.

DatabricksDelta LakePySparkAzure Data Factory

Pipeline

Real-time Streaming Pipeline

Built a high-throughput streaming pipeline using Kafka and Spark Structured Streaming to process millions of events per hour.

Apache KafkaSpark StreamingPythonAWS

Tooling

Automated Data Quality Framework

Developed a comprehensive data quality and validation framework to monitor data drift and ensure integrity across the pipeline.

PythonSQLGreat ExpectationsAirflow

Abhijeet Kumar Pandit