Skip to main content

Command Palette

Search for a command to run...

Stop Ignoring Data Pipelines: ETL vs ELT Explained Using a Real ML Workflow

Updated
4 min read
M
I write about data science, machine learning, and generative AI. I share practical insights from what I learn and build.

Most of us love building machine learning models.

We tune hyperparameters, try different algorithms, and chase better accuracy.

But there’s one part we quietly ignore:

How the data actually gets to the model.

And here’s the truth:

A bad data pipeline will break your model long before your algorithm does.


TL;DR

  • ETL = Transform data before storing

  • ELT = Store data first, transform later

  • ETL works well for smaller, structured data

  • ELT is better for large-scale, flexible ML workflows

If you’ve ever cleaned a dataset before training a model… you’ve already used ETL.


The Problem We Don’t Talk About

Let’s say you’re building a model.

Your data:

  • Comes from multiple sources

  • Has missing values

  • Uses inconsistent formats

Before training anything, you need to answer:

How do I turn this messy data into something usable?

That process is your data pipeline.


ETL: What You’re Probably Already Doing

ETL stands for:

  • Extract → Collect data (CSV, APIs, databases)

  • Transform → Clean, filter, preprocess

  • Load → Store or feed into your model

In most ML projects:

  • You load a dataset

  • Clean it (handle nulls, encode features)

  • Train your model

That’s ETL.

We just call it:

“data preprocessing”


ELT: The Shift for Modern Data

Now imagine your dataset is massive.

Transforming everything before storing it becomes slow and restrictive.

So we flip the process:

  • Extract → Collect raw data

  • Load → Store it immediately

  • Transform → Process it later when needed

This is ELT.

Instead of committing to one transformation early, you keep raw data flexible.


ETL vs ELT at a Glance

Feature ETL ELT
Order Transform → Load Load → Transform
Flexibility Limited High
Speed (Big Data) Slower Faster
Best For Structured data Large-scale systems

How This Fits Into a Real ML Workflow

Let’s map this to what you already do.

ETL-style workflow:

  • Collect data

  • Clean and preprocess immediately

  • Train model

ELT-style workflow:

  • Store raw data in a data lake

  • Transform based on use case

  • Train multiple models with different transformations

If you’ve ever:

  • Tried multiple preprocessing techniques

  • Reused the same dataset for different models

You’ve already felt the need for ELT.


Scaling This: Where Tools Come In

When data grows, your local machine starts struggling.

That’s where tools like Apache Spark come in.

They allow you to:

  • Process large datasets

  • Run transformations at scale

  • Build flexible ELT-style pipelines

You don’t need to master these tools right now.

Just understand:

They exist to make ELT possible at scale.


When Should You Use ETL vs ELT?

Use ETL when:

  • Data is small to medium

  • Transformations are fixed

  • You want structured pipelines

Use ELT when:

  • Data is large or growing

  • You want flexibility in experiments

  • You don’t want to lose raw data


Why This Matters (Especially for ML Engineers)

Here’s something I learned:

We often spend hours improving models by 1–2%.

But sometimes, the real improvement comes from fixing how data flows into them.

Understanding ETL and ELT helps you:

  • Experiment faster

  • Avoid repeated preprocessing

  • Build more reliable ML systems


Final Thought

Most people focus on models.

Better engineers focus on systems.

And better systems start with better data pipelines.

Because in the end, better data flow beats a better model.