Data Pipeline Generation With AI

TL;DR

AI is good at generating ETL skeletons: Airflow DAGs, Spark jobs, dbt models, basic connectors.
AI struggles with schema design, data quality rules, SLAs, and "what happens when things go wrong?"
Use AI for boilerplate. You own semantics, reliability, and lineage.

Data pipelines are structured. They're also full of business logic, edge cases, and "our source system does weird things." AI handles the structure. You handle the weird.

Where AI Helps

ETL Skeletons

Prompt: "Create an Airflow DAG that runs daily, pulls from PostgreSQL, and loads to BigQuery."

What you get: A working DAG. Operators. Scheduling. Maybe basic error handling.

What you add: Connection IDs. Schema mapping. Incremental vs. full load logic. Retries and backfill strategy. Your actual table names.

dbt and SQL Transformation

Prompt: "Create a dbt model that joins orders and customers."

What you get: A SQL model. Maybe staging layers. Reasonable structure.

What you add: Your actual column names. Grain. Deduplication logic. "We have 3 customer records per person because of legacy systems" — AI doesn't know that.

Spark and Batch Jobs

Prompt: "Write a Spark job to aggregate clicks by user."

What you get: DataFrame operations. GroupBy. Maybe partitioning hints.

What you add: Your cluster config. Memory tuning. "We have late-arriving data — how do we handle it?" AI gives generic patterns.

Where AI Falls Short

Schema and Semantic Understanding

"What's the grain of this table?" — AI doesn't know. You do. Wrong grain = wrong metrics.
"Which column is the source of truth for customer ID?" — Data modeling. Human.
"We have 5 systems that all have 'status' — they mean different things." — Data dictionary. AI can't maintain it.

Data Quality and SLAs

"If this pipeline fails, who gets paged?" — Operational context. AI doesn't know your on-call.
"We need this table by 6am for the exec dashboard." — SLA. AI doesn't prioritize.
"What if the source sends malformed JSON?" — Validation rules. You design them.

Lineage and Governance

"Where does this column come from?" — Lineage. AI can't maintain it across your org.
"Is this PII? Do we need to mask it?" — Governance. Human policy.
"Can marketing use this dataset?" — Access control. Org rules.

Failure Modes and Backfill

"The source was down for 4 hours. Do we backfill? Skip? Alert?" — Incident playbook. You write it.
"We replayed data. Are downstream tables idempotent?" — Design question. AI won't ask it.

How to Use AI for Data Pipelines

Generate structure, inject semantics. Use AI for the scaffold. Add your schema, your transforms, your error handling.
Always validate output. Run the pipeline in dev. Check row counts. Compare to known-good runs. AI can generate logically wrong SQL.
Document what AI can't know. Schema decisions, SLA rationale, failure playbooks. That documentation is your institutional memory.
Own orchestration and monitoring. AI writes the job. You own scheduling, alerting, and "what do we do when it breaks?"

Quick Check

AI generates an Airflow DAG that pulls from PostgreSQL and loads to BigQuery. What does AI typically NOT know?

# AI generates the structure. You add:
# - Your connection IDs
# - Schema mapping (your column names)
# - Incremental vs. full load logic
# - Retries, backfill strategy, SLA

with DAG("my_pipeline", schedule="@daily") as dag:
  extract = PostgresOperator(
      postgres_conn_id="your_prod_conn",  # You add
      sql="SELECT * FROM your_table",     # Your schema
  )
  load = BigQueryOperator(
      destination="project.dataset.your_table",  # You add
      write_disposition="WRITE_TRUNCATE",  # Or append — you decide
  )

Do This Next

Generate one pipeline skeleton with AI (Airflow, dbt, or Spark). How much would you keep? What would you change?
List three "AI doesn't know" facts about your data. Schema quirks, SLA requirements, or failure modes. That's your pipeline design input.