Azure for Data Engineers: Data Factory, Synapse Analytics, and the Microsoft Cloud Data Stack

Azure is not a novelty cloud. It is where many of the largest enterprises run their core systems, and that reality shapes the data stack. If you are a data engineer working in a Microsoft-heavy company, you do not get to avoid Azure. Your identity provider is Azure AD. Your collaboration stack is Microsoft 365 and Teams. Your finance team has a long-term agreement with Microsoft, and the procurement process is already baked in. That context matters when you design a data platform because it explains why Azure gets chosen even when a different cloud might be more developer-friendly on paper.

This post is a practitioner guide to the Azure data stack. It is not a product tour. It is the set of services that actually show up in production, how they fit together, and the tradeoffs you should be aware of before you commit. I will cover ADLS Gen2, Azure Data Factory, Synapse Analytics, Event Hubs, Microsoft Fabric, and Azure Databricks, plus the parts of dbt that make this stack real. If you are coming from AWS or GCP, I will map each part to the equivalent services and explain where the match is clean and where it is not.

Why Azure Matters

The biggest reason Azure matters is Microsoft enterprise dominance. I see Fortune 100 and global healthcare organizations on Azure not because it is the coolest, but because Microsoft already owns the identity, security, procurement, and productivity layers. When your data platform integrates with Azure AD, Purview, Power BI, and the existing compliance tooling, the political friction is lower. That matters more than most architects admit.

The second reason is Azure Databricks. Microsoft took the best Spark platform in the market and made it a first-class citizen in Azure. If you care about Spark at scale, Azure Databricks is a strong default. In practice, it is better than Synapse Spark pools and far more reliable under real workloads. You get managed clusters, Delta Lake, Unity Catalog, and a mature notebook and jobs experience.

The third reason is integration with Teams and Office. This sounds fluffy until you watch how real executives use data. Power BI dashboards embedded in Teams channels are not just a convenience, they are how decisions get made. Azure is optimized for that flow. If your company lives in Teams, Azure tends to win by default.

Azure Data Lake Storage Gen2 (ADLS)

ADLS Gen2 is Azure's S3. It is the durable, low-cost object storage layer that everything else builds on. The key feature is the hierarchical namespace, which makes the storage behave more like a file system than a flat object store. That gives you directory-level ACLs, rename operations that are not painfully expensive, and a structure that data engineers find intuitive.

From a lakehouse perspective, ADLS Gen2 integrates cleanly with Delta Lake. Azure Databricks and Synapse can both read and write Delta tables in ADLS, and you can use the same layout patterns you would use on S3 or GCS. Cost-wise, ADLS is competitive with S3 and GCS. The bigger difference is operational: Microsoft storage account quotas, networking configuration, and the need to think through RBAC, ACLs, and private endpoints. The security model is powerful, but it is also more complex than most teams expect.

Treat ADLS as the permanent system of record. Raw data, staged tables, and curated datasets should all live there, with clear prefixes and lifecycle policies. If you do that, you can swap compute engines without re-architecting the storage. That is the same advantage that S3 gives AWS teams, and it is just as real on Azure.

The operational detail that trips teams up is access control. ADLS supports both Azure RBAC and POSIX-style ACLs, and you need to decide which is authoritative. In production, I prefer RBAC for coarse-grained access and ACLs for data-domain boundaries. If you do not plan that boundary layer, you end up with storage accounts that are either too open or too locked down. The time to solve it is early, before every pipeline depends on a fragile permission chain.

abfss://curated@mydatalake.dfs.core.windows.net/finance/fact_transactions/
abfss://raw@mydatalake.dfs.core.windows.net/events/app_clicks/2026/03/26/

Azure Data Factory (ADF)

ADF is the orchestration backbone for most Azure data platforms. You define pipelines, datasets, and linked services. Pipelines contain activities like copy, data flow, notebook execution, or stored procedure calls. Datasets define the shape and location of data. Linked services are the connection definitions. It is highly visual and heavily integrated with Azure services, which is why non-engineer stakeholders often like it more than Airflow.

If you have used AWS Glue crawlers, ADF pipelines feel familiar but more visual. Glue is more code-driven and leans on Spark; ADF leans on UI-first orchestration with optional code. Dagster is more developer-friendly, and Airflow is more flexible. But in Microsoft environments, ADF wins because it plugs into everything with minimal setup and because the security and networking story is simpler in Azure-native shops.

The tradeoff is that complex ADF pipelines become hard to version. You can integrate with Git, but the JSON definitions are verbose and not pleasant to review. My pattern is to keep ADF for ingestion and hand off heavy transforms to Synapse or Databricks, with dbt handling the SQL models. That division keeps pipelines readable and reduces the amount of orchestration code you have to debug.

The integration runtime is another practical consideration. If you are pulling from on-prem sources, you will likely use a self-hosted integration runtime, which becomes its own operational surface. Monitor it, patch it, and treat it as a production dependency. On the cloud side, managed identities are your friend. Use them to avoid storing secrets in ADF, and make sure you budget time to wire up network rules, private endpoints, and firewall allow lists.

{
  "name": "ingest_sales_api",
  "properties": {
    "activities": [
      {
        "name": "CopyFromApi",
        "type": "Copy",
        "inputs": [{ "referenceName": "sales_api", "type": "DatasetReference" }],
        "outputs": [{ "referenceName": "raw_sales_json", "type": "DatasetReference" }],
        "typeProperties": {
          "source": { "type": "RestSource" },
          "sink": { "type": "JsonSink" }
        }
      }
    ],
    "annotations": ["ingestion", "api"]
  }
}

Azure Synapse Analytics

Synapse is Microsoft's official analytics warehouse. It bundles serverless SQL pools, dedicated SQL pools, Spark pools, and data integration tooling. In practice, most teams use either serverless SQL or a dedicated pool, not both. The decision is cost and performance: serverless is pay-per-query and great for ad hoc or dbt-style transforms. Dedicated pools are provisioned and better for high concurrency BI or predictable workloads.

Synapse integrates cleanly with ADLS Gen2, which means you can query data in-place using external tables or serverless SQL. That makes it feel similar to BigQuery or Redshift Spectrum. For teams coming from AWS, the mental model is close to Athena for serverless and Redshift for dedicated. The mapping is not perfect because Synapse has its own quirks, but the concept is the same.

dbt works with Synapse via the dbt-synapse adapter. It is a viable production option if you control concurrency and keep models relatively straightforward. I treat serverless SQL pools as the default and only move to dedicated when the performance or concurrency demands are obvious. That ordering keeps costs under control and avoids paying for idle compute.

Dedicated pools are worth it when you need predictable performance for dashboards with high concurrency. They are also the only option when you rely heavily on materialized data in a warehouse-style environment. But if your workload is bursty and mostly batch, the serverless pool plus external tables in ADLS is more cost-efficient. Most teams I see start with serverless, then add a dedicated pool only for the handful of datasets that power mission-critical BI.

CREATE EXTERNAL TABLE curated.fact_orders
WITH (
  LOCATION = 'curated/fact_orders/',
  DATA_SOURCE = [adls_data_source],
  FILE_FORMAT = [parquet_format]
)
AS SELECT * FROM staging.fact_orders;

Azure Event Hubs

Event Hubs is Azure's managed streaming backbone. It is Kafka protocol compatible, which means you can use Kafka clients directly without changing much code. Under the hood, it uses a partition model, consumer groups, and offsets that feel familiar if you have operated Kafka. The biggest difference is operational: you are not managing brokers or Zookeeper, and the scaling knobs are simpler.

The capture feature is one of the most useful parts. You can configure Event Hubs to automatically write events to ADLS or Blob Storage in Avro or Parquet, which is a clean way to land streaming data for batch processing. If you are building a lakehouse, capture gives you a low-effort bridge from real-time ingestion to your storage layer.

The Kafka compatibility is not full parity, and there are limits on throughput and retention you should plan for. But for most teams, it is the easiest way to get streaming data into Azure without running Kafka yourself. It is the Kafka equivalent in the Azure ecosystem.

Endpoint=sb://my-namespace.servicebus.windows.net/;
SharedAccessKeyName=send-rule;
SharedAccessKey=***REDACTED***;
EntityPath=events-clickstream

Microsoft Fabric

Microsoft Fabric launched in 2023 as a unified analytics platform. It aims to bring together data engineering, data science, real-time analytics, and BI under a single SaaS-style umbrella. The core idea is OneLake, a single logical data lake that Fabric services share. You get Lakehouse storage, notebooks, pipelines, and Power BI in one workspace with a single capacity model.

Fabric is still evolving, but it matters because it signals where Microsoft is heading. For greenfield teams that want a managed, consolidated experience, Fabric can simplify a lot of the Azure sprawl. For established teams, Fabric is more of a strategic consideration: do you build around Synapse and ADF, or do you plan a path toward Fabric's unified model? Today, Fabric and Synapse overlap, and that ambiguity is part of the Azure tax.

My pragmatic take is to treat Fabric as a platform choice rather than a point service. If your organization is already standardized on Power BI and wants tighter integration with notebooks and pipelines, Fabric is worth serious evaluation. If you already have a mature lakehouse on ADLS with Databricks and dbt, Fabric is more of a long-term roadmap conversation.

Azure Databricks

Azure Databricks is still the best-in-class Spark platform on Azure. It brings the Databricks runtime, Delta Lake, Delta Live Tables, and Unity Catalog into the Microsoft ecosystem with strong Azure-native security integration. If you are doing heavy Spark processing, Databricks is the most reliable option, and it avoids many of the pitfalls of Synapse Spark pools.

Delta Live Tables are particularly useful for structured streaming and curated pipelines. They give you declarative pipelines with built-in quality checks, incremental processing, and lineage. Unity Catalog is the other big reason teams adopt Databricks; it provides a centralized governance layer that spans notebooks, jobs, and data access. If governance and auditability matter, Unity Catalog is a material advantage over a pure ADLS plus Synapse stack.

The tradeoff is cost and governance complexity. Databricks is not cheap, and it introduces another control plane you have to manage. But if Spark is critical to your workload, Azure Databricks is the path of least resistance. It is the place where Spark feels like a product rather than an infrastructure project.

CREATE TABLE curated.fact_sessions
USING DELTA
LOCATION 'abfss://curated@mydatalake.dfs.core.windows.net/sessions/'
TBLPROPERTIES ('delta.autoOptimize.optimizeWrite' = 'true');

Building a Complete Azure Data Stack

A clean, production-grade Azure stack looks like this: ADLS Gen2 for storage, ADF for ingestion and orchestration, Synapse serverless SQL for ad hoc and dbt transformations, and dbt as the modeling layer. You can swap Synapse for Databricks if you need heavy Spark, but the core pattern stays the same. Keep your raw data in ADLS, expose it via external tables, and let dbt build the curated layer.

If you want to compare this to other clouds, here is the closest translation. Azure: ADLS + ADF + Synapse + dbt. AWS: S3 + Glue + Athena + dbt. GCP: GCS + Dataflow + BigQuery + dbt. Each cloud has its own quirks, but the architecture is similar: object storage, orchestration, SQL compute, and a transformation layer.

The reason this pattern works is that it minimizes lock-in while still taking advantage of managed services. ADLS is open storage. dbt models are portable. Synapse serverless gives you pay-per-query. You can add Databricks for Spark without rewriting your entire pipeline. The stack is modular, which is how you stay flexible as requirements change.

# dbt profile for Synapse (dbt-synapse adapter)
my_project:
  target: prod
  outputs:
    prod:
      type: synapse
      driver: 'ODBC Driver 18 for SQL Server'
      server: my-synapse.sql.azuresynapse.net
      database: analytics
      schema: dbt
      authentication: sql
      user: dbt_runner
      password: ${DBT_SYNAPSE_PASSWORD}
      encrypt: true
      trust_cert: false
      threads: 8

When to Choose Azure

Azure is the default when your company already lives in Microsoft 365, Teams, and Azure AD. Identity and access management is a non-negotiable foundation, and Azure AD makes that part painless. If your analysts live in Power BI, the Azure ecosystem feels more cohesive than anything else because the tooling was designed to work together.

Azure is also common in regulated industries like healthcare and finance. Microsoft has long-standing enterprise contracts, and the compliance and security features are mature. If your legal and procurement teams are already aligned with Microsoft, you will save months of negotiation by staying inside the Azure umbrella. That is not a technical advantage, but it is often the deciding factor.

I recommend Azure when the organization cares more about predictable enterprise integration than cutting-edge developer experience. If your team is small and needs maximum velocity, AWS or GCP might feel lighter. If your company is large, Microsoft-heavy, and already paying for Azure, it is hard to justify going elsewhere.

Practical Callouts

First, dbt-synapse setup is straightforward but picky about drivers and authentication. Use the Microsoft ODBC driver, make sure encryption is enabled, and test connectivity in a CI environment early. The adapter is good enough for production, but you need to watch concurrency and query costs on serverless pools.

Second, cost models are different from BigQuery in subtle ways. BigQuery charges per byte scanned with a predictable rate. Synapse serverless also charges per TB processed, but you pay for data movement and external table metadata operations that can surprise you. Dedicated pools are provisioned and will happily burn money if left running. Treat capacity planning as a real responsibility, not an afterthought.

Third, the Azure tax is real. Azure services are powerful but often require more configuration: resource groups, storage accounts, RBAC, private endpoints, managed identities, and region pairing decisions. If you do not plan for this, you will spend weeks on infrastructure plumbing before you ever run a transformation. The upside is that once it is wired, it is stable. The downside is that the learning curve is steep.

# Example dbt-synapse model config
{{
  config(
    materialized = 'table',
    distribution = 'hash',
    dist = 'customer_id',
    index = 'clustered columnstore'
  )
}}

select
  customer_id,
  sum(order_total) as lifetime_value
from {{ ref('fct_orders') }}
group by 1

Closing

Azure is not a single service you adopt. It is a full ecosystem that tends to work best when you embrace the Microsoft way of doing things. ADLS gives you durable storage, ADF handles ingestion, Synapse gives you SQL analytics, and Databricks is the Spark engine that makes large-scale processing realistic. Fabric is the wildcard: a unifying platform that might simplify things over time but is still evolving.

If you are a data engineer in a Microsoft-heavy company, the right move is to learn this stack deeply. The services are not perfect, but they are proven. When you understand where Azure shines and where it is awkward, you can build platforms that are reliable, cost-aware, and aligned with the reality of enterprise tech stacks.

Keep Reading

If you want to go deeper on the open lakehouse and dbt patterns, these two posts pair well with this one.

Apache Iceberg and the Open Lakehouse dbt in Production

Questions or pushback on any of this? Find me on LinkedIn.

Share on X Share on LinkedIn

Ryan Kirsch is a senior data engineer with 8+ years building data infrastructure at media, SaaS, and fintech companies. He specializes in Kafka, dbt, Snowflake, and Spark, and writes about data engineering patterns from production experience. See his full portfolio.