Azure Data Factory
Introduction
Azure Data Factory is a cloud-based data integration service offered by Microsoft. It provides a platform for creating, scheduling, and managing data workflows, referred to as pipelines. These data pipelines enable the efficient extraction of data from diverse sources, necessary transformation, and subsequent loading into various data stores.
What are the Main Components in Azure Data Factory?
Azure Data Factory has five main components:
- Author & Monitor: The user interface where we create, schedule, and monitor data pipelines.
- Data Pipelines: Define the data movement and transformation activities.
- Data Sets: Represent the data structure of tables or files, acting as inputs and outputs of data pipelines.
- Linked Services: Contain the connection information for data sources and destinations, such as databases and storage accounts.
-
Integration Runtime (IR): The compute infrastructure used by Azure Data Factory to provide various data integration capabilities. There are three main types:
- Self-hosted Integration Runtime: Installed on an on-premises machine, facilitating connections and data movement between on-premises data sources and cloud services.
- Azure Integration Runtime: Managed by Azure, designed for cloud data integration, and connecting various Azure services.
- SSIS Integration Runtime: Used for executing SQL Server Integration Services (SSIS) packages in Azure, facilitating ETL (Extract, Transform, Load) processes.
Activities in Azure Data Factory
Azure Data Factory is a cloud-based data integration service offered by Microsoft. It provides a platform for creating, scheduling, and managing data workflows, referred to as pipelines. These data pipelines enable the efficient extraction of data from diverse sources, necessary transformation, and subsequent loading into various data stores. Within Azure Data Factory, activities are designed to execute various data processing tasks.
Copy Activity
The Copy Activity facilitates seamless data transfer between source and destination. Key functionalities include:
- Column Mapping: Mapping columns from the source to the destination dataset for renaming, reordering, or omitting columns.
- Data Type Conversion: Performing basic data type conversions, such as converting a string to a date.
- Data Filtering: Applying filters to select relevant rows based on specific conditions.
Data Flow Activity
Enables data transformation using a visual interface. It supports extracting, transforming, and loading data from various sources to customized destinations.
Web Activity
Used to make HTTP requests to external web services, APIs, or websites. It allows data extraction, sending data to external systems, or triggering actions in other services.
HDInsight Spark Job Activity
Executes Apache Spark jobs on HDInsight clusters, enabling big data processing within Azure Data Factory workflows.
Databricks Notebook Activity
Executes Databricks notebooks as part of the data pipeline. Ideal for advanced data transformation, analysis, and machine learning tasks.
Stored Procedure Activity
Executes stored procedures in relational databases like Azure SQL Database. It facilitates database actions but does not produce outputs for subsequent activities.
Lookup Activity
Queries and retrieves data from datasets such as databases or files. Returns up to 5000 rows or 4 MB of data for use in pipeline logic.
If Condition Activity
Implements conditional logic within pipelines. Executes one set of activities if a condition is true and another if false.
ForEach Activity
Iterates over a collection of items, executing the same set of actions for each. Useful for repetitive tasks like processing multiple files.
Delete Activity
Removes files or objects from a data store for data cleanup or maintenance tasks.
Wait Activity
Introduces a delay in the pipeline, useful for scheduling or synchronizing activities.
Get Metadata Activity
Retrieves metadata (e.g., file names, sizes, modification dates) from datasets or files, enabling data-driven decisions in pipelines.
Set Variable Activity
Assigns values to variables for use in subsequent activities or expressions within the pipeline.
Switch Activity
Implements conditional branching. Directs the workflow to activities based on matching conditions. If no conditions match, executes default activities (if defined).
Transformations in Azure Data Factory
Azure Data Factory Data Flow provides several transformations for data processing. Here's a short list of some common transformations and their usages:
Source Transformation
A source transformation configures the data source for the data flow. Every data flow requires at least one source transformation, but multiple sources can be added and combined using transformations like join, lookup, or union.
- To add a source, select the "Add Source" box in the data flow canvas.
- Each source transformation is associated with one dataset or linked service, defining the shape and location of the data.
- File-based datasets can use wildcards and file lists to process multiple files simultaneously.
Select Transformation
Use the Select Transformation to rename, drop, or reorder columns. This transformation doesn’t modify row data but controls which columns are propagated downstream.
- Supports fixed mappings, rule-based mapping, and auto-mapping.
- Columns not matching the defined mappings are dropped automatically.
Derived Column Transformation
Use the Derived Column Transformation to create new columns or modify existing ones.
- Create or update columns based on rules or column patterns.
- Column patterns match columns using metadata and apply transformations dynamically.
Filter Transformation
The Filter Transformation filters rows based on a condition, including only those that match. It functions similarly to a WHERE
clause in SQL.
Aggregate Transformation
The Aggregate Transformation groups data and performs calculations like SUM
, MIN
, MAX
, AVG
, and COUNT
.
- Use the Expression Builder to define aggregation rules.
- Group by existing or computed columns for flexible aggregation.
Join Transformation
The Join Transformation combines data from two sources or streams. Various join types are supported:
- Inner Join: Includes rows with matching values in both sources.
- Left Outer Join: Includes all rows from the left source and matched rows from the right, with unmatched right rows set to
NULL
. - Right Outer Join: Includes all rows from the right source and matched rows from the left, with unmatched left rows set to
NULL
. - Full Outer Join: Includes all rows and columns from both sources, with unmatched values set to
NULL
. - Custom Cross Join: Combines streams based on a custom condition or expression. Non-equi joins and Cartesian products are supported with careful optimization to avoid performance issues.
Note: Custom cross joins should include columns from both sources. Using static values may lead to inefficient full dataset scans.
Flowlet Transformation
Use the Flowlet transformation to run a previously created mapping data flow flowlet.
Flatten Transformation
Use the Flatten transformation to take array values inside hierarchical structures such as JSON and unroll them into individual rows. This process is known as denormalization.
Alter Row Transformation
The Alter Row transformation in Azure Data Factory allows defining row-level policies for insert, update, delete, and upsert operations based on specified conditions. It is particularly useful for handling incremental data loading and performing complex data transformations.
External Call Transformation
The External Call transformation enables data engineers to call out to external REST endpoints row-by-row to add custom or third-party results into data flow streams.
Exists Transformation
The Exists transformation is used to filter rows in a data flow based on matching rows in another dataset. It essentially checks if a row in the source dataset exists in the target dataset based on specified conditions. It is commonly used for scenarios like upsert operations, where you need to determine if a record already exists in the target dataset before deciding whether to update or insert the record.
Cast Transformation
Use the Cast transformation to easily modify the data types of individual columns in a data flow. The Cast transformation also enables an easy way to check for casting errors.
Sort Transformation
The Sort transformation allows sorting the rows in the data stream. Individual columns can be selected and sorted in ascending or descending order.
Conditional Split Transformation
This transformation directs data rows to different outputs based on specified conditions. It evaluates expressions and routes the data to the appropriate stream according to the results of these evaluations.
Lookup Transformation
The Lookup Transformation is used to look up data from another source within a data flow stream and return results based on a specified condition. It behaves similarly to a left outer join, where all rows from the primary stream are included in the output stream, along with additional columns from the lookup stream.
Pivot Transformation
The Pivot transformation in Azure Data Factory is used to convert rows into multiple columns. It involves aggregating data by selecting group-by columns and generating pivot columns using aggregate functions.
New Branch
Add a new branch to perform multiple sets of operations and transformations against the same data stream. Adding a new branch is useful when you want to use the same source for multiple sinks or for self-joining data. A new branch can be added from the transformation list, like other transformations. New Branch will only be available as an action when there's an existing transformation following the transformation you're attempting to branch.
Sink Transformation
After finishing transforming your data, write it into a destination store using the Sink transformation. Every data flow requires at least one Sink transformation, but you can write to as many sinks as necessary to complete your transformation flow. To write to additional sinks, create new streams via new branches and conditional splits. Each Sink transformation is associated with exactly one dataset object or linked service. The Sink transformation determines the shape and location of the data you want to write to.
Window Transformation
The Window transformation is where you define window-based aggregations of columns in your data streams. In the Expression Builder, you can define different types of aggregations based on data or time windows (SQL OVER clause) such as LEAD, LAG, NTILE, CUMEDIST, RANK, etc. A new field will be generated in the output that includes these aggregations. Optional group-by fields can also be included.
Rank Transformation
Use the Rank transformation to generate an ordered ranking based on sort conditions specified by the user.
Union Transformation
The Union transformation in Azure Data Factory combines multiple data streams or datasets into one unified dataset, integrating sources from different origins or existing transformations within the data flow. You can merge any number of streams in the settings table by selecting the '+' icon next to each configured row.
Unpivot Transformation
The Unpivot transformation in Azure Data Factory is utilized to convert columns into rows, serving as the inverse operation of the Pivot transformation. This transformation is frequently employed for data normalization in reporting or analytics scenarios, especially when aggregating or analyzing data distributed across multiple columns.
Surrogate Key Transformation
The Surrogate Key transformation in ADF assigns unique identifier values to records in a dataset, typically for data warehouse purposes. It helps maintain historical data, simplifies record tracking, and supports efficient data management in ETL processes.
HDInsight Spark Transformation
Execute custom Spark code for advanced data processing.
Partitions in Azure Data Factory
In Azure Data Factory, partitions are a way to divide data into smaller, more manageable chunks for processing. They are commonly used in data loading, transformation, and copying operations to improve performance and parallel processing.
There are two main types of partitions in Azure Data Factory:
Round Robin: Round robin distributes data equally across partitions. Use round robin when you don't have good key candidates to implement a solid, smart partitioning strategy. You can set the number of physical partitions.
Hash: The service produces a hash of columns to produce uniform partitions such that rows with similar values fall in the same partition. When using the Hash option, test for possible partition skew. You can set the number of physical partitions.
Dynamic Range: The dynamic range uses Spark dynamic ranges based on the columns or expressions that you provide. You can set the number of physical partitions.
Fixed Range: Build an expression that provides a fixed range for values within partitioned data columns. To avoid partition skew, you should have a good understanding of the data before using this option. The values entered for the expression are used as part of a partition function. You can set the number of physical partitions.
Key: If you have a good understanding of the cardinality of your data, key partitioning might be a good strategy. Key partitioning creates partitions for each unique value in your column. You can't set the number of partitions because the number is based on unique values in the data.
These partitioning methods enable better resource utilization and faster data processing in Azure Data Factory.