Wednesday, August 12, 2020

Overview of Azure Data Factory Components

Azure Data Factory Components

On the left side of the Author page, you will see your factory resources. In this example, we have already created one pipeline, two datasets, and one data flow:

Screenshot of the Author page in Azure Data Factory, with one Pipeline, two Datasets, and one Data Flow already created

Let’s go through each of these Azure Data Factory components and explain what they are and what they do.

Pipelines

Pipelines are the things you execute or run in Azure Data Factory, similar to packages in SQL Server Integration Services (SSIS). This is where you define your workflow: what you want to do and in which order. For example, a pipeline can first copy data from an on-premises data center to Azure Data Lake Storage, and then transform the data from Azure Data Lake Storage into Azure Synapse Analytics (previously Azure SQL Data Warehouse).

Screenshot of the Author page in Azure Data Factory, with a Pipeline open in the user interface

When you open a pipeline, you will see the pipeline authoring interface. On the left side, you will see a list of all the activities you can add to the pipeline. On the right side, you will see the design canvas with the properties panel underneath it.

Activities

Activities are the individual steps inside a pipeline, where each activity performs a single task. You can chain activities or run them in parallel. Activities can either control the flow inside a pipeline, move or transform data, or perform external tasks using services outside of Azure Data Factory.

Screenshot of the Author page in Azure Data Factory, with a Pipeline open and the Activities highlighted

You add an activity to a pipeline by dragging it onto the design canvas. When you click on an activity, it will be highlighted, and you will see the activity properties in the properties panel. These properties will be different for each type of activity.

Data Flows

Data Flows are a special type of activity for creating visual data transformations without having to write any code. There are two types of data flows: mapping and wrangling.

Screenshot of the Author page in Azure Data Factory, with a Mapping Data Flow open

Datasets

If you are moving or transforming data, you need to specify the format and location of the input and output data. Datasets are like named views that represent a database table, a single file, or a folder.

Screenshot of the Author page in Azure Data Factory, with a Dataset open

Linked Services

Linked Services are like connection strings. They define the connection information for data sources and services, as well as how to authenticate to them.

Screenshot of the Author page in Azure Data Factory, with Connections open and Linked Services highlighted

Integration Runtimes

Integration runtimes specify the infrastructure to run activities on. You can create three types of integration runtimes: AzureSelf-Hosted, and Azure-SSIS. Azure integration runtimes use infrastructure and hardware managed by Microsoft. Self-Hosted integration runtimes use hardware and infrastructure managed by you, so you can execute activities on your local servers and data centers. Azure-SSIS integration runtimes are clusters of Azure virtual machines running the SQL Server Integration (SSIS) engine, used for executing SSIS packages in Azure Data Factory.

Screenshot of the Author page in Azure Data Factory, with Connections open and Integration Runtimes highlighted

Triggers

Triggers determine when to execute a pipeline. You can execute a pipeline on a wall-clock schedule, in a periodic interval, or when an event happens.

Screenshot of the Author page in Azure Data Factory, with Triggers open

Templates

Finally, if you don’t want to create all your pipelines from scratch, you can use the pre-defined templates by Microsoft, or create custom templates.

Screenshot of the Author page in Azure Data Factory, with Templates open

Summary

In this post, we went through the Author page in more detail and looked at the different Azure Data Factory components. I like to illustrate and summarize these in a slightly different way:

Illustration of all the Azure Data Factory components and how they relate to each other

You create pipelines to execute one or more activities. If an activity moves or transforms data, you define the input and output format in datasets. Then, you connect to the data sources or services through linked services. You can specify the infrastructure and location where you want to execute the activities by creating integration runtimes. After you have created a pipeline, you can add triggers to automatically execute it at specific times or based on events. Finally, if you don’t want to create your pipelines from scratch, you can start from pre-defined or custom templates.

Sunday, April 19, 2020

Squash commits into one with Git

In Git you can merge several commits into one with the powerful interactive rebase. It's a handy tool I use quite often; I usually tidy up my working space by grouping together several small intermediate commits into a single lump to push upstream.

Step 1: choose your starting commit

The first thing to do is to invoke git to start an interactive rebase session:
git rebase --interactive HEAD~N

Or, shorter:

git rebase -i HEAD~N

where N is the number of commits you want to join, starting from the most recent one. For example, this is a hypothetical list of commits taken from the git log command, while I'm working on a generic feature Z:

871adf OK, feature Z is fully implemented      --- newer commit
0c3317 Whoops, not yet...
87871a I'm ready!
643d0e Code cleanup
afb581 Fix this and that
4e9baa Cool implementation
d94e78 Prepare the workbench for feature Z
6394dc Feature Y                               --- older commit
And this is what I would like to do:
871adf OK, feature Z is fully implemented      --- newer commit --┐
0c3317 Whoops, not yet...                                         |
87871a I'm ready!                                                 |
643d0e Code cleanup                                               |-- Join these into one
afb581 Fix this and that                                          |
4e9baa Cool implementation                                        |
d94e78 Prepare the workbench for feature Z     -------------------┘
6394dc Feature Y                               --- older commit

Obtaining:
84d1f8 Feature Z                               --- newer commit (result of rebase)
6394dc Feature Y                               --- older commit

Notice how a rebase generates a new commit with a new hash (84d1f8 in the example above). So in this case the command would be:

git rebase --interactive HEAD~[7]

because I want to combine the last seven commits into one, and d94e78 Prepare the workbench for feature Z is the seventh one.

I have tons of commits to squash, do I have to count them one by one?

A downside of the git rebase --interactive HEAD~[N] command is that you have to guess the exact number of commits, by counting them one by one. Luckily, there is another way:

git rebase --interactive [commit-hash]

Where [commit-hash] is the hash of the commit just before the first one you want to rewrite from. So in my example the command would be:

git rebase --interactive 6394dc
Where 6394dc is Feature Y. You can read the whole thing as:

Merge all my commits on top of commit [commit-hash].

Step 2: picking and squashing

At this point your editor of choice will pop up, showing the list of commits you want to merge. Note that it might be confusing at first, since they are displayed in a reverse order, where the older commit is on top. I've added --- older commit and --- newer commit to make it clear, you won't find those notes in the editor.

pick d94e78 Prepare the workbench for feature Z     --- older commit
pick 4e9baa Cool implementation 
pick afb581 Fix this and that  
pick 643d0e Code cleanup
pick 87871a I'm ready! 
pick 0c3317 Whoops, not yet... 
pick 871adf OK, feature Z is fully implemented      --- newer commit

[...]


Below the commit list there is a short comment (omitted in my example) which outlines all the operations available. You can do many smart tricks during an interactive rebase, let's stick with the basics for now though. Our task here is to mark all the commits as squashable, except the first/older one: it will be used as a starting point.

You mark a commit as squashable by changing the work pick into squash next to it (or s for brevity, as stated in the comments). The result would be:

pick d94e78 Prepare the workbench for feature Z     --- older commit
s 4e9baa Cool implementation 
s afb581 Fix this and that  
s 643d0e Code cleanup
s 87871a I'm ready! 
s 0c3317 Whoops, not yet... 
s 871adf OK, feature Z is fully implemented      --- newer commit

[...]

Save the file and close the editor.

Wednesday, April 1, 2020

Copy Data Wizard in Azure Data Factory

In this example, we are going to copy the themes.csv file from Rebrickable into a blob container called lego in our Azure Data Lake Storage Gen2 account.
From the Azure Data Factory Home page, click copy data:
Screenshot of the Home page in Azure Data Factory with the Copy Data task highlighted
This opens the Copy Data Wizard. Let’s walk through each step!

1. Properties

On the Properties page, give the pipeline a name and description. Keep the default “run once now” option:
Screenshot of the Copy Data Wizard step 1, the properties page
I chose the name Lego_HTTP_to_ADLS_Themes. This tells me that I’m copying data from a website (HTTP) into Azure Data Lake Storage Gen2 (ADLS). This name makes sense in my head, but you need to find a naming convention that works for you :)
Click next to move on to the Source properties.

2. Source

On the Source page, we will first create a new linked service to Rebrickable, then create a new dataset to represent the themes.csv file.
Click create new connection:
Screenshot of the Copy Data Wizard step 2a, the source connection page
Search and select the HTTP Linked Service:
Screenshot of the New Linked Service pane with the HTTP Linked Service highlighted
Give the linked service a name and description, and use the base URL cdn.rebrickable.com/media/downloads/. (You can find this URL by inspecting the links on rebrickable.com/downloads. Keep the last slash.) Change authentication type to anonymous. Click create:
Screenshot of the New Linked Service pane with the properties filled out
I chose the name HTTP_Lego, because I like to prefix my linked services with the connection type.
The linked service has now been created, yay! Make sure it’s selected and click next to move on to the dataset properties:
Screenshot of the Copy Data Wizard step 2a, the source connection page, with the new HTTP linked service highlighted
Since we specified the base URL in the Linked Service, we only have to specify the file name themes.csv.gz in the relative URL. Keep the other default options. Click next:
Screenshot of the Copy Data Wizard step 2b, the source dataset properties page
This next part feels kind of like magic, especially if you have been working with SQL Server Integration Services (SSIS) in the past. The Copy Data Wizard now inspects the file and tries to figure out the file format for us. But… since we are working with a gzipped file, it doesn’t make a whole lot of sense yet…
Screenshot of the Copy Data Wizard step 2b, the source dataset file format settings page, highlighting the data preview showing scrambled text because the source file is gzipped
Let’s fix that! Change the compression type to gzip. Tadaaa! Magic! Without us doing anything else manually, the copy data wizard unzips the CSV file for us and shows us a preview of the content:
Screenshot of the Copy Data Wizard step 2b, the source dataset file format settings page, highlighting the compression type and compression level
If you are working with a raw CSV file, the copy data wizard can detect the file format, the delimiter, and even that we have headers in the first row. But since we are working with a gzipped file, we have to configure these settings manually. Choose first row as header:
Screenshot of the Copy Data Wizard step 2b, the source dataset file format settings page, highlighting the first row as header setting
If the headers are not detected correctly on the first attempt, try clicking detect text format again:
Screenshot of the Copy Data Wizard step 2b, the source dataset file format settings page, highlighting the detect text format button
You can now preview the schema inside the gzipped file. Beautiful! :D
Screenshot of the Copy Data Wizard step 2b, the source dataset file format settings page, highlighting the detected schema preview
Click next to move on to the Destination properties.

3. Destination

On the Destination page, we will first create a new linked service to our Azure Data Lake Storage Gen2 account, then create a new dataset to represent the themes.csv file in the destination.
Click create new connection:
Screenshot of the Copy Data Wizard step 3a, the destination connection page
Select the Azure Data Lake Storage Gen2 linked service:
Screenshot of the New Linked Service pane with the Azure Data Lake Storage Gen2 Linked Service highlighted
Give the linked service a name and description. Select your storage account name from the dropdown list. Test the connection. Click create:
Screenshot of the New Linked Service pane with the properties filled out
I chose the name ADLS_cathrinewadls, because I like to prefix my linked services with the connection type.
The second linked service has now been created, yay! Make sure it’s selected, and click next to move on to the dataset properties:
Screenshot of the Copy Data Wizard step 3a, the source connection page, with the new Azure Data Lake Storage Account Gen2 linked service highlighted
Specify lego as the folder path, and themes.csv as the file name. Keep the other default options. Click next:
Screenshot of the Copy Data Wizard step 3b, the destination dataset properties page
Enable add header to file and keep the other default options:
Screenshot of the Copy Data Wizard step 3b, the destination dataset file format settings page with the add header to file setting enabled and highlighted
Click next to move on to the Settings.

4. Settings

On the Settings page, we will configure the fault tolerance settings. This is another part that feels like magic. By changing a setting, we can enable automatic handling and logging of rows with errors. Whaaat! :D In SQL Server Integration Services (SSIS), this had to be handled manually. In Azure Data Factory, you literally just enable it and specify the settings. MAGIC! :D
Change the fault tolerance settings to skip and log incompatible rows:
Screenshot of the Copy Data Wizard step 4, the settings page, with the fault tolerance dropdown showing the option to skip and log incompatible rows
At this time, error logging can only be done to Azure Blob Storage. Aha! So that’s why we created two storage accounts earlier ;) Click new:
Screenshot of the Copy Data Wizard step 4, the settings page, with the New connection button highlighted
The Copy Data Wizard is even smart enough to figure out that it needs to create an Azure Blob Storage connection. Good Copy Data Wizard :D Give the linked service a name and description. Select your storage account name from the dropdown list. Test the connection. Click create:
Screenshot of the New Linked Service pane, with the Azure Blob Storage type highlighted
I chose the name ABLB_cathrinewblob, because I like to prefix my linked services with the connection type.
Specify lego/errors/themes as the folder path:
Screenshot of the Copy Data Wizard step 4, the settings page, with all the properties filled out
Click next to move on to the Summary.

5. Summary

On the Summary page, you will see a pretty graphic illustrating that you are copying data from an HTTP source to an Azure Data Lake Storage Gen2 destination:
Screenshot of the Copy Data Wizard step 5, the summary page
Click next to move on to Deployment.

6. Deployment

The final step, Deployment, will create the datasets and pipeline. Since we chose the “run once now” setting in the Properties step, the pipeline will be executed immediately after deployment:
Screenshot of the Copy Data Wizard step 5, the deployment page, with the deployment in prosess
Once the deployment is complete, we can open the pipeline on the Author page, or view the execution on the Monitor page. Click monitor:
Screenshot of the Copy Data Wizard step 5, the deployment page, with the deployment completed
Success! ✔🥳 Our pipeline executed successfully.
Screenshot of the Monitor page in Azure Data Factory, with the successful pipeline run highlighted
We can now open Azure Storage Explorer and verify that the file has been copied from Rebrickable:
Screenshot of Azure Storage Explorer showing a new lego container with the themes.csv file in it

Summary

In this post, The Copy Data Wizard created all the factory resources for us: one pipeline with a copy data activity, two datasets, and two linked services. This guided experience is a great way to get started with Azure Data Factory.
Next, we will go through each of these factory resources in more detail, and look at how to create them from the Author page instead of through the Copy Data Wizard. First, let’s look at pipelines!

Sunday, March 29, 2020

Overview of Azure Data Factory Components

My First Blog on Azure

Azure Data Factory Components


Pipelines

Pipelines are the things you execute or run in Azure Data Factory, similar to packages in SQL Server Integration Services (SSIS). This is where you define your workflow: what you want to do and in which order. For example, a pipeline can first copy data from an on-premises data center to Azure Data Lake Storage, and then transform the data from Azure Data Lake Storage into Azure Synapse Analytics (previously Azure SQL Data Warehouse).
Screenshot of the Author page in Azure Data Factory, with a Pipeline open in the user interface
When you open a pipeline, you will see the pipeline authoring interface. On the left side, you will see a list of all the activities you can add to the pipeline. On the right side, you will see the design canvas with the properties panel underneath it.

Activities

Activities are the individual steps inside a pipeline, where each activity performs a single task. You can chain activities or run them in parallel. Activities can either control the flow inside a pipeline, move or transform data, or perform external tasks using services outside of Azure Data Factory.
Screenshot of the Author page in Azure Data Factory, with a Pipeline open and the Activities highlighted
You add an activity to a pipeline by dragging it onto the design canvas. When you click on an activity, it will be highlighted, and you will see the activity properties in the properties panel. These properties will be different for each type of activity.

Data Flows

Data Flows are a special type of activity for creating visual data transformations without having to write any code. There are two types of data flows: mapping and wrangling.
Screenshot of the Author page in Azure Data Factory, with a Mapping Data Flow open

Datasets

If you are moving or transforming data, you need to specify the format and location of the input and output data. Datasets are like named views that represent a database table, a single file, or a folder.
Screenshot of the Author page in Azure Data Factory, with a Dataset open

Linked Services

Linked Services are like connection strings. They define the connection information for data sources and services, as well as how to authenticate to them.
Screenshot of the Author page in Azure Data Factory, with Connections open and Linked Services highlighted

Integration Runtimes

Integration runtimes specify the infrastructure to run activities on. You can create three types of integration runtimes: AzureSelf-Hosted, and Azure-SSIS. Azure integration runtimes use infrastructure and hardware managed by Microsoft. Self-Hosted integration runtimes use hardware and infrastructure managed by you, so you can execute activities on your local servers and data centers. Azure-SSIS integration runtimes are clusters of Azure virtual machines running the SQL Server Integration (SSIS) engine, used for executing SSIS packages in Azure Data Factory.
Screenshot of the Author page in Azure Data Factory, with Connections open and Integration Runtimes highlighted

Triggers

Triggers determine when to execute a pipeline. You can execute a pipeline on a wall-clock schedule, in a periodic interval, or when an event happens.
Screenshot of the Author page in Azure Data Factory, with Triggers open

Templates

Finally, if you don’t want to create all your pipelines from scratch, you can use the pre-defined templates by Microsoft, or create custom templates.
Screenshot of the Author page in Azure Data Factory, with Templates open

Get files of last hour in Azure Data Factory

  Case I have a Data Factory pipeline that should run each hour and collect all new files added to the data lake since the last run. What is...