In this example, we are going to copy the themes.csv file from Rebrickable into a blob container called lego in our Azure Data Lake Storage Gen2 account.
From the Azure Data Factory Home page, click copy data:
data:image/s3,"s3://crabby-images/c0f44/c0f4426da569a89fbcef8b56ed5aa242d19cac36" alt="Screenshot of the Home page in Azure Data Factory with the Copy Data task highlighted"
This opens the Copy Data Wizard. Let’s walk through each step!
1. Properties
On the Properties page, give the pipeline a name and description. Keep the default “run once now” option:
data:image/s3,"s3://crabby-images/00169/00169108d40363c53787abb570f292af7dda7a2e" alt="Screenshot of the Copy Data Wizard step 1, the properties page"
Click next to move on to the Source properties.
2. Source
On the Source page, we will first create a new linked service to Rebrickable, then create a new dataset to represent the themes.csv file.
Click create new connection:
data:image/s3,"s3://crabby-images/cac2b/cac2b4c3ad7347ac2d22bb88c2b472317384fc6b" alt="Screenshot of the Copy Data Wizard step 2a, the source connection page"
Search and select the HTTP Linked Service:
data:image/s3,"s3://crabby-images/73196/73196eaeea265c6cb1a3d7ed607257877b1e0997" alt="Screenshot of the New Linked Service pane with the HTTP Linked Service highlighted"
Give the linked service a name and description, and use the base URL cdn.rebrickable.com/media/downloads/. (You can find this URL by inspecting the links on rebrickable.com/downloads. Keep the last slash.) Change authentication type to anonymous. Click create:
data:image/s3,"s3://crabby-images/82804/828041f8a7c6838d8139cd27b14b90effab613d4" alt="Screenshot of the New Linked Service pane with the properties filled out"
The linked service has now been created, yay! Make sure it’s selected and click next to move on to the dataset properties:
data:image/s3,"s3://crabby-images/d5613/d5613187ca0d34eec0cdb108bd31099e4966e1d6" alt="Screenshot of the Copy Data Wizard step 2a, the source connection page, with the new HTTP linked service highlighted"
Since we specified the base URL in the Linked Service, we only have to specify the file name themes.csv.gz in the relative URL. Keep the other default options. Click next:
data:image/s3,"s3://crabby-images/23635/236353cf5316bc8712c938fb5b308f4ba8330c45" alt="Screenshot of the Copy Data Wizard step 2b, the source dataset properties page"
This next part feels kind of like magic, especially if you have been working with SQL Server Integration Services (SSIS) in the past. The Copy Data Wizard now inspects the file and tries to figure out the file format for us. But… since we are working with a gzipped file, it doesn’t make a whole lot of sense yet…
data:image/s3,"s3://crabby-images/ebbf6/ebbf618f969923a1332f34407c61adca211602a1" alt="Screenshot of the Copy Data Wizard step 2b, the source dataset file format settings page, highlighting the data preview showing scrambled text because the source file is gzipped"
Let’s fix that! Change the compression type to gzip. Tadaaa! Magic! Without us doing anything else manually, the copy data wizard unzips the CSV file for us and shows us a preview of the content:
data:image/s3,"s3://crabby-images/05c6e/05c6e2fd55f2423210fd146d86e926243636b258" alt="Screenshot of the Copy Data Wizard step 2b, the source dataset file format settings page, highlighting the compression type and compression level"
If you are working with a raw CSV file, the copy data wizard can detect the file format, the delimiter, and even that we have headers in the first row. But since we are working with a gzipped file, we have to configure these settings manually. Choose first row as header:
data:image/s3,"s3://crabby-images/7e2be/7e2be2fd59a88baacd23f1f9f4a0ad12b6568ed5" alt="Screenshot of the Copy Data Wizard step 2b, the source dataset file format settings page, highlighting the first row as header setting"
If the headers are not detected correctly on the first attempt, try clicking detect text format again:
data:image/s3,"s3://crabby-images/13f11/13f11380c4b5d784a08e32153f1d7982c8f681c6" alt="Screenshot of the Copy Data Wizard step 2b, the source dataset file format settings page, highlighting the detect text format button"
You can now preview the schema inside the gzipped file. Beautiful! :D
data:image/s3,"s3://crabby-images/30d72/30d72939df9d96aa55290507b0e1422fa8932ec7" alt="Screenshot of the Copy Data Wizard step 2b, the source dataset file format settings page, highlighting the detected schema preview"
Click next to move on to the Destination properties.
3. Destination
On the Destination page, we will first create a new linked service to our Azure Data Lake Storage Gen2 account, then create a new dataset to represent the themes.csv file in the destination.
Click create new connection:
data:image/s3,"s3://crabby-images/4542c/4542c5c38ecb5399daf1c0353f763983f087adbc" alt="Screenshot of the Copy Data Wizard step 3a, the destination connection page"
Select the Azure Data Lake Storage Gen2 linked service:
data:image/s3,"s3://crabby-images/daa2f/daa2ffe21db943cb65219b20b8fc92648b31ae00" alt="Screenshot of the New Linked Service pane with the Azure Data Lake Storage Gen2 Linked Service highlighted"
Give the linked service a name and description. Select your storage account name from the dropdown list. Test the connection. Click create:
data:image/s3,"s3://crabby-images/48714/4871486ddb94c7295c49957c894be66764d16705" alt="Screenshot of the New Linked Service pane with the properties filled out"
The second linked service has now been created, yay! Make sure it’s selected, and click next to move on to the dataset properties:
data:image/s3,"s3://crabby-images/f929a/f929a66d45a76d0582537726e77754e054d94c25" alt="Screenshot of the Copy Data Wizard step 3a, the source connection page, with the new Azure Data Lake Storage Account Gen2 linked service highlighted"
Specify lego as the folder path, and themes.csv as the file name. Keep the other default options. Click next:
data:image/s3,"s3://crabby-images/522aa/522aa4f779f0c86150ff1fa44627ae1c789b21ea" alt="Screenshot of the Copy Data Wizard step 3b, the destination dataset properties page"
Enable add header to file and keep the other default options:
data:image/s3,"s3://crabby-images/bc0ae/bc0ae30536e2a8f07ce76f16c2b69acee84b913c" alt="Screenshot of the Copy Data Wizard step 3b, the destination dataset file format settings page with the add header to file setting enabled and highlighted"
Click next to move on to the Settings.
4. Settings
On the Settings page, we will configure the fault tolerance settings. This is another part that feels like magic. By changing a setting, we can enable automatic handling and logging of rows with errors. Whaaat! :D In SQL Server Integration Services (SSIS), this had to be handled manually. In Azure Data Factory, you literally just enable it and specify the settings. MAGIC! :D
Change the fault tolerance settings to skip and log incompatible rows:
data:image/s3,"s3://crabby-images/d3361/d3361e173b88f4c6645e8405631877b77bc0dcc2" alt="Screenshot of the Copy Data Wizard step 4, the settings page, with the fault tolerance dropdown showing the option to skip and log incompatible rows"
At this time, error logging can only be done to Azure Blob Storage. Aha! So that’s why we created two storage accounts earlier ;) Click new:
data:image/s3,"s3://crabby-images/8cfd9/8cfd996701dcd8e883c65c40d7f8b189a1c5265a" alt="Screenshot of the Copy Data Wizard step 4, the settings page, with the New connection button highlighted"
The Copy Data Wizard is even smart enough to figure out that it needs to create an Azure Blob Storage connection. Good Copy Data Wizard :D Give the linked service a name and description. Select your storage account name from the dropdown list. Test the connection. Click create:
data:image/s3,"s3://crabby-images/ced68/ced689144dfa48217c2ca3e1a4d9850a81b92095" alt="Screenshot of the New Linked Service pane, with the Azure Blob Storage type highlighted"
Specify lego/errors/themes as the folder path:
data:image/s3,"s3://crabby-images/f2760/f27607d0170ccdbfcb589f12b25d1ba57b917fc6" alt="Screenshot of the Copy Data Wizard step 4, the settings page, with all the properties filled out"
Click next to move on to the Summary.
5. Summary
On the Summary page, you will see a pretty graphic illustrating that you are copying data from an HTTP source to an Azure Data Lake Storage Gen2 destination:
data:image/s3,"s3://crabby-images/3e5db/3e5dbed5d827bf96ada4a31a667ea30b967f262c" alt="Screenshot of the Copy Data Wizard step 5, the summary page"
Click next to move on to Deployment.
6. Deployment
The final step, Deployment, will create the datasets and pipeline. Since we chose the “run once now” setting in the Properties step, the pipeline will be executed immediately after deployment:
data:image/s3,"s3://crabby-images/98a52/98a528b35894b83dc6ae2047d83a64375904a21e" alt="Screenshot of the Copy Data Wizard step 5, the deployment page, with the deployment in prosess"
Once the deployment is complete, we can open the pipeline on the Author page, or view the execution on the Monitor page. Click monitor:
data:image/s3,"s3://crabby-images/91205/91205e908f496189932a086b792b4cd91be05785" alt="Screenshot of the Copy Data Wizard step 5, the deployment page, with the deployment completed"
Success! ✔🥳 Our pipeline executed successfully.
data:image/s3,"s3://crabby-images/f7f3d/f7f3d6e5488932c3a77ea3548a4dba97a6bf481c" alt="Screenshot of the Monitor page in Azure Data Factory, with the successful pipeline run highlighted"
We can now open Azure Storage Explorer and verify that the file has been copied from Rebrickable:
data:image/s3,"s3://crabby-images/9a6e2/9a6e2d236f60d43a7ee9813d12d7acaf4a4af4ab" alt="Screenshot of Azure Storage Explorer showing a new lego container with the themes.csv file in it"
Summary
In this post, The Copy Data Wizard created all the factory resources for us: one pipeline with a copy data activity, two datasets, and two linked services. This guided experience is a great way to get started with Azure Data Factory.
Next, we will go through each of these factory resources in more detail, and look at how to create them from the Author page instead of through the Copy Data Wizard. First, let’s look at pipelines!
No comments:
Post a Comment