Search Suggest

Transforming JSON data with the help of Azure Data Factory - Part 5 - Comparing it with Databricks

(2020-July-29) There is a well known and broadly advertised message from Microsoft that Azure Data Factory (ADF) is a code-free environment to help you to create your data integration solutions - https://azure.microsoft.com/en-us/resources/videos/microsoft-azure-data-factory-code-free-cloud-data-integration-at-scale/. I agree and support this approach of using drag and drop visual UI to build and automate data pipelines without writing code. However, I'm also interested to try if I can recreate certain ADF operations by writing code, just out of my curiosity.

Previously I have written a blog post about using ADF Data Flow Flatten operation to transform a JSON file - Part 1: Transforming JSON to CSV with the help of Azure Data Factory - Mapping Data Flows


This time I would like to check and compare Databricks code development experience to Flatten the very same sourcing JSON file.
[
{
"id": "0001",
"type": "donut",
"name": "Cake",
"ppu": 0.55,
"batters":
{
"batter":
[
{ "id": "1001", "type": "Regular" },
{ "id": "1002", "type": "Chocolate" },
{ "id": "1003", "type": "Blueberry" },
{ "id": "1004", "type": "Devil's Food" }
]
},
"topping":
[
{ "id": "5001", "type": "None" },
{ "id": "5002", "type": "Glazed" },
{ "id": "5005", "type": "Sugar" },
{ "id": "5007", "type": "Powdered Sugar" },
{ "id": "5006", "type": "Chocolate with Sprinkles" },
{ "id": "5003", "type": "Chocolate" },
{ "id": "5004", "type": "Maple" }
]
},
{
"id": "0002",
"type": "donut",
"name": "Raised",
"ppu": 0.55,
"batters":
{
"batter":
[
{ "id": "1001", "type": "Regular" }
]
},
"topping":
[
{ "id": "5001", "type": "None" },
{ "id": "5002", "type": "Glazed" },
{ "id": "5005", "type": "Sugar" },
{ "id": "5003", "type": "Chocolate" },
{ "id": "5004", "type": "Maple" }
]
},
{
"id": "0003",
"type": "donut",
"name": "Old Fashioned",
"ppu": 0.55,
"batters":
{
"batter":
[
{ "id": "1001", "type": "Regular" },
{ "id": "1002", "type": "Chocolate" }
]
},
"topping":
[
{ "id": "5001", "type": "None" },
{ "id": "5002", "type": "Glazed" },
{ "id": "5003", "type": "Chocolate" },
{ "id": "5004", "type": "Maple" }
]
}
]


(1) Data connection
First, I want to compare Databricks' understanding of this JSON file vs. ADF Mapping Data Flow data connector.

ADF Mapping Data Flow provides the following file structure projection of my JSON file:

This is how Databricks understood my JSON file, so both tools are in sync in this.


(2) Flattening topping JSON array column
The very first use of the Flatten data transformation in my ADF data flow expands the topping column:

Databricks use of the explode Spark function provides similar results:


(3) Flattening batter JSON array column
The next use of the Flatten data transformation in my ADF data flow expands the batter column:

Which doesn't look any different in my next Spark DataFrame with the help of using the same explode function:


(4) Validating results in ADF Mapping Data Flow
My ADF data flow final transformation output matched with my Databricks notebook DataFrame result: all necessary columns have been flattened (or "exploded") and JSON data file schema in both cases have been properly interpreted.

I'm just curious how different my appreciation for the ADF data flow Flatten transformation would be if it was named Explode after its Spark explode sibling :-), because the actual Spark flatten function doesn't expand data structures into multiple rows but it transforms an array of arrays into a single array.

So, my data adventures journey continues :-)



Post a Comment