Previously I have written a blog post about using ADF Data Flow Flatten operation to transform a JSON file - Part 1: Transforming JSON to CSV with the help of Azure Data Factory - Mapping Data Flows
This time I would like to check and compare Databricks code development experience to Flatten the very same sourcing JSON file.
[
{
"id": "0001",
"type": "donut",
"name": "Cake",
"ppu": 0.55,
"batters":
{
"batter":
[
{ "id": "1001", "type": "Regular" },
{ "id": "1002", "type": "Chocolate" },
{ "id": "1003", "type": "Blueberry" },
{ "id": "1004", "type": "Devil's Food" }
]
},
"topping":
[
{ "id": "5001", "type": "None" },
{ "id": "5002", "type": "Glazed" },
{ "id": "5005", "type": "Sugar" },
{ "id": "5007", "type": "Powdered Sugar" },
{ "id": "5006", "type": "Chocolate with Sprinkles" },
{ "id": "5003", "type": "Chocolate" },
{ "id": "5004", "type": "Maple" }
]
},
{
"id": "0002",
"type": "donut",
"name": "Raised",
"ppu": 0.55,
"batters":
{
"batter":
[
{ "id": "1001", "type": "Regular" }
]
},
"topping":
[
{ "id": "5001", "type": "None" },
{ "id": "5002", "type": "Glazed" },
{ "id": "5005", "type": "Sugar" },
{ "id": "5003", "type": "Chocolate" },
{ "id": "5004", "type": "Maple" }
]
},
{
"id": "0003",
"type": "donut",
"name": "Old Fashioned",
"ppu": 0.55,
"batters":
{
"batter":
[
{ "id": "1001", "type": "Regular" },
{ "id": "1002", "type": "Chocolate" }
]
},
"topping":
[
{ "id": "5001", "type": "None" },
{ "id": "5002", "type": "Glazed" },
{ "id": "5003", "type": "Chocolate" },
{ "id": "5004", "type": "Maple" }
]
}
]
(1) Data connection
First, I want to compare Databricks' understanding of this JSON file vs. ADF Mapping Data Flow data connector.
ADF Mapping Data Flow provides the following file structure projection of my JSON file:
This is how Databricks understood my JSON file, so both tools are in sync in this.
(2) Flattening topping JSON array column
The very first use of the Flatten data transformation in my ADF data flow expands the topping column:
Databricks use of the explode Spark function provides similar results:
(3) Flattening batter JSON array column
The next use of the Flatten data transformation in my ADF data flow expands the batter column:
Which doesn't look any different in my next Spark DataFrame with the help of using the same explode function:
(4) Validating results in ADF Mapping Data Flow
My ADF data flow final transformation output matched with my Databricks notebook DataFrame result: all necessary columns have been flattened (or "exploded") and JSON data file schema in both cases have been properly interpreted.
I'm just curious how different my appreciation for the ADF data flow Flatten transformation would be if it was named Explode after its Spark explode sibling :-), because the actual Spark flatten function doesn't expand data structures into multiple rows but it transforms an array of arrays into a single array.
So, my data adventures journey continues :-)