Showing posts with label RICARDO. Show all posts
Showing posts with label RICARDO. Show all posts
Case
I want to create multiple Azure Data Factory (ADF) pipelines that are using the same source and sink location, but with different files and tables. How can we do this in ADF when taking extensibility and maintainability of your pipelines into account?














Solution
Let's say we have a scenario where all the source files are stored into our Azure Storage, for example Blob Storage. We want to load these files into a traditional Data Warehouse (DWH) using an Azure SQL Database, that contains a separate schema for the Staging layer. Earlier we showed you how you can do this using a Foreach Loop in one single pipeline. For this scenario we will not this because:
  • Not loading all Staging tables every day on the same time. 
    • Sometimes it will only load 3 of the total 5 files, because not all the files are coming for a certain day and can be different per day. 
    • Load can be spread during the day, several moments.
  • Be more flexible per Staging load.
    • In most cases you want to have the same Staging load for every table, but sometimes transformations can be different or you need an extra table to prepare the dataset.
  • The use of technical columns in Staging table.
    • In most cases you want to store some metadata, for example "inserted date" and "inserted by". When using a "Copy Data Activity", you have to configure the mapping section when the source and sink fields are not equal. 
One of the solutions is building dynamic pipelines. This will be a combination of parameters, variables and naming convention. The result will be a dynamic pipeline, that we can clone to create multiple pipelines using the same source and sink dataset. In this blog post we will focus on the Staging layer. Of course you can implement this for every layer of your DWH.

Note:
When working in a team, it is important to have a consistent way of development and have a proper naming convention in place. This contributes to the extensibility and maintainability of your application. Click here for an example of naming convention in ADF.

Pipeline

For this solution, we will create a pipeline that contains the following activities:
  • Copy Data activity (load data from source to sink)
  • Stored Procedure activity (update technical columns)
The source will be a CSV file and is stored in a Blob container. The data is based on the customer sales table from WideWorldImporters-Standard. The file will be delivered daily.

Storage account
bitoolssa
Blob container
sourcefiles
Blob folder
WWI
File name*
SalesCustomers20200322.csv
* this is the blog post date, but it should be the date of today

Note:
Blob storage containers only have virtual folders which means that the folder name is stored in the filename. Microsoft Azure Storage Explorer will show it as if it are real folders.

1) Linked Service

Open the ADF portal, go to Connections - Linked services and and click on New. Select Azure Blob Storage and give it a suitable (generic) name. Make connection with your storage account. Create another Linked service for Azure SQL Database, because that will be our destination (sink). Click here how to make connection using Azure Key Vault.
ADF portal - Create Linked Service























2) Dataset (source)

Click New dataset and select Azure Blob Storage. The format is DelimitedText. Give it a suitable (generic) name and select the Linked Service for Blob Storage that we created earlier. 

Once you click 'OK', it will open the dataset automatically. Go to Parameters and add the following:
  • Container -> type "String"
  • Directory -> type "String"
  • File -> type "String"
  • ColumnDelimiter -> type "String"
Go to Connection and now use the applicable parameters to fill File path. You can apply dynamic content for each setting. For now, we also added a parameter for "Column delimiter". At last, we use First row as header.

Click Publishing to save your content locally. For now we did not configure a Git Repository, but of course we recommend that.
ADF portal - Create Source Dataset
























Note:
Use Preview data to verify if the source input is as expected by filling in the correct parameters values.

3) Dataset (sink)

Create another dataset for the sink. Choose Azure SQL Database and give it a suitable (generic) name. Select the Linked Service for Azure SQL Database that we created earlier and click 'OK'.

Add the following Parameters:
  • SchemaName (String)
  • TableName (string)
Go to Connection and click Edit. Fill in both parameters using dynamic content. 
ADF portal - Create Sink Dataset














4) Pipeline - Copy data

Create a new pipeline and include the schema and table name in the name, like "PL_Load_Stg_Customer". Go to Variables and add the following including values:
  • Container -> type "String" and value "sourcefiles"
  • Directory -> type "String" and value "WWI"
  • File -> type "String" and value "SalesCustomers"
  • ColumnDelimiter -> type "String" and value ","
Add the Copy data activity, go to Source and use dynamic content to assign the variables to the input parameters. For the file, we use also expression language to retrieve the correct name of the file dynamically: "@{variables('File')}@{formatDateTime(utcnow(), 'yyyyMMdd')}.csv"

Go to Sink and fill in the schema and table name. We use the SPLIT function to retrieve this from the pipeline name. We use a Pre-copy data script to truncate the table before loading.

At last, go to Mapping and click on Import schemas. This will automatically map the columns with the same names (source and sink). We will remove the columns that do not exists in the source, in this case our technical columns "InsertedDate" and "InsertedBy". We will fill those columns in the next activity.

ADF portal - Create Copy data activity
















5) Pipeline - Stored Procedure

Add the Stored Procedure activity and give it a suitable name. In SQL Account, select the Linked service for Azure SQL Database that we created earlier. We created a SP, that contains dynamic SQL to fill the columns "InsertedDate" and "InsertedBy" for every Staging table. See code below.

/*
==========================================================================================================================
Stored Procedure [dbo].[uspPostLoadStaging]
==========================================================================================================================
Description:

This query will load the following columns:
- InsertedDate
- InsertedBy

In ADF, without using Data Flows (Mapping), you can combine a Copy data activity with a Stored Procedure in order to
fill those (technical) columns during execution of the pipeline. In SSIS this was done by the Derived Column task.

==========================================================================================================================
Parameter Parameter description
--------------------------------------------------------------------------------------------------------------------------
@InsertedDate Date when the data was inserted into the Staging table.
@InsertedBy Name of a service / account that has inserted the data into the Staging table.
==========================================================================================================================

==========================================================================================================================
Change history

Date Who Remark
--------------------------------------------------------------------------------------------------------------------------
2020-03-22 Ricardo Schuurman Intial creation.
==========================================================================================================================
*/

CREATE PROCEDURE [dbo].[uspPostLoadStaging]
@SchemaName AS NVARCHAR(255)
, @TableName AS NVARCHAR(255)
, @InsertedDate AS DATETIME
, @InsertedBy AS NVARCHAR(255)

AS
BEGIN

SET NOCOUNT ON;
BEGIN TRY

DECLARE
@QueryStep1 AS NVARCHAR(MAX)
, @QueryStep2 AS NVARCHAR(MAX)

-- ERROR variables
, @errorMsg AS NVARCHAR(MAX) = ''
, @errorLine AS NVARCHAR(MAX) = ''

/* Example values of SP parameters
, @SchemaName AS NVARCHAR(255)
, @TableName AS NVARCHAR(255)
, @InsertedDate AS DATETIME
, @InsertedBy AS NVARCHAR(255)

SET @SchemaName = N'Stg'
SET @TableName = N'Customer'
SET @InsertedDate = GETDATE()
SET @InsertedBy = N'Test'

*/

/* ================================================================================================================
Step 1: Extract schema and table name (based on the ADF pipeline naming convention)
================================================================================================================ */

-- Add LIKE condition for schema
SET @SchemaName = '%' + @SchemaName + '%'

SELECT
@QueryStep1 = '[' + [TABLE_SCHEMA] + '].[' + [TABLE_NAME] + ']'
FROM [INFORMATION_SCHEMA].[TABLES]
WHERE TABLE_TYPE = 'BASE TABLE'
AND TABLE_SCHEMA LIKE @SchemaName
AND TABLE_NAME = @TableName

/* ================================================================================================================
Step 2: Execute Update statement using the fixed (technical) columns
================================================================================================================ */

SET @QueryStep2 = N'
UPDATE ' + @QueryStep1 + '
SET
[InsertedDate] = ''' + CONVERT(NVARCHAR(30),@InsertedDate, 120) + '''
, [InsertedBy] = ''' + @InsertedBy + ''';
'

EXEC SP_EXECUTESQL
@Query = @QueryStep2

END TRY

-------------------- Error Handling --------------------

BEGIN CATCH

SELECT
@errorMsg = ERROR_MESSAGE()
, @errorLine = ERROR_LINE()

SET @errorMsg = 'Error Occured with the following Message ' + @errorMsg + 'Error Line Number '+ @errorLine;

THROW 50001, @errorMsg, @errorLine;

END CATCH

END


Create the SP in the database, go to Stored Procedure and select the SP. Click Import parameter and fill the parameters. We use the System variables 'Pipeline Name' and 'Pipeline trigger time' for "InsertedDate" and "InsertedBy". Reuse the values of "SchemaName" and "TableName" from the sink (copy data activity).
ADF portal - Create Stored Procedure activity




















Result
Execute the pipeline in ADF (by clicking Debug) and check the result in SQL Server Management Studio. It works!












Note:
In case of a "SqlFailedToConnect" error, make sure the firewall of the Azure SQL Database allows the Integration Runtime of ADF to access. Go to the Azure portal or use sp_set_firewall_rule on the  database to create a firewall rule. 


Summary
In this blog post you saw how you can create a dynamic pipeline using parameters and variables in combination with a proper naming convention in ADF, for example the names of your pipelines. This way of working can contribute to a consistent solution and code. In case of a new Staging load table, clone the existing pipeline and use the schema and table name in the pipeline name. You only have to change the mapping in the copy data activity.

In a next post we will continue with dynamic pipelines, but then using metadata that is stored in a database instead of ADF itself.

Case
Recently Microsoft introduced a new feature for Azure Data Factory (ADF) called Mapping Data Flows. This allows you to do data transformations without writing and maintaining code. Similar to SSIS, but then in the Cloud. How can we use Mapping Data Flows to build an incremental load?



Solution
First, it is good to know that Mapping Data Flows is not the only feature that is part of "Data Flows". Microsoft announced also Wrangling Data Flows. The main difference between these two features is that Mapping Data Flows is more traditional "ETL" with a known source and destination, while Wrangling Data Flows is suited for preparing data and store this dataset in Azure Data Lake for example. From here, it can be used as input for Machine Learning or doing transformations later on. Currently it is in Limited Private Preview. In a future post we will explain more about Wrangling Data Flows and show you how it actually works.

In this post we are going to set up an incremental load for a Data Warehouse (DWH) scenario using Mapping Data Flows. We want to load our Staging layer incremental. We use a so-called Watermark for this. An example of a watermark is a column that contains the last modified time or id. We already configured an Azure Data Factory and we are using an Azure SQL Database with sample data from WideWorldImporters. We use the "Orders" table.

Note:
Here you will find an tutorial of an incremental load using an ADF pipeline with several activities.

1) Create table for watermark(s)
First we create a table that stores the watermark values of all the tables that are suited for an incremental load. The table contains the following columns:

--Create Watermark table
CREATE TABLE [dbo].[WatermarkTable](
[WatermarkTableName] [nvarchar](255) NULL
, [WatermarkColumn] [nvarchar](255) NULL
, [WatermarkValue] [nvarchar](255) NULL
) ON [PRIMARY]
GO

2) Fill watermark table
Add the appropriate table, column and value to the watermark table. The value must be the last loaded date or id. This is a one-time insert, because after this the watermark value will be updated after each load. In our case the table contains the following record:

Azure SQL Database - Watermark value for Orders

3) Create Linked Service
Before we can start building the Data Flow, we first have to make a connection with our database. Go to Connections and click on 'New'. Select Azure SQL Database and give it a suitable name. Select your Azure subscription and then select the correct server, database and fill in the credentials.

Azure Data Factory - Create Linked Service















4) Create Data Flow
Now lets build the Data Flow itself. Click on Add Data Flow and give it a suitable name. Important to know is that you can test your Data Flow components by turning on the Data Flow Debug mode. Be aware of the costs, because it will spin up a Databricks cluster and that is hourly billed. More information about the debug mode here.

Azure Data Factory - Create Data Flow















Note:
After creating the Data Flow, we can only save it by adding at least one source and one sink.

The goal of this Data Flow is to include only data from the source table / query that is greater then the watermark value that is stored from the previous load. Below we will explain each component of this flow individually.

Source 1: source table / query
In this Data Flow the first source is the source table / query. Click on "Add Source", give it a suitable name and click on new "Source dataset". Select "Azure SQL Database" and select the Linked Service we created earlier. Now go back to the source and click on "Source Options". Select "Query" and write the query. Click on "Import Schema" and at last we can preview our data. We have turned on the Debug mode to import the schema and preview the data.

Azure Data Factory - Add Source in Data Flows













We used the following query. As you can see, we added a column that contains the table name from which we want to retrieve the watermark value.

--Create source query
SELECT
[OrderID]
, [CustomerID]
, [SalespersonPersonID]
, [PickedByPersonID]
, [ContactPersonID]
, [BackorderOrderID]
, [OrderDate]
, [ExpectedDeliveryDate]
, [CustomerPurchaseOrderNumber]
, [IsUndersupplyBackordered]
, [Comments]
, [DeliveryInstructions]
, [InternalComments]
, [PickingCompletedWhen]
, [LastEditedBy]
, [LastEditedWhen]
, 'Orders' AS TableName
FROM [Sales].[Orders]

Source 2: watermark table
This source contains a simple query of the watermark table. The setup is the same as source 1, only with a different query. Later on we will make sure we only select the watermark value from the correct table in the watermark table (with a Join).

--Create watermark table query
SELECT
[WatermarkTableName]
, [WatermarkColumn]
, [WatermarkValue]
FROM [dbo].[WatermarkTable]

Azure Data Factory - Source WatermarkTable















Derived Column
It can occur that the watermark values has different datatypes. That is why we use the 'nvarchar' datatype to store the values. In this case the watermark is a 'datetime' and we have to convert this to make a successful join to the watermark table. We use the expression language of Mapping Data Flow for this, more information here.

Azure Data Factory - Use Mapping Data Flow expression language















Join
We use a join to combine data from different sources. We can choose between 'Full outer', 'Inner', 'Left outer', 'Right outer' or a 'Cross' join. Here we want to make sure that the watermark value of the specific table is used for the incremental load. We use a 'Left outer', but you can also use a 'Inner' here because all the records contain the same table. A join on watermark table is more future proof, because a join on the watermark column is not going to work when you have multiple watermark tables with the same watermark column name.

Azure Data Factory - Use Join component
















Filter
Because you can only use the "Join" component with two columns that are equal to each other, we use "Filter" to include only the records where the watermark value from the source table / query is greater then the latest watermark value.

Azure Data Factory - Filter out old records
















Select
In SSIS you can map the columns from source to destination in the destination. If you do not want to load a column, you do not map it. Here we have to use the "Select" component to select the relevant columns. Thank you Ronny Albouts for mention this.

Azure Data Factory - Map or unmap columns
















Derived Column 2
Before we choose the destination, we will convert the column"LastEditedWhen" back to a date using a new "Derived Column" component. Otherwise we cannot map this column with the 'Orders_Incremental' table, because we use the original table definition of 'Orders' and here the datatype is a date (time).

Azure Data Factory - Convert back to date before Sink















Sink
In ADF a destination is called "Sink". Here we will select our destination table called "Orders_Incremental". The table definition is the same as "Orders". Create the Sink dataset and automatically the columns will map. You can turn off 'Auto Mapping' to make manual changes.

Azure Data Factory - Add Sink in Data Flows













Note:
If the destination type is not available, you can store the data into CSV format or Blob Storage and use a Copy Activity to load the data in your preferred destination.

Result
After the Data Flow is ready, go back to "Pipelines" and create a new one. In here we will select the Data Flow activity and select the Data Flow we have created earlier. Now run the pipeline and let's take a look at the result, it is working!

SQL Server Management Studio - Result of Incremental Load

Note:
To make it as generic as possible, it is prefered to use the "Derived Column" component to add a new column (similar to SSIS). In this way you will keep the source query 'clean'. Unfortunately this is not working at the moment and Microsoft is investigating this issue.

Update Watermark
At last we have to update the watermark value to the last known value, in this case the date of "LastEditedWhen". Therefore we will use the following simple Stored Procedure.

/* Create Stored Procedure */
CREATE PROCEDURE [dbo].[usp_UpdateWatermark]
@tableName nvarchar(255)

AS

BEGIN

DECLARE
/* ============= Variables ============= */
@watermarkValue nvarchar(255)

/* Determine latest Watermark value from input table */
SELECT
@watermarkValue = MAX([LastEditedWhen])
FROM [Sales].[Orders_Incremental] AS T

/* Update Watermark table */
UPDATE [dbo].[WatermarkTable]
SET [WatermarkValue] = @watermarkValue
WHERE [WatermarkTableName] = @tableName

END
GO

Add the Stored Procedure activity at the end of the pipeline. The result should look like this.

Azure Data Factory - Update Watermark using SP















As you can see the T-SQL is hard coded. In a next post we will show you how to setup a dynamic pipeline so that you can reuse the Stored Procedure activity for every table in an Incremental Load batch.

Conclusion
In this post we showed you how to create an incremental load scenario for your Data Warehouse using Mapping Data Flows inside Azure Data Factory. With Mapping Data Flows, you can transform and clean up your data like a traditional ETL tool (SSIS).

This is just the beginning of Mapping Data Flows, we will expect more and more functions to make this grow into the new "SSIS in the cloud".

Case
Microsoft recently introduced Power BI dataflows for self-service ETL or data preparation. It uses an Azure Data Lake Storage (ADLS), which is only visible to Power BI, to store the metadata and result of the dataflows. However, you can also bring your own storage account. How does that work and why would you want that?

Bring Your Own Storage in preview

Solution
As we mentioned earlier in our previous post, Power BI dataflows can be seen as a self-service data preparation tool. It is easy accessible, so that other people in your organization (besides IT) can get started with transforming and maintaining the data. Nowadays there are often several people / business analysts in a department who are maintaining Excel files. With Power BI dataflows you can centralize this process and contribute to the ideology of "one version of the truth".

When you bring your own ADLS Gen2 storage account (StorageV2) for Power BI dataflows, other services like Azure Data Factory or Azure Databricks could use that same data. This makes it possible to do light weight data preparations with an user-friendly tool for corporate data warehousing or data science. However Power BI dataflows does not (yet?) replace SSIS, Azure Data Factory or any other ETL tool. They could even write back data to the data lake to create 'external' dataflows that are not maintained in Power BI.

Configure and connect to your own storage account
First you have to create an ADLS Gen2 storage account. Make sure the storage account is on the same region as the Power BI tenant. Click here for the documentation of Microsoft that explains step by step how to create such account.

Next, execute the following steps to make a connection with your own storage account. Note that the last step is a one-time action that cannot be changed afterwards.

It requires the following permissions to make a connection:
  • Role "Owner" in the Azure Subscription (service administrator / classic administrator) to add an ADLS Gen2 storage account. This is also required to give the Power BI Service access to the storage account and Blob Container. 
  • Global Administrator in O365 or Azure Active Directory to connect your ADLS Gen2 storage account with dataflows. This has to be done in the Power BI admin portal. The role Power BI Service Administrator is not sufficient to perform those actions.

Power BI Service - Successfully connected to your own ADLS Gen2 storage account













Result
To actual store the dataflow definition and the related data files in your ADLS Gen2 storage account, you must create a new Power BI app workspace or update an existing one. In case of an update, make sure you do not already have dataflows stored in the workspace. Otherwise you cannot change this.

In this case we created a new workspace. Turn on Dataflow storage (preview) under "Advanced" settings. In this case we built the same (simple) dataflow as our previous post. Save your dataflow and click "Refresh".

Azure Storage Explorer
When the dataflow is refreshed, go to the Blob container you created earlier for your ADLS Gen2 storage account. Here you will find the definition (source code) of the dataflow and the output. Note that the content of the ADLS Gen2 storage account is only visible in Azure Storage Explorer. Click here to download.

Azure Data Explorer - Result in own ADLS Gen2 storage account


















And now you can choose per workspace whether you want to use this new Data Lake. This only works for the new workspaces, that are at the moment also still in preview.
Power BI Workspace settings


Common errors
When you try to connect to your own storage account in the admin portal of the Power BI Service, you can get several errors.

You must have global administrator permissions
Only Global Administrators in Office 365 or Azure Active Directory are administrators in Power BI and therefore able to connect to the storage account. Click here for more information about administrator roles in Power BI.














There was a problem accessing your dataflow storage account
After creating your storage account , it can take up to 30 minutes to make a connection. Also make sure you avoid spelling mistakes.

Your storage account must be in the same Azure Active Directory tenant
This occurs for example when you are trying to make a connection with the ADLS storage account in the Power BI Service with an account outside the organization. In this case, the organization (subscription) is where the storage account is created.
















Conclusion
In this post we showed you how to use your own Azure Data Lake Storage account instead of the default provided by Power BI using dataflows. This new feature has several possible use cases. For example:
  1. Data preparation for 'corporate' data warehousing by a business user with a user-friendly tool
  2. Data preparation for 'corporate' data science by a business user with a user-friendly tool
  3. Creating 'external' dataflows for Power BI with Azure Services like Data Factory or Databricks
  4. Using data from other CDM-compliant applications like Dynamics 365 and Office 365

We hope Microsoft will make it a bit easier to bring your own storage, because at the moment there are a lot of steps to take and you need a lot of rights to do it. This discourages to arrange this powerful option.

We also expect some more admin capabilities in Power BI, because at this moment you cannot change your dataflow storage once Power BI is connected to your own ADLS account (so be very careful). And for larger corporations one ADLS account is probably not enough. It is also expected that the integration and collaboration with other Azure services will be improved so that you are even more flexible in choosing the services in your BI landscape.

Case
Power BI recently introduced dataflows. What is it and who should use it?
Power BI dataflows












Solution
According to Microsoft this new addition is for self-service ETL by business annalists and BI professionals. A minor addition from our side: especially business annalists that are skilled with tools like Excel and BI professionals with already a focus on Power BI will probably be very exiting using this new addition. An other group that could benefit of this new feature is the technical application administrator that has a great amount of knowledge of the data model of their application. With dataflows the could do some data preparation for the business annalists

First of all you must have at least a Power BI Pro license and this new preview feature is a Power BI Service-only feature that is not (yet?) available in Power BI Desktop. However you can use the result of a Data Flow which we will show you in the last step.

1) Power BI Service
To create a dataflow, sign in to Power BI Service and go to one of your workspaces, but not My Workspace. Or create a new workspace. Here you will find a new tab called "Dataflows (preview)" and in the Create-menu there is a dataflow option. Click on it to create a new dataflow.

Power BI dataflows - Create new dataflow
















2) Add new entities
In this step we will add a source for our dataflow. You can map this data to one of the standard Common Data Model (CDM) entities, but you can also create 'custom' entities that are not mapped. The CDM is a standard model for example for contacts or accounts to which you can map your sources like CRM or SalesForce. It should make it easier for development, but also analytics. In a subsequent post we will explain the CDM.

For this example we will not map to CDM, but create a 'custom' entity. We will use a text file that contains sensor data. We use Azure Blob Storage to store this data, but of course you can also use an internal file share. However then you first need to install the On-premises data gateway.


As told, we choose "Azure Blobs" as data source and next fill in the connection settings. If you have ever used this storage account before then it will remember the account key. Choose your Blob folder which contains the data and now you can transform your data similar to the Power Query Editor in Power BI Desktop.

Power BI dataflows - Create new Entity
















3) Edit Query
Just like Power BI Desktop you can do data preparation inside an "Edit Query" mode. This Power Query version does not have the full functionality compared to Power BI Desktop, but the expectation is that it will be extended in the upcoming releases. Some limitations of the current Query Editor in dataflows are: you cannot change the datatypes of the columns or use the 'Group By' function.

Power BI dataflows - Power Query Editor















There are also differences between a Pro license and Premium. In case you are doing some basic transformations like combining two queries (using Merge or Append queries) it will cause the following warning when using a Pro license:
This dataflow contains computed entities, which require Premium to refresh. To enable refresh, upgrade this workspace to Premium capacity.

Power BI dataflows - Pro license vs Premium







More information about the differences between a Pro license and Premium here.

4) Save and use dataflow 
After finishing the data preparation, you must save the new dataflow. Give it a suitable name and after saving, you will be asked to refresh the dataflow or schedule it later. You can set up a "Scheduled refresh" the same way as a dataset.

Power BI dataflow - Save (and schedule refresh)














Open Power BI Desktop, select "Get Data" and now you can use a Dataflow as a source for your report.

Power BI Desktop - Use a dataflow as source












Conclusion
In this post we created our first dataflow, a new (preview) feature of Power BI. Despite we are using the first version which is still lacking of some basic features, it already looks very promising. We expect/hope that the limited Power Query possibilities will soon be aligned with those in Power BI Desktop making this a very powerful tool.

The main benefit of this new addition, is that you don’t have to setup and host a separate ETL tool with possibly complex code. Now everything is integrated in one platform. The disadvantages are, besides the limited Power Query options, the lack of versioning and release management.

Will it replace enterprise ETL with tools like SSIS, Azure Data Factory and Azure Databricks? Probably not in the near future. For now it is still self-service ETL which you could use as a first step to enterprise ETL. However, Microsoft will likely extend this tool in the coming years. Just like they did with Power BI itself. Back in 2014 most people didn’t see that as a serious alternative for reporting.

An alternative route map could be integrating Power BI dataflows within your existing BI platform. You can publish (and refresh) the result of your dataflows to Azure Data Lake and then pick up the data, besides Power BI, with tools as Azure Databricks and Azure Data Factory. We will explain this in a subsequent post next year.
Case
When data is not extracted from a cleaned up or validated environment, such as a Data Warehouse or Data Mart, the first step is to analyse and possible clean up the data. How can Power BI help in this?

Data Profiling in Power Query Editor








Solution
The release of October 2018 Desktop update introduced a new (preview) feature that can profiling the data in the Power Query Editor. With this, you can do some quick analysis over your data and you can easily identify errors and empty values in a column.

For this blog post we use sample data from WideWorldImporters. We have saved the data in a CSV file, because it is more common to do this kind of analyses when you load raw data from Text or CSV files into your Power BI Model instead of using validated data from a star schema in a Data Mart.

Apply data profiling in Power BI
Because it is a preview feature, we have to turn this feature on. Go to "File - Options and settings - Options - Preview features" and select Enable column profiling.

Power BI Desktop - Enable Preview feature




















Open the Power Query Editor and load your data. In our case it is a CSV file that contains raw data about sales orders. Go to the "View" tab and select Column quality. With this option you can see whether your data contains errors or empty values, also known as "null" values.

Power BI Desktop - Column quality in Power Query Editor
















Besides this, you can also view the total of unique and distinct values. Go to the "View" tab and select Column distribution. Unique values indicates how many values occur only once and distinct values means how many different values there are in the specific column. Now the Query Editor will also recommend to do a quick fix if you want, for example remove the duplicate values in a column.

Power BI Desktop - Column distribution in Power Query Editor

















Note:
Be aware of the fact that column profiling is only based on top 1000 rows (preview results of the query).

Conclusion
In this post you saw a new feature called "Data Profiling". It is a good start and useful to use this for doing some quick analyses over your data. Note that this feature is in preview.

The program team of Power BI has announced to add more capabilities for data profiling, so we are very curious about what this will bring even more in the future.
Case
With PowerApps, Microsoft brought a new element to the existing world of reporting and dashboarding inside the Microsoft BI stack. For example, you can connect and customize your data using PowerApps. How does this work and how can you use PowerApps inside Power BI?

Power BI Marketplace - PowerApps custom visual






Solution
Important to know is that Microsoft PowerApps is part of Office 365. In case your organization does not have an Office 365 licence, PowerApps will cost 7 or 40 dollar per user per month. With a license, you can start building your own apps. There are more features available for an additional cost. More details about pricing here.

In this post we want to show you how you can connect to your data and store new data in an Azure SQL Database using PowerApps. For this example, you can give as Sales Employee approval (or not) on the report. This all happens in a Power BI dashboard using PowerApps.

1) Create new PowerApp
We will build the PowerApp using the Power BI Desktop and service. First, you have to create a new report in Power BI Desktop and add the PowerApps custom visual to the report. Choose PowerApps from the marketplace. Notice that this visual is still in Preview.

Power BI Desktop - Add PowerApps custom visual




















For our example, we will use data from WideWorldImportersDW (Azure SQL Database) and import the "Employee" dimension and "Sales" fact table. We have implemented a number of transformations to keep only sales employees and created a 'Employee Full Name' column with values like 'Schuurman, Ricardo'. We choose the columns 'Employee Full Name' and 'Profit' in the PowerApps custom visual.

After you have published your report and opened it in the Power BI service, you will see the option Create new. You will be redirected to the development portal of PowerApps.

Power BI Service - Create new PowerApp using custom visual

Note:
When you create a new PowerApp using the Power BI service, a new dataset for the Power BI data will automatically be created with the name "'PowerBIIntegration'.Data".

2) Build PowerApp
Once you are in the development portal, you can start building your app. In a next post we will fine tune the app and explain and show you several elements of the PowerApp.

PowerApps Studio - Developing the PowerApp














Important to know is that we have created a new table in WideWorldImportersDW, called "SalesApproval".

USE [WideWorldImportersDW]

CREATE TABLE [PowerBI].[SalesApproval](
[SalesApprovalKey] [int] IDENTITY(1,1) NOT NULL,
[EmployeeFullName] [nvarchar](100) NULL,
[Profit] [numeric](20, 8) NULL,
[Approved] [nvarchar](50) NULL,
[Comment] [nvarchar](max) NULL,
CONSTRAINT [PK_Dimension_Employee] PRIMARY KEY CLUSTERED
(
[SalesApprovalKey] ASC
)WITH (STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF) ON [PRIMARY]
) ON [PRIMARY] TEXTIMAGE_ON [PRIMARY]
GO

In this table we will store the data we have filled with the PowerApp. Later on we will show you how to do this and which data will be stored.

When you are done developing, you have to save and publish the PowerApp. After this, the app is live and available in several applications, such as Power BI and Microsoft Teams.

PowerApps Studio - Publishing the PowerApp














Note:
After saving and publishing your PowerApp it can happen that the PowerApp is missing the Power BI data the next time you open it. To fix this, go to the Power BI service and add a new Power Apps custom visual in the report. Select the right columns (in PowerApps Data) and click on Choose app. Add your existing app and go to the PowerApps Studio. Now your PowerApp is working again, including the Power BI data.

3) Result
Let's see if we can use this PowerApp in Power BI and store the result in the table "SalesApproval" we created earlier. Go to the Power BI service and open the report. Search for the right sales person (this is you as sales person if this dashboard is live) and fill in the form:
Yes or No for approval and associated comment (Text input). Press Submit when you are done.

Now if we look at the table in SQL Server Management, we see a new record containing the data we have filled in. Cool!

Power BI Service - Submit data in PowerApp (custom visual) and result














You can download the Power BI report here and the first version of the corresponding PowerApp we used for this post here. Use the Export (preview) and Import (preview) feature to add the app in your environment.

Note:
The Export and Import feature is still in preview. After preview, you must have PowerApps Plan 2 trial or PowerApps Plan 2 paid license for this feature.

Summary
In this post you saw how to use the PowerApps custom visual in Power BI. For this example we used PowerApps to approve (or not) a sales report and store this data in an Azure SQL Database.

In a next post we will describe and explain how this PowerApp is build. We will also customize the app to make it more user friendly.
Case
In an earlier post, we showed you how to use Azure Logic Apps for extracting email attachments without programming skills. The attachments contain the source files. Because this step is part of an Data Warehouse solution, it would be nice to run this together with the ETL process that needs these source files. How can we archive this?

Azure Data Factory V2 - Execute Azure Logic App















Solution
In the first few years of Azure, it was not possible to run your Data Warehouse process entirely in the Cloud. Of course, you could store the data in Azure SQL Database or Azure SQL Data Warehouse (see here for the differences between these two), but when you are using SQL Server Integration Services (SSIS) you still had to run this on-premise or create a custom Virtual Machine. Until recently. This post explains how you can execute SQL Server Integration Services (SSIS) packages in Azure, using Azure Data Factory (ADF) V2.

Besides running SSIS packages in ADF V2, you can also execute other Azure services in here. For example: Azure Databricks, Azure Data Lake Analytics (U-SQL scripts) and HDInsight (services like Hadoop, Spark, Hive etc.).

This post shows you how to execute an Azure Logic App inside ADF V2.

1) Add and configure activity
Create a new pipeline or edit an existing one. Select "General" and choose the Web activity. Give it a suitable name and go to Settings. Fill in the URL, corresponding to the one inside the HTTP trigger in the Azure Logic App, you created earlier:

Azure Logic App - URL in HTTP Trigger

















Select the "POST" API Method. Now add a Header and enter the following:
  • KEY: Content-Type
  • VALUE: application/json
When you are finished, click Publish All.

Azure Data Factory V2 - Configure Web Activity


















2) Run pipeline
After you have published your pipeline, go to Trigger and select Trigger (Now). You can also run the pipeline without publishing it: using Debug. In this mode you will see the result of the pipeline run in the bottom at Output.

NOTE:
If you do not publish your pipeline, you are getting the following error when you are trying to use Trigger (Now):

Pipeline Error - Use Trigger (now) without publishing









NOTE 2:
If you do not publish your pipeline, you are getting the following warning when you want to access the monitor screen:

Pipeline Warning - Go to Monitor without publishing









3) Result
Once you have triggered the pipeline, go to Monitor on the left in the menu. Default it will open the Pipeline Runs overview, but you can also select the Integration Runtimes or Trigger Runs overview at the top.

You can also watch the Runs history of the Logic App:


View Result - Logic App run history













Summary
This post explains how you can manage other ETL, next to SSIS, in your Data Warehouse using one orchestrator. In this case we execute an Azure Logic App using Azure Data Factory (V2).

Click here to see how you can also execute a SSIS package using Azure Logic Apps.