Showing posts with label SOURCE. Show all posts
Showing posts with label SOURCE. Show all posts
Case
I'm running my SSIS packages in Azure and my source is a flat file in an Azure Blob Storage container. Therefor I use the Azure Blob Source as a source in my Data Flow Task. However this source has just a few formatting options compared to the Flat File Source (and its connection manager). I want to specify things like qualifiers and data types. How do I do that?
Azure Blob Source has too few options


























Solution
With the current version of the Azure Blob Source you can only specify the column separator, but there is a workaround available. Your Integration Runtime (IR) that is hosted in ADF is actually a virtual machine with Integration Services on it. A simple Script Task running on that IR reveals the drives and their available space. It shows that we have several drives available on that Virtual Machine.
Log with drive details











// C# Code to replace your Main() method
public void Main()
{
// Create array with drive information
System.IO.DriveInfo[] allDrives = System.IO.DriveInfo.GetDrives();

// Create string to store a message
String DriveDetails = "";

// Loop through all drives to get info about it
foreach (System.IO.DriveInfo d in allDrives)
{
// Get drive letter (C:) and type (NTFS)
DriveDetails = d.Name + "(" + d.DriveType + ")" + Environment.NewLine;

// If drive is ready you can get more details
if (d.IsReady == true)
{
DriveDetails += " - Volume label: " + d.VolumeLabel + Environment.NewLine;
DriveDetails += " - File system: " + d.DriveFormat + Environment.NewLine;
DriveDetails += " - Available space to current user: " + d.AvailableFreeSpace + Environment.NewLine;
DriveDetails += " - Total available space: " + d.TotalFreeSpace + Environment.NewLine;
DriveDetails += " - Total size of drive: " + d.TotalSize;
}

// Fire the message as warning to stand out between other messages
Dts.Events.FireWarning(0, "Details", DriveDetails, "", 0);
}

// End Script Task
Dts.TaskResult = (int)ScriptResults.Success;
}

So the solution is to first use the Azure Blob Download Task to download the file from the Blob Storage Container to the Virtual Machine. After that you can use a regular Flat File Source in the Data Flow Task.
Azure Blob Download Task



























I'm not sure what the purpose is of these disks and if one of them is for non-persistent data (disk that are automatically cleaned), but I recommend using the E drive to temporarily store the downloaded files and clean up afterwards.

Windows Temp folder
An alternative to pick a temporarily folder on your IR machine is to use a very simple Script Task with only one line of code that retrieves the path of the Windows temp folder. The path looks something like D:\Users\WATASK_1\AppData\Local\Temp\. If you store this path in an SSIS string variable, then you can use that for expressions on your tasks and Flat File connection manager. After the next reboot Windows removes all old files in this folder.
// C# code (see line 5)
public void Main()
{
// TODO: Add your code here
Dts.Variables["User::tempPath"].Value = System.IO.Path.GetTempPath();

Dts.TaskResult = (int)ScriptResults.Success;
}

Script Task that fills a variable




Case
I want to use the users from my Active Directory in my Data Warehouse for an Employee Dimension. How can I accomplish that in SSIS?














Solution
With a little .Net scripting in a Script Component you can accomplish an Active Directory source in your Data Flow Task. Before you start you need to figure out which fields are available in your Active Directory. If you don't know the fieldnames you could set a breakpoint on the foreach loop and add a watch on the result variable. Then you could browse through the properties of that variable to find all available fields.

1) Script Component Source
Create a Data Flow Task and add a Script Component and choose "Source" as the type. After that give you source a suitable name like "SRC_SCR - Active Directory Users".
Script Component - Source





















2) Output columns
Next edit the Script Component and go to the Inputs and Outputs pane. This is where we need to specify all the new output columns and datatypes. For my Active Directory example I have added 5 string (DT_STR) columns: Fullname, Firstname, Surename, Department and Manager. You have to figure out the required length. You could start with the default 50 or change it to a more save 255.
Add new output columns

























3) Edit script
Go back to the first page and choose C# as your scripting language and then hit the Edit Script buton to open the vsta environment.
Edit script

























4) Reference
To tell our script about Active Directory, we first need to add a new reference to the .NET assembly System.DirectoryServices. In the Solution Explorer right click References and
 choose Add Reference... Then scroll down, select System.DirectoryServices and click OK to confirm.
Add reference

















5) The Script - Namespaces
Under namespaces add a new using for our newly referenced assembly: using System.DirectoryServices; This allows you to shorten the code.
Add using



















6) The Script - CreateNewOutputRows
Now replace your CreateNewOutputRows method with the one below and add the GetPropertyValue method from below within your class tags under the existing methods. You could remove the PostExecute and PreExecute methods since we don't need them in this example.

The CreateNewOutputRows method adds new rows to the buffer by calling Output0Buffer.AddRow(); within a loop construction and then it fills all columns with data from the Active Directory. Before the foreach loop you find the code to connect to the Active Directory with some optional filter.
public override void CreateNewOutputRows()
{
// Specify the connnectionstring of your domain
// @mycompany.com => LDAP://DC=mycompany,dc=com
// Consider using a variable or parameter instead
// of this hardcoded value. On the other hand
// how many times does your domain changes
string domainConnectionString = "LDAP://DC=ilionx,dc=com";

using (DirectorySearcher ds = new DirectorySearcher(new DirectoryEntry(domainConnectionString)))
{
ds.Filter = "(&" +
"(objectClass=user)" + // Only users and not groups
"(department=*)" + // All departments
"(givenname=j*)" + // Firstname starts with j
")";
// See ds. for more options like PageSize.
//ds.PageSize = 1000;

// Find all persons matching your filter
using (SearchResultCollection results = ds.FindAll())
{
// Loop through all rows of the search results
foreach (SearchResult result in results)
{
// Add a new row to the buffer
Output0Buffer.AddRow();
// Fill all columns with the value from the Active Directory
Output0Buffer.Fullname = GetPropertyValue(result, "cn");
Output0Buffer.Surename = GetPropertyValue(result, "sn");
Output0Buffer.Department = GetPropertyValue(result, "department");
Output0Buffer.Manager = GetPropertyValue(result, "manager");
Output0Buffer.Firstname = GetPropertyValue(result, "givenname");
}
}
}
}

// Extra method to avoid having an if construction around each column
// It checks whether it can find a value. If not it returns an empty string.
private static string GetPropertyValue(SearchResult Results, string Property)
{
// Null value results in count zero
if (Results.Properties[Property].Count > 0)
{
return Results.Properties[Property][0].ToString();
}
else
{
return "";
}
}


7) The result
To test the result add a dummy Derived Column and a Data Viewer to see the values. If it works you can add the rest of the components to the Data Flow Task to fill your dimension table.
The result of my script
















Summary
A very simple and short script to get data from your Active Directory. For more filter examples visit
MSDN. For very large Active Directories you have to play with the PageSize property.
Note that there are also scriptless options like with an ADO.Net source or with Third Party components. Each solution has its own pros and cons.




Case
I have a large XML file with Orders and Onderlines which I want to (merge) join into a new destination. To join the two outputs I need to order them, but the sort transformation takes too much time. Is there a faster alternative?
XML Source with two joined outputs

























Solution
The solution is surprisingly quite simple: the outputs are already sorted and you only have to tell SSIS that (Similar to a source with order by in the query).

For XML files with multiple levels (first for orders and second for orderlines) like below, SSIS will create two output ports.
XML Sample
















The outputs will have an extra bigint column which allows you to connect the orderlines to the correct order.
Two outputs with additional ID column














Instead of using these ID columns in the SORT transformations, you can also use the advanced editor of the XML source to tell SSIS that these columns are already sorted. Right click the XML source and choose 'Show Advanced Editor...'.
Show Advanced Editor...






















Then go to the last page 'Input and Output Property' and select the Orderline output. In the properties of this output you can tell SSIS that the output is sorted.
Set IsSorted to true
























Next expand OrderLine and then Output Columns and click on the additional ID column 'Order_Id'. In its properties locate the SortKeyPosition and change it from 0 to 1.
Set SortKeyPosition to 1

























Repeat this for the second output called 'Order' and then close the advanced editor. If you still have the SORT transformations, you will notice the yellow triangle with the exclamation mark in it. It tells you that the data is already sorted and that you can remove the SORT transformations.
And if you edit the Data Flow Path and view the metadata you will see that the column is now sorted.
Sorted! Remove the SORT transformations!




















Conclusion
The solution is very simple and perhaps this should have been the default sort key position anyway? It smells like a bug to me...
No sorts!




SSIS 2016 CTP 2.3 introduces an ODdata Source Component and an OData Connection Manager with V4 support. It now supports:
  • OData V3 protocol for ATOM and JSON data formats.
  • OData V4 protocol for JSON data format.
The 2014 version only supports V3.
SSIS 2014: Data at the root level is invalid. Line 1, position 1. (System.Xml)

























 
First the OData Connection Manager. You can test this with the test URL: http://services.odata.org/V4/Northwind/Northwind.svc/
More info here.
OData Connection Manager with V4 support



 




















 
 

And there is the OData Source Component that uses the OData Connection Manager. More info here.
OData Source




















And now in action with a Data Viewer:
OData with Data Viewer

Case
The path of the XSD file in the XML Source component is hardcoded, but the path on my production environment is different than my development environment. The XML source doesn't have expressions. How do I make this XSD path configurable?
XML Source Connection Manager Page
without Connection Managers



























Solution
Unfortunately the XML source component does lack some very basic functionalities like the use of a Connection Manager (although ironically it's called the Connection Manager page). The source can use variables as input, but there isn't such option for the XSD file.
XML Task can use Connection Managers


























1) Find Expressions
The XML Source Component doesn't support expressions, but the Data Flow Task itself does. Go to the properties of the Data Flow Task and locate the expressions and click on the ... button

Select Data Flow Task and press F4 to get properties


























2) Select Property
Locate the XMLSchemaDefinition property of your XML Source in the Property Expression Editor and add an expression on it by clicking on the ... button.
Property Expression Editor
















3) Add Expression
Now you can either replace its value by a variable or a parameter if you use SSIS 2012 and above.
Expression Builder
























That's it. Now you have a workaround for the absence of a real Connection Manager. An alternative could be to use a Script Component as XML source or an XML file with an inline XSD schema.
Case
I have an Excel (xls) file as a source and it has cell protection turned on to prevent editing the wrong cells. It's an official form from a national care authority, so I can't change that. I don't even have the password.

When I try to read that file in SSIS it gives me an error:
Error at Package.dtsx [connection manager "xxxxx"]: SSIS Error Code DTS_E_OLEDBERROR. An OLE DB error has occured. Error code: 0x80004005
An OLE DB record is available. Source "Microsoft JET Database Engine"Hresult: 0x80004005
Description: Could Not Decrypt File

Error at xxxxx [Excel Source]"SSIS Error Code DTS_E_CANNOTACQUIRECONNECTIONFROMCONNECTIONMANAGER. The AcquireConnection method call to the connection manager "xxxxx" failed with error code 0xC0202009.

SSIS 2005 - 2012: Could not decrypt file














The error doesn't occur when the file is still opened in Excel, but that's not really an option on a server.

Solution
You either have to remove the cell protection or convert the XLS file to XLSX (The ACE OLE DB provider doesn't have the same problem as the JET OLE DB provider). Also see/vote for this Microsoft Connect Bug report.

But can you do this automatically if you can't (or don't want to) ask that to the people who provided the excel files or if you have a whole bunch of those files? Here are a couple of options I have examined:

1) Interop Excel in Script Task
There is a Microsoft library (Microsoft.Office.Interop.Excel) available that can do that for you, but the BIG downside is that it requires an Office installation! Even if you download the Redistributable Primary Interop Assemblies (PIA), you still need Office installed. PIA is only a .Net wrapper that lets you communicate in .Net to the COM dll's from Office.
Serverside Office installation is a bad practice and Microsoft doesn't support/recommend an Office installation on a server.

2) Third party / open source dll's  in Script Task
There are a lot of third party and open source dll's for excel. A .Net colleague of mine tried a whole bunch them, but they either don't work for the old (xls) excel files or they don't have an option to convert to xlsx or to remove the cell protection.

3) Cozyroc Excel Task
The COZYROC SSIS+ Library has an Excel Task that can do the conversion to XLSX for you. It's not for free, but I think it's worth it. Hiring an external BI/ETL consultant for two days is probably more expensive. Moreover you get a whole bunch of other cool tasks and transformations with it and you can test them freely within BIDS or SSDT before you buy them. This is my solution with the Cozyroc tasks.
(with a little help from Cozyroc Consultant Ivan Peev)


















Note: Although the Cozyroc Excel Task works like a charm. I still think this is a bug that Microsoft should fix. Cell protection is an edit prevention and I 'm not editing...

Tip: You need to know the password to remove the cell protection in Excel. If you don't have it, use this vba macro to get it.
Case
I want to use the Windows Eventlog as a source in SSIS.

Solution
You can either read the saved eventlog files (*.evt / *.evtx) with for example this open source component or you can read directly from the eventlog itself. Let's elaborate the last option.

1) Script Component
Go to your Data Flow and add a Script Component. The Type should be Source.
Script Type Source




















2) Add Columns
Edit the Script Component and go to the Inputs and Outputs tab. Add four columns at the output section:
  1. EntryType (string 255)
  2. Source (string 255)
  3. Message (string 4000)
  4. TimeGenerated (database timestamp)
Add output columns



















This example uses only four columns, but there are more columns available in the eventlog.

3) The Script
Go to the first tab and hit the Edit Script... button. This script is in the default C# language. SSIS will create three methods: PreExecute, PostExecute and CreateNewOutputRows. Remove the first two, because we don't need them. Replace your method with this one and add the using row at the top.
// C# code
using System;
using System.Data;
using System.Diagnostics; // Added
using Microsoft.SqlServer.Dts.Pipeline.Wrapper;
using Microsoft.SqlServer.Dts.Runtime.Wrapper;

[Microsoft.SqlServer.Dts.Pipeline.SSISScriptComponentEntryPointAttribute]
public class ScriptMain : UserComponent
{
public override void CreateNewOutputRows()
{
// Get all events from the Application(/System/Security) log from the local server (.)
EventLog myEvenLog = new EventLog("Application", ".");

// Create variable to store the entry
EventLogEntry myEntry;

// Loop trough all entries (oldest first)
for (int i = 0; i < myEvenLog.Entries.Count; i++)
{
// Get single entry
myEntry = myEvenLog.Entries[i];

// Add new records
this.Output0Buffer.AddRow();

// Fill columns
this.Output0Buffer.EntryType = myEntry.EntryType.ToString();
this.Output0Buffer.Source = myEntry.Source;
this.Output0Buffer.TimeGenerated = myEntry.TimeGenerated;
// Take a max of 4000 chars
this.Output0Buffer.Message = myEntry.Message.Substring(0, (myEntry.Message.Length > 4000 ? 3999 : myEntry.Message.Length - 1));
}
}
}

Note 1: with variables you can avoid hard coded strings in your Script Component.

Note 2: You should add some error handling to your script (Try Catch) to avoid unexpected errors (authorization, too large texts, etc.).


4) The result
Add a random transformation component after your source and add a dataviewer between them to see the result.
The result















Note: It's also possible to write to the eventlog with .net code, but that is not a best practice. Use the SSIS script event logging instead!
Case
This is an follow up on another strangely formatted file article: My CSV file has splitted rows. There is only one attribute on each row:
id,1
name,Joost
address,Street 2
id,2
name,Jack
address,Avenue 4b
id,3
name,William
address,Plaza 5

I want to convert it to:
id,name,address
1,Joost,Street 1
2,Jack,Avenue 4b
3,William,Plaza 5

The pivot isn't usable because the records aren't grouped with a key. What's the solution?

Solution
You can use a Script Component as a source to solve this.

1) Script Component
Go to your Data Flow and add a Script Component. The Type should be Source.
Script Type Source




















2) Add Columns
Edit the Script Component and go to the Inputs and Outputs tab. Add three columns at the output section:
  1. ID (four-byte signed integer)
  2. Name (string)
  3. Address (string)
Add output columns



















3) Connection Manager
Go to the Connection Managers tab and add a connection (1) to avoid a hardcoded connection string in your Script. The Connection Manager Type (2) should be FILE and the Usage type (3) should be Existing file. After that change the default name from "Connection" to "myFile".
Add FILE connection to avoid hardcoded connection strings


























4) The Script
Go to the first tab and hit the Edit Script... button. This script is in the default C# language. SSIS will create three methods: PreExecute, PostExecute and CreateNewOutputRows. Remove the first two, because we don't need them. Replace your method with this one and add the using row at the top.
// C# code
using System;
using System.Data;
using System.IO; // Added
using Microsoft.SqlServer.Dts.Pipeline.Wrapper;
using Microsoft.SqlServer.Dts.Runtime.Wrapper;

[Microsoft.SqlServer.Dts.Pipeline.SSISScriptComponentEntryPointAttribute]
public class ScriptMain : UserComponent
{
public override void CreateNewOutputRows()
{
// Variables to store column values
int ID = 0; // Column 1
string Name = ""; // Column 2
string Address = ""; // Column 3
int NumberOfColumns = 3;

// Counter to keep track of the current column
int counter = 0;

// Read file (use the FILE connection added in the script component, named myFile)
using (StreamReader sr = new StreamReader(this.Connections.myFile.ConnectionString))
{
String line;
// Read lines from the file until the end of the file is reached.
while ((line = sr.ReadLine()) != null)
{
// Raising the counter
counter++;

// Split the line into columns
string[] columns = line.Split(',');

// Fill the right variable
if (counter.Equals(1))
{
// Column 1
ID = System.Convert.ToInt32(columns[1]);
}
else if (counter.Equals(2))
{
// Column 2
Name = columns[1];
}
else if (counter.Equals(3))
{
// Column 3
Address = columns[1];
}

// Add a row if the last column has been reached
if (counter.Equals(NumberOfColumns))
{
// Add one new row and fill columns
this.Output0Buffer.AddRow();
Output0Buffer.ID = ID;
Output0Buffer.Name = Name;
Output0Buffer.Address = Address;

// Last column, reset counter
counter = 0;
}
}
}
}
}

5) The result
Add a random transformation component after your source and add a dataviewer between them to see the result.
The result




















Note: You should add some error handling to your script (Try Catch) to avoid unexpected errors (no file, other delimiter, empty rows, etc.).

Let me know if you have another strangely formatted file. Maybe I can find a solution to process it.
Case
My CSV file has some rows where not all columns at the end of the row are filled:

"test1";"abc";"xyz";"123"
"test2";"cba";"zyx";"321"
"test3";"abc"
"test4";"efg";"zyx"
"test5";"cba";"zyx";"321"

If you use a flat file connection manager, you will get strange things:
Some rows will 'merge'




















And if the last row is incomplete you will get:








Is there a workaround for this poorly formatted file?

Solution
You can use a Script Component as a source to solve this.
* UPDATE: This has been solved in SQL Server 2012 RC0 *

1) Script Component
Go to your Data Flow and add a Script Component. The Type should be Source.
Script Type Source




















2) Add Columns
Edit the Script Component and go to the Inputs and Outputs tab. Add four string columns at the output section (this example only uses strings).
Add output columns



















3) Connection Manager
Go to the Connection Managers tab and add a connection (1) to avoid a hardcoded connection string in your Script. The Connection Manager Type (2) should be FILE and the Usage type (3) should be Existing file. After that change the default name from "Connection" to "myFile".
Add FILE connection to avoid hardcoded connection strings


























4) The Script
Go to the first tab and hit the Edit Script... button. This script is in the default C# language. SSIS will create three methods: PreExecute, PostExecute and CreateNewOutputRows. Remove the first two, because we don't need them. Replace your method with this one and add the using row at the top.
// C# code
using System;
using System.Data;
using System.IO; // Added
using Microsoft.SqlServer.Dts.Pipeline.Wrapper;
using Microsoft.SqlServer.Dts.Runtime.Wrapper;

[Microsoft.SqlServer.Dts.Pipeline.SSISScriptComponentEntryPointAttribute]
public class ScriptMain : UserComponent
{
public override void CreateNewOutputRows()
{
// Read file (use the FILE connection added in the script component, named myFile)
using (StreamReader sr = new StreamReader(this.Connections.myFile.ConnectionString, System.Text.Encoding.UTF7)) // Encoding is optional
{
String line;
// Read lines from the file until the end of the file is reached.
while ((line = sr.ReadLine()) != null)
{
// Split the line into columns
string[] columns = line.Split(';');

// Add one new row
this.Output0Buffer.AddRow();

// Fill columns, but check if they exist
if (columns.Length > 0)
{
// Remove the " at the start and end of the string
// with a trim or use a substring.
Output0Buffer.MyFirstColumn = columns[0].TrimStart('"').TrimEnd('"');
}
if (columns.Length > 1)
{
Output0Buffer.MySecondColumn = columns[1].TrimStart('"').TrimEnd('"');
}
if (columns.Length > 2)
{
Output0Buffer.MyThirdColumn = columns[2].TrimStart('"').TrimEnd('"');
}
if (columns.Length > 3)
{
Output0Buffer.MyFourthColumn = columns[3].TrimStart('"').TrimEnd('"');
}
}
}
}
}
or VB.Net
// C# code
' VB.Net code
Imports System
Imports System.Data
Imports System.IO 'Added
Imports System.Math
Imports Microsoft.SqlServer.Dts.Pipeline.Wrapper
Imports Microsoft.SqlServer.Dts.Runtime.Wrapper

<Microsoft.SqlServer.Dts.Pipeline.SSISScriptComponentEntryPointAttribute> _
<CLSCompliant(False)> _
Public Class ScriptMain
Inherits UserComponent

Public Overrides Sub CreateNewOutputRows()
' Read file (use the FILE connection added in the script component, named myFile)
Using sr As New StreamReader(Me.Connections.myFile.ConnectionString, System.Text.Encoding.UTF7) ' Encoding is optional

' Create variable to store a line of text
Dim line As String = sr.ReadLine

' Read lines from the file until the end of the file is reached.
Do While (Not line Is Nothing)

' Split the line into columns
Dim columns As String() = line.Split(";")

' Add one new row
Me.Output0Buffer.AddRow()

' Fill columns, but check if they exist
If columns.Length > 0 Then
' Remove the " at the start and end of the string
' with a trim or use a substring.
Output0Buffer.MyFirstColumn = columns(0).TrimStart("""").TrimEnd("""")
End If
If columns.Length > 1 Then
Output0Buffer.MySecondColumn = columns(1).TrimStart("""").TrimEnd("""")
End If
If columns.Length > 2 Then
Output0Buffer.MyThirdColumn = columns(2).TrimStart("""").TrimEnd("""")
End If
If columns.Length > 3 Then
Output0Buffer.MyFourthColumn = columns(3).TrimStart("""").TrimEnd("""")
End If

' Goto next line
line = sr.ReadLine
Loop
End Using
End Sub
End Class

5) The result
Add a random transformation component after your source and add a dataviewer between them to see the result.
Notice the NULL values

















Note: You should add some error handling to your script (Try Catch) to avoid unexpected errors (no file, other delimiter, empty rows, etc.).

Note: If you are using this in a foreach loop, you don't need the connection manager. The variable from the foreach loop already contains the filepath.