Search Suggest

Cleaning with Regular Expressions in SSIS

Case
I have a column with html tags in the value. How can I remove those tags?

Solution
You could try the make an expression in a derived column, but it would probably become an unreadable expression. There are a couple of alternatives like the Konesans RegexClean Transformation. Links to more third party and open source components are availble in a previous article. And if your source is a database then you could use a user defined function, but  in this case we will be using a .Net Regular Expression within a Script Component to remove the html tags.

File example:
"Name";"Message"
"Joost";"Hallo, <strong>this</strong> is bold."
"William";"This is my homepage: <a href='http://server.hoit.asia/'>http://server.hoit.asia/</a>."


1) Source
Add a Flat File Source Component for the textfile above.
Flat File Source
















2) Script Component
Add a Script Component type transformation below the Flat File Source and give it a suitable name.
Script Component - transformation






















3) Input Column
Edit the Script Component and select the message column as ReadWrite on the Input Columns tab.
Message as ReadWrite input column



















4) The Script
SSIS creates some standard methods, but we only use Input0_ProcessInputRow. You can remove the rest. We are also adding a new method that removes the html tags. It uses a .Net Regex method. You can search the web for your own regex pattern and change it in the extra method.
// C# code
using System;
using System.Data;
using System.Text.RegularExpressions; // Added
using Microsoft.SqlServer.Dts.Pipeline.Wrapper;
using Microsoft.SqlServer.Dts.Runtime.Wrapper;

[Microsoft.SqlServer.Dts.Pipeline.SSISScriptComponentEntryPointAttribute]
public class ScriptMain : UserComponent
{
// Method that will be executed for each row.
public override void Input0_ProcessInputRow(Input0Buffer Row)
{
// Replace the value of the message column
Row.Message = RemoveHtml(Row.Message);
}

// A string method that removes html tags
// with a regex pattern
public String RemoveHtml(String message)
{
// The pattern for a html tag
String htmlTagPattern = "<(.|\n)+?>";
// Create a regex object with the pattern
Regex objRegExp = new Regex(htmlTagPattern);
// Replace html tag by an empty string
message = objRegExp.Replace(message, String.Empty);
// Return the message without html tags
return message;
}
}

or VB.Net

' VB.Net code
Imports System
Imports System.Data
Imports System.Math
Imports System.Text.RegularExpressions ' Added
Imports Microsoft.SqlServer.Dts.Pipeline.Wrapper
Imports Microsoft.SqlServer.Dts.Runtime.Wrapper

<Microsoft.SqlServer.Dts.Pipeline.SSISScriptComponentEntryPointAttribute()> _
<CLSCompliant(False)> _
Public Class ScriptMain
Inherits UserComponent

Public Overrides Sub Input0_ProcessInputRow(ByVal Row As Input0Buffer)
' Replace the value of the message column
Row.Message = RemoveHtml(Row.Message)
End Sub

' A string method that removes html tags
' with a regex pattern
Public Function RemoveHtml(message As [String]) As [String]
' The pattern for a html tag
Dim htmlTagPattern As [String] = "<(.|" & vbLf & ")+?>"
' Create a regex object with the pattern
Dim objRegExp As New Regex(htmlTagPattern)
' Replace html tag by an empty string
message = objRegExp.Replace(message, [String].Empty)
' Return the message without html tags
Return message
End Function
End Class


Note: this script uses a very simple regular expression. Search the web for other/better regular expressions.

5) The Result
For testing purposes I added a Derived Columns and two data viewes on the Data Flow Paths.


















SSIS 2008 Package example

إرسال تعليق