I have a column with html tags in the value. How can I remove those tags?
Solution
You could try the make an expression in a derived column, but it would probably become an unreadable expression. There are a couple of alternatives like the Konesans RegexClean Transformation. Links to more third party and open source components are availble in a previous article. And if your source is a database then you could use a user defined function, but in this case we will be using a .Net Regular Expression within a Script Component to remove the html tags.
File example:
"Name";"Message" "Joost";"Hallo, <strong>this</strong> is bold." "William";"This is my homepage: <a href='http://server.hoit.asia/'>http://server.hoit.asia/</a>." |
1) Source
Add a Flat File Source Component for the textfile above.
Flat File Source |
2) Script Component
Add a Script Component type transformation below the Flat File Source and give it a suitable name.
Script Component - transformation |
3) Input Column
Edit the Script Component and select the message column as ReadWrite on the Input Columns tab.
Message as ReadWrite input column |
4) The Script
SSIS creates some standard methods, but we only use Input0_ProcessInputRow. You can remove the rest. We are also adding a new method that removes the html tags. It uses a .Net Regex method. You can search the web for your own regex pattern and change it in the extra method.
// C# code
using System;
using System.Data;
using System.Text.RegularExpressions; // Added
using Microsoft.SqlServer.Dts.Pipeline.Wrapper;
using Microsoft.SqlServer.Dts.Runtime.Wrapper;
[Microsoft.SqlServer.Dts.Pipeline.SSISScriptComponentEntryPointAttribute]
public class ScriptMain : UserComponent
{
// Method that will be executed for each row.
public override void Input0_ProcessInputRow(Input0Buffer Row)
{
// Replace the value of the message column
Row.Message = RemoveHtml(Row.Message);
}
// A string method that removes html tags
// with a regex pattern
public String RemoveHtml(String message)
{
// The pattern for a html tag
String htmlTagPattern = "<(.|\n)+?>";
// Create a regex object with the pattern
Regex objRegExp = new Regex(htmlTagPattern);
// Replace html tag by an empty string
message = objRegExp.Replace(message, String.Empty);
// Return the message without html tags
return message;
}
}
or VB.Net
' VB.Net code
Imports System
Imports System.Data
Imports System.Math
Imports System.Text.RegularExpressions ' Added
Imports Microsoft.SqlServer.Dts.Pipeline.Wrapper
Imports Microsoft.SqlServer.Dts.Runtime.Wrapper
<Microsoft.SqlServer.Dts.Pipeline.SSISScriptComponentEntryPointAttribute()> _
<CLSCompliant(False)> _
Public Class ScriptMain
Inherits UserComponent
Public Overrides Sub Input0_ProcessInputRow(ByVal Row As Input0Buffer)
' Replace the value of the message column
Row.Message = RemoveHtml(Row.Message)
End Sub
' A string method that removes html tags
' with a regex pattern
Public Function RemoveHtml(message As [String]) As [String]
' The pattern for a html tag
Dim htmlTagPattern As [String] = "<(.|" & vbLf & ")+?>"
' Create a regex object with the pattern
Dim objRegExp As New Regex(htmlTagPattern)
' Replace html tag by an empty string
message = objRegExp.Replace(message, [String].Empty)
' Return the message without html tags
Return message
End Function
End Class
Note: this script uses a very simple regular expression. Search the web for other/better regular expressions.
5) The Result
For testing purposes I added a Derived Columns and two data viewes on the Data Flow Paths.
SSIS 2008 Package example