2/05/2008

Entityize and ASCIIfy your XML text strings

2 is not equal to 3, not even for very large values of 2 -- Grabel's Law

I have a custom search facility that I use on a couple of different web sites where the search queries are stored in a database table in order to compute count statistics and also to generate a standard xml sitemap for the search engines to nibble on. Problem is, I don't know what users are going to enter as search terms.

From a purely search standpoint, I really don't care; if they enter gobbledegook Unicode glop and get back no search results, fie on them, right?

However, I need to clean this stuff before I store it in the database since when I pull it out to generate my custom sitemap, I'm going to end up with illegal XML characters in the sitemap document. That means google, ask.com, live.com and yahoo are all going to choke on it and I might as well not even have a sitemap if that happens.


So I put a couple of static cleanup methods into global.asax which conveniently allows them to be called from any page:

// Usage: query =  Global.Entityize(Global.ASCIIify(query)); 

public static string ASCIIify( string str)
{

StringBuilder sb = new StringBuilder();
char[] chars = str.ToCharArray();
for (int i = 0; i < chars.Length; i++)
{
char c = chars[i];
if ((int)c < 128) // is within ASCII charset
{
sb.Append(c);
}
}
return sb.ToString();
}

public static string Entityize(string str)
{
return System.Security.SecurityElement.Escape(str);
}




ASCIIfy simply strips out anything that's not in the ASCII Charset. Of course, you may not want to do this, so your solution may be different.



Entityize uses the convenient Escape method of the SecurityElement class - no need to write complicated "replace" code. By combining the calls to the two methods in a single line: Global.Entityize(Global.ASCIIify(query));



-- I get a clean string that I can insert in the database and know that my Sitemap.xml files will be OK.