Entityize and ASCIIfy your XML text strings

2 is not equal to 3, not even for very large values of 2 -- Grabel's Law

I have a custom search facility that I use on a couple of different web sites where the search queries are stored in a database table in order to compute count statistics and also to generate a standard xml sitemap for the search engines to nibble on. Problem is, I don't know what users are going to enter as search terms.

From a purely search standpoint, I really don't care; if they enter gobbledegook Unicode glop and get back no search results, fie on them, right?

However, I need to clean this stuff before I store it in the database since when I pull it out to generate my custom sitemap, I'm going to end up with illegal XML characters in the sitemap document. That means google, ask.com, live.com and yahoo are all going to choke on it and I might as well not even have a sitemap if that happens.


So I put a couple of static cleanup methods into global.asax which conveniently allows them to be called from any page:

// Usage: query =  Global.Entityize(Global.ASCIIify(query)); 

public static string ASCIIify( string str)
{

StringBuilder sb = new StringBuilder();
char[] chars = str.ToCharArray();
for (int i = 0; i < chars.Length; i++)
{
char c = chars[i];
if ((int)c < 128) // is within ASCII charset
{
sb.Append(c);
}
}
return sb.ToString();
}

public static string Entityize(string str)
{
return System.Security.SecurityElement.Escape(str);
}




ASCIIfy simply strips out anything that's not in the ASCII Charset. Of course, you may not want to do this, so your solution may be different.



Entityize uses the convenient Escape method of the SecurityElement class - no need to write complicated "replace" code. By combining the calls to the two methods in a single line: Global.Entityize(Global.ASCIIify(query));



-- I get a clean string that I can insert in the database and know that my Sitemap.xml files will be OK.

Comments

  1. Anonymous11:02 AM

    I really dig your concoctive verbs :)

    ReplyDelete
  2. Thanks! My latest method is called "FixWindowsUpdateAndHaveABeer". Unfortunately, I can't get it to work (the update part, that is).

    ReplyDelete
  3. Peter,
    In search query only letters, digits, spaces, dashes, and quotes are allowed, right?
    Why not use Regex for replacement?
    Something like that:
    --- C# code ---
    string cleanedSearchQuery = System.Text.RegularExpressions.Regex.Replace(rawQuery, @"[^\w\s\-"]+", " ").Trim();
    ---------------

    ReplyDelete
  4. Peter,

    I don't understand, what's the purpose of keeping users' search queries in database?

    ReplyDelete
  5. think you are referring to a generic "real search engine" query, my search "engine" accepts anything. But sure, you could certainly use Regex if it suits your fancy. Regarding why I store the search queries in the database, I think the post already explains that. It's in the first paragraph.

    ReplyDelete
  6. Peter,
    Do you mean this:
    "in order to compute count statistics and also to generate a standard xml sitemap for the search engines to nibble on."?

    1) Why not simply save "count" results that you've got when you executed query for the first time?

    2) Why do you generate "standard XML sitemap" based on _users_ search queries?

    ReplyDelete
  7. By "Count" I meant the number of times that query was requested by visitors, not the count of search results. My bad, should have been more specific.

    I generate standard xml sitemaps so the search engines will index my search results as if they were pages and therefore these pages will come up in a search on a big 4 search engine, I will get traffic, and people will click on my ads. Your basic Capitalism!

    ReplyDelete
  8. 1) Do you include search results of only the most popular queries into Google site maps?

    2) Isn't it content duplication?

    ReplyDelete
  9. I suggest you look at search engines like Mamma's info page:

    http://www.mamma.com/info/help/tips.html#3

    to make your own assessment about "content duplication".

    There are at least a dozen others like this one that have been operating for years.

    ReplyDelete
  10. I's like to note, that Mamma's popularity is on decline.
    If you want the same fate for your web sites -- you may follow their approach.

    ReplyDelete
  11. I only mentioned Mamma as an example, Dennis. We can each measure the popularity, ascent or decline of our own websites quite easily.

    ReplyDelete
  12. Peter,

    Let's compare performance of our web sites

    I added ittyurl.net, petesbloggerama.blogspot.com and my web site PostJobFree.com

    Did I miss any of your important sites?

    I and you created IttyUrl.net and PostJobFree.com at ~same time (~beginning of 2007).

    ReplyDelete
  13. Dennis,
    The post is about how to clean search queries, not tit-for-tat "my site is better than yours" type of stuff.

    If you plug your website into Alexa's comparison graph, it doesn't even show up on the chart.

    I certainly wish you good luck with it though, looks like a good concept.

    ReplyDelete
  14. Peter,

    Would you agree that any development activity should be considered within the context of business use?

    I think anything that helps to pick better web site development strategy is useful.
    I pay lots of attention to how Google does business/technology. In part, because Google is so popular.

    Regarding Alexa -- as far as I know -- their data is not reliable. I use Alexa only when I want to look at some traffic data that happened more than half a year ago.

    ReplyDelete
  15. Dennis,
    I don't want to get into a blog comments "shooting match", which apparently is what you seem to want to turn this into. I value your comments, as long as they are on topic.

    ReplyDelete
  16. Peter, I'm trying this conversation (and most of my other conversations) into source of valuable insights.

    But if you prefer to strictly stay on original topic -- I cannot force you to change.

    Sorry that you perceive my comments as "shooting match". They are not.

    ReplyDelete
  17. Dennis,
    I am not in the least bit offended about your comments, in fact they made me think and I did some research and have a new post on the subject. However, I think a better place to conduct a conversation would be the forums at eggheadcafe.com - that's why we have them.

    ReplyDelete
  18. Peter,
    what's the name of the forum that would be appropriate for discussion of business model around eggheadcafe.com, traffic analysis etc?

    ReplyDelete
  19. Use the "Ask Dr. Dotnetsky" forum. It's a catch-all that is good for any subject.

    ReplyDelete
  20. public static string ASCIIify2(string str) {
    StringBuilder sb = new StringBuilder();
    foreach (char c in str)
    if (c < (char)128) // is within ASCII charset
    sb.Append(c);

    return sb.ToString();
    }

    ReplyDelete
  21. I really don't understand the point of the "Entityize" method which is nothing but a wrapper. Couldn't you have merged the two methods into one so instead of return sb.ToStrin() you put the SecurityElement.Escape(sb.ToString())?

    ReplyDelete
  22. Andrei,
    Of course I could have merged the two methods. But, because they do slightly different things, I decided that I might want to use only one or the other. This is one of those design decisions where "Whatever you think is right" - is right.

    ReplyDelete

Post a Comment

Popular posts from this blog

Some observations on Script Callbacks, "AJAX", "ATLAS" "AHAB" and where it's all going.

IE7 - Vista: "Internet Explorer has stopped Working"

FIREFOX / IE Word-Wrap, Word-Break, TABLES FIX