Spam Filters Gone Wild 2007

One of the primary concerns you will have as a blogger or developer of content sites is filtering porn and spam. The approach I've taken with ittyurl.net is basically to have a database table, "BADWORDS". This gets loaded into a string array at startup, and any time somebody wants to add a link, since the application spiders the page anyway to collect tags and metadata, I run it through my IsBadWord method. The process is very fast and it has worked extremely well. Since about January 2007 when I put up the first beta of the site, I've only had to manually remove three or four links out of the several thousand that users have added on the site. Sometimes the sneaky little scumbags have a redirect to their porn / spam sites from a "nice" page and that of course is something you cannot foresee (unless of course, you want to have your WebRequest follow redirects -- it just goes to show you they will stop at nothing in the dirty tricks department!). Other times it was just drug stuff ("Phentermine", "Xanax", "Viagra" - you know the routine) and those weren't in my BADWORDS table -- although they are now! Here's some sample code:

public class Global : HttpApplication
public static string[] BadWords;

public static void PopulateBadWords()
DataTable dtBadWords = null;
DataSet dsBadWords =
"dbo.GetBadWords", null);
dtBadWords = dsBadWords.Tables[0];
catch (Exception ex)
BadWords = new string[dtBadWords.Rows.Count];
for (int i = 0; i < dtBadWords.Rows.Count; i++)
BadWords[i] = (String) dtBadWords.Rows[i][0];
HttpContext.Current.Application["BadWords"] = BadWords;

But other times you get a catch-22 - This article, "Bloggers Bring in the Big Bucks" wouldn't go in because one of the characters' names is "Heather Cocks" (good God, what a name to have!). And of course, my badwords filter found it and DK-ed the entry. That's too bad, because I actually built the site for my own use - as a way to easily store, tag and make links searchable. In the process, I realized others might find it useful so I expanded the concept and made it public. It even has a Webservices API -http://ittyurl.net/IttyUrlService.asmx

Other problems I've found are that it's one thing to have a CAPTCHA on your publication to deter automated spam bots. But what about maniacs? There are actually mentally disturbed people who deliberately hate-spam comments on blogs. I don't have any issue with people posting comments that disagree with my views on something; that's perfectly fine with me, I put my ego in my back pocket and publish their comment. But there are actually people who mount ad-hominem attacks, deliberately seeking out numbers of posts and putting their trash on there.

So, moderated comments come in to play. It's an inconvenience, but I'm usually online most of the time so I get an email and approve it right away. I think possibly the best answer may be a combination of spam filters and moderation. In other words, if there is a bad word, you would get an email allowing you to look at the content and override your spam filter.

Bayesian filtering is another possibility - I've seen some pretty interesting C# code with Bayesian filtering, but as we all know these filters need to be "trained" - much like a neural network. In my case, that's probably overkill.

It never ceases to amaze me the amount of spam there is in the Blogosphere - I get sometimes 200+ messages in my yahoo mail spam folder, and only 10 or 20 real messages. Yahoo and Gmail are both doing a pretty good job -- I hardly ever find legitimate mail in the spam folder. On the other hand, I do often find one or two spams in the inbox, especially when there is a new genetic mutation of some spam formula they haven't learned yet. Do these people really believe I need a bigger member? That I really want to buy bogus drugs online and give them my credit card? Oh, and here's a killer- "N33d m0ney right now? One-hour payday loans" -- like their l33t-speak is really gonna make it past the spam filter, huh?

It's pitiful. It's childish, and it's harmful. It's indicative of what our societal psyche has become - don't do any real work, don't care about other people, and just try to make money fast any way you can -- and you and I are bearing the cost of it in inefficiency and increased bandwidth consumption. Spam isn't just an annoyance. It has a real cost, and guess who's footing the bill?