Probability rules

I have moved over to spamprobe for all my spam filtering needs. It’s an implementation of Paul Graham’s Bayesian spam detection algorithm, which detects spam based on word frequency analysis. It requires some training before it works well; you have to feed it a collection of a couple of hundred good messages and a couple of hundred spam messages so that it can build a table of spam words. Or, alternatively, you can train it over time and put up with false negatives and positives for a little while. But once you get those few hundred messages classified, you’re golden.

The really cool thing is that it doesn’t depend on a viewable list of spam words, as does SpamAssassin. I used to use SpamAssassin, and over time more and more spams were getting through, because it’s easy for spammers to look at the list of keywords that comes with SpamAssassin and avoid those words. What’s more, Bayesian filters evolve along with the spammers — if a spammer tries a new approach, but spamprobe catches the message because they didn’t go far enough, the words used in the new approach get classified as potential spam for the next go-round.

The downside is that you can’t really do a global install; everyone needs to train their own filter. Well. You could, but you’d lose a bit of accuracy. I suppose it’d be interesting to see a fairly decent sized site try. I wonder how many people get email at Flit’s site? (Just kidding.)

There are a bunch more Bayesian filters listed here if spamprobe doesn’t suit you for some reason. I get a few hundred emails a day and spamprobe has been plenty fast enough so far, though.

Be First to Comment

Leave a Reply Cancel reply