How to avoid s p lit up wor ds?

David Relson relson at osagesoftware.com
Fri Jan 17 22:23:18 CET 2003


At 04:19 PM 1/17/03, Chris Wilkes wrote:

>On Fri, Jan 17, 2003 at 04:06:24PM -0500, David Relson wrote:
> >
> > The bad news is that the html tags _do_ break up the words.
>
>I'm almost of the mind to send all HTML mail to a spam bin, and then
>tell BF to rate it non-spam if it gets some low BF value.
>
>However that still doesn't get around the problem of humans being able
>to read text that a programming looking for text can't (the "ton er" =
>"ton<BR>er" = toner are the same case).
>
>I'm not sure how you can write a tokenizer to combine word fragments
>that should be combined together.

Why bother with the tags?  I can read "buy to ner car tri dg es", though 
it's a bit of a pain.  Combining such fragments calls for an AI type 
algorithm...





More information about the Bogofilter mailing list