How to avoid s p lit up wor ds?

Barry Gould BarryGould at PennySaverUSA.net
Sat Jan 18 03:32:50 CET 2003


At 01:23 PM 1/17/2003, David Relson wrote:

>Why bother with the tags?  I can read "buy to ner car tri dg es", though 
>it's a bit of a pain.  Combining such fragments calls for an AI type 
>algorithm...

In the message I received, it looked like
to<!--fred-->ner ...

The (html-aware) MUA ignores the html comments (they are not rendered to 
the screen, i.e. they are null).

Note this is different than if it were
to ner
or
to<br>ner
, both of which DO require some sort of AI to understand.
However, I have NOT seen email that actually looks like this.

IMHO, the best way for bogofilter to deal with this would be to convert the 
message from html to plaintext at some point. Maybe run over it once as 
HTML, and once as text, for messages that are HTML. This is what I am 
currently doing with base-64 messages, etc.

Other alternatives that come to mind would include making the HTML comment 
tag <!-- --> get highly penalized, as it would only show up infrequently 
except in spam.
However, this would require modifying the way the statistics are computed, 
as 1.0 would not be high enough for such tags.

Unless each instance were to get counted! (multiplying the probabilities 
somehow I suppose.)

BTW, spam assassin already has some basic rules for what it calls "Gappy 
Text", e.g.
B U Y  M E  N O W !
but this is of course different than
bu y me n ow

Barry





More information about the Bogofilter mailing list