html processing [was: spaced out spam words]

Tom Anderson tanderso at oac-design.com
Mon Jun 12 18:47:31 CEST 2006


Mikhail Zabaluev wrote:
> В Сбт, 10/06/2006 в 10:32 -0400, David Relson пишет:
> 
>>Bogofilter is _already_ giving special handling to "font" tags as well
>>as "a" and "img".
> 
> 
> Ah yes, I forgot.
> I'd like to expand this to all tags and attributes, and maybe special
> character entities as well.

Statistical filtering negates the need to provide special handling for 
any content.  If some particular string, be it font tags, special 
characters, etc., appears in spams more than hams, then it will be 
classified as spammy with training.  No special logic required.  If some 
people are genuinely concerned about single characters contributing to 
their false negatives, even after thoroughly training on them, then I 
could see some value in providing a configuration variable to select 
string length.  I think multiple word tokens might be valuable as well, 
a la CRM114, though I probably wouldn't want to use more than two-word 
patterns.

But procedural filtering logic really shouldn't be necessary, and to the 
extent that it might be marginally helpful, it should really be included 
in a seperate prefilter that inserts tokens for Bogofilter to pick up 
on, not within the statistical filter itself.  This is actually 
something that I currently do to good effect with link tags.  If the 
href string and the content are both URLs but not the same URL, then I 
insert a SCAM-ADDRESS token.  I also do URIBL lookups on links and 
insert a SPAM-ADDRESS token for any that are contained in the 
blocklists.  Besides adding extra spammy tokens for a huge percentage of 
spams, it also provides a nice visual cue when reviewing false negatives 
and when I scan my filtered emails for false positives.  But, while this 
is useful functionality, I don't believe it need be a core feature of 
Bogofilter itself.  Similarly, you might want to build an HTML-rendering 
prefilter to identify hidden strings and add tokens to the message or 
delete the hiding feature.  For example, you could change a font color 
that is the same as the background to the inverse instead.  Only after 
such processing would you pipe the email through Bogofilter.

BTW, the prefilter I mentioned above is "stripsearch.pl" and should be 
included in your bogofilter contrib directory.

Tom




More information about the Bogofilter mailing list