html processing [was: spaced out spam words]
Tom Anderson
tanderso at oac-design.com
Mon Jun 12 18:47:31 CEST 2006
Mikhail Zabaluev wrote:
> В Сбт, 10/06/2006 в 10:32 -0400, David Relson пишет:
>
>>Bogofilter is _already_ giving special handling to "font" tags as well
>>as "a" and "img".
>
>
> Ah yes, I forgot.
> I'd like to expand this to all tags and attributes, and maybe special
> character entities as well.
Statistical filtering negates the need to provide special handling for
any content. If some particular string, be it font tags, special
characters, etc., appears in spams more than hams, then it will be
classified as spammy with training. No special logic required. If some
people are genuinely concerned about single characters contributing to
their false negatives, even after thoroughly training on them, then I
could see some value in providing a configuration variable to select
string length. I think multiple word tokens might be valuable as well,
a la CRM114, though I probably wouldn't want to use more than two-word
patterns.
But procedural filtering logic really shouldn't be necessary, and to the
extent that it might be marginally helpful, it should really be included
in a seperate prefilter that inserts tokens for Bogofilter to pick up
on, not within the statistical filter itself. This is actually
something that I currently do to good effect with link tags. If the
href string and the content are both URLs but not the same URL, then I
insert a SCAM-ADDRESS token. I also do URIBL lookups on links and
insert a SPAM-ADDRESS token for any that are contained in the
blocklists. Besides adding extra spammy tokens for a huge percentage of
spams, it also provides a nice visual cue when reviewing false negatives
and when I scan my filtered emails for false positives. But, while this
is useful functionality, I don't believe it need be a core feature of
Bogofilter itself. Similarly, you might want to build an HTML-rendering
prefilter to identify hidden strings and add tokens to the message or
delete the hiding feature. For example, you could change a font color
that is the same as the background to the inverse instead. Only after
such processing would you pipe the email through Bogofilter.
BTW, the prefilter I mentioned above is "stripsearch.pl" and should be
included in your bogofilter contrib directory.
Tom
More information about the Bogofilter
mailing list