mass processing with mutt and Fcc
David Relson
relson at osagesoftware.com
Tue Apr 1 22:08:04 CEST 2003
At 02:54 PM 4/1/03, Michael Kenneth Ter Louw wrote:
>On Tue, 1 Apr 2003, Boris 'pi' Piwinger wrote:
>
> > > At the present time, when processing html, bogofilter does discards html
> > > comments, valid html tags (and their innards), and invalid html tags
> (and
> > > their innards). Basically everything between angle brackets is being
> > > ignored at this time.
> > >
> > > The rationale is that that many tokens within html tags are not worth
> > > scoring as spam indicators.
> >
> > I see. I thought that the use of html would be useful (I
> > remember the early versions of bogofilter said so). Also web
> > addresses as in links or img elements might be useful.
>
>Graham mentions the use of HTML tags in his article:
>
>"In fact, "ff0000" (html for bright red) turns out to be as good an
>indicator of spam as any pornographic term."
>
>I don't know if analyzing *all* the HTML tags would be worth the benefit
>offered by this single case. Just thought I'd throw it out there.
>
>Mike
Actually it takes extra work to recognize html tags (and comments) and
throw them away. When processing normal text, pretty much all that's kept
is letters and digits and a few special characters like period, hyphen,
underscore, apostrophe, etc. It's trivially easy to apply the normal text
mode to html.
If someone will suggest a good name for a config file option, I can
implement it and people can experiment with turning off special html
proceessing.
David
More information about the Bogofilter
mailing list