mass processing with mutt and Fcc

David Relson relson at
Tue Apr 1 22:08:04 CEST 2003

At 02:54 PM 4/1/03, Michael Kenneth Ter Louw wrote:

>On Tue, 1 Apr 2003, Boris 'pi' Piwinger wrote:
> > > At the present time, when processing html, bogofilter does discards html
> > > comments, valid html tags (and their innards), and invalid html tags 
> (and
> > > their innards).  Basically everything between angle brackets is being
> > > ignored at this time.
> > >
> > > The rationale is that that many tokens within html tags are not worth
> > > scoring as spam indicators.
> >
> > I see. I thought that the use of html would be useful (I
> > remember the early versions of bogofilter said so). Also web
> > addresses as in links or img elements might be useful.
>Graham mentions the use of HTML tags in his article:
>"In fact, "ff0000" (html for bright red) turns out to be as good an
>indicator of spam as any pornographic term."
>I don't know if analyzing *all* the HTML tags would be worth the benefit
>offered by this single case.  Just thought I'd throw it out there.

Actually it takes extra work to recognize html tags (and comments) and 
throw them away.  When processing normal text, pretty much all that's kept 
is letters and digits and a few special characters like period, hyphen, 
underscore, apostrophe, etc.  It's trivially easy to apply the normal text 
mode to html.

If someone will suggest a good name for a config file option, I can 
implement it and people can experiment with turning off special html 


More information about the Bogofilter mailing list