mass processing with mutt and Fcc

David Relson relson at osagesoftware.com
Tue Apr 1 22:08:04 CEST 2003


At 02:54 PM 4/1/03, Michael Kenneth Ter Louw wrote:



>On Tue, 1 Apr 2003, Boris 'pi' Piwinger wrote:
>
> > > At the present time, when processing html, bogofilter does discards html
> > > comments, valid html tags (and their innards), and invalid html tags 
> (and
> > > their innards).  Basically everything between angle brackets is being
> > > ignored at this time.
> > >
> > > The rationale is that that many tokens within html tags are not worth
> > > scoring as spam indicators.
> >
> > I see. I thought that the use of html would be useful (I
> > remember the early versions of bogofilter said so). Also web
> > addresses as in links or img elements might be useful.
>
>Graham mentions the use of HTML tags in his article:
>
>"In fact, "ff0000" (html for bright red) turns out to be as good an
>indicator of spam as any pornographic term."
>
>I don't know if analyzing *all* the HTML tags would be worth the benefit
>offered by this single case.  Just thought I'd throw it out there.
>
>Mike

Actually it takes extra work to recognize html tags (and comments) and 
throw them away.  When processing normal text, pretty much all that's kept 
is letters and digits and a few special characters like period, hyphen, 
underscore, apostrophe, etc.  It's trivially easy to apply the normal text 
mode to html.

If someone will suggest a good name for a config file option, I can 
implement it and people can experiment with turning off special html 
proceessing.

David






More information about the Bogofilter mailing list