mass processing with mutt and Fcc

David Relson relson at osagesoftware.com
Wed Apr 2 18:58:34 CEST 2003


At 11:49 AM 4/2/03, Jesse Meyer wrote:

>On Tue, Apr 01, 2003 at 10:23:32PM +0200, Boris 'pi' Piwinger wrote:
> > David Relson <relson at osagesoftware.com> wrote:
> >
> > >Actually it takes extra work to recognize html tags (and comments) and
> > >throw them away.  When processing normal text, pretty much all that's 
> kept
> > >is letters and digits and a few special characters like period, hyphen,
> > >underscore, apostrophe, etc.  It's trivially easy to apply the normal 
> text
> > >mode to html.
> >
> > The problem is that we need HTML processing to avoid the
> > spammers' tricks with tags in the middle of words. So it
> > would be nice to do that and also evaluate the content of
> > the tags.
>
>Wouldn't it be rather easy (although probably not very elegant) to
>make a short script that runs any html message through lynx -dump
>first, then gives it to bogofilter to analyse, and, if that succeeds,
>then passing the original message through?

Jesse,

As an experiment, you could try that with several messages and compare 
results.  Something like the following might do it:

for M in msg* ; do
bogofilter -v < $M
lynx -dump < $M | bogofilter -v
done

It'd be interesting to hear how much the spam scores differ...

David





More information about the Bogofilter mailing list