html comment processing

David Relson relson at osagesoftware.com
Sun Mar 30 16:12:55 CEST 2003


At 07:58 AM 3/30/03, Greg Louis wrote:

>On 20030329 (Sat) at 2057:50 -0500, David Relson wrote:
> >
> > For the html purists, I propose to add a config file option named
> > "strict_comment".  A value of "true" will cause bogofilter to follow the
> > standard and a value of "false" will work as described above.  The default
> > value will be "false".
> >
>It would be comforting to know how well the loose interpretation works
>before releasing it, IMHO.  That is, to run some actual experimentation
>and make sure there's an improvement.  I've got two more s/mindev scans
>going at present, but after that I can find clock cycles for this
>purpose.  We'd want test corpora that can tell us two things:
>
>1.  Does loose comment processing catch significantly more spam?
>2.  Does loose comment processing introduce more risk of missing valid
>     tokens?  (I don't see why it should, but data are better than
>     theories without data.)

David,

We know that loose processing has some benefits.

1. It fixes a problem in registering mailboxes when an improperly 
terminated comment causes bogofilter to miss a "^From " delimiter.  This 
problem _could_ be corrected by an additional check (or checks) for "^From ".

2. It make a big difference in whether the following text produces words or 
fragments thereof:
Please vis<! FF3FFi?FS$s0,sz>it our web<! FF3FFi?FS$s0,sz>si<! 
FF3FFi?FS$s0,sz>te

3. Using ">" to terminate an html comment, rather than "-->", is what 
bogofilter 0.10.x did, and that version was deemed to work well.

Of course, an experiment would be of value.  Can you crank up your fast 
machine?

David





More information about the Bogofilter mailing list