html comment processing
David Relson
relson at osagesoftware.com
Sun Mar 30 16:12:55 CEST 2003
At 07:58 AM 3/30/03, Greg Louis wrote:
>On 20030329 (Sat) at 2057:50 -0500, David Relson wrote:
> >
> > For the html purists, I propose to add a config file option named
> > "strict_comment". A value of "true" will cause bogofilter to follow the
> > standard and a value of "false" will work as described above. The default
> > value will be "false".
> >
>It would be comforting to know how well the loose interpretation works
>before releasing it, IMHO. That is, to run some actual experimentation
>and make sure there's an improvement. I've got two more s/mindev scans
>going at present, but after that I can find clock cycles for this
>purpose. We'd want test corpora that can tell us two things:
>
>1. Does loose comment processing catch significantly more spam?
>2. Does loose comment processing introduce more risk of missing valid
> tokens? (I don't see why it should, but data are better than
> theories without data.)
David,
We know that loose processing has some benefits.
1. It fixes a problem in registering mailboxes when an improperly
terminated comment causes bogofilter to miss a "^From " delimiter. This
problem _could_ be corrected by an additional check (or checks) for "^From ".
2. It make a big difference in whether the following text produces words or
fragments thereof:
Please vis<! FF3FFi?FS$s0,sz>it our web<! FF3FFi?FS$s0,sz>si<!
FF3FFi?FS$s0,sz>te
3. Using ">" to terminate an html comment, rather than "-->", is what
bogofilter 0.10.x did, and that version was deemed to work well.
Of course, an experiment would be of value. Can you crank up your fast
machine?
David
More information about the Bogofilter
mailing list