MAXTOKENLEN [was: StudlyCaps]
relson at osagesoftware.com
Fri Jul 9 14:29:40 CEST 2004
On 09 Jul 2004 08:10:15 -0400
Tom Anderson wrote:
> On Fri, 2004-07-09 at 04:12, Andreas Pardeike wrote:
> > On 2004-07-09, at 08.58, Peter Bishop wrote:
> > > Sure there are real words like that too
> > > but if these are split consistently by bogofilter then
> > > Mc and Donald
> > > would be stored instead, so might be recognised OK
> > >
> > > - even better when token pairs/sequences
> > > are looked for in later versions of bogofilter
> > Then what happens tO tExT LiKe tHiS?
> I'd imagine it'd be ignored completely since it doesn't meet the
> minimum token length. This isn't actually a terrible idea since it's
> not very readable text anyway, and there should be sufficient other
> tokens to make the message spammy. However, perhaps bogofilter could
> score both ways... with and without breaking on the case changes. But
> now we're getting more complicated.
Here are my thoughts on MAXTOKENLEN.
Breaking MixedCaseStuFF into separate tokens is above and beyond
bogofilter's charter and is a bad idea.
Two reasonable approaches are:
truncate long tokens to MAXTOKENLEN
convert long tokens to MAXTOKENLEN+delta
Alternatively, bogofilter can ignore long tokens, as it does now.
More information about the Bogofilter