MAXTOKENLEN [was: StudlyCaps]

Fri Jul 9 14:29:40 CEST 2004

On 09 Jul 2004 08:10:15 -0400
Tom Anderson wrote:

> On Fri, 2004-07-09 at 04:12, Andreas Pardeike wrote:
> > On 2004-07-09, at 08.58, Peter Bishop wrote:
> > 
> > > Sure there are real words like that too
> > > but if these are split consistently by bogofilter then
> > > Mc and Donald
> > > would be stored instead, so might be recognised OK
> > >
> > > - even better when token pairs/sequences
> > > are looked for in later versions of bogofilter
> > 
> > Then what happens tO tExT LiKe tHiS?
> 
> I'd imagine it'd be ignored completely since it doesn't meet the
> minimum token length.  This isn't actually a terrible idea since it's
> not very readable text anyway, and there should be sufficient other
> tokens to make the message spammy.  However, perhaps bogofilter could
> score both ways... with and without breaking on the case changes.  But
> now we're getting more complicated.
> 
> Tom

Here are my thoughts on MAXTOKENLEN.

Breaking MixedCaseStuFF into separate tokens is above and beyond
bogofilter's charter and is a bad idea.

Two reasonable approaches are:

  truncate long tokens to MAXTOKENLEN
  convert long tokens to MAXTOKENLEN+delta

Alternatively, bogofilter can ignore long tokens, as it does now.

Regards,

David