case folding [was: tuning ]

Wed May 7 17:09:56 CEST 2003

Am 08:52 07.05.2003 +1000 teilte michael at optusnet.com.au mir
folgendes mit:
->David Relson <relson at osagesoftware.com> writes:
->> Case folding saves on database size.  Without it, you might
have
->> "money", "Money", "MONEY", "monEy", ...
->
->Indeed. The other point that's frequently missed is that
there's
->a tradeoff between accuracy and training duration.
->
->The higher the accuracy you want, the more clues you look at
->(i.e. using the case as well as the spelling), the more data
you need
->for bogofilter to sensibly understand it. This in turn means
that you
->need a larger email/spam corpus to train on which will drive a
larger
->database.

Thx to both of you.
The database size issue is obvious, given the same amount of
messages to train the databases.
As for the spelling, there's no sensible way to prevent that that
is used in the databases, willingly or not :)

As for the tradeoff database size/accuracy, I'd like to test
that... I'd believe that there'd be a turnover point, from which
on accuracy improves with case sensitivity. The question being,
is this turnover point early enough relative to database size so
that there is gain. Sorry for my english, these are probably
totally inaccurate terms, but I believe you know what I mean.
Since my databases are to small to bear anything representative,
does anyone know about resources on the web to generally train
bayesfilters? Of course, spam/nonspam resources would be best.

Greetings, jo