case folding [was: tuning ]

Thu May 8 18:13:28 CEST 2003

Am 19:12 07.05.2003 +0200 teilte T'aZ mir folgendes mit:
->On Wed, 07 May 2003 17:09:56 +0200
->Joerg Over <over at dexia.de> wrote:
->> Since my databases are to small to bear anything
representative,
->> does anyone know about resources on the web to generally
train
->> bayesfilters? Of course, spam/nonspam resources would be
best.
->
->you can find tons of spam on http://www.spamarchive.org

Thx! I fetched some of them and tested.
These are the results with respect to database size + token
count.
Percentage is relation without casemangling/with casemangling.

./1.gz	228965 words, 1642 messages
	database file size: 117,31%	token count: 119,33%
./11.gz	115625 words, 1210 messages
	database file size: 117,17%	token count: 118,53%
./21.gz	468924 words, 3930 messages
	database file size: 116,24%	token count: 118,17%
./31.gz	115370 words, 1467 messages
	database file size: 115,45%	token count: 116,76%
./41.gz	217069 words, 2503 messages
	database file size: 117,78%	token count: 118,00%
./51.gz	 99662 words, 1086 messages
	database file size: 116,33%	token count: 117,76%

All of those together:
1245615 words, 11838 messages, 
filesize = 2859008 / 2539520 =~ 112,58%
tokens   = 94372 / 82755     =~ 114,04%

So, database size and token count don't increase a lot; indeed a
lot less than I expected.

Now for the accuracy. This is the hard part. I tested the big
databases with and without case mangling against my little spam
collection which was -not- part of the generated spam databases.
Results with -g were identical.
Results with -r show an increase in spamicity of between 5% and
20%.
There's 1 fn with mangling, 0 fn without.
Results with -f show - well, I'm in war with fisher-graham.
With case mangling I get 30 fn in my 33 spam-mails.
Without case mangling I get 2 fn in my 33 spams.

Caveats:
- I'm not sure if I changed the sources in the right places, but
I checked the databases if they contain mixed case and the output
of -vv for checking different cases.

- I have a heavily biased database with only my small goodlist.

- I did not enough testing with NotSpam yet. While I believe the
online spam database is sufficient to check for database size and
token count, I'm not convinced my testing with spam is accurate
enough. I also think with the goodlist it's even more important
to collect one's own. I'll take up these tests when I've got big
enough spam and good collections, and try to find a correlation
between database size and accuracy. I'll also need more time for
that.

Conclusion: I believe if case mangling was switched off, the
default values might have to be changed. I also believe that the
rather small increase in database size justifies case mangling
being an option one day. After all, bayes filtering generally
should get better with the amount of data to look at and the
possibilities of discrimination between the tokens.

I also believe I'm not a scientist and nobody should just believe
me, and someone, if interested, should try and reproduce
something like that, preferrably with a better data collection. I
might have made a hundred mistakes.

Regards, jo