MIME content-type tokenization

Mon Feb 23 07:39:13 CET 2004

Oh, and another idea, while thinking about multiple tokenizations, is 
how about using N-grams for additional token types?  Generally 
speaking, you can figure out how likely a message is in a particular 
language by the frequency of 2-grams, so if something is in a language 
other than what the recipient wants to receive (such as, say, Russian, 
or Spammer, or whatever) then that might provide a further emergent 
high-level classification.

At worst, there'll just be 65536 2-grams which have nearly equal spam 
and non-spam weights. :)

--
http://trikuare.cx/