MIME content-type tokenization
fluffy
magenta at trikuare.cx
Mon Feb 23 07:39:13 CET 2004
Oh, and another idea, while thinking about multiple tokenizations, is
how about using N-grams for additional token types? Generally
speaking, you can figure out how likely a message is in a particular
language by the frequency of 2-grams, so if something is in a language
other than what the recipient wants to receive (such as, say, Russian,
or Spammer, or whatever) then that might provide a further emergent
high-level classification.
At worst, there'll just be 65536 2-grams which have nearly equal spam
and non-spam weights. :)
--
http://trikuare.cx/
More information about the bogofilter-dev
mailing list