medium and long term trends?
Bopolissimus Platypus
bopolissimus at sni.ph
Fri Jan 2 19:22:59 CET 2004
hello all,
i don't understand the math (of graham's original suggestion, or robson
and fisher's enhancements) enough to have truly a informed opinion on
how the bayesian/graham/robson/fisher approaches will work out in
the medium and long term.
from the depths of my ignorance of the math though, bubble up some
questions.
1. over time, i see much more spam than ham. and i report the spam
but not the ham (it's probably a natural impulse, scratch the itch
that irritates, ignore thegood). generally, what will this trend lead
to? will spam message reporting eventually pollute the wordcount
databases so much that we need to retrain (dropping old spam
emails) regularly? or do we need to run bogoutil regularly
to drop old words? do we need to "normalize" counts so
that certain word counts don't get so large that they
completely overshadow other words just because their
counts are so large compared to non-spam counts? or is
the math such that it doesn't matter if we stop identifying
ham after the initial training run?
2. how dangerous are those spams that consist of 70-80%
random dictionary words and the rest,spammy words?
i can see that after reporting, the messages are classified
correctly as spam. but what happens over the medium
and long term when, after two hundred thousand such spams
and maybe 1000 hams, wordlists are so corrupted that almost
any word is considered spammy?
is bogofilter immune to this? or do we just rebuild our
wordlists from only the most recent ham and spam (or all
the ham, and only the most recent ham?)
3. does bogofilter currently decode base64, MIME, uuencode,
zip files? on the mailing list, long ago, i noted resistance to
decoding anything(understandable, due to the performance
hit) but clearly decoding that stuff has got to work better
than just classifying on the MIME headers. although of
course there are still ways to spoof.
tiger
--
Gerald Timothy Quimpo gquimpo*hotmail.com tiger*sni*ph
http://bopolissimus.sni.ph
Public Key: "gpg --keyserver pgp.mit.edu --recv-keys 672F4C78"
my name is Inigo Montoya. You killed my father. Prepare to die.
More information about the Bogofilter
mailing list