medium and long term trends?

Bopolissimus Platypus bopolissimus at sni.ph
Fri Jan 2 19:22:59 CET 2004


hello all,

i don't understand the math (of graham's original suggestion, or robson 
and fisher's enhancements) enough to have truly a  informed opinion on
how the bayesian/graham/robson/fisher approaches will work out in
the medium and long term.

from  the depths of my ignorance of the math though, bubble up some
questions.

1. over time, i see much more spam than ham. and i report the spam
    but not the ham (it's probably a natural impulse,  scratch the itch
    that irritates, ignore thegood). generally, what will this trend lead 
    to?  will spam message reporting eventually  pollute the wordcount
    databases so much that we need to retrain (dropping old spam 
    emails) regularly?  or do we need to run bogoutil regularly
    to drop  old words?  do we need to "normalize" counts  so
    that certain  word counts don't get so large that they
     completely overshadow other words just because their
    counts are so large compared to non-spam counts? or  is
    the math such that it doesn't matter if we stop identifying
    ham after the initial training run?

2.  how  dangerous are those spams that consist of 70-80%
     random dictionary words and the rest,spammy words?
     i can see that after  reporting, the messages are classified
     correctly as spam.  but what happens over the medium
     and long term  when, after two hundred thousand such spams
     and maybe 1000 hams, wordlists are so corrupted that almost 
     any word is considered spammy?   

    is bogofilter immune to this?   or do we just rebuild  our
    wordlists from only the most recent ham and spam (or all
    the ham, and only the  most recent ham?)

3.  does bogofilter  currently decode base64, MIME, uuencode,
     zip files?  on the mailing list, long ago, i noted resistance to
      decoding anything(understandable, due to  the performance
     hit) but clearly decoding that stuff has got to work better
     than just classifying on the MIME  headers.  although of 
    course there are still ways to spoof.

tiger

-- 
Gerald Timothy Quimpo  gquimpo*hotmail.com tiger*sni*ph
http://bopolissimus.sni.ph
Public Key: "gpg --keyserver pgp.mit.edu --recv-keys 672F4C78"

  my name is Inigo Montoya.  You killed my father.  Prepare to die.






More information about the Bogofilter mailing list