medium and long term trends?

Fri Jan 2 20:18:44 CET 2004

On 20040103 (Sat) at 0222:59 +0800, Bopolissimus Platypus wrote:

> questions.
Some opinions and practices (after about 15 months' use of bogofilter):

> 1. over time, i see much more spam than ham. and i report the spam
>     but not the ham (it's probably a natural impulse,  scratch the itch
>     that irritates, ignore thegood). generally, what will this trend lead 
>     to?  will spam message reporting eventually  pollute the wordcount
>     databases so much that we need to retrain (dropping old spam 
>     emails) regularly?  or do we need to run bogoutil regularly
>     to drop  old words?  do we need to "normalize" counts  so
>     that certain  word counts don't get so large that they
>      completely overshadow other words just because their
>     counts are so large compared to non-spam counts? or  is
>     the math such that it doesn't matter if we stop identifying
>     ham after the initial training run?

The wordlist message counts should be kept as nearly equal as
practical.  For example, I'm currently at 22,434 spam and 21,240
nonspam.

> 2.  how  dangerous are those spams that consist of 70-80%
>      random dictionary words and the rest,spammy words?
>      i can see that after  reporting, the messages are classified
>      correctly as spam.  but what happens over the medium
>      and long term  when, after two hundred thousand such spams
>      and maybe 1000 hams, wordlists are so corrupted that almost 
>      any word is considered spammy?   

See the preceding response.  If the wordlist message counts are equal
or thereabouts, spam that uses lots of nonspam words will remove some
otherwise "unspammy" words from scoring, but there will be enough
tokens that occur frequently in your nonspam message population that
bogofilter's accuracy shouldn't suffer, especially if you train on
error once the wordlist is a reasonable size, and just top up from time
to time with extra nonspam to keep it balanced.

>     is bogofilter immune to this?   or do we just rebuild  our
>     wordlists from only the most recent ham and spam (or all
>     the ham, and only the  most recent ham?)

When I have to rebuild, I train with about 20,000 each of "recent" spam
and nonspam, and thereafter I train on error, adding a few nonspam
whenever the db starts to get out of balance.  That seems to work
adequately for me.

> 3.  does bogofilter  currently decode base64, MIME, uuencode,
>      zip files?  on the mailing list, long ago, i noted resistance to
>       decoding anything(understandable, due to  the performance
>      hit) but clearly decoding that stuff has got to work better
>      than just classifying on the MIME  headers.  although of 
>     course there are still ways to spoof.

It doesn't decode such stuff, other than MIME headers, though it's been
considered.  I suspect the improvement is small enough that the only
people who might need it are the high-volume users.  For them, one
false positive in ten thousand and one false negative in two hundred
are unacceptably many, and it's worthwhile to invest in the memory and
processors to overcome the performance hit.  Me, I'd rather keep the
speed decent even on the old P400 gateway where my personal mail
arrives (though with the recent explosion of spam volume, one false
negative in two hundred spam is beginning to be a lot for me too! :(

-- 
| G r e g  L o u i s         | gpg public key: 0x400B1AA86D9E3E64 |
|  http://www.bgl.nu/~glouis |   (on my website or any keyserver) |
|  http://wecanstopspam.org in signatures helps fight junk email. |