spam levels [was: Including html-tag contents ...]

Mon May 12 17:14:49 CEST 2003

Tony,

I've changed the subject line as this thread is no longer about html tags ...

At 10:33 AM 5/12/03, Tony L. Svanstrom wrote:

>On Mon, 12 May 2003 the voices made David Relson write:
>
>DR> I confess that I still don't grok how values "2" and "4" would affect
>DR> bogofilter's behavior.  Can you write some simple perl that would
>DR> illustrate your idea?
>
>  In pseudo-bayesian talk:
>
>  If we look at the 10 best tokens to tell if the e-mail is spam then my 
> idea is
>set the value of some of them as either "pure spam" or "pure ham" even before
>we start looking at the e-mail.

The Graham algorithm used the 15 best ham/spam indicators.  With Robinson 
and Robinson-Fisher, bogofilter uses all tokens which are more than MIN_DEV 
distance from EVEN_ODDS.  As shipped these values are 0.1 and 0.5 so all 
tokens scoring below 0.4 or above 0.6 are used.  Tests by Greg and me 
indicate that MIN_DEV should be between 0.35 and 0.45, though the exact 
value depends on the site and its email.

>  In procmail it could be used like this:
>
># E-mails from this person are almost never spam, but the sender could be
># forged; so we "hardcode" 3/10 tokens as pure ham before looking at the
># e-mail.
>:0:
>* ^From.*mygf at dom.tld
>         | bogofilter -u 1
>
># Almost all ham to this address is in a different language than the spam
># I get to this address; so we set 2/10 tokens as pure spam.
>:0:
>* ^To:.*myoldaddress at dom.tld
>         | bogofilter -u 4
>
>  Those are, of course, very simple examples; one might want to use an already
>existing scoring- (man procmailsc) based spamfilter and just add bogofilter at
>the end of it; which would be a lot easier to do if you could give the results
>of it to bogofilter, instead of bogofilter "ignoring" the previous work and
>starting from scratch.

I'm still not clear on your idea.  It sounds like you want to change a 
percentage of the token scores to 0.0 or to 1.0.
So "-u 1" would take the lowest 30% of the hammish tokens and change their 
scores to 0.0, i.e. "pure ham", and that "-u 4" would take the highest 20% 
and set them to 1.0???

Given a message of 100 words, suppose 50 were eliminated by MIN_DEV and of 
the remaining 50 words, 20 had hammish scores (less than 0.50) and 30 had 
spammish scores (above 0.50).  What effect would values "-U1", "-U2", etc 
have???  (Note: since "-u" is in use and "-U" is not used, I'm using "-U" 
to designate the "Unsure" switch.)

David