Do we need an exclusion list or something?

David Relson relson at osagesoftware.com
Sat Sep 14 00:03:38 CEST 2002


<x-flowed>
At 05:34 PM 9/13/02, Paul Tomblin wrote:
>Quoting Jonathan Buzzard (jonathan at buzzard.org.uk):
> > eds at reric.net said:
> > > In my opinion this will always be a problem.  I spotted this when I
> > > fed it  a bunch of spam messages from the month of May and then found
> > > that the  word "may" was being treated as a very strong indicator of
> > > spamicity.
> >
> > I hinted on this at the beginning of the week. There are two problems
> > the inclusion of common words, which don't mean anything, and stuff
> > getting included from the headers.
>
>The problem I pointed out was with words that are in 100% of the spam
>messages *and* 100% of the ham messages.  Surely those should have been
>filtered out already?

I agree.  The code gets hamness and spamness numbers for each word, then 
computes probabilities based on ratio of these values to their respective 
message counts.  The computed probability is used to populate the array for 
spamicity determination.

    g:  0.029780   b:  1.000000  p:  0.971082  linda
    g:  0.000831   b:  0.006615  p:  0.888350  babe
    g:  0.000529   b:  0.006064  p:  0.919752  porno
    g:  0.455841   b:  0.891400  p:  0.661649  free
    g:  1.000000   b:  1.000000  p:  0.500000  http
    g:  1.000000   b:  1.000000  p:  0.500000  with

Here's a small sample of the words in my "hello babe" message (spam).  For 
each word is shown its spamness (percent of spam messages that contained 
the word), hamness (percent of non-spam messages ...), the probability of 
the word being spam, and the word.  As can be seen, linda appears in 100% 
of the spam messages (with actual count from spamlist.db being 2585 
occurrences in 1814 messages), but only 2.9780% of good messages (394 of 
26461).  With these numbers, the probability of 97.1082% seems 
reasonable.  Looking at "http" and "with", both are in 100% of the 
messages, hence they get a 50% probility (and won't contribute to the 
spamicity).


For summay digest subscription: bogofilter-digest-subscribe at aotto.com

</x-flowed>



More information about the Bogofilter mailing list