Do we need an exclusion list or something?
David Relson
relson at osagesoftware.com
Sat Sep 14 00:03:38 CEST 2002
<x-flowed>
At 05:34 PM 9/13/02, Paul Tomblin wrote:
>Quoting Jonathan Buzzard (jonathan at buzzard.org.uk):
> > eds at reric.net said:
> > > In my opinion this will always be a problem. I spotted this when I
> > > fed it a bunch of spam messages from the month of May and then found
> > > that the word "may" was being treated as a very strong indicator of
> > > spamicity.
> >
> > I hinted on this at the beginning of the week. There are two problems
> > the inclusion of common words, which don't mean anything, and stuff
> > getting included from the headers.
>
>The problem I pointed out was with words that are in 100% of the spam
>messages *and* 100% of the ham messages. Surely those should have been
>filtered out already?
I agree. The code gets hamness and spamness numbers for each word, then
computes probabilities based on ratio of these values to their respective
message counts. The computed probability is used to populate the array for
spamicity determination.
g: 0.029780 b: 1.000000 p: 0.971082 linda
g: 0.000831 b: 0.006615 p: 0.888350 babe
g: 0.000529 b: 0.006064 p: 0.919752 porno
g: 0.455841 b: 0.891400 p: 0.661649 free
g: 1.000000 b: 1.000000 p: 0.500000 http
g: 1.000000 b: 1.000000 p: 0.500000 with
Here's a small sample of the words in my "hello babe" message (spam). For
each word is shown its spamness (percent of spam messages that contained
the word), hamness (percent of non-spam messages ...), the probability of
the word being spam, and the word. As can be seen, linda appears in 100%
of the spam messages (with actual count from spamlist.db being 2585
occurrences in 1814 messages), but only 2.9780% of good messages (394 of
26461). With these numbers, the probability of 97.1082% seems
reasonable. Looking at "http" and "with", both are in 100% of the
messages, hence they get a 50% probility (and won't contribute to the
spamicity).
For summay digest subscription: bogofilter-digest-subscribe at aotto.com
</x-flowed>
More information about the Bogofilter
mailing list