Do we need an exclusion list or something?

Jonathan Buzzard jonathan at buzzard.org.uk
Fri Sep 13 23:27:27 CEST 2002


eds at reric.net said:
> In my opinion this will always be a problem.  I spotted this when I
> fed it  a bunch of spam messages from the month of May and then found
> that the  word "may" was being treated as a very strong indicator of
> spamicity. 

I hinted on this at the beginning of the week. There are two problems
the inclusion of common words, which don't mean anything, and stuff
getting included from the headers.

The stuff in the headers should be better filtered, so a from line should
ignore the from but include who it is actually from, same goes for subject,
date lines should just be dropped etc.

Then we can catch the rest by calculating the top thirty (or so words)
and removing from the lists those with equal probability (within some
tolerance) for spam not spam, and picking the best ten (or so) that remain.

I did some tests with 200 common English words by removing them from the
word lists. That and some header removal, and I saw a dramatic improvement
in accuracy. For the record until I did this bogofilter was really rather
poor on my test set.

I have included the 200 words I used, which I got from vmspell (a spelling
program for VMS) below.


JAB.

-=-=-=-=-=-=
a
about
above
after
all
almost
although
always
am
among
an
and
any
anybody
anyone
anything
are
as
back
be
before
being
best
better
between
big
but
by
call
can
can't
cannot
could
did
differ
different
differs
do
does
doesn't
don't
down
during
each
ever
every
everybody
everyone
exactly
example
except
few
fewer
fewest
final
find
first
for
fortunately
found
from
full
gave
get
give
given
good
great
greater
greatest
has
hasn't
have
having
he
her
his
how
however
if
in
into
is
it
it's
its
kind
know
knowing
knows
least
less
like
little
make
many
may
me
might
more
most
much
must
my
never
new
next
no
not
now
number
of
old
on
one
only
or
other
otherwise
our
out
over
right
said
same
saw
say
saying
second
see
seem
seen
sent
several
shall
she
should
show
showed
showing
shown
since
small
so
some
somebody
someone
something
sometimes
take
taken
taking
than
that
the
their
then
there
these
they
third
this
those
three
to
too
two
type
under
until
up
us
use
used
using
usual
usually
very
want
was
we
we're
well
were
what
when
whenever
where
which
while
who
whom
whose
why
will
with
would
wrong
you
you're
your
-=-=-=-=-=-=-=

-- 
Jonathan A. Buzzard                 Email: jonathan at buzzard.org.uk
Northumberland, United Kingdom.       Tel: +44(0)1661-832195

-- 
Jonathan A. Buzzard                 Email: jonathan at buzzard.org.uk
Northumberland, United Kingdom.       Tel: +44(0)1661-832195



For summay digest subscription: bogofilter-digest-subscribe at aotto.com



More information about the Bogofilter mailing list