My wordlist doesn't detect spam very well anymore

Teemu Likonen tlikonen at iki.fi
Sun Feb 9 08:52:34 CET 2020


Hello

In the last six months or so I have noticed that my Bogofilter
wordlist.db doesn't detect spam very well and I'm asking for ideas to
help me understand the possible reasons.

Background:

  - I am the only user this wordlist and I'm confident that the number
    of false classifications in the database is close to zero.

  - I have always used "bogofilter -u" because I have thought (perhaps
    wrongly) that the more is always better. The current .MSG_COUNT is
    45419 (spam) and 165548 (good).

  - My incoming good/spam message ratio is currently about 26/1.

  - I have used default min_dev, ham_cutoff, spam_cutoff and other
    values.

  - Most of my mail comes from different computer related mailing lists.
    They are pretty clean but some spam messages slip in occasionally.
    Usually those messages are classified as good messages or unsure
    messages and I need to fix this with "bogofilter -Ns" or "-s". Even
    after (re)training the very same messages and similar future spam
    messages are "unsure".

Recently I read some Bogofilter documents and realized that they suggest
that we should keep the .MSG_COUNT of spam and good messages quite close
to each other. For example, document bogofilter-tuning.HOWTO.html says:

    If one list grows faster than the other, extra (correctly
    classified) messages may be added from time to time to equalize them
    again; try to keep the smaller list's message count at least two
    thirds of the larger's.

So me and my database with its 45419 (spam) and 165548 (good) messages
have not taken the advice written above. Perhaps I should have dropped
the "-u" option a long ago.

Is there a way to fix my current database or classification?

I have never changed Bogofilter's settings before and only just recently
read the documentation about things like min_dev, spam_cutoff and
ham_cutoff. My quick test seems to suggest that rising min_dev above 0.4
helps to detect spam better. I have about 100 spam messages for testing
and some of them are "unsure" with Bogofilter's default settings.

I also tried to start from scratch: I copied my years old database
elsewhere and trained a new database with the latest 100 spam messages
and 5500 good messages. I used default Bogofilter settings. This seems
to detect my current spam (and similar new spam) better but it will
requite some training in the future.

Any suggestions? Which way is better? (1) Start from scratch and use
some better ways to maintain the quality of my wordlist. (2) Continue
using the old wordlist and apply some other settings and maintenance
practices?

-- 
///  OpenPGP key: 4E1055DC84E9DFF613D78557719D69D324539450
//  https://keys.openpgp.org/search?q=tlikonen@iki.fi
/  https://keybase.io/tlikonen  https://github.com/tlikonen
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 694 bytes
Desc: not available
URL: <https://www.bogofilter.org/pipermail/bogofilter/attachments/20200209/cb696527/attachment.sig>


More information about the bogofilter mailing list