wordlists [was: ACME Labs spam wordlist available for use.]

Mon Nov 8 03:18:49 CET 2004

On Sun, 7 Nov 2004 20:33:30 -0500
Eric Wood wrote:

> From: "David Relson" <relson at osagesoftware.com>
> >> Jef Poskanzer <jef at acme.com> wrote:
> >>
> >> >See http://www.acme.com/spamwords/
> 
> I take it that I can't really use my 0.17.2 wordlist so I dreading the
> 
> process of retraining (ahem, because I haven't kept a spam/ham
> reservoir). So I wish I could grab someone else 'generic' spam list
> without worry.
> 
> One thing that prevents me from using someone else's list is that I
> don't know how the database was built and against what particular
> version of bogofilter using what particular set of database options.
> 
> For example, I'd love to get bogofilter-0.93.0 but I don't know if I
> can just pull acme's list and go.  I don't know if acme's list is
> 0.93.0 "ready", ie. every word was stored as tristate versuses
> dual-state, are some other parculiarity.
> 
> Maybe bogofilter can give me a short diagnostic saying that this
> wordlist is optimally using all of 0.93.0 features.
> 
> -Eric Wood 

Hi Eric,

Your 0.17.2 wordlist is fine.  No need to change it for 0.93.0.

There's only been one big change in the wordlist's internal format.
Remember when bogofilter switch from two wordlists (named goodlist.db
and spamlist.db) to one (wordlist.db)???  With two wordlists, each entry
was the token ("word"), its count (ham or spam, depending on list), and
a timestamp (YYYYMMDD, i.e. when the entry was last changed).  With one
wordlist, an entry is a 4-tuple, i.e. (token, spam count, ham count, and
timestamp).  

The tri-state change doesn't affect the wordlist in any way. It doesn't
even change the numeric spam score, i.e. the number between 0.000000 and
1.000000, for a message.  What tri-state _does_ change is the
"classification" of the message.  With two-state, a message is
classified as "Yes" (meaning spam) or "No" (meaning ham).  With
tri-stat, the message is labeled as "Spam", "Ham", or "Unsure".

The natural next question is: How?  The answer is "according to
parameters spam_cutoff and ham_cutoff".  Bogofilter has known about both
spam_cutoff and ham_cutoff for a long time -- well over a year.

Here's how they're used:

If the message's score is greater than or equal to spam_cutoff, the
X-Bogosity line now says "Spam".  If ham_cutoff is non-zero and the
message score is less than or equal to ham_cutoff, the X-Bogosity line
says "Ham".  If the above inequalities are both false, the message score
is between ham_cutoff and spam_cutoff and the message is labeled
"Unsure."  Also, if ham_cutoff has a value of 0.0, bogofilter operates
in two-state mode and all messages that aren't "spam" (as defined above)
are "ham".

What's different now is that bogofilter's _default_ mode is tri-state
mode with Spam/Ham/Unsure labels.  Previously, the default was two-state
mode with Yes/No labels.

So, to answer the initial question, you _can_ use your old wordlist and
Acme's list will be compatible.

There might be some variations in how the lists were built, as
bogofilter has options to _not_ apply the special header tags ("head:",
"subj:", "from:", etc) and options to mask characters with values above
0x80. 

HTH,

David