bogotrain

David Relson relson at osagesoftware.com
Fri Dec 19 15:15:43 CET 2003


On Fri, 19 Dec 2003 08:05:16 -0500 (EST)
Dave Lovelace <dave at firstcomp.biz> wrote:

> I ran bogotrain, and it ran for many hours.  Then, when it finally was
> done, here's what it produced:
> > 
> > The wordlist contains 17589 non-spam and 1500 spam messages.
> > Bogotune must be run with at least 2000 of each.
> > The wordlist has a ratio of spam to non-spam of 0.1 to 1.0.
> > Bogotune requires the ratio be in the range of 0.2 to 5.
> > 
> I don't know how to check how many messages of each kind there are,
> so as to know in advance whether bogotune will be happy with any
> wordlist I build.  This is the wordlist I built from mail I had on
> hand when I uprev'd bogofilter.
> 
> But the big question is: why should it take over 12 hours for bogotune
> to find out that the wordlist is unacceptable?  Surely this is
> something it can check quickly at the outset?

Hi Dave,

Do you know what bogotune was doing?  At startup it has to read the
wordlist and the messages into memory.  When there's insufficient ram
available and your operating system goes into swap mode, performance
suffers drastically.  That's likely what happened.

Bogotune's man page and FAQ have a lot of information about it.  They're
recommended reading.

To see more about what's happening, use "-v" or "-vv".

Message counts can be determined with:

   cat ham*.mbx | grep -c "^From "
   cat spam*.mbx | grep -c "^From "

Converting the input messages to the message count format is a good
thing to do.  It takes a hunk of time, but will save time when bogotune
is run.

When you learn more about what bogotune was doing during those hours,
let me know.  I _may_ be able to suggest something that will speed
things for you.

Hope this helps.

David




More information about the Bogofilter mailing list