problem email, bogofilter and bogolexer hang

Greg Louis glouis at dynamicro.on.ca
Thu Jan 23 20:04:26 CET 2003


On 20030123 (Thu) at 1307:55 -0500, Greg Louis wrote:
> On 20030123 (Thu) at 1235:35 -0500, Greg Louis wrote:
> > I've warned David about a problem I encountered trying to rebuild my
> > training db with 0.10.[1]: the 1047th email in my spam corpus, if
> > processed alone, causes bogofilter to hang without output; if processed
> > as part of the 14,262-email mbox file, it causes bogofilter to exit
> > with an "Invalid buffer size" message.
> 
> The problem is not encountered if "-k n" is included on the command
> line.

Unfortunately, that's not the end of the story.  If I try to build the
training db with -k n, I get odd output:
# ./bogofilter -v -s -k n -d /root/scratch </store/spam_corpus 
# 10909495 words, 235 messages

The word count may be right, but the message count is 14,027 short. 
Then the program starts building spamlist.db, and goes on and on and
on... with 0.8.0, 6.9 million tokens are stored in 11 Mb, and I set
datestamp_token to false, so I ought to need about 17 Mb for 0.10.0's
spamtest.db file.  It's still growing as I write this, and it's already
over 27 Mb.  I think I'll kill the job; this doesn't look promising.

My next attempt was
# cat /store/mail/backup/csseen* | ./bogofilter -v -n -k n -d /root/scratch 
# 8151 words, 38 messages

That's not right; there are lots more than 8151 tokens, and there are
4106 messages:
# cat /store/mail/backup/csseen* | wc -w
2518020
# cat /store/mail/backup/csseen* | grep -c '^From '                    
4106

-- 
| G r e g  L o u i s          | gpg public key:      |
|   http://www.bgl.nu/~glouis |   finger greg at bgl.nu |
| Help free our mailboxes. Include                   |
|        http://wecanstopspam.org in your signature. |




More information about the bogofilter-dev mailing list