problem email, bogofilter and bogolexer hang

David Relson relson at osagesoftware.com
Fri Jan 24 01:57:06 CET 2003


At 02:04 PM 1/23/03, Greg Louis wrote:

>On 20030123 (Thu) at 1307:55 -0500, Greg Louis wrote:
> > On 20030123 (Thu) at 1235:35 -0500, Greg Louis wrote:
> > > I've warned David about a problem I encountered trying to rebuild my
> > > training db with 0.10.[1]: the 1047th email in my spam corpus, if
> > > processed alone, causes bogofilter to hang without output; if processed
> > > as part of the 14,262-email mbox file, it causes bogofilter to exit
> > > with an "Invalid buffer size" message.
> >
> > The problem is not encountered if "-k n" is included on the command
> > line.
>
>Unfortunately, that's not the end of the story.  If I try to build the
>training db with -k n, I get odd output:
># ./bogofilter -v -s -k n -d /root/scratch </store/spam_corpus
># 10909495 words, 235 messages
>
>The word count may be right, but the message count is 14,027 short.
>Then the program starts building spamlist.db, and goes on and on and
>on... with 0.8.0, 6.9 million tokens are stored in 11 Mb, and I set
>datestamp_token to false, so I ought to need about 17 Mb for 0.10.0's
>spamtest.db file.  It's still growing as I write this, and it's already
>over 27 Mb.  I think I'll kill the job; this doesn't look promising.
>
>My next attempt was
># cat /store/mail/backup/csseen* | ./bogofilter -v -n -k n -d /root/scratch
># 8151 words, 38 messages
>
>That's not right; there are lots more than 8151 tokens, and there are
>4106 messages:
># cat /store/mail/backup/csseen* | wc -w
>2518020
># cat /store/mail/backup/csseen* | grep -c '^From '
>4106

A trick I used earlier this week was to run split to divide my 14MB file 
into 1MB portions.  Then find the problem hunk and brake it into 100K 
hunks.  The technique worked like a charm.  It took only a few minutes to 
isolate a message that demonstrated an error processign mime boundaries.





More information about the bogofilter-dev mailing list