Naive Bayes classifier derived from bogofilter-0.7

Greg Louis glouis at dynamicro.on.ca
Tue Nov 26 12:45:34 CET 2002


On 20021125 (Mon) at 1931:20 -0500, Scott Lenser wrote:
> > With 0.6.0 I can't actually get your bogofilter to complete its runs;
> > I get gadzillions of gmime-WARNING messages, some of which are utterly
> > bogus, eg
> > gmime-WARNING **: No domain in email address: "Attila
> > =?iso-8859-1?q?Szov=E1thy=22?= <aszovathy at gw.cdk.bme.hu>
> > 
> > and there is one email in the second of three test runs I scripted that
> > causes bogofilter_srl to hang, eating cpu but effecting nothing. 
> > Unless I can get around that, I won't be able to do any worthwhile
> > comparisons.

Quite a while after I sent this, the program did indeed terminate on
that message and went on to complete the run.

> I get a lot of the stupid gmime-WARNING messages as well.  I usually just
> redirect stderr to /dev/null to ignore them.

Unfortunately it seems the gmime-CRITICAL ones cause the program to
quit without reporting a result; this happens about 0.3% of the time.

> I've noticed a problem while doing some further testing.  Since I am relying
> on gmime to get rid of mime and I bumped up the MAXWORDLEN so that I could
> store tokens like 'HF:User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:0.9.8) Gecko/20020214'
> if the email includes quoted parts that used to be mime encodings, I'll end
> up encoding a whole bunch of "words" out of the base64 encoded cruft.  Basically
> messages like
> 
> > <base64 stuff>
> > <base64 stuff>
> > <base64 stuff>
> 
> will cause it to take a long time on that message.  I've never seen it not terminate
> but sometimes it takes a long time.  You should be able to fix that particular
> problem by putting in a base64 encoding filter in lexer_text_plain.l and lexer_text_html.l.
> The current on in bogofilter-0.9 is suitable if you remove the ^ from the beginning
> (and maybe the $ from the end but probably not needed).

I was able, with patience, to complete the run and will be doing the
data reduction this morning.  Further report to follow.

-- 
| G r e g  L o u i s          | gpg public key:      |
|   http://www.bgl.nu/~glouis |   finger greg at bgl.nu |



More information about the bogofilter-dev mailing list