Naive Bayes classifier derived from bogofilter-0.7

Scott Lenser slenser at cs.cmu.edu
Tue Nov 26 01:31:20 CET 2002


> With 0.6.0 I can't actually get your bogofilter to complete its runs;
> I get gadzillions of gmime-WARNING messages, some of which are utterly
> bogus, eg
> gmime-WARNING **: No domain in email address: "Attila
> =?iso-8859-1?q?Szov=E1thy=22?= <aszovathy at gw.cdk.bme.hu>
> 
> and there is one email in the second of three test runs I scripted that
> causes bogofilter_srl to hang, eating cpu but effecting nothing. 
> Unless I can get around that, I won't be able to do any worthwhile
> comparisons.
> 

I get a lot of the stupid gmime-WARNING messages as well.  I usually just
redirect stderr to /dev/null to ignore them.

I've noticed a problem while doing some further testing.  Since I am relying
on gmime to get rid of mime and I bumped up the MAXWORDLEN so that I could
store tokens like 'HF:User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:0.9.8) Gecko/20020214'
if the email includes quoted parts that used to be mime encodings, I'll end
up encoding a whole bunch of "words" out of the base64 encoded cruft.  Basically
messages like

> <base64 stuff>
> <base64 stuff>
> <base64 stuff>

will cause it to take a long time on that message.  I've never seen it not terminate
but sometimes it takes a long time.  You should be able to fix that particular
problem by putting in a base64 encoding filter in lexer_text_plain.l and lexer_text_html.l.
The current on in bogofilter-0.9 is suitable if you remove the ^ from the beginning
(and maybe the $ from the end but probably not needed).

- Scott



More information about the bogofilter-dev mailing list