scripts

Tom Allison tallison at tacocat.net
Sat Mar 13 23:04:00 CET 2004


Boris 'pi' Piwinger wrote:
> Tom Allison <tallison at tacocat.net> wrote:
> 
> 
>>I came up with this little script to use for training/testing to 
>>exhaustion.  Works fairly well on my email archives of ham/spam.
>>
>>Designed for maildir format mail (courier-imap).
> 
> 
> Looks pretty inefficient. IF you start building the
> database, you will first only have ham, and stop learning
> after the first message. Then you only have spam and
> probably only learn very few messages. Then again only ham.
> I'd assume that you need many runs to do it.
> 
> It might be more elegant to use -T and grep -v ^N or
> something like that. Also you can use security margins to
> improve results.
> 
> pi
> 

HISTORY:
I started with bogofilter -u and after building up ~500 hame and ~500 
spam messages in my training archives, I then turned off the '-u' option 
and only train on error.  However, I found that over time, some of my 
archives moved their scoring to Unsure (from ham or spam) and required 
some additional training.

As a result of these findings, I decided that the best approach is to 
take my current history (starting on 2/18/2004) and re-testing the 
entire ham/spam archives to exhaustion (2070 ham, 1601 spam).  This 
results in <25 corrections the first time through.  One took 5 
iterations to complete.

However, running a modified script to test only shows that I am getting 
at best 1 a day of uncertainty of any kind.  I have not adjusted any of 
my parameters from default (except for ham_cutoff to support ternary 
reporting).

Typically what I am doing at the moment is taking all ham and storing it 
into the archives.  I would do that same with spam, but I'm short on 
tokens for bogotune, so I'm teaching all the spam and then saving it 
into the archives.  Eventually I should be able to just archive the spam.

I also found that my spam token count went up considerably (100+) after 
my initial training to exhaustion exercise.

It would be efficient to change the script to:
bogofilter -T < $F | grep -v N && bogofilter -n < $F && bogofilter -T < $F

Is there any way I could combine the last two steps to get the resultant 
score after training?  I don't have anything to test right now, my 
cronjobs just ran, but something like:
bogofilter -T < $F | grep -v H && bogofilter -nT < $F
to give an output like:
U 0.415000
H 0.015000

Please keep in mind, this is a very human readable script right now. 
I'm sending all the output to my mailbox to review after it runs.  Had 
to make sure it worked first.





More information about the Bogofilter mailing list