scripts
Tom Allison
tallison at tacocat.net
Sat Mar 13 23:04:00 CET 2004
Boris 'pi' Piwinger wrote:
> Tom Allison <tallison at tacocat.net> wrote:
>
>
>>I came up with this little script to use for training/testing to
>>exhaustion. Works fairly well on my email archives of ham/spam.
>>
>>Designed for maildir format mail (courier-imap).
>
>
> Looks pretty inefficient. IF you start building the
> database, you will first only have ham, and stop learning
> after the first message. Then you only have spam and
> probably only learn very few messages. Then again only ham.
> I'd assume that you need many runs to do it.
>
> It might be more elegant to use -T and grep -v ^N or
> something like that. Also you can use security margins to
> improve results.
>
> pi
>
HISTORY:
I started with bogofilter -u and after building up ~500 hame and ~500
spam messages in my training archives, I then turned off the '-u' option
and only train on error. However, I found that over time, some of my
archives moved their scoring to Unsure (from ham or spam) and required
some additional training.
As a result of these findings, I decided that the best approach is to
take my current history (starting on 2/18/2004) and re-testing the
entire ham/spam archives to exhaustion (2070 ham, 1601 spam). This
results in <25 corrections the first time through. One took 5
iterations to complete.
However, running a modified script to test only shows that I am getting
at best 1 a day of uncertainty of any kind. I have not adjusted any of
my parameters from default (except for ham_cutoff to support ternary
reporting).
Typically what I am doing at the moment is taking all ham and storing it
into the archives. I would do that same with spam, but I'm short on
tokens for bogotune, so I'm teaching all the spam and then saving it
into the archives. Eventually I should be able to just archive the spam.
I also found that my spam token count went up considerably (100+) after
my initial training to exhaustion exercise.
It would be efficient to change the script to:
bogofilter -T < $F | grep -v N && bogofilter -n < $F && bogofilter -T < $F
Is there any way I could combine the last two steps to get the resultant
score after training? I don't have anything to test right now, my
cronjobs just ran, but something like:
bogofilter -T < $F | grep -v H && bogofilter -nT < $F
to give an output like:
U 0.415000
H 0.015000
Please keep in mind, this is a very human readable script right now.
I'm sending all the output to my mailbox to review after it runs. Had
to make sure it worked first.
More information about the Bogofilter
mailing list