training set growth

Jeremy Blosser jblosser-bogofilter at firinn.org
Wed Nov 20 21:04:46 CET 2002


On Nov 20, David Relson [relson at osagesoftware.com] wrote:
> At 02:23 PM 11/20/02, Greg Louis wrote:
> >Dunno if you saw my mail to the list about preventing endless growth of
> >training sets.  Would it be straightforward to do something (perhaps in
> >bogoutil) like "remove all tokens with counts of one?"
> 
> The simplest thing, from a unix point of view, would to use bogoutil to 
> dump the database, use grep (or other tool) to select the lines with 
> counts > 1, then create a new database and use bogoutil to load it.
> 
> The following will delete words with counts of 0 and 1:
> 
> for file in goodlist.db spamlist.db ; do
>         bogoutil -d $BOGOFITLER_DIR/$file.db > goodlist.all
>         cat $file.all | egrep -v " [01]$" bogofilter -l $file.db
>         cp $file.db $BOGOFILTER_DIR
> done

sorry, but I don't think that's quite what you meant, is it?  more like:

for file in goodlist spamlist ; do
        bogoutil -d $BOGOFITLER_DIR/$file.db > $file.all
        cat $file.all | egrep -v " [01]$" | bogoutil -l $file.db
        cp $file.db $BOGOFILTER_DIR
done

you were probably in a hurry, but I'm trying to pick up the bogofilter
utils since those are new to me, so it's good practice. ;-)




More information about the Bogofilter mailing list