training set growth

David Relson relson at osagesoftware.com
Wed Nov 20 20:42:17 CET 2002


At 02:23 PM 11/20/02, Greg Louis wrote:

>Dunno if you saw my mail to the list about preventing endless growth of
>training sets.  Would it be straightforward to do something (perhaps in
>bogoutil) like "remove all tokens with counts of one?"

The simplest thing, from a unix point of view, would to use bogoutil to 
dump the database, use grep (or other tool) to select the lines with 
counts > 1, then create a new database and use bogoutil to load it.

The following will delete words with counts of 0 and 1:

for file in goodlist.db spamlist.db ; do
         bogoutil -d $BOGOFITLER_DIR/$file.db > goodlist.all
         cat $file.all | egrep -v " [01]$" bogofilter -l $file.db
         cp $file.db $BOGOFILTER_DIR
done






More information about the Bogofilter mailing list