training set growth
David Relson
relson at osagesoftware.com
Wed Nov 20 20:42:17 CET 2002
At 02:23 PM 11/20/02, Greg Louis wrote:
>Dunno if you saw my mail to the list about preventing endless growth of
>training sets. Would it be straightforward to do something (perhaps in
>bogoutil) like "remove all tokens with counts of one?"
The simplest thing, from a unix point of view, would to use bogoutil to
dump the database, use grep (or other tool) to select the lines with
counts > 1, then create a new database and use bogoutil to load it.
The following will delete words with counts of 0 and 1:
for file in goodlist.db spamlist.db ; do
bogoutil -d $BOGOFITLER_DIR/$file.db > goodlist.all
cat $file.all | egrep -v " [01]$" bogofilter -l $file.db
cp $file.db $BOGOFILTER_DIR
done
More information about the Bogofilter
mailing list