training set growth
Jeremy Blosser
jblosser-bogofilter at firinn.org
Wed Nov 20 21:04:46 CET 2002
On Nov 20, David Relson [relson at osagesoftware.com] wrote:
> At 02:23 PM 11/20/02, Greg Louis wrote:
> >Dunno if you saw my mail to the list about preventing endless growth of
> >training sets. Would it be straightforward to do something (perhaps in
> >bogoutil) like "remove all tokens with counts of one?"
>
> The simplest thing, from a unix point of view, would to use bogoutil to
> dump the database, use grep (or other tool) to select the lines with
> counts > 1, then create a new database and use bogoutil to load it.
>
> The following will delete words with counts of 0 and 1:
>
> for file in goodlist.db spamlist.db ; do
> bogoutil -d $BOGOFITLER_DIR/$file.db > goodlist.all
> cat $file.all | egrep -v " [01]$" bogofilter -l $file.db
> cp $file.db $BOGOFILTER_DIR
> done
sorry, but I don't think that's quite what you meant, is it? more like:
for file in goodlist spamlist ; do
bogoutil -d $BOGOFITLER_DIR/$file.db > $file.all
cat $file.all | egrep -v " [01]$" | bogoutil -l $file.db
cp $file.db $BOGOFILTER_DIR
done
you were probably in a hurry, but I'm trying to pick up the bogofilter
utils since those are new to me, so it's good practice. ;-)
More information about the Bogofilter
mailing list