wordlist maintenance
David Relson
relson at osagesoftware.com
Sun Dec 15 18:39:23 CET 2002
Greetings,
Today's subject is wordlist maintenance. Presently bogofilter has nominal
capabilities in this arena. Bogoutil is the tool that can be used to dump
and load wordlists. When it operates, bogoutil operates on the _whole_
wordlist. It lacks any kind of filtering or selection mechanism.
I have corrected this deficiency by adding an age value to the
wordlists. The age value is a day-of-year that ranges from 1 to
366. bogofilter sets it when updating a token count. bogoutil can filter
on it when dumping and loading wordlists. In addition to aging, token
counts and lengths can be used for filtering. Note: it is not necessary
to update your wordlist as the dates will be added to it over time.
The date-last-modified value is implemented using a day-of-year value,
which ranges from 1 to 366. As this value will wrap on an annual basis,
maintenance based on it will need to happen more often than once a year.
Here are some sample commands:
"bogoutil -d wordlist.db -c 2" - dump all tokens with counts greater than
or equal to 2.
"bogoutil -d wordlist.db -a 100" - dump all tokens with ages less that or
equal to 100.
"bogoutil -d wordlist.db -s5,30" - dump all tokens with lengths between 5
and 30 (inclusive)
"bogoutil -d wordlist.db -y 123" - when dumping, assign date-last-modified
value of 123 to any tokens that lack the attribute.
"-c count" - When used with "-d" (dump), bogoutil will dump tokens whose
counts exceed the given count. The same restriction will apply when using
"-l" (load).
"-a age" - bogoutil will dump/load tokens whose age is newer that the given
age.
"-s min,max" - bogoutil will dump/load tokens whose size, i.e. character
count, is between the min and max values.
"-y day-of-year" - provides the age attribute for any token that needs it.
Note that these options can also be used with the "-l" (load) option for
bogoutil. Also, bogofilter recognizes the "-y" option.
Also, "-n" (replace_nonascii_characters) has been added to help in dealing
with asian spam. It converts non-ascii characters to '?'. Used with
bogoutil's '-l' (load) option, asian tokens of equal length will be given
the same representation and will be merged in the wordlist. This option
may be useful only for those processing US-ASCII mail. It is not
recommended for those processing mail using languages with many accented
vowels and consonants as those characters will be modified in the wordlists.
These capabilities are in the current cvs version of bogofilter. The
documentation still needs to be updated.
David
More information about the Bogofilter
mailing list