wordlist maintenance

David Relson relson at osagesoftware.com
Sun Dec 15 18:39:23 CET 2002


Greetings,

Today's subject is wordlist maintenance.  Presently bogofilter has nominal 
capabilities in this arena.  Bogoutil is the tool that can be used to dump 
and load wordlists.  When it operates, bogoutil operates on the _whole_ 
wordlist.  It lacks any kind of filtering or selection mechanism.

I have corrected this deficiency by adding an age value to the 
wordlists.  The age value is a day-of-year that ranges from 1 to 
366.  bogofilter sets it when updating a token count.  bogoutil can filter 
on it when dumping and loading wordlists.  In addition to aging, token 
counts and lengths can be used for filtering.  Note:  it is not necessary 
to update your wordlist as the dates will be added to it over time.

The date-last-modified value is implemented using a day-of-year value, 
which ranges from 1 to 366.  As this value will wrap on an annual basis, 
maintenance based on it will need to happen more often than once a year.

Here are some sample commands:

"bogoutil -d wordlist.db -c 2" - dump all tokens with counts greater than 
or equal to 2.
"bogoutil -d wordlist.db -a 100" - dump all tokens with ages less that or 
equal to 100.
"bogoutil -d wordlist.db -s5,30" - dump all tokens with lengths between 5 
and 30 (inclusive)
"bogoutil -d wordlist.db -y 123" - when dumping, assign date-last-modified 
value of 123 to any tokens that lack the attribute.

"-c count" - When used with "-d" (dump), bogoutil will dump tokens whose 
counts exceed the given count.  The same restriction will apply when using 
"-l" (load).

"-a age" - bogoutil will dump/load tokens whose age is newer that the given 
age.

"-s min,max" - bogoutil will dump/load tokens whose size, i.e. character 
count, is between the min and max values.

"-y day-of-year" - provides the age attribute for any token that needs it.

Note that these options can also be used with the "-l" (load) option for 
bogoutil.  Also, bogofilter recognizes the "-y" option.

Also, "-n" (replace_nonascii_characters) has been added to help in dealing 
with asian spam.  It converts non-ascii characters to '?'.  Used with 
bogoutil's '-l' (load) option, asian tokens of equal length will be given 
the same representation and will be merged in the wordlist.  This option 
may be useful only for those processing US-ASCII mail.  It is not 
recommended for those processing mail using languages with many accented 
vowels and consonants as those characters will be modified in the wordlists.

These capabilities are in the current cvs version of bogofilter.  The 
documentation still needs to be updated.

David





More information about the Bogofilter mailing list