Month Abbreviations as Stopwords

Thu Jan 9 03:35:16 CET 2003

At 09:17 PM 1/8/03, Graham Wilson wrote:

>On Wed, Jan 08, 2003 at 02:54:08PM -0500, Suzanne Skinner wrote:
> > I think it would be a good idea to add the month abbreviations Jan-Dec
> > (as found in mail headers) to the default stopwords in lexer.l. I
> > recently noticed that the scoring for these words was somewhat
> > lopsided here because of the way my spam intake has increased over the
> > past year.
>
>would it really have a great affect. can non-spam message through
>bogofilter with all of the dates in the header changed to Feb instead of
>Jan. what kind of results do you get?

Suzanne,

Evidentally, you've not been looking at the mime branch of development 
(where the new mime parsing code presently exists).  Part of the changes 
has been to discard the old lexer.l and implement three new, smaller 
lexers.  The first is for message headers, the second for plain text 
messages (or plain text mime parts), and the third is for html messaes (or 
mime parts).

The purpose of the original stop list was to discard the keywords in html 
tags.  Consequently, the plain text lexer doesn't have a stoplist any 
longer. The current processing of html discards all text within tags.  It 
also has no stop list.

To summarize, bogofilter is presently running with simple lexers and 
without stop lists.

A project that has come up from time to time is to implement an "ignore" 
list, i.e. a list of words that should be ignored when scoring 
messages.  The idea was to have the list be easily maintainable by a 
user.  Using a plain text list would allow maintenance with any old text 
editor.  If you're looking for a project, I can send you a partially 
completed version of an ignore list implementation :-)

David

David