Month Abbreviations as Stopwords
David Relson
relson at osagesoftware.com
Thu Jan 9 03:35:16 CET 2003
At 09:17 PM 1/8/03, Graham Wilson wrote:
>On Wed, Jan 08, 2003 at 02:54:08PM -0500, Suzanne Skinner wrote:
> > I think it would be a good idea to add the month abbreviations Jan-Dec
> > (as found in mail headers) to the default stopwords in lexer.l. I
> > recently noticed that the scoring for these words was somewhat
> > lopsided here because of the way my spam intake has increased over the
> > past year.
>
>would it really have a great affect. can non-spam message through
>bogofilter with all of the dates in the header changed to Feb instead of
>Jan. what kind of results do you get?
Suzanne,
Evidentally, you've not been looking at the mime branch of development
(where the new mime parsing code presently exists). Part of the changes
has been to discard the old lexer.l and implement three new, smaller
lexers. The first is for message headers, the second for plain text
messages (or plain text mime parts), and the third is for html messaes (or
mime parts).
The purpose of the original stop list was to discard the keywords in html
tags. Consequently, the plain text lexer doesn't have a stoplist any
longer. The current processing of html discards all text within tags. It
also has no stop list.
To summarize, bogofilter is presently running with simple lexers and
without stop lists.
A project that has come up from time to time is to implement an "ignore"
list, i.e. a list of words that should be ignored when scoring
messages. The idea was to have the list be easily maintainable by a
user. Using a plain text list would allow maintenance with any old text
editor. If you're looking for a project, I can send you a partially
completed version of an ignore list implementation :-)
David
David
More information about the bogofilter-dev
mailing list