what happens if I discard tokens that occur only once?

David Relson relson at osagesoftware.com
Fri Jun 3 13:30:10 CEST 2005


On Thu, 2 Jun 2005 20:47:58 -0700
Chris Fortune wrote:

> bogoutil lets you to discard tokens having <= given number of occurrences from the database.
> 
> What's the effect of discarding tokens that occur only once?  My assumption is that this would make the wordlist much lighter
> without impacting the classifications much.  What are your findings?  Reasons?  FYI, my wordlist is 20,000 ham + 23,000 spam

Hi Chris,

Discarding hapaxes (tokens that appear only once) can be done.  Their
significance depends on how you've built your wordlist.

From Oct 2002 into Jan 2004, _every_ incoming message went into my
wordlist.  The result was a large wordlist with lots of tokens.  The
list also grew rapidly with hundreds of new messages being registered
each day.

An observation I made was that most messages scored very high or very
low, i.e. were obviously ham or spam.  I added a thresh_update
parameter to bogofilter so that high and low scoring messages wouldn't
get registered.

In Jan 2004, I began using an update_threshold of 0.01, so
that messages scoring below 0.01 or above 0.99, are _not_ put into the
wordlist.  The growth rate of the list dropped dramatically, from
hundreds a day to dozens (or fewer).  Accuracy didn't suffer in any
noticeable way.

In Feb 2005, I discarded hapaxes and tokens over a year old.  My
wordlist dropped in size by roughly 90% (from 1,491,844 tokens to
111,145).  Again, I didn't notice a change in accuracy.  However, with
fewer tokens in the wordlist, fewer messages have very low/high scores
and the wordlist is growing more rapidly.  In the 4 months since the
discard, it has grown to 758,277 tokens -- a growth of 582%.

In summary, discarding hapaxes does have effects, but I've found them
to be tolerable.

HTH,

David



More information about the Bogofilter mailing list