what happens if I discard tokens that occur only once?
David Relson
relson at osagesoftware.com
Fri Jun 3 13:30:10 CEST 2005
On Thu, 2 Jun 2005 20:47:58 -0700
Chris Fortune wrote:
> bogoutil lets you to discard tokens having <= given number of occurrences from the database.
>
> What's the effect of discarding tokens that occur only once? My assumption is that this would make the wordlist much lighter
> without impacting the classifications much. What are your findings? Reasons? FYI, my wordlist is 20,000 ham + 23,000 spam
Hi Chris,
Discarding hapaxes (tokens that appear only once) can be done. Their
significance depends on how you've built your wordlist.
From Oct 2002 into Jan 2004, _every_ incoming message went into my
wordlist. The result was a large wordlist with lots of tokens. The
list also grew rapidly with hundreds of new messages being registered
each day.
An observation I made was that most messages scored very high or very
low, i.e. were obviously ham or spam. I added a thresh_update
parameter to bogofilter so that high and low scoring messages wouldn't
get registered.
In Jan 2004, I began using an update_threshold of 0.01, so
that messages scoring below 0.01 or above 0.99, are _not_ put into the
wordlist. The growth rate of the list dropped dramatically, from
hundreds a day to dozens (or fewer). Accuracy didn't suffer in any
noticeable way.
In Feb 2005, I discarded hapaxes and tokens over a year old. My
wordlist dropped in size by roughly 90% (from 1,491,844 tokens to
111,145). Again, I didn't notice a change in accuracy. However, with
fewer tokens in the wordlist, fewer messages have very low/high scores
and the wordlist is growing more rapidly. In the 4 months since the
discard, it has grown to 758,277 tokens -- a growth of 582%.
In summary, discarding hapaxes does have effects, but I've found them
to be tolerable.
HTH,
David
More information about the Bogofilter
mailing list