Database Size versus Shannon's Word Entropy

Matthias Andree matthias.andree at gmx.de
Wed Oct 25 20:54:18 CEST 2017


Am 25.10.2017 um 15:58 schrieb Rick van Rein:
> Hi Matthias,
>
> You mentioned that the write volume would increase when last-read was
> recorded; but in update mode you would write anyway.
That's true, but ISTR this only happens on a clear rating, not an unsure.

> Also, are you aware that statistics has an elegant method of gradually
> and continuously forgetting things?
> http://mathworld.wolfram.com/ExponentialDistribution.html
> It' like specifying the count and date all in one float :)

Do you mean the regular exponential decay of the f(t) = a0 *
exp(-\lambda *t) kind?

The point is, I am not too acquainted with the design of the algorithms,
but I wonder how it helps to decay individual token probabilities,
because the tokens counts are not absolute, but to be seen in relation
to the .MSG_COUNT counts, see src/prob.c.
So my naive guess is that removing a token is less harmful than mucking
around with its probabilities.

>> What goal are you trying to achieve by
>> receiver-extension specific filtering?
> In terms of user facilitation:
> Grouping related activities together, protecting privacy by keeping them
> separate and having independent ACLs on each.
>
> In terms of Bogofilter delivering to multiple recipients:
> Use of the term scoring to figure out what alias would be the most
> likely recipient for a message.  So, not spamfiltering, but
> subclassification of the content on the non-spam side.

Bogofilter isn't designed to do that. It does that three-state thing,
spam/ham/dunno, you'd need to do a thorough code audit to figure if it
can be made classify into more than the two extreme (spam/ham) and the
third "unsure" bins, and how much effort it would be, and I don't think
the current database design is amenable to a more-than-two
classification. The spam and ham dichotomy is hardwired in several
places, so it's not trivial to change.




More information about the bogofilter mailing list