The idea of Bogosort and its mathematics

Rick van Rein rick at openfortress.nl
Mon Jul 26 14:15:43 CEST 2021


Hi,

I have a long-standing wish to extend the logic of Bogofilter to make
act like, what I'll call, Bogosort -- sort mail into a topical mailbox.
 There would have to be a multi-way comparison for this to work.  Spam
filtering is then a special use-case of this idea.

So I jumped into the mathematics behind Bogofilter in the hope to
generalise it.  I think I derived calculations that are both simpler and
more general, starting from Paul Graham's nameless pw definition and
Gary Robinson's P and Q values.

AFAIK the desire is that S > 0 and so that P > Q.  There is a lot of
work done on both sides that can be symmetrically removed to speed up
the computations.

In terms of probabilities, I got at

log(  p1) + log(  p2) +...+ log(  pn) >
log(1-p1) + log(1-p2) +...+ log(1-pn)

In terms of pw = B / (B + G), this further reduces to

log(B1) + log(B2) +...+ log(Bn) >
log(G1) + log(G2) +...+ log(Gn)

Alternatively, in terms of Paul Graham's variables this reduces to

log(b1) +...+ log(bn) - log(nbad1 ) -...- log(nbadn ) >
log(g1) +...+ log(gn) - log(ngood1) -...- log(ngoodn)

Details and derivations: https://gitlab.com/arpa2/bogosort/-/issues/2


I'm using logarithms here, thus reducing multiplication and division to
addition and subtraction.  Which form works best is an implementation
choice.  Storing logarithms of word counts is a trade-off between faster
filter operations and slower datastore updates (though that could be
reduced during batch updates).


These forms all say "bad exceeds good" in one way or another.  All the
bad is on one side and all good is on the other.  This means that my
wish to generalise to more than two "topics" is possible.  A message
could be compared to multiple criteria, and the maximum would be the
topic/mailbox for which the message is delivered.


I'm interested to hear if this generalisation to Bogosort is considered
useful by others as well.  The logarithms are not necessarily part of
that logic, but the reduction of filtering speed is more present there.


Cheers,
 -Rick


More information about the bogofilter-dev mailing list