Database Size versus Shannon's Word Entropy

Rick van Rein rick at openfortress.nl
Tue Oct 24 22:59:20 CEST 2017


Hi,

> Bogoutil also allows the user to filter out seen-only-once tokens
> (lower-case -c option -- by age, see the -a option).

Thanks for pointing that out!

> Not sure if it
> really matters much: 44 MB seems small enough these days (it sure wasn't
> when I built my first Linux PC on DX4 basis in the late 1990s).

I started on a ZX Spectrum and have always thought 48 kB was a whole lot :)

But my reason for wondering about database size is that I am also
thinking about splitten them over users, such as a separate spam filter
for aliases like rick+bboy at example.com that cover an area of interest
for the mail user.  Or IMAP subfolders.

Bogofilter is likely to be useful to sort email into the right alias
and/or IMAP sub-folder (including ones for Spam and Unsure).  But that
would not allow for the light-weight alias [0] support that we're after
for our IdentityHub project [1].  That got me thinking / playing about
the database size.

I did notice that the number of cases is a #define now set to 2, so a
more practical approach to this alias sorting idea could be to simply
have larger records with counters for each alias.  [But that would make
it a -dev discussion I suppose.]


Thanks,
 -Rick


[0] http://internetwide.org/blog/2015/04/23/id-3-idforms.html
[1] http://internetwide.org/blog/2016/06/24/iwo-phases.html


More information about the bogofilter mailing list