Database Size versus Shannon's Word Entropy

Matthias Andree matthias.andree at gmx.de
Tue Oct 24 23:14:18 CEST 2017


Am 24.10.2017 um 22:59 schrieb Rick van Rein:

> But my reason for wondering about database size is that I am also
> thinking about splitten them over users, such as a separate spam filter
> for aliases like rick+bboy at example.com that cover an area of interest
> for the mail user.  Or IMAP subfolders.

The thing is, spam doesn't care too much about the recipient, of course
somewhat targetted spam through lists might, but on the whole I think
it's more hinging on the scale effects of dirt cheap sending millions of
messages more than anything else. What goal are you trying to achieve by
receiver-extension specific filtering? I think the false negative rate
(spam going through) will be more or less the same, or perhaps lower if
you share the database, about the false positive I can't say. Unless one
of the folders is for newsletters that are close to what spammers
imitate it might help, but on the whole for me I haven't ever bothered.
Per user yes because that's easier to set up, per mailbox extension, no.

> Bogofilter is likely to be useful to sort email into the right alias
> and/or IMAP sub-folder (including ones for Spam and Unsure).  But that
> would not allow for the light-weight alias [0] support that we're after
> for our IdentityHub project [1].  That got me thinking / playing about
> the database size.
>
> I did notice that the number of cases is a #define now set to 2, so a
> more practical approach to this alias sorting idea could be to simply
> have larger records with counters for each alias.  [But that would make
> it a -dev discussion I suppose.]

Without having looked at your Idhub materials, just commenting on what I
believe to be reading between the lines:
I wonder if there are other (than bogofilter) classifiers that support
more than the spam/ham + unsure targets.

CRM114 used to be something, not sure if it does that and if it's still
maintained.
I don't want to turn you away, it's just I have a vague feeling there
might be something more suitable as a starting point for multi-valued
sorting than bogofilter with its black-white-and-unsure approach. This
is in an attempt to avoid you going down to Turing-machine limited
expressiveness that bogofilter is for something more complex, i. e. has
more bins where your sifted and sorted inputs are to end up.

> [0] http://internetwide.org/blog/2015/04/23/id-3-idforms.html
> [1] http://internetwide.org/blog/2016/06/24/iwo-phases.html
> _______________________________________________
> bogofilter mailing list
> bogofilter at bogofilter.org
> https://www.bogofilter.org/mailman/listinfo/bogofilter





More information about the bogofilter mailing list