Reducing wordlist size by ignoring DKIM headers

Sat Apr 10 11:41:21 CEST 2021

Hi,

I recently did some experiments in an attempt to reduce the bogofilter
wordlist size by ignoring DKIM signatures. I found that patching the
lexer to discard DKIM-related message headers reduces the wordlist size
by 10% after training without affecting the false classification rate.
I'm sharing my findings in case anyone else here finds them useful.

A typical DKIM signature is a folded (multi-line) header that looks
something like this:

DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
    d=gmail.com; s=20161025;
    h=mime-version:reply-to:from:date:message-id:subject:to;
    bh=AVIHm+mpxD8KWByKiGebyHMjh6gMB43Tym35y4dl0fE=;
    b=I2zbIJxf7VFzrClDMY3suygw2BBFhzCzYPBiab2nbIgsvhMr9xoBoCvu7PXz5dIOOx
    dE3Cr+yaivIdqgxsHXVSo1KsIts0Zj5Knpte7wmjn44ooRU7jm/vqb31AH4fo7/vkdQG
    omNcwZ1cSXyjVkmRRRzMageel7q6Ow93VzBkH4yu0D9YsZDZKXE9A4ctVygKDmDixE3q
    rSjDOGg97kWRvyOjIhZ4TkOy1Wc2RXy95jDgGcZV1HYoloNlNDtGPXoq5GvtcPJu0QM5
    i7e+fHhalyNGr5GJoFo8IQSTJwFlKYbSi1U9HpenG4fAwdxBqjkb10nwhb/mZJShhnSR
    TZLw==

bogofilter splits this up into tokens. The signature is encoded in such
a way that punctuation characters mix with alphanumeric characters.
Hence many tokens seem to have a sensible length and the maximum token
length limit isn't very effective in filtering them out.

"head:i7e"
"head:dE3Cr"
"head:mZJShhnSR"
"head:vkdQG"
...

Since the signature contains a hash of the message content these tokens
aren't useful for mail classification. They just bloat the wordlist and
it seems it would be beneficial to ignore them.

I've made a short patch for lexer_v3.l that ignores the tokens in the
DKIM-Signature header (as well as some related headers that contain
similar signatures).

The attached patch applies to bogofilter release 1.2.5.

Note on the patch: When exiting the IDISCARD condition I'm yy_unputting
an extra '\n'. This seems to be necessary for subsequent rules starting
with '^' to match on the unputted text. I'm not sure if this is the
right way to do it, but I couldn't make it work otherwise. It also means
I'm unputting one more character than I received in yytext in that
action. I'm not completely sure this is safe to do.

I've tested the patch on a dataset of 23300 messages collected between
Jan 2020 and Jan 2021. 50% of the messages were used as a training set
and 50% as an evaluation set.

wordlist.db with unpatched bogofilter 1.2.5:

17752064 bytes
318992 tokens (according to bogoutil -d)

after patch:

15794176 bytes (approx. 11% reduction)
287658 tokens (approx. 10% reduction)

Classification performance seems unaffected. I used spam_cutoff=0.90.

unpatched bogofilter 1.2.5:
    spam fcr: 1.55 %
    ham  fcr: 0.16 %

after patch:
    spam fcr: 1.14 %
    ham  fcr: 0.00 %

fcr=false classification rate. After patching, 1.14% of spam
messages in the dataset were miss-classified as ham. This is a bit
better than 1.55% with unpatched bogofilter, but the difference seems
within the experimental error. The "ground truth" classification in my
dataset isn't perfect either.

The wordlist on my small mail server is growing at approx. 50 MB/year.
At least in my case, I think a 5 MB reduction isn't worth further
complicating the bogofilter lexer. I already found it hard to understand
and modify, and I know it has been a source of memory bugs in the past.
Hence I don't think it would make much sense to include my patch in
bogofilter. Still, I thought it was interesting to see how much of an
effect the DKIM headers have on wordlist size.

Best regards
Tomaž
-------------- next part --------------
A non-text attachment was scrubbed...
Name: ignore_dkim2.patch
Type: text/x-patch
Size: 1497 bytes
Desc: not available
URL: <https://www.bogofilter.org/pipermail/bogofilter/attachments/20210410/d701899b/attachment.bin>