Reducing wordlist size by ignoring DKIM headers

Sat Apr 10 14:59:28 CEST 2021

On 10. 04. 21 13:48, RW via bogofilter wrote:
> Most statistical spam filters do support ignoring user specified
> headers. I'd like to see BF support this too. Many of us simply pipe
> emails through BF and then use the x-bogosity header. And personally I
> don't want any header to be *permanently* stripped by a separate filter.

The current lexer already filters out some headers, and does some
special processing for others. It does quite a bit more magic than I
expected (like recognizing dollar amounts!) There's a big comment on the
top about how the goal is to try to recognize various unique strings so
that they don't bloat the wordlist. What I did wasn't terribly out of
line considering the existing bogofilter code.

I agree that having a user-configurable list of headers to ignore would
be a useful feature in bogofilter. However, currently header handling is
hard coded and can be configured during compile-time only. I was
surprised by this when I first saw it. Bogofilter doesn't parse out the
headers, body, etc. into data structures and then process those. It does
everything in one pass - headers, mime decoding, html cleanup, text
tokenization, etc. all using a single flex parser.

I'm guessing this is faster, but I also found it hard to understand and
modify. Most of the time I spent making this patch was figuring out how
to correctly account for the fact that DKIM headers typically span
multiple lines.

With my current understanding of the lexer code, making it possible to
configure header handling at run-time looks like it would require a
significant rewrite. As I mentioned in my original mail, I don't think
it's worth it. Maybe wordlist size is problematic in large bogofilter
installations, but for a small server and a reasonably modern hardware,
the bloat from these headers really doesn't look significant.

Best regards
Tomaž