bogofilter's handling of attached PDF files

Tue Jan 7 17:36:37 CET 2025

Rob,

to avoid quoting nesting from becoming really inconcise with your
unhelpful text-above-quote style and frequent followups to your own
message, I'll just pick a late message in the thread and respond.

- In the past, we've usually seen that messages contained "enough"
material from headers (including those of the attachment parts) to give
you *some* tokens. You can use bogolexer on the entire message (save its
full source, then run bogolexer on it) to see what tokens would be part
of bogofilter's estimation and/or registration.

- Interpreting binary attachments (PDF or office formats or images) and
extracting text from them requires *robust* software that does not
itself pose a security risk or a massive performance penalty; and it
also must make sure to take readability into account - i. e. NOT extract
text that is (near) invisible by using the same (or very near) colours
versus its rendering background, are outside page display regions and
what not. Spammers used such tricks as white-text-on-white-paper in HTML
and other formats to add a lot of innocent content to their malicious
mailings so as to evade statistical filtering.

- Careful selection of what to train on (either good or bad message) can
help making training more effective; I myself haven't really used
bogofilter's "-u" self-reinforcing option because that really bloats the
"wordlist" database and requires *exhaustive* correction of all false
classification.

- So, bogofilter makes little effort to extract text from attachments,
and only interprets a subset of text/* and message/* MIME attachment forms.

Regards,
Matthias