bogofilter's handling of attached PDF files

Tue Jan 7 20:24:14 CET 2025

------ Original Message ------
>From "Matthias Andree via bogofilter" <bogofilter at bogofilter.org>
To bogofilter at bogofilter.org
Date 1/7/2025 11:36:37 AM
Subject Re: bogofilter's handling of attached PDF files

>Rob,
>
>to avoid quoting nesting from becoming really inconcise with your
>unhelpful text-above-quote style and frequent followups to your own
>message, I'll just pick a late message in the thread and respond.
>
>- In the past, we've usually seen that messages contained "enough"
>material from headers (including those of the attachment parts) to give
>you *some* tokens. You can use bogolexer on the entire message (save its
>full source, then run bogolexer on it) to see what tokens would be part
>of bogofilter's estimation and/or registration.
>
>- Interpreting binary attachments (PDF or office formats or images) and
>extracting text from them requires *robust* software that does not
>itself pose a security risk or a massive performance penalty; and it
>also must make sure to take readability into account - i. e. NOT extract
>text that is (near) invisible by using the same (or very near) colours
>versus its rendering background, are outside page display regions and
>what not. Spammers used such tricks as white-text-on-white-paper in HTML
>and other formats to add a lot of innocent content to their malicious
>mailings so as to evade statistical filtering.
>
>- Careful selection of what to train on (either good or bad message) can
>help making training more effective; I myself haven't really used
>bogofilter's "-u" self-reinforcing option because that really bloats the
>"wordlist" database and requires *exhaustive* correction of all false
>classification.
>
>- So, bogofilter makes little effort to extract text from attachments,
>and only interprets a subset of text/* and message/* MIME attachment forms.
>
>Regards,
>Matthias

Matthias,

I've already been following your advice about making sure the training 
is high quality. I've built up a collection of ham/spam that's into the 
6 figures large - AND - which was hand-trained. Nothing gets in there 
unless (a) it's hand-copied in there after further human analysis, and 
(b) unless the message wasn't already scoring above/below a certain 
threshold, to prevent overtraining of things not needed, and (c) we also 
do extensive analsyis/rechecking to ensure quality - constantly hunting 
for misclassified items. Literally thousands of hours of work went into 
this over the past few years. I think I might have the highest-qualty 
lowest-misclassified large-ish collection of ham/spam emails for baysian 
training - in the world.

But when these emails have an attached PDF that contains an image and no 
text - and such an email has a subject line like "invoice" and very 
little text in the body of the emails - and they're being sent from 
large email systems that sends much other legit email (gmail, ouitlook 
etc) - then there's just not much for Bogofilter to "grab onto" - and 
that has caused my Bogofilter to have to be somewhat of a "trained to 
exaustion" situation for these types of emails. I think that "training 
to exhaustion" (for these types of emails) situation - is having the 
unfortunate side effect of causing some weird occassional bad Bogofilter 
scores an all various OTHER emails. (But note that I'm NOT putting 
multiple copies of the SAME emails in that training - these are 
DIFFERENT emails that together are causing this "training to 
exhaustion")

Unfortunately, as I've been exploring this more - the ideas I presented 
on this thread are probably not going to work - because the signal to 
noise ratio in that PDF metadata is not very good, and often the 
"smoking gun" for these series of PDFs is a sequent of items, not the 
items themselves. That works better for hand-created anti-spam rules, 
and doesn't work as well for baysian filtering. So I could probably 
develop some custom rules that would help me to surgically target MUCH 
of these spams with other rules - then remove them from Bogofilter 
training/targetting altogether - and that might then help Bogofilter to 
have a superior results when MORE of these TYPES of spams - are removed 
my Bogofilter training collection.

HOWEVER - I DO HAVE SUGGESTIONS/QUESTIONS FOR BOGO:

(1) Occassionally, these binary files will be EXACT copies of other ones 
that the spammer is using in many other emails - including the base64 
encoding being line-for-line identical. But yet they'll be named 
differently in the attachment name. So here is a suggestion. Without 
even having to rendor the binary file - would it be possible if 
Bogofilter did something like generating an SHA-1 hash (or MD5 or 
similar hash) based on the combined base64 lines of an attachment, and 
then added that hash string to the Bogodata, as if it were a word in the 
email - then factored that into Bogo scoring/training. Then if/when such 
a hashed string appeared in both ham/spam, it would cancel each other 
out. But if it keeps re-occurring in spam, without being seen in legit 
emails - it could then help achieve a higher scoring for that - which 
would then help to prevent AS MUCH overtraining of those types to a 
point where they're causing other problems (for the reasons I mentioned 
earlier)

(2) What is this "-u, --update-as-scored" feature you mentioned? I 
searched the Internet for that, but I'm not finding any detailed 
explainations, and I couldn't find much here:
https://www.google.com/search?q=%22bogofilter%22+%22update-as-scored%22
I'd love to know more about this feature. Any suggestions? Link?

Thanks again for all that you do - Bogofilter is excellent software - I 
just think this is a loophole that spammers have exploited more often in 
recent years - and I think some improvements with Bogofilter would help 
this.

Rob McEwen, invaluement