bogofilter's handling of attached PDF files
Rob McEwen
rob at invaluement.com
Tue Jan 7 20:24:14 CET 2025
------ Original Message ------
>From "Matthias Andree via bogofilter" <bogofilter at bogofilter.org>
To bogofilter at bogofilter.org
Date 1/7/2025 11:36:37 AM
Subject Re: bogofilter's handling of attached PDF files
>Rob,
>
>to avoid quoting nesting from becoming really inconcise with your
>unhelpful text-above-quote style and frequent followups to your own
>message, I'll just pick a late message in the thread and respond.
>
>- In the past, we've usually seen that messages contained "enough"
>material from headers (including those of the attachment parts) to give
>you *some* tokens. You can use bogolexer on the entire message (save its
>full source, then run bogolexer on it) to see what tokens would be part
>of bogofilter's estimation and/or registration.
>
>- Interpreting binary attachments (PDF or office formats or images) and
>extracting text from them requires *robust* software that does not
>itself pose a security risk or a massive performance penalty; and it
>also must make sure to take readability into account - i. e. NOT extract
>text that is (near) invisible by using the same (or very near) colours
>versus its rendering background, are outside page display regions and
>what not. Spammers used such tricks as white-text-on-white-paper in HTML
>and other formats to add a lot of innocent content to their malicious
>mailings so as to evade statistical filtering.
>
>- Careful selection of what to train on (either good or bad message) can
>help making training more effective; I myself haven't really used
>bogofilter's "-u" self-reinforcing option because that really bloats the
>"wordlist" database and requires *exhaustive* correction of all false
>classification.
>
>- So, bogofilter makes little effort to extract text from attachments,
>and only interprets a subset of text/* and message/* MIME attachment forms.
>
>Regards,
>Matthias
Matthias,
I've already been following your advice about making sure the training
is high quality. I've built up a collection of ham/spam that's into the
6 figures large - AND - which was hand-trained. Nothing gets in there
unless (a) it's hand-copied in there after further human analysis, and
(b) unless the message wasn't already scoring above/below a certain
threshold, to prevent overtraining of things not needed, and (c) we also
do extensive analsyis/rechecking to ensure quality - constantly hunting
for misclassified items. Literally thousands of hours of work went into
this over the past few years. I think I might have the highest-qualty
lowest-misclassified large-ish collection of ham/spam emails for baysian
training - in the world.
But when these emails have an attached PDF that contains an image and no
text - and such an email has a subject line like "invoice" and very
little text in the body of the emails - and they're being sent from
large email systems that sends much other legit email (gmail, ouitlook
etc) - then there's just not much for Bogofilter to "grab onto" - and
that has caused my Bogofilter to have to be somewhat of a "trained to
exaustion" situation for these types of emails. I think that "training
to exhaustion" (for these types of emails) situation - is having the
unfortunate side effect of causing some weird occassional bad Bogofilter
scores an all various OTHER emails. (But note that I'm NOT putting
multiple copies of the SAME emails in that training - these are
DIFFERENT emails that together are causing this "training to
exhaustion")
Unfortunately, as I've been exploring this more - the ideas I presented
on this thread are probably not going to work - because the signal to
noise ratio in that PDF metadata is not very good, and often the
"smoking gun" for these series of PDFs is a sequent of items, not the
items themselves. That works better for hand-created anti-spam rules,
and doesn't work as well for baysian filtering. So I could probably
develop some custom rules that would help me to surgically target MUCH
of these spams with other rules - then remove them from Bogofilter
training/targetting altogether - and that might then help Bogofilter to
have a superior results when MORE of these TYPES of spams - are removed
my Bogofilter training collection.
HOWEVER - I DO HAVE SUGGESTIONS/QUESTIONS FOR BOGO:
(1) Occassionally, these binary files will be EXACT copies of other ones
that the spammer is using in many other emails - including the base64
encoding being line-for-line identical. But yet they'll be named
differently in the attachment name. So here is a suggestion. Without
even having to rendor the binary file - would it be possible if
Bogofilter did something like generating an SHA-1 hash (or MD5 or
similar hash) based on the combined base64 lines of an attachment, and
then added that hash string to the Bogodata, as if it were a word in the
email - then factored that into Bogo scoring/training. Then if/when such
a hashed string appeared in both ham/spam, it would cancel each other
out. But if it keeps re-occurring in spam, without being seen in legit
emails - it could then help achieve a higher scoring for that - which
would then help to prevent AS MUCH overtraining of those types to a
point where they're causing other problems (for the reasons I mentioned
earlier)
(2) What is this "-u, --update-as-scored" feature you mentioned? I
searched the Internet for that, but I'm not finding any detailed
explainations, and I couldn't find much here:
https://www.google.com/search?q=%22bogofilter%22+%22update-as-scored%22
I'd love to know more about this feature. Any suggestions? Link?
Thanks again for all that you do - Bogofilter is excellent software - I
just think this is a loophole that spammers have exploited more often in
recent years - and I think some improvements with Bogofilter would help
this.
Rob McEwen, invaluement
More information about the bogofilter
mailing list