bogofilter's handling of attached PDF files

Wed Jan 8 09:51:22 CET 2025

Am 07.01.25 um 20:24 schrieb Rob McEwen via bogofilter:
> ------ Original Message ------
> From "Matthias Andree via bogofilter" <bogofilter at bogofilter.org>
> To bogofilter at bogofilter.org
> Date 1/7/2025 11:36:37 AM
> Subject Re: bogofilter's handling of attached PDF files
>
>> Rob,
>>
>> to avoid quoting nesting from becoming really inconcise with your
>> unhelpful text-above-quote style and frequent followups to your own
>> message, I'll just pick a late message in the thread and respond.
>>
>> - In the past, we've usually seen that messages contained "enough"
>> material from headers (including those of the attachment parts) to give
>> you *some* tokens. You can use bogolexer on the entire message (save its
>> full source, then run bogolexer on it) to see what tokens would be part
>> of bogofilter's estimation and/or registration.
>>
>> - Interpreting binary attachments (PDF or office formats or images) and
>> extracting text from them requires *robust* software that does not
>> itself pose a security risk or a massive performance penalty; and it
>> also must make sure to take readability into account - i. e. NOT extract
>> text that is (near) invisible by using the same (or very near) colours
>> versus its rendering background, are outside page display regions and
>> what not. Spammers used such tricks as white-text-on-white-paper in HTML
>> and other formats to add a lot of innocent content to their malicious
>> mailings so as to evade statistical filtering.
>>
>> - Careful selection of what to train on (either good or bad message) can
>> help making training more effective; I myself haven't really used
>> bogofilter's "-u" self-reinforcing option because that really bloats the
>> "wordlist" database and requires *exhaustive* correction of all false
>> classification.
>>
>> - So, bogofilter makes little effort to extract text from attachments,
>> and only interprets a subset of text/* and message/* MIME attachment
>> forms.
>>
>> Regards,
>> Matthias
>
> Matthias,
>
> I've already been following your advice about making sure the training
> is high quality. I've built up a collection of ham/spam that's into
> the 6 figures large - AND - which was hand-trained. Nothing gets in
> there unless (a) it's hand-copied in there after further human
> analysis, and (b) unless the message wasn't already scoring
> above/below a certain threshold, to prevent overtraining of things not
> needed, and (c) we also do extensive analsyis/rechecking to ensure
> quality - constantly hunting for misclassified items. Literally
> thousands of hours of work went into this over the past few years. I
> think I might have the highest-qualty lowest-misclassified large-ish
> collection of ham/spam emails for baysian training - in the world.
>
> But when these emails have an attached PDF that contains an image and
> no text - and such an email has a subject line like "invoice" and very
> little text in the body of the emails - and they're being sent from
> large email systems that sends much other legit email (gmail, ouitlook
> etc) - then there's just not much for Bogofilter to "grab onto" - and
> that has caused my Bogofilter to have to be somewhat of a "trained to
> exaustion" situation for these types of emails. I think that "training
> to exhaustion" (for these types of emails) situation - is having the
> unfortunate side effect of causing some weird occassional bad
> Bogofilter scores an all various OTHER emails. (But note that I'm NOT
> putting multiple copies of the SAME emails in that training - these
> are DIFFERENT emails that together are causing this "training to
> exhaustion")

Rob,

I haven't done most of the statistics analyses in bogofilter at the time
when the groundwork was done, that was more Greg Louis's domain. I am
not sold on the idea of training to exhaustion, but I do understand that
there may be a wish to weight certain features - which is what training
"one message" until the balance tilts would do.

> Unfortunately, as I've been exploring this more - the ideas I
> presented on this thread are probably not going to work - because the
> signal to noise ratio in that PDF metadata is not very good, and often
> the "smoking gun" for these series of PDFs is a sequent of items, not
> the items themselves. That works better for hand-created anti-spam
> rules, and doesn't work as well for baysian filtering. So I could
> probably develop some custom rules that would help me to surgically
> target MUCH of these spams with other rules - then remove them from
> Bogofilter training/targetting altogether - and that might then help
> Bogofilter to have a superior results when MORE of these TYPES of
> spams - are removed my Bogofilter training collection.
>
> HOWEVER - I DO HAVE SUGGESTIONS/QUESTIONS FOR BOGO:
>
> (1) Occassionally, these binary files will be EXACT copies of other
> ones that the spammer is using in many other emails - including the
> base64 encoding being line-for-line identical. But yet they'll be
> named differently in the attachment name. So here is a suggestion.
> Without even having to rendor the binary file - would it be possible
> if Bogofilter did something like generating an SHA-1 hash (or MD5 or
> similar hash) based on the combined base64 lines of an attachment, and
> then added that hash string to the Bogodata, as if it were a word in
> the email - then factored that into Bogo scoring/training. Then
> if/when such a hashed string appeared in both ham/spam, it would
> cancel each other out. But if it keeps re-occurring in spam, without
> being seen in legit emails - it could then help achieve a higher
> scoring for that - which would then help to prevent AS MUCH
> overtraining of those types to a point where they're causing other
> problems (for the reasons I mentioned earlier)

Are you suggesting to just look at a hash of the entire attachment? Or
segments of it, such as lines? How long are these lines?   Normally I
would lean towards SHA2-256 or other crypto-safe hashes, but if we use
"short" lines (say, 80 characters or less) and limit the number of
tokens, we can experiment with that.

The thing is: will this open another cat and mouse game and train the
mice to try to add fill bytes or tweak compression parameters to evade
that? Because inserting one or a few bytes somewhere such that it's not
a multiple of three bytes (which is the input for the three-to-four-byte
encoding base64 implements) shifts the base64 for all subsequent text
and "invalidates" tokens.

> (2) What is this "-u, --update-as-scored" feature you mentioned? I
> searched the Internet for that, but I'm not finding any detailed
> explainations, and I couldn't find much here:
> https://www.google.com/search?q=%22bogofilter%22+%22update-as-scored%22
> I'd love to know more about this feature. Any suggestions? Link?

It is in the manual pages - see man 1 bogofilter:

>        The -u option tells bogofilter to register the message's text after
>        classifying it as spam or non-spam. A spam message will be
> registered on
>        the spamlist and a non-spam message on the goodlist. If the
>        classification is "unsure", the message will not be registered.
>        Effectively this option runs bogofilter with the -s or -n flag, as
>        appropriate. Caution is urged in the use of this capability, as any
>        classification errors bogofilter may make will be preserved and
> will
>        accumulate until manually corrected with the -Sn and -Ns option
>        combinations. Note this option causes the database to be opened for
>        write access, which can entail massive slowdowns through lock
> contention
>        and synchronous I/O operations.
The synchronous I/O operations having become much cheaper with the
ubiquity of solid-state storage (which does not need to seek tracks with
the head and wait for the sector of the platter to spin by) and we're up
from dozens of synch writes per seconds to tens of thousands, the latter
warning is less relevant, but the other concerns still apply.   I also
am not sold on the idea of self-amplifying from the database.

> Thanks again for all that you do - Bogofilter is excellent software -
> I just think this is a loophole that spammers have exploited more
> often in recent years - and I think some improvements with Bogofilter
> would help this.