bogofilter's handling of attached PDF files

Tue Jan 7 16:23:01 CET 2025

------ Original Message ------
>From "Tomaž Šolc via bogofilter" <bogofilter at bogofilter.org>
>I think Bogofilter completely ignores attachments.
>I didn't look in the source, but I did a simple experiment:
>...
>and I couldn't see any of those strings from the PDF.

Tomaž,

Thanks for verifying that the PDF attachments themselves - are completely ignored by Bogofilter, as I suspected.

Here is a sample file of a typical criminal PDF, and I also included the plain ascii text that I had extracted from the raw PDF code:

https:// www . invaluement . com/typical-pdf-criminal-scam-spam-attachment-seeking-phone-call . zip

(remove the spaces I added, to get that link to work - I did this since this contains spammy content as an example - and I didn't want that to get spam filterered - or cause me to get mistakenly blocked.)

This link is a zip file containing two things:

(1) a typical PDF file that a criminal spammer sent - where they're trying to scare the end user into calling a phone number - and from there they continue an attempted criminal scheme that tries to steal the email-recipient's money.

(2) the plain ascii text that I was able to extract from the PDF file. (recognizing that this probably should instead be a more sophisticated unicode text format, to capture a larger variety of types/languages of plain text)

So, in this case, there is extremely little plain text that can be extracted in the traditional sense - because this PDF is basically one large graphic. I tried to use Ghostscript to see what it extracted as text, and all I got was a bunch of whitespace with ONLY "51CW8FBDXRL3YZ6QOYPX08W25U2" in the middle of it. That's it. So there's not much text that can be used by Bogofilter using a typical PDF-to-TEXT converter, for these common types of PDFs found in MANY scam spams. (and likewise when you open it in a PDF viewer, there seems to be zero text to select/copy)

A more sophisticated method would be the OCR the text in all images in the PDF - but that is very very resource-expensive and doesn't scale well.

So it's my theory that the text I extracted from the raw PDF (NOT the rendoring of the PDF, which has essentially zero text) - that this is going to have wonderful disctinctive words that Bogofilter could use to distinguish legit PDFs found "in the wild" in email systems, compared to the spammers' PDF files. This makes sense because spammers tend to use templates and automation - which theoretically should reveal distinctive characteristics in this extracted text.

So in that zip file, take a look at extracted-raw-ascii-text.txt - where you'll see that text that my system extracted. But there is probaby still much useless "noise" in there. So my next step is going to try to figure out WHICH words/lines are quality signals and which ones are more "noise". To do that, I'm going to do this extraction on the PDFs of emails found in thousands of spams, and compare that to the same data for thousands of legit emails - then compare the differences - and see which words/lines are repetitive and most often found in both - so that I can then get my text extraction method to ignore those more useless words/lines of text - to get to the words/lines that are more distinctive/personalized for each PDF.

But, meanwhile, I'm extremely busy with other urgent projects, so I only have time to work at this in small chunks of time over the coming days/weeks.

I'll report back my findings once that's completed.

If this products excellent results - then - if I have to - in my own spam filtering system - I'll create temporary "shadow" copies of actual/real emails - where I re-write the email - replacing the PDF's base64 encoding - with this extracted plain text - then I'll get Bogofilter to train on those - and get my system to create such an extra copy for incoming "live" emails - for bogofilter to check against - so that the training/checking is comparing apples-to-apples. But wow - that's a massive amount of work/overhead on my end - I'll then have to have two separate copies of the ham/spam training directories - and constantly create extra temporary/manipulated copies of incoming emails to be checked by Bogofitler - when it would be so much easier and less complicated if a future version of Bogofilter would have the option to do this automatically on everything checked/trained!

But I have some more homework to do on my end before officially requesting that feature. And I'll need to first prove that this even works well, and figure out which text elements should be extracted. (I feel confident that it will very successful - but there's no substitute for testing/evidence. I could be wrong. But this seems very promising!)

Rob McEwen, invaluement