bogofilter's handling of attached PDF files

Rob McEwen rob at invaluement.com
Mon Jan 6 17:51:10 CET 2025


I'm sorry if this has already been discussed in the past, and I missed 
that. I haven't been on this list for that many years. So it's my 
understanding that Bogofilter completely ignores PDF files (and other 
binary attachments?). Is that correct?

So what I'm wondering is this - often there is at least SOME plain text 
metadata that is included in a PDF file - and often some other plain 
text - that you can see in plain sight after base64-decoding a PDF 
attachment - and then viewing that in a plain-text editor. (Yes, most of 
the characters are useless binary junk in a text viewer - but not 100% 
of them!)

Because so many spammers use PDF attachments in spams that have very 
very little text in them, this strategy might be very beneficial for 
helping spam filters to distinguish between legit emails and spams that 
have PDF files with very little other text.

Or maybe I'm wrong and Bogofilter is already doing this? Does anyone 
know?

ALSO - it would also be helpful to go further and combine Bogofilter 
with some kind of pdf-to-text tool - and then have Bogofilter treat that 
plain text as if it were in the emails. Of course, that wouldn't solve 
the issue of images embedded in PDF files - but that would take this a 
step further. (and recognizing that this might be too resource-expensive 
for some some systems)

Any thoughts on this? Has any of this been tried? Or is any of this 
already implemented?

Thanks!

Rob McEwen, CEO of invaluement


More information about the bogofilter mailing list