bogofilter's handling of attached PDF files
Rob McEwen
rob at invaluement.com
Mon Jan 6 17:51:10 CET 2025
I'm sorry if this has already been discussed in the past, and I missed
that. I haven't been on this list for that many years. So it's my
understanding that Bogofilter completely ignores PDF files (and other
binary attachments?). Is that correct?
So what I'm wondering is this - often there is at least SOME plain text
metadata that is included in a PDF file - and often some other plain
text - that you can see in plain sight after base64-decoding a PDF
attachment - and then viewing that in a plain-text editor. (Yes, most of
the characters are useless binary junk in a text viewer - but not 100%
of them!)
Because so many spammers use PDF attachments in spams that have very
very little text in them, this strategy might be very beneficial for
helping spam filters to distinguish between legit emails and spams that
have PDF files with very little other text.
Or maybe I'm wrong and Bogofilter is already doing this? Does anyone
know?
ALSO - it would also be helpful to go further and combine Bogofilter
with some kind of pdf-to-text tool - and then have Bogofilter treat that
plain text as if it were in the emails. Of course, that wouldn't solve
the issue of images embedded in PDF files - but that would take this a
step further. (and recognizing that this might be too resource-expensive
for some some systems)
Any thoughts on this? Has any of this been tried? Or is any of this
already implemented?
Thanks!
Rob McEwen, CEO of invaluement
More information about the bogofilter
mailing list