bogofilter's handling of attached PDF files
Rob McEwen
rob at invaluement.com
Mon Jan 6 18:29:32 CET 2025
oops - I probably wasn't clear in my previous message - (see previous
email below). So what I'm asking is this - does Bogofilter include such
plain-text metadata (and/or other plain-text data) in it's analysis of
emails - and I'm referring to the text that can be easily viewed in a
decoded PDF file, without actually trying to extract the text from a PDF
file in a traditional sense - so I'm NOT talking about the text you
could get if you opened a PDF file in a PDF viewer, selected all, and
copied and pasted that into a text editor.
Does Bogofilter already do this? If not, it there a way to get
Bogofilter to do this?
(and then going further and actually extracting regular text from a PDF
- so - simulating what someone would do if they opened a PDF editor and
did a select-all, the copied/pasted into a text editor - that would be
ANOTHER possible strategy to take this further - although, as mentioned
in my previous post, that might be too resource-expensive?)
Rob McEwen, invaluement
------ Original Message ------
>From "Rob McEwen via bogofilter" <bogofilter at bogofilter.org>
To bogofilter at bogofilter.org
Date 1/6/2025 11:51:10 AM
Subject bogofilter's handling of attached PDF files
>I'm sorry if this has already been discussed in the past, and I missed that. I haven't been on this list for that many years. So it's my understanding that Bogofilter completely ignores PDF files (and other binary attachments?). Is that correct?
>
>So what I'm wondering is this - often there is at least SOME plain text metadata that is included in a PDF file - and often some other plain text - that you can see in plain sight after base64-decoding a PDF attachment - and then viewing that in a plain-text editor. (Yes, most of the characters are useless binary junk in a text viewer - but not 100% of them!)
>
>Because so many spammers use PDF attachments in spams that have very very little text in them, this strategy might be very beneficial for helping spam filters to distinguish between legit emails and spams that have PDF files with very little other text.
>
>Or maybe I'm wrong and Bogofilter is already doing this? Does anyone know?
>
>ALSO - it would also be helpful to go further and combine Bogofilter with some kind of pdf-to-text tool - and then have Bogofilter treat that plain text as if it were in the emails. Of course, that wouldn't solve the issue of images embedded in PDF files - but that would take this a step further. (and recognizing that this might be too resource-expensive for some some systems)
>
>Any thoughts on this? Has any of this been tried? Or is any of this already implemented?
>
>Thanks!
>
>Rob McEwen, CEO of invaluement
>_______________________________________________
>bogofilter mailing list
>bogofilter at bogofilter.org
>https://www.bogofilter.org/mailman/listinfo/bogofilter
More information about the bogofilter
mailing list