bogofilter's handling of attached PDF files

Rob McEwen rob at invaluement.com
Mon Jan 6 18:29:32 CET 2025


oops - I probably wasn't clear in my previous message - (see previous 
email below). So what I'm asking is this - does Bogofilter include such 
plain-text metadata (and/or other plain-text data) in it's analysis of 
emails - and I'm referring to the text that can be easily viewed in a 
decoded PDF file, without actually trying to extract the text from a PDF 
file in a traditional sense - so I'm NOT talking about the text you 
could get if you opened a PDF file in a PDF viewer, selected all, and 
copied and pasted that into a text editor.

Does Bogofilter already do this? If not, it there a way to get 
Bogofilter to do this?

(and then going further and actually extracting regular text from a PDF 
- so - simulating what someone would do if they opened a PDF editor and 
did a select-all, the copied/pasted into a text editor - that would be 
ANOTHER possible strategy to take this further - although, as mentioned 
in my previous post, that might be too resource-expensive?)

Rob McEwen, invaluement



------ Original Message ------
>From "Rob McEwen via bogofilter" <bogofilter at bogofilter.org>
To bogofilter at bogofilter.org
Date 1/6/2025 11:51:10 AM
Subject bogofilter's handling of attached PDF files

>I'm sorry if this has already been discussed in the past, and I missed that. I haven't been on this list for that many years. So it's my understanding that Bogofilter completely ignores PDF files (and other binary attachments?). Is that correct?
>
>So what I'm wondering is this - often there is at least SOME plain text metadata that is included in a PDF file - and often some other plain text - that you can see in plain sight after base64-decoding a PDF attachment - and then viewing that in a plain-text editor. (Yes, most of the characters are useless binary junk in a text viewer - but not 100% of them!)
>
>Because so many spammers use PDF attachments in spams that have very very little text in them, this strategy might be very beneficial for helping spam filters to distinguish between legit emails and spams that have PDF files with very little other text.
>
>Or maybe I'm wrong and Bogofilter is already doing this? Does anyone know?
>
>ALSO - it would also be helpful to go further and combine Bogofilter with some kind of pdf-to-text tool - and then have Bogofilter treat that plain text as if it were in the emails. Of course, that wouldn't solve the issue of images embedded in PDF files - but that would take this a step further. (and recognizing that this might be too resource-expensive for some some systems)
>
>Any thoughts on this? Has any of this been tried? Or is any of this already implemented?
>
>Thanks!
>
>Rob McEwen, CEO of invaluement
>_______________________________________________
>bogofilter mailing list
>bogofilter at bogofilter.org
>https://www.bogofilter.org/mailman/listinfo/bogofilter


More information about the bogofilter mailing list