bogofilter's handling of attached PDF files

Tomaž Šolc tomaz.solc at tablix.org
Mon Jan 6 20:05:28 CET 2025


On 6. 01. 25 18:29, Rob McEwen via bogofilter wrote:
> oops - I probably wasn't clear in my previous message - (see previous 
> email below). So what I'm asking is this - does Bogofilter include such 
> plain-text metadata (and/or other plain-text data) in it's analysis of 
> emails - and I'm referring to the text that can be easily viewed in a 
> decoded PDF file, without actually trying to extract the text from a PDF 
> file in a traditional sense - so I'm NOT talking about the text you 
> could get if you opened a PDF file in a PDF viewer, selected all, and 
> copied and pasted that into a text editor.
> 
> Does Bogofilter already do this?

I think Bogofilter completely ignores attachments.

I didn't look in the source, but I did a simple experiment:

I started with an empty database and a PDF file.

I couldn't see any actual content in the PDF as plain text, but there 
are plenty of ASCII strings that bogofilter could register. For example 
"endstream":

$ strings test.pdf|grep endstream

I registered a mail that had that PDF attached into the bogofilter database:

$ mkdir db
$ bogofilter --version
bogofilter version 1.2.5
$ bogofilter -d ./db -n -I test.eml

then dumped the content of the database:

$ bogoutil -d ./db|grep endstream

and I couldn't see any of those strings from the PDF.

Best regards
Tomaž


More information about the bogofilter mailing list