bogofilter's handling of attached PDF files
Tomaž Šolc
tomaz.solc at tablix.org
Mon Jan 6 20:05:28 CET 2025
On 6. 01. 25 18:29, Rob McEwen via bogofilter wrote:
> oops - I probably wasn't clear in my previous message - (see previous
> email below). So what I'm asking is this - does Bogofilter include such
> plain-text metadata (and/or other plain-text data) in it's analysis of
> emails - and I'm referring to the text that can be easily viewed in a
> decoded PDF file, without actually trying to extract the text from a PDF
> file in a traditional sense - so I'm NOT talking about the text you
> could get if you opened a PDF file in a PDF viewer, selected all, and
> copied and pasted that into a text editor.
>
> Does Bogofilter already do this?
I think Bogofilter completely ignores attachments.
I didn't look in the source, but I did a simple experiment:
I started with an empty database and a PDF file.
I couldn't see any actual content in the PDF as plain text, but there
are plenty of ASCII strings that bogofilter could register. For example
"endstream":
$ strings test.pdf|grep endstream
I registered a mail that had that PDF attached into the bogofilter database:
$ mkdir db
$ bogofilter --version
bogofilter version 1.2.5
$ bogofilter -d ./db -n -I test.eml
then dumped the content of the database:
$ bogoutil -d ./db|grep endstream
and I couldn't see any of those strings from the PDF.
Best regards
Tomaž
More information about the bogofilter
mailing list