Spam in images
matthias.andree at gmx.de
Mon Aug 14 05:08:39 EDT 2006
Pavel Kankovsky <peak at argo.troja.mff.cuni.cz> writes:
> On Wed, 9 Aug 2006, .rp wrote:
>> hmm, I wonder if it would be worth hooking in an OCR program to read the
>> image and what the min hardware requirements would be to scan them
>> without bringing a system to a crawl.
> I think OCR is an overkill. After all, the classifier does not really need
> to be able to read the text--it needs to be able to find out whether the
> picture is similar to known examples of spam (and ham) pictures.
> As far as I can tell, nowadays it is possible to recognize a majority of
> image files attached to spams just by looking at their first few
> BASE64-encoded lines. It might be worth trying to modify Bogofilter's
> parser to extract some kind of quasi-tokens from image files (e.g. break
> the file into N-bytes pieces for some small N and translate every piece
> (or first M pieces) into a token, perhaps with BASE64 in order to keep
> tokens (mostly) printable).
> Things might become more tricky if spammers start randomizing their
> "picto-spams" in the future.
OTOH, I hardly ever receive legit messages with "just an image", so
blocking messages with fewer than 5 lines of text and an image would be
quite discriminative (although bogofilter cannot learn this combination
More information about the Bogofilter