Spam in images

Mon Aug 14 11:08:39 CEST 2006

Pavel Kankovsky <peak at argo.troja.mff.cuni.cz> writes:

> On Wed, 9 Aug 2006, .rp wrote:
>
>> hmm, I wonder if it would be worth hooking in an OCR program to read the 
>> image and what the min hardware requirements would be to scan them 
>> without bringing a system to a crawl.
>
> I think OCR is an overkill. After all, the classifier does not really need
> to be able to read the text--it needs to be able to find out whether the
> picture is similar to known examples of spam (and ham) pictures.
>
> As far as I can tell, nowadays it is possible to recognize a majority of
> image files attached to spams just by looking at their first few
> BASE64-encoded lines. It might be worth trying to modify Bogofilter's
> parser to extract some kind of quasi-tokens from image files (e.g. break
> the file into N-bytes pieces for some small N and translate every piece
> (or first M pieces) into a token, perhaps with BASE64 in order to keep 
> tokens (mostly) printable).
>
> Things might become more tricky if spammers start randomizing their
> "picto-spams" in the future.

OTOH, I hardly ever receive legit messages with "just an image", so
blocking messages with fewer than 5 lines of text and an image would be
quite discriminative (although bogofilter cannot learn this combination
currently).

-- 
Matthias Andree