try to use bogolexer 0.15.7 on messages with attaches

Wed Nov 5 15:02:55 CET 2003

On Wed, 5 Nov 2003 17:47:00 +0400
Mike Lykov <combr at vesna.ru> wrote:

> You wrote:
> 
> I write directly to you, if you want, you may answer in maillist ;)
> 
> > > Can you comment results in first letter (where base64 part marks
> > > as bo, bt. bl and other simultaneously) ?
> >
> > Don't understand your question.  The first letter is 'h' for head
> > and'b' for body.  base64 text is always in the body.
> 
> *** 235 bh 77
> PCFET0NUWVBFIEhUTUwgUFVCTElDICItLy9XM0MvL0RURCBIVE1MIDQuMV*** 236 bl
> 77 L0VOIj4NCjwhLS0gc2F2ZWQgZnJvbSB1cmw9KDAxMjQpZmlsZTovL0M6XERvY***
> 237 bs 77
> ZCUyMFNldHRpbmdzXG9zb2tpbmFcTG9jYWwlMjBTZXR0aW5nc1xUZW1wb3J*** 238 bs
> 77 dCUyMEZpbGVzXE9MSzg0XNHl7Ojt4PDx6uD/JTIw8ODx8fvr6uAzLmh0bSAtLR
> 
> Yes, it's a body. But why so base64 part are recognized as bh (html)
> in 1 line, bl (lcomment) in 2 line, and scomment in 3 & 4 lines ?  

The different states "html, lcomment, scomment" indicate what the parser
is doing when the line is read in.  Look at lexer_v3.l to see how the
different states are used.  Since flex may read ahead in the message
while trying to match a pattern, the states may not be what you
expected.  This is a debug display to help developers understand what's
going on.  If you use an additional "-v" in your command line,
bogofilter will display the line before _and_ after base64 decoding.

> I has met such case:
> (copying from early letter)
> 
> *** 226 bt 77
> IFNvZnRMaW5lDQp1bnN1YnNjcmliZSBTZW1pbmFycw0KPG1haWx0bzpzdWJ*** 227 bt
> 77bmUtZGlyZWN0LnJ1P3N1YmplY3Q9dW5zdWJzY3JpYmUgU2VtaW5hcnM+IC*** 228 bt
> 53 IO7yIO3u4u7x8uXpIPHl7Ojt4PDu4iBTb2Z0TGluZQ0KDQoNCg0K             
> *** 229 bt  1                                                         
> 
> 
> (above all ok - end of base64 encoded file (first attach in a letter) 
> and begin another (second attach))
>          
> *** 230 hi 44 ------=_NextPart_000_EF17_01C3507A.A8F2C320             
>          
> *** 231 hi 25 Content-Type: text/html;                                
>          
> *** 232 hi 18   charset="koi8-r"                                      
>          
> *** 233 hi 34 Content-Transfer-Encoding: base64                       
>          
> *** 234 hi  1                                                         
>          
> 
> All ok (begin second attach in a same letter)
> 
> *** 235 bh 77
> PCFET0NUWVBFIEhUTUwgUFVCTElDICItLy9XM0MvL0RURCBIVE1MIDQuMV*** 236 bl
> 77 L0VOIj4NCjwhLS0gc2F2ZWQgZnJvbSB1cmw9KDAxMjQpZmlsZTovL0M6XERvY***
> 237 bs 77
> ZCUyMFNldHRpbmdzXG9zb2tpbmFcTG9jYWwlMjBTZXR0aW5nc1xUZW1wb3J*** 238 bs
> 77 dCUyMEZpbGVzXE9MSzg0XNHl7Ojt4PDx6uD/JTIw8ODx8fvr6uAzLmh0bSAtLR
> 
> Not ok  - see above

Looks OK to me.

> > > Can you explain this part of newly generated wordlist goodlist.db
> > > ?
> > No.  It works fine for my test case.  Can you gzip the original
> > message and send it to me?
> 
> I will try, but i cannot now take a file of suitable size (my test
> corpus of letters  - 45 Mbytes ;)
> 
> > > but so parts must be ignored, by your words!
> > images and applications are ignored.  text/plain, text/html, etc are
> > processed.  Look at files src/lexer.c and src/token.c for details.
> 
> And what about this :
> -----------
> Content-Type: application/msword; name="Contract.doc"                 
>          
> Content-Transfer-Encoding: base64                                     
>          
> Content-Disposition: attachment; filename="Contract.doc" 
> -----------
> Content-Type: application/x-msexcel; name="=?koi8-r?B?5dbFzS4gxMn
> Content-Transfer-Encoding: base64                                     
>          
> Content-Disposition: attachment; filename="=?koi8-r?B?5dbFzS4gxMn
> 
> It is ignored ? Or this base64-encoded content are tokenized ?
> 
> I have many, many letters wiith attaches of that types passing my
> server. Are there its encoded content important or not as the image or
> application ?

You have sufficient code in the bogofilter installation to answer these
questions for yourself.  Use "bogolexer -p < message" to see what
happens.

David