Much simplified lexer (was: lexer change)

Wed Nov 12 13:36:09 CET 2003

Boris 'pi' Piwinger wrote:

>> The main benefit of the Bayesian
>> method is that it's not hindered by aging of rules like SpamAssassin
>> is.  We shouldn't be deciding based on a few more incorrect
>> classifications here or there to institute a new rule. 
> 
> Basically I agree. But somehow you have to determine what a
> word is (and hence if a word can start with a $-sign). But
> you are right, I cannot give any reason besides testing for
> not allowing tokens of length one or numbers. You would
> actually expect that those are useful.

Insprired by our discussion, Tom, I changed the lexer to be
more in the fashion you describe. If you want to see if it
works for you, it is attached.

>> I might agree
>> with a rule if there were a fundamental underlying philosophical reason,
>> but just tweaking the output is not a good enough reason.
> 
> I can follow you there. I'd be happy to add numbers and
> short tokens as well as tokens starting with $ of any form.

I will allow $ at any place in the word ($cientology,
Mircro$oft etc.). I will allow numbers at any place in the
word (this includes only numbers). I will allow tokens of
length one and two.

> Here is once more pretty much what a token is:
>> TOKENFRONT	[^[:blank:][:cntrl:][:digit:][:punct:]]
>> TOKENMID	[^[:blank:]<>;=():&%$#@+|/\\{}^\"?*,[:cntrl:][\]]*
>> TOKENBACK	[^[:blank:]<>;=():&%$#@+|/\\{}^\"?*,[:cntrl:][\]._+-]

> If I can trust my eyes (I usally cannot;-) those characters
> are allowed to show up in the middle of a word, but not at
> the beginning: !'-._`~ (which looks OK).
> 
> At the end of a word we only allow !'` in addition to those
> allowed at the front. I cannot say why ' or ` should be
> there. I'd disallow those.

I do that. And I missed the ~, which I'll also disallow.

> And by your argument also remove ! -- even though it "works".

And that. This makes TOKENFRONT and TOKENBACK the same.

> I don't know anything about ´.

It is not in ASCII, [:punct:] is a subset of ASCII, though.
We have problems with punctuation, blanks etc. it other
charsets anyway (we cannot always recognize them).

First test to see how it performs. Recall:

Test with the last release: 2.7M
                       spam   good
.MSG_COUNT              592    307
wo (fn):  0.500000    26     23     19     68
wo (fp):  0.500000     5      4      4     13
wi (fn):  0.581092    50     41     41    132
wi (fp):  0.581092     3      2      1      6
wi (fn):  0.499993    26     23     19     68
wi (fp):  0.499993     6      4      5     15
wi (fn):  0.457261    15     15     14     44
wi (fp):  0.457261    14     10      8     32

Allowing two-byte-tokens: 2.8M
                       spam   good
.MSG_COUNT              630    284
wo (fn):  0.500000    24     22     22     68
wo (fp):  0.500000     4      4      3     11
wi (fn):  0.544564    40     30     31    101
wi (fp):  0.544564     3      1      2      6
wi (fn):  0.499999    24     22     21     67
wi (fp):  0.499999     5      4      4     13
wi (fn):  0.419627     8     12     15     35
wi (fp):  0.419627    12      8     11     31

With the attached new lexer: 2.9M
                       spam   good
.MSG_COUNT              554    308
wo (fn):  0.500000    18     20     21     59
wo (fp):  0.500000     6      4      6     16
wi (fn):  0.584458    43     30     30    103
wi (fp):  0.584458     3      1      2      6
wi (fn):  0.503945    21     21     24     66
wi (fp):  0.503945     6      4      4     14
wi (fn):  0.471097    13     15     14     42
wi (fp):  0.471097    13      7     11     31

Again we can read different things from the results. On one
hand the number of false positives increases, which is bad.
On the other hand if you look at different false positives
targets (roughly: .05%, .1%,  .25%) it performs better than
the last release, especially for very few false positives.
Compared to only adding two-byte-tokens to the last release
it performs a bit worse, but not that much (except for the
initial false positive rate).

So let's wrap things up. This (experimental!) lexer removes
several special rules introduced over time (with good
reason, but at some point a review might be worth it), it is
getting simpler to read with fewer definitions and rules.
There might well be more to be done.

As Tom argued, we are excluding several tokens with no good
reason, which now get into the list without any external
judgement if this is helpful or not.

This version performs reasonably well, so one might question
several exceptions we had so far.

pi
-------------- next part --------------
/* $Id: lexer_v3.l,v 1.111 2003/11/10 23:43:39 relson Exp $ */

%{
/*
 * NAME
 *   lexer_header.l -- bogofilter's lexical analyzer for message headers
 *
 *   01/01/2003 - split out of lexer.l
 *
*/

/*
 * Our lexical analysis is different from Paul Graham's rules: 
 *
 * We throw away headers that are readily identifiable as dates.
 * We throw away all digit strings that don't look like IP address parts.
 * We thow away lines beginning with <tab>id<space> -- mailer UDs.
 * We throw away *all* tokens of length 1 or 2.
 *
 * These are optimizations to keep the token lists from bloating.
 * The big win is recognizing machine-generated unique IDs that
 * we'll never see again and shouldn't 
 *
 * We don't treat dot between two alphanumerics as a separator,
 * because we want to keep domain names and IP addresses together as 
 * recognizable units. 
 *
 * Having done the above, there isn't much need to recognize URLs.  
 * If a URL is a spam indicator, very likely any other URL from the
 * same site is as well, so the hostname part should be an adequate
 * statistical trigger.  
 *
 * LEXED_TOKENS, which are found in "msg-count" files need a special pattern
 * because they can be:
 *	1 - normal bogofilter tokens
 *	2 - url:xxx and subj: tokens
 *	3 - mime boundaries
 */

/* 12 May 2003
 * Added Paul Graham's latest ideas on parsing.
 * (From http://www.paulgraham.com/better.html)
 *
 * 1. Case is preserved.
 *
 * 2. Exclamation points are constituent characters.
 *
 * 3. Periods and commas are constituents if they occur between two
 *    digits. This lets me get ip addresses and prices intact.
 *
 * 4. A price range like $20-25 yields two tokens, $20 and $25.
 *
 * 5. Tokens that occur within the To, From, Subject, and Return-Path
 *    lines, or within urls, get marked accordingly.
 *    For example. "foo" in the Subject line becomes "subj:foo".
*/

/* DR 08/29/2003:
**
** With flex-2.5.31 and '%option never-interactive noreject', file
** msg.dr.0118.base64 (in tests/bogofilter/inputs/split.d) parses
** incorrectly because line 24 isn't base64 decoded.
*/

#include "common.h"

#include <ctype.h>
#include <stdlib.h>

#include "buff.h"
#include "charset.h"
#include "lexer.h"
#include "mime.h"		/* for mime_*() */
#include "msgcounts.h"
#include "textblock.h"
#include "token.h"
#include "xmalloc.h"

#define YY_DECL token_t yylex(void)
    YY_DECL;			/* declare function */

#define YY_INPUT(buf,result,max_size) result = yyinput((byte *)buf, max_size)
#define YY_EXIT_FAILURE EX_ERROR

#undef	stderr
#define	stderr	dbgout		/* for debug & -D options */

static word_t yyt;
static int lineno;

/* Function Prototypes */

static word_t *yy_text(void);
static void html_char(void);
static void html_reorder(void);

static void url_char(void);

static void skip_to(char chr);

char yy_get_state(void);
void yy_set_state_initial(void);

/* Function Definitions */

static word_t *yy_text(void)
{
    yyt.text = (byte *)yytext;
    yyt.leng = yyleng;
    return &yyt;
}

%}

%option warn
%option nodebug debug
%option align caseless 8bit
%option never-interactive
%option noreject noyywrap
%option prefix="lexer_v3_"

UINT8		([01]?[0-9]?[0-9]|2([0-4][0-9]|5[0-5]))
IPADDR		{UINT8}\.{UINT8}\.{UINT8}\.{UINT8}
BCHARSNOSPC	[0-9a-zA-Z'()+_,-./:=?#]
BCHARS		[0-9a-zA-Z'()+_,-./:=?# ]
MIME_BOUNDARY	{BCHARS}*{BCHARSNOSPC}

ID		<?[0-9a-zA-Z-]*>?
CHARSET		[0-9a-zA-Z-]+
MTYPE		[ \t]*[0-9a-zA-Z/-]*

NUM		[0-9]+
NUM_NUM		\ {NUM}\ {NUM}
MSG_COUNT	^\".MSG_COUNT\"

TOKENBORDER	[^[:blank:][:cntrl:]<>;&%@|/\\{}^"*,[\]?=():#+._!'`~-]
TOKENMID	[^[:blank:][:cntrl:]<>;&%@|/\\{}^"*,[\]?=():#+]
BOGOLEX_TOKEN	[^[:blank:][:cntrl:]<>;&%@|/\\{}^"*,[\]]+
TOKEN		{TOKENBORDER}({TOKENMID}*{TOKENBORDER})?

/*  RFC2047.2
    encoded-word = "=?" charset "?" encoding "?" encoded-text "?="
    charset = token    ; see section 3
    encoding = token   ; see section 4
    token = 1*<Any CHAR except SPACE, CTLs, and especials>
    especials = "(" / ")" / "<" / ">" / "@" / "," / ";" / ":" / "
		<"> / "/" / "[" / "]" / "?" / "." / "="
    encoded-text = 1*<Any printable ASCII character other than "?"
		      or SPACE>
		   ; (but see "Use of encoded-words in message
		   ; headers", section 5)
*/

/* 09/01/03
  Using "[^?]" in the pattern and validating the charset in 'C'
  reduces executable size by approx 290k.
  new: ENCODED_WORD =\?{CHARSET}\?[bq]\?[^?]*\?\=
  old: ENCODED_WORD =\?{CHARSET}\?(b\?{BASE64}|q\?{QP})\?\=
*/

WHITESPACE	[ \t\n]
NOTWHITESPACE	[^ \t\n]

ENCODED_WORD	=\?{CHARSET}\?[bq]\?[^?]*\?=
ENCODED_TOKEN	({TOKENBORDER}{TOKENMID}*)?({ENCODED_WORD}{WHITESPACE}+)*{ENCODED_WORD}

HTML_ENCODING		"&#"x?[[:xdigit:]]+";"
URL_ENCODING		"%"[[:xdigit:]][[:xdigit:]]

HTML_WI_COMMENTS	"<"[^>]*">"

/*
 * Generally, there are some html tags that cause an "eyebreak" and some
 * that do not. For example, the "P" tag or the "BR" tag cause a break,
 * and can be interpreted in place, while, the B (bold) tag does not.
 * No close tags seem to cause a break.
 * Comments do not.  This is an attempt to make an exhaustive list of
 * tags that cause an "eyebreak". When the exit tag also causes a break,
 * we include the /?. I believe this to be a complete list of tags that
 * can cause a formatting break.
 */

HBREAK		p|br|li|h[1-6]|hr|title|table|center|dd|dt|iframe|img|input|select|td|textarea|th|\/?(div|blockquote|pre|dir|dl|fieldset|legend|form|menu|ol|ul)

BREAKHTML	"<"({HBREAK}({WHITESPACE}[^>]*|""))">"

VERP		{TOKEN}-{NUM}-{TOKEN}={TOKEN}@{TOKEN}

%s TEXT HTML BOGO_LEX
%s HTOKEN HDISCARD SCOMMENT LCOMMENT HSCRIPT

%%

<INITIAL,BOGO_LEX>{MSG_COUNT}{NUM_NUM}		{ if (lineno == 0) { 
						      BEGIN BOGO_LEX; 
						      set_msg_counts(strchr(yytext, ' ') + 1); 
						  }
						  return MSG_COUNT_LINE;
						}
<BOGO_LEX>^\"{BOGOLEX_TOKEN}\"{NUM_NUM}		{ return BOGO_LEX_LINE; }
<BOGO_LEX>\n					{ lineno += 1; }

<INITIAL>{ENCODED_TOKEN}			{ word_t *w = yy_text();
						  size_t size = text_decode(w);
						  while (size-- > 0)
						    unput(w->text[size]);
						}

<INITIAL>^(To|From|Return-Path|Subject):	{ set_tag(yytext); }
<INITIAL>^Received:				{ set_tag(yytext); return TOKEN; }
<INITIAL>^Content-(Transfer-Encoding|Type|Disposition):{MTYPE}	{ mime_content(yy_text()); skip_to(':'); return TOKEN; }

<INITIAL>^MIME-Version:.*			{ mime_version(yy_text()); return HEADKEY; }
<INITIAL>^(Delivery-)?Date:.*			{ return HEADKEY; }
<INITIAL>^(Resent-)?Message-ID:.*		{ return HEADKEY; }
<INITIAL>^(In-Reply-To|References):.* 		{ return HEADKEY; }

<INITIAL>boundary=[ ]*\"?{MIME_BOUNDARY}\"?	{ mime_boundary_set(yy_text()); }
<INITIAL>charset=\"?{CHARSET}\"?		{ got_charset(yytext); skip_to('='); return TOKEN; }

<INITIAL>(file)?name=\"?			/* ignore */
<INITIAL>(ESMTP|SMTP)+/{WHITESPACE}+id\ {ID}	{ if (header_line_markup) { return TOKEN; } }
<INITIAL>[:blank:]*id\ {ID}			/* ignore */

<INITIAL>\n[ \t]				{ lineno += 1; }
<INITIAL>\n\n					{ enum mimetype type = get_content_type();
						  have_body = true;
						  msg_header = false;
						  clr_tag();
						  switch (type) { 
						  case MIME_TEXT_HTML:	BEGIN HTML; break;
						  case MIME_MESSAGE:	yy_set_state_initial(); break;
						  default:		BEGIN TEXT; 
						  }
						  return EOH;
						}

<INITIAL>\n					{ set_tag("Header"); lineno += 1; }
<INITIAL><<EOF>>				{ return NONE; }

<INITIAL>{VERP} 				{ skip_to('='); return VERP; }

^--{MIME_BOUNDARY}(--)?$			{ if (got_mime_boundary(yy_text())) {
						      yy_set_state_initial();
						      return BOUNDARY;
						  } else {
						      yyless(2);
						  }
						}

  /* This has to match just as much or more than the below rules, so as to be the 
     controlling rule. */
<HTML>{TOKEN}({HTML_WI_COMMENTS}*{BREAKHTML}+{HTML_WI_COMMENTS}*.?|({HTML_WI_COMMENTS})+{WHITESPACE})		{ 
    			char *chr = memchr(yytext, '<', yyleng);	/* find start of html tag */
			size_t len = chr - yytext;
			yyless(len);
			return TOKEN;
			}

<HTML>{TOKEN}({HTML_WI_COMMENTS})+/{NOTWHITESPACE}	{ html_reorder(); }

<HTML>"<!--"					{ BEGIN SCOMMENT; }
<HTML>"<!"					{ BEGIN (strict_check ? HTOKEN : LCOMMENT ); }
<HTML>"<"(a|img|font){WHITESPACE}		{ BEGIN HTOKEN; }
<HTML>"<"					{ BEGIN HDISCARD; }	/* unknown tag */

<HTOKEN>{TOKEN}					{ if (tokenize_html_tags)     return TOKEN; }
<HSCRIPT>{TOKEN}				{ if (tokenize_html_script)   return TOKEN; }
<HDISCARD,LCOMMENT,SCOMMENT>{TOKEN}		{ /* discard innards of html tokens and comments */ }

<HTOKEN,HDISCARD,LCOMMENT>">"			{ BEGIN HTML; }	/* end of tag, loose comment; return to normal html processing */
<SCOMMENT>"-->"					{ BEGIN HTML; }	/* end of strict comment; return to normal html processing */

{IPADDR}					{ return IPADDR;}
{TOKEN}						{ return TOKEN;}

<HTML>{TOKEN}?{HTML_ENCODING}			{ html_char(); }	/* process escaped chars, eg 'e' is 'a' */
<HTOKEN>{TOKEN}?{URL_ENCODING}+			{ url_char(); }		/* process escaped chars, eg '%61'    is 'a' */

.						/* ignore character */
\n						{ lineno += 1; clr_tag(); }
<<EOF>>						{ return NONE; }
%%

void lexer_v3_init(FILE *fp)
{
    lineno = 0;
    have_body = false;
    yy_set_state_initial();
    yyrestart(fp);
}

static void skip_to(char chr)
{
    size_t len = strchr(yytext, chr) - yytext;
    yyless(len);
}

static void html_reorder(void)
{
    char *chr = memchr(yytext, '<', yyleng);	/* find start of html tag */
    size_t len = chr - yytext;
    char *tmp;
    char *yycopy = xmalloc(yyleng + 1); 	/* +1 for NUL byte below */

    memcpy(yycopy, yytext+len, yyleng-len);	/* copy tag to start of buffer */
    memcpy(yycopy+yyleng-len, yytext, len);	/* copy leading text to end of buffer */
    yycopy[yyleng] = '\0';			/* for debugging */

    for(tmp = yycopy+yyleng-1 ; tmp >= yycopy; tmp--) 
	yyunput(*tmp, yytext);
    xfree(yycopy);
}

static int xtoi(char *in, size_t len)
{
    int val = 0;
    while (isxdigit((byte) *in) && (len-- > 0)) {
	char c = *in++;
	val <<= 4;
	val += isdigit((unsigned char)c)
	    ? (c - '0')
	    : (tolower((unsigned char)c) - 'a' + 10);
    }
    return val;
}

static void html_char(void)
{
    char *txt = strstr(yytext, "&#");	/* find decodable char */
    size_t len = txt - yytext;
    int  val;
    char *yycopy = NULL;

    if (len != 0) {
	yycopy = xmalloc(yyleng + 1);	/* +1 for NUL byte below */
	memcpy(yycopy, yytext, yyleng);	/* copy tag to start of buffer */
	yycopy[yyleng] = '\0';		/* for debugging */
    }

    txt += 2;
    val = isdigit((byte) *txt) ? atoi(txt) : xtoi(txt+1, 4);
    if ((val < 256) && isprint(val)) {	/* use it if printable */
	yyunput(val, yytext);
	yyleng = len;			/* adjust len to pre-char count */
    }
    else {
	if (yycopy)
	    yycopy[yyleng-1] = ' ';	/* prevents parsing loop */
    }

    if (yycopy != NULL) {
	while (yyleng-- > 0)
	    yyunput(yycopy[yyleng], yytext);
	xfree(yycopy);
    }
}

static void url_char(void)
{
    char *src, *dst;
    src = dst = yytext;

    while (src < yytext + yyleng) {
	char c = *src++;
	if (c == '%') {
	    c = xtoi(src, 2);
	    src += 2;
	}
	*dst++ = c;
    }
    while (dst > yytext) {
	yyunput(*--dst, yytext);
    }
}

char yy_get_state()
{
    switch (YYSTATE) {
    case INITIAL:  return 'i';
    case TEXT:     return 't';
    case HTML:
    case HTOKEN:   return 'h';
    case SCOMMENT: return 's';
    case LCOMMENT: return 'l';
    default:       return 'o';
    }
}

void yy_set_state_initial(void)
{
    BEGIN INITIAL;
    msg_header = true;
    set_tag("Header");

    if (DEBUG_LEXER(1))
	fprintf(dbgout, "BEGIN INITIAL\n");

#ifdef	FLEX_DEBUG
    yy_flex_debug = BOGOTEST('L');
#endif
}

/*
 * The following sets edit modes for GNU EMACS
 * Local Variables:
 * mode:c
 * indent-tabs-mode:t
 * End:
 */