buffer swapping....Lots of progress.
Nick Simicich
njs at scifi.squawk.com
Tue Feb 25 16:42:20 CET 2003
I *think* This change needs to be made:
while (msg_header
&& count != -1
&& memcmp(buf,spam_header_name,hdrlen) == 0)
{
count should change to >= hdrlen ---- else this will compare leftover
buffer contents.
in lexer.c, I think that count should be >= 5 here.
if (count > 0
&& memcmp("From ", buf, 5) != 0
This instead?
&& ((count >= 5 && memcmp("From ", buf, 5) != 0) || count < 5)
&& !msg_header && !msg_state->mime_header
&& msg_state->mime_type != MIME_TYPE_UNKNOWN) {
int decoded_count = mime_decode(buf, count);
/*change buffer size only if the decoding worked */
if (decoded_count != 0)
count = decoded_count;
}
I think I got it working...at least the buffer swaps are working and I am
not stepping all over the buffers. It was as simple, once I knew what was
going on, as delaying the buffer swap until *after* the call to yylex. I
moved my swap induction code to get_token, right after the while(true)
{. All the basic tests work. But all the advanced tests fail:
make check-TESTS
make[1]: Entering directory
`/home/njs/prod_bogofilter/bogofilter-0.10.3.1/tests/bogofilter'
FAIL: t.lexer.mbx
FAIL: t.robx
SKIP: t.valgrind
./outputs/split.out ./checks.3729.20030225T074011/split.out differ: char
27, line 1
FAIL: t.split
./outputs/msg.1.f.v ./checks.3804.20030225T074012/tests/msg.1.f.v differ:
char 53, line 1
FAIL: t.systest
./outputs/grftest.out ./checks.4118.20030225T074014/tests/grftest.out
differ: char 81, line 2
FAIL: t.grftest
======================
5 of 5 tests failed
(1 tests were not run)
======================
What I am seeing is oddities like this:
*** 88 b,b 1
*** 89 b,b 51 From glamb at dynagen.on.ca Fri Nov 1 06:20:26 2002
lexer_state: TEXT -> HEAD
*** mime_reset
*** mime_pop. stackp: 0
*** mime_push. stackp: 0
lexer_state: HEAD -> TEXT
*** 90 b,b 35 Return-Path: <glamb at dynagen.on.ca>
*** 91 b,b 33 Delivered-To: smythe at example.com
So something is triggering an extra state change. Hmmmm. OK, I figured
that out. It was noticing a newline while it was still processing the line
in the text processor. So you note that you have a pending state change
and you refuse to do another one. That is in the patch following. But...
[njs at glock bogofilter]$ ./t.lexer.mbx -v 2>&1 | less
[njs at glock bogofilter]$ pwd
/home/njs/prod_bogofilter/bogofilter-0.10.3.1/tests/bogofilter
[njs at glock bogofilter]$
>*** 116 b,b 28 Content-Disposition: inline
>yyredo: 27 "Content-Disposition: inline"
>*** 117 b,b 66 In-Reply-To:
><4.3.2.7.2.20021031175102.00be09a0 at mail.example.com>
>*** 118 b,b 41 Organization: Dynagen Consulting Limited
>*** 119 b,b 1
>*** 120 b,b 56 On 20021031 (Thu) at 1805:18 -0500, David Smythe wrote:
>*** 121 b,b 1
>*** 122 b,b 71 > Also, I'm thinking the beginning of the Intro needs a bit
>more of a
Note that even though we are parsing header here, it is still flagged as
b,b, and the discovery of the newline (it is hit) does not therefore switch
from head to body. It looks like this is supposed to be done in
mime_reset, and I think that is being called, but the state is wrong later.
I would appreciate it if someone who has worked on this code can look at it.
The debugging lines still read b,b, and then the *next* state change is not
made because the got_newline() code is ignored. I have to go see the
doctor now, so I have to get going. >
[njs at glock bogofilter-0.10.3.1]$ for a in *~1~; do b=`basename $a .~1~`;
diff -u $a $b; done > /home/docroot/docs/njs/demime/bogofilter-0.10.3.1.patch
[njs at glock bogofilter-0.10.3.1]$
http://majordomo.squawk.com/njs/demime/bogofilter-0.10.3.1.patch
The rest is the explanation of the patch:
In "change_lexer_state()", I stash the old and new state on a state change:
static token_t simicich_old_lexer_state = LEXER_HEAD;
static token_t simicich_new_lexer_state = LEXER_HEAD;
static
void change_lexer_state(lexer_state_t new)
{
/* if change of state, show new state */
if (DEBUG_LEXER(1) && lexer_state != new)
fprintf(dbgout, "lexer_state: %s -> %s\n",
state_name(lexer_state), state_name(new));
/* start simicich */
if(lexer_state != new) { /* No need to swap buffers back to where it */
/* came from - we can shortcut. */
simicich_old_lexer_state = lexer_state;
simicich_new_lexer_state = new;
}
/* end simicich */
lexer_state = new;
return;
}
In get_token:
token_t get_token(void)
{
token_t class = NONE;
unsigned char *cp;
/* simicich */
void * hold_buffer;
/* end */
[......]
while (true) {
/* start simicich */
if(simicich_old_lexer_state != simicich_new_lexer_state) {
/* No need to swap buffers back to where it */
/* came from - we can shortcut. */
switch(simicich_old_lexer_state) {
case LEXER_HTML:
hold_buffer = lexer_text_html_extract_current_buffer();
break;
case LEXER_TEXT:
hold_buffer = lexer_text_plain_extract_current_buffer();
break;
case LEXER_HEAD:
hold_buffer = lexer_head_extract_current_buffer();
break;
default:
hold_buffer = NULL;
}
switch(simicich_new_lexer_state) {
case LEXER_HTML:
lexer_text_html_install_buffer(hold_buffer);
break;
case LEXER_HEAD:
lexer_head_install_buffer(hold_buffer);
break;
case LEXER_TEXT:
lexer_text_plain_install_buffer(hold_buffer);
break;
}
simicich_old_lexer_state = simicich_new_lexer_state;
}
/* end simicich */
this was added to near the end of token.h to provide some description of
calling sequences for the routines..YY_BUFFER_STATE is only defined in the
lexers, everywhere else we can treat this as a opaque pointer:
/* Added Simicich - support buffer switching */
void * lexer_head_extract_current_buffer( void );
void * lexer_text_html_extract_current_buffer( void );
void * lexer_text_plain_extract_current_buffer( void );
#ifdef FLEX_SCANNER
void lexer_head_install_buffer(YY_BUFFER_STATE);
void lexer_text_html_install_buffer(YY_BUFFER_STATE);
void lexer_text_plain_install_buffer(YY_BUFFER_STATE);
#else
void lexer_head_install_buffer(void *);
void lexer_text_html_install_buffer(void *);
void lexer_text_plain_install_buffer(void *);
#endif
Finally, at the end of each lexer, I have code like this - this is the code
out of lexer_head.l:
%%
void * lexer_head_extract_current_buffer()
{
YY_BUFFER_STATE hold = yy_current_buffer;
yy_switch_to_buffer(yy_create_buffer( yyin, YY_BUF_SIZE ));
/* yy_delete_buffer(yy_current_buffer); */
return hold;
}
void lexer_head_install_buffer(YY_BUFFER_STATE b)
{
YY_BUFFER_STATE hold = yy_current_buffer;
if(b != hold) {
yy_delete_buffer(hold);
yy_switch_to_buffer(b);
}
yy_init = 1;
}
/*
* The following sets edit modes for GNU EMACS
* Local Variables:
* mode:c
* End:
*/
--
SPAM: Trademark for spiced, chopped ham manufactured by Hormel.
spam: Unsolicited, Bulk E-mail, where e-mail can be interpreted generally
to mean electronic messages designed to be read by an individual, and it
can include Usenet, SMS, AIM, etc. But if it is not all three of
Unsolicited, Bulk, and E-mail, it simply is not spam. Misusing the term
plays into the hands of the spammers, since it causes confusion, and
spammers thrive on confusion. Spam is not speech, it is an action, like
theft, or vandalism. If you were not confused, would you patronize a spammer?
Nick Simicich - njs at scifi.squawk.com - http://scifi.squawk.com/njs.html
Stop by and light up the world!
More information about the bogofilter-dev
mailing list