words and buffers

Thu Feb 6 17:44:15 CET 2003

Greetings,

Over the last few days, I've been working on bogofilter's internal handling 
of text input and tokens.  At present text is read into a buffer (which 
involves a data pointer, buffer size, and byte count, i.e. a "char *" 
pointer and two "size_t" variables).  Tokens are represented as "char *" 
pointers.  Using these data types generally involve passing around the trio 
of variables for the buffer or the pointer for the token.  Besides the 
inconvenience and untidiness of passing around groups of variables, some 
string operations are needed repeatedly (for example strlen() for a 
token).  Storing a length each token would increase speed.  Since input 
text can contain NUL bytes, which by convention are terminators for "char 
*" variables, it is wrong to use printf("%s") to display it.  This is 
another need/use for the length variable.

To address these issues, I have added types to bogofilter named word_t and 
buff_t.  word_t is a struct containing a "byte *" pointer for the data, a 
"size_t" variable for the length, and storage for the actual token.  buff_t 
is a struct containing a "byte *" data pointer, a "size_t" variable (for 
bytes used), a "size_t" variable (for buffer size), etc.  Since buff_t and 
word_t share some attributes and in some places a word_t is wanted when a 
buff_t is available, the buff_t definition includes a word_t.

The process of implementing the new design was difficult and time 
consuming.  The calling sequences for many functions were altered.  They 
now have word_t* and buff_t* instead of char* and a variety of size_t 
parameters.  Significant effort went into making sure that the tough areas 
are processed correctly.  A particularly complex/difficult combination was 
base64 encoded sections of html with multiline comments to remove.

The good news is that the word/buffer code works fine.  It passes the 
regression tests and also correctly processes all the different messages 
appeared as problems during the beta period (versions 0.10, 0.10.1, ..., 
0.10.1.5).

The bad news is that many files got changed.  My count is 6 files added, 
i.e. code and header files for word_t and buff_t and xmemchr(), and approx 
40 files were modified.  Also there's room for improvement in some parts of 
the code.  They are more involved, i.e. less understandable, than I 
like.  Work on improving these areas will continue to take place.

David