words and buffers
David Relson
relson at osagesoftware.com
Thu Feb 6 17:44:15 CET 2003
Greetings,
Over the last few days, I've been working on bogofilter's internal handling
of text input and tokens. At present text is read into a buffer (which
involves a data pointer, buffer size, and byte count, i.e. a "char *"
pointer and two "size_t" variables). Tokens are represented as "char *"
pointers. Using these data types generally involve passing around the trio
of variables for the buffer or the pointer for the token. Besides the
inconvenience and untidiness of passing around groups of variables, some
string operations are needed repeatedly (for example strlen() for a
token). Storing a length each token would increase speed. Since input
text can contain NUL bytes, which by convention are terminators for "char
*" variables, it is wrong to use printf("%s") to display it. This is
another need/use for the length variable.
To address these issues, I have added types to bogofilter named word_t and
buff_t. word_t is a struct containing a "byte *" pointer for the data, a
"size_t" variable for the length, and storage for the actual token. buff_t
is a struct containing a "byte *" data pointer, a "size_t" variable (for
bytes used), a "size_t" variable (for buffer size), etc. Since buff_t and
word_t share some attributes and in some places a word_t is wanted when a
buff_t is available, the buff_t definition includes a word_t.
The process of implementing the new design was difficult and time
consuming. The calling sequences for many functions were altered. They
now have word_t* and buff_t* instead of char* and a variety of size_t
parameters. Significant effort went into making sure that the tough areas
are processed correctly. A particularly complex/difficult combination was
base64 encoded sections of html with multiline comments to remove.
The good news is that the word/buffer code works fine. It passes the
regression tests and also correctly processes all the different messages
appeared as problems during the beta period (versions 0.10, 0.10.1, ...,
0.10.1.5).
The bad news is that many files got changed. My count is 6 files added,
i.e. code and header files for word_t and buff_t and xmemchr(), and approx
40 files were modified. Also there's room for improvement in some parts of
the code. They are more involved, i.e. less understandable, than I
like. Work on improving these areas will continue to take place.
David
More information about the bogofilter-dev
mailing list