[cvs] bogofilter/src charset.c, 1.13, 1.14 charset.h, 1.1, 1.2 collect.c, 1.36, 1.37 lexer.c, 1.103, 1.104 mime.c, 1.32, 1.33 mime.h, 1.19, 1.20

Mon Jan 3 23:36:41 CET 2005

Hi,
Matthias Andree wrote:

>relson at users.sourceforge.net writes:
>
>  
>
>> static void map_windows_1251(void)
>> {
>>-#ifdef	WINDOWS_1251_to_CYRILLIC
>>+#ifdef	CP866
>>     /* Map:  windows-1251 -> KOI8-R (Cyrillic) */
>>     /* Contributed by: Yar Tikhiy (yarq at users.sourceforge.net) */
>>     static char xlate_1251[] = {
>>-	0xA8, 0xB3,	
>>-	0xB8, 0xA3,	
>>+	0xA8, 0xB3,
>>+	0xB8, 0xA3,
>> 	0xE0, 0xC1,  0xE1, 0xC2,  0xE2, 0xD7,  0xE3, 0xC7,  0xE4, 0xC4,  0xE5, 0xC5,  0xE6, 0xD6,  0xE7, 0xDA,
>> 	0xE8, 0xC9,  0xE9, 0xCA,  0xEA, 0xCB,  0xEB, 0xCC,  0xEC, 0xCD,  0xED, 0xCE,  0xEE, 0xCF,  0xEF, 0xD0,
>> 	0xF0, 0xD2,  0xF1, 0xD3,  0xF2, 0xD4,  0xF3, 0xD5,  0xF4, 0xC6,  0xF5, 0xC8,  0xF6, 0xC3,  0xF7, 0xDE,
>>@@ -285,6 +290,98 @@
>> #endif
>> }
>>    
>>
>
>What is this function doing?
>  
>
This function should be rename to
static void map_windows_1251_to_koi8r(void) ....and aaaa....
ups..
should be

+#ifndef CP866

i.e. it is old function for converting from xxxx codepades to base koi8r codepage. I made 

static void map_windows_1251_to_cp866(void);
static void map_koi8_r_to_cp866(void);
static void map_iso_8859_5_to_cp866(void);

because my work codepage is cp866. koi8r is native russian codepage for 
UNIX's so may be some time later anybody will make
map_iso_8859_5_to_koi8r and unicode to koi8r support.

>Why are we converting directly from one codepage to another?
>  
>
because one word may have different binary representation, say word 
"spammer" may be
E1 AF A0 AC ? AC A5 E0 CP866
D3 D0 C1 CD ? CD C5 D2 KOI8-R
F1 EF E0 EC ? EC E5 F0 CP1251
E1 DF D0 DC ? DC D5 E0 ISO-8859-5

User as human can view (with debug, bogoutil etc.) can understand words 
only in his current codepage - this is the first reason and second is 
data base size.

>  
>
>>+int  htmlUNICODE_decode(byte *buf, int len)
>>    
>>
>
>And what does this function do?
>  
>
this function decode unicode html tags and should change name to
int decode_and_htmlUNICODE_to_cp866(byte *buf, int len)

It decodes things like м to cp866 and all other normal characters 
with charset_table[]
In any case all those changes will work with CP866 macro defined

>>+void mime_type2(word_t * text)
>>    
>>
>
>What does this do? Why a mile-long #if 0? The whole mime.* change is
>undocumented and I don't see why we might need it, what it changes or does.
>  
>
this function is used in proposed "EK binary problem hack"
#if 0 - there is empty swith isn' t it ? and old mime_type() is still in 
place

SY,
EK