UTF-8 constants
Posted: Wed Dec 17, 2014 7:15 pm
For some time, I was looking for a way to represent a Unicode character in UTF-8 format as a constant in terms of its code point. What I mean is something like the way, in C and some other languages, you can represent, for example, U+0905 (अ, Devanagari letter A) as "\u0905". In C, this gives a wide char type which is stored internally as UTF-16 LE. In Harbour, UTF-16 LE of this character can be expressed as e"\x05\x09". But in UTF-8, this becomes e"\xE0\xA4\x85", and the connection to 0x0905 is almost impossible to see.
Harbour has at least two ways to represent a ANSI character constant in terms of its code point, e.g. CHR(0xA0) and E"\xA0" for no-break space. So I thought it should also have a way to do this for UTF-8. But after looking through the Harbour changelog and other documentation, I did not find any solution.
Of course, it is possible to simply put the character in quotes. This works well for some characters. But for others, like no-break space, this does not work well at all, since this character looks just like a regular space (U+0020). For Asian characters, another problem is that they can be difficult to read at point sizes usually used for Western characters.
Harbour does have a function HB_UTF8CHR() that converts a numeric code point to its UTF-8 representation. But this is executed only at runtime. So HB_UTF8CHR() of a constant integer is not considered a constant string from Harbour's point of view. It cannot be used in places where a constant is required, such as in an initialization expression for a STATIC variable, or in a CASE statement of a SWITCH block. It is also inefficient to do a conversion at runtime that could instead be done at compile time.
Fortunately, I discovered that a few functions are evaluated at compile time if they have constant arguments, and the result is therefore also considered a constant. For instance, CHR(99) is considered a constant, because it is evaluated at compile time, not runtime. I did some testing and discovered that the following functions in this category:
The following are a few functions that are not evaluated at compile time:
So ultimately, the solution for me was to develop a way of converting a code point to a UTF-8 string in terms of the first group of functions, and use #translate to map this to a pseudofunction:
I now use this in my programs to express U+hhhh in Harbour as U(0xhhhh), e.g. U+0905 as U(0x0905).
Kevin
Harbour has at least two ways to represent a ANSI character constant in terms of its code point, e.g. CHR(0xA0) and E"\xA0" for no-break space. So I thought it should also have a way to do this for UTF-8. But after looking through the Harbour changelog and other documentation, I did not find any solution.
Of course, it is possible to simply put the character in quotes. This works well for some characters. But for others, like no-break space, this does not work well at all, since this character looks just like a regular space (U+0020). For Asian characters, another problem is that they can be difficult to read at point sizes usually used for Western characters.
Harbour does have a function HB_UTF8CHR() that converts a numeric code point to its UTF-8 representation. But this is executed only at runtime. So HB_UTF8CHR() of a constant integer is not considered a constant string from Harbour's point of view. It cannot be used in places where a constant is required, such as in an initialization expression for a STATIC variable, or in a CASE statement of a SWITCH block. It is also inefficient to do a conversion at runtime that could instead be done at compile time.
Fortunately, I discovered that a few functions are evaluated at compile time if they have constant arguments, and the result is therefore also considered a constant. For instance, CHR(99) is considered a constant, because it is evaluated at compile time, not runtime. I did some testing and discovered that the following functions in this category:
Code: Select all
+ // numeric and string
- * / %
^ // including negative and fractional exponents
0x $ == != < <= > >= .T. .Y. .F. .N. ! .NOT. .AND. .OR. {} {=>} {||} E""
ASC() AT() CHR() EMPTY() HB_BITAND() HB_BITNOT() HB_BITOR() HB_BITRESET() HB_BITSET() HB_BITSHIFT() HB_BITTEST() HB_BITXOR() IF() INT() LEN() LOWER() MAX() MIN() UPPER()
Code: Select all
ABS() ALLTRIM() EVAL() EXP() HB_UTF8ASC() HB_UTF8AT() HB_UTF8CHR() HB_UTF8LEFT() HB_UTF8LEN() HB_UTF8RAT() HB_UTF8RIGHT() HB_UTF8SUBSTR() ISALPHA() ISDIGIT() ISLOWER() ISUPPER() LEFT() LOG() LTRIM() MOD() PADC() PADL() PADR() RAT() REPLICATE() RIGHT() ROUND() RTRIM() SPACE() SQRT() STR() STRTRAN() STRZERO() STUFF() SUBSTR() TRANSFORM() TYPE() VAL() VALTYPE()
Code: Select all
#translate U(<c>) => ;
IF(<c> \< 0x80 , CHR( <c> ), ;
IF(<c> \< 0x0800 , CHR(INT(<c> / 0x40) + 0xC0) + CHR( <c> % 0x40 + 0x80), ;
IF(<c> \< 0x10000, CHR(INT(<c> / 0x1000) + 0xE0) + CHR(INT(<c> / 0x40) % 0x40 + 0x80) + CHR( <c> % 0x40 + 0x80), ;
CHR(INT(<c> / 0x40000) + 0xF0) + CHR(INT(<c> / 0x1000) % 0x40 + 0x80) + CHR(INT(<c> / 0x40) % 0x40 + 0x80) + CHR( <c> % 0x40 + 0x80))))
Kevin