Re: How to socket and utf-8?

From: Pascal Bourguignon (spam_at_mouse-potato.com)
Date: 11/18/05


Date: Fri, 18 Nov 2005 19:24:18 +0100

yarco.w@gmail.com writes:
> 2. When using utf-8 for communication, should i translate it into ascii
> for normal using?

http://en.wikipedia.org/wiki/Utf-8

In C, you don't have a notion of character.

The type char is merely a small integer, perhaps signed perhaps
unsigned, and the type unsigned char is merely a small unsigned
integer, and the type signed char is merely a small signed integer.

ASCII is a encoding, which is a direct mapping between some
_characters_ and some _integers_. Since the integers of the ASCII
encoding are between 0 and 127, they're small enough to be held in C
variables of type unsigned char. It's just a coincidence.

If you wanted to use the UNICODE encoding, which (to a first
approximation) is a direct mapping between some more _characters_ and
some _integers_, but bigger integers up to 0x10fff, you'll need to use
unsigned long int C variables.

Now, both ASCII and UNICODE have a 1-1 mapping between a set of
characters and a set of integers.

But there are other encodings, such as UTF-8, or UTF-16, or
ISO-2022-JP, etc, that map a character to a sequence of numbers of
variable length. However, despite this variable length of characters
encoded in UTF-8, this encoding has some nice properties:

- a character X encoded in ASCII as the same code as the same
  character X encoded in UTF-8.

- no multi-byte sequence of UTF-8 contain a byte equal to one of the
  ASCII subset: all multi-byte sequences in UTF-8 use only numbers
  between 160 and 255.

So when you use C variables of type unsigned char, you can handle
safely utf-8 byte sequences, while you're not interested in the actual
characters represented by the byte sequence, or as long as the only
characters in this utf-8 byte sequence are all ASCII characters.

By the way you cannot "translate utf-8 to ASCII", because most
characters encodable in utf-8 cannot be encoded in ASCII:

$ echo é|iconv -f utf-8 -t ascii
iconv: illegal input sequence at position 0

So you can easily process utf-8 data as a whole, without having to
translate it. What would be "normal use" for your strings?

> Can i deal with it directory like ascii?

Globally, yes.
If you want to process the characters, in general, no.
In some cases, yes.

For example: "C'est ça la vie" is encoded in UTF-8 as these bytes:

43 27 65 73 74 20 c3 a7 61 20 6c 61 20 76 69 65

If you want to split this string on spaces (bytes 20), you can do it
as if it was encoded in ASCII, because the space in UNICODE has the
same code as in ASCII, and because UTF-8 doesn't use this code for
anything else than a space. So you can get these four subsequences of
bytes:

43 27 65 73 74
c3 a7 61
6c 61
76 69 65

which, when decoded from UTF-8 give you back these four strings:

"C'est"
"ça"
"la"
"vie"

If you keep in mind that in C you are not processing characters, but
bytes, and if you keep in mind the properties of the UTF-8 encoding,
then you can do a great deal without having to decode UTF-8 bytes to
characters. You may want to use:

typedef unsigned char byte;
byte* bytes="Hello World";

instead of char and string...

-- 
__Pascal Bourguignon__                     http://www.informatimago.com/
I need a new toy.
Tail of black dog keeps good time.
Pounce! Good dog! Good dog!


Relevant Pages

  • Re: Get ASCII value for character when higher than 127
    ... UTF-8 will handle it. ... the correct int value for the special characters. ... char timeString; ... strcat; ...
    (microsoft.public.vc.language)
  • Re: regex and utf8 characters (german umlauts)
    ... characters depends on what encoding you are assuming the source is ... In UTF-8, these three bytes are invalid. ... ASCII medium. ...
    (comp.lang.perl.misc)
  • Re: Chinese filenames
    ... Always use simple ASCII characters. ... Ensure your PHP script be properly UTF-8 encoded. ... The name of the file can be acquired as a UTF-8 string: ...
    (comp.lang.php)
  • Re: Is it OK to include an ANSI file into a UTF-8 file?
    ... As ASCII characters have the same encoding under both ASCII and UTF-8 and ... appear as part of another character of UTF-8, ... example it is set to 850 then these are Western European characters like ... > What is an ANSI file? ...
    (comp.lang.php)
  • Re: DB2 UTF-8 ODBC double conversion
    ... UTF-8 *is* Unicode. ... byte to store characters in the 7-bit ASCII code. ... If I give a UTF-8 string to CreateFile, ... this means that everyone who is using that database has to understand that the ...
    (microsoft.public.vc.mfc)

Loading