Re: How to socket and utf-8?
From: Pascal Bourguignon (spam_at_mouse-potato.com)
Date: 11/18/05
- Next message: Daniel C. Bastos: "Re: Why accept error when there have a fork?"
- Previous message: troophous: "Re: Trying to get iostat statistics AIX 5l v 5.3 help!"
- In reply to: yarco.w_at_gmail.com: "How to socket and utf-8?"
- Next in thread: SM Ryan: "Re: How to socket and utf-8?"
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Date: Fri, 18 Nov 2005 19:24:18 +0100
yarco.w@gmail.com writes:
> 2. When using utf-8 for communication, should i translate it into ascii
> for normal using?
http://en.wikipedia.org/wiki/Utf-8
In C, you don't have a notion of character.
The type char is merely a small integer, perhaps signed perhaps
unsigned, and the type unsigned char is merely a small unsigned
integer, and the type signed char is merely a small signed integer.
ASCII is a encoding, which is a direct mapping between some
_characters_ and some _integers_. Since the integers of the ASCII
encoding are between 0 and 127, they're small enough to be held in C
variables of type unsigned char. It's just a coincidence.
If you wanted to use the UNICODE encoding, which (to a first
approximation) is a direct mapping between some more _characters_ and
some _integers_, but bigger integers up to 0x10fff, you'll need to use
unsigned long int C variables.
Now, both ASCII and UNICODE have a 1-1 mapping between a set of
characters and a set of integers.
But there are other encodings, such as UTF-8, or UTF-16, or
ISO-2022-JP, etc, that map a character to a sequence of numbers of
variable length. However, despite this variable length of characters
encoded in UTF-8, this encoding has some nice properties:
- a character X encoded in ASCII as the same code as the same
character X encoded in UTF-8.
- no multi-byte sequence of UTF-8 contain a byte equal to one of the
ASCII subset: all multi-byte sequences in UTF-8 use only numbers
between 160 and 255.
So when you use C variables of type unsigned char, you can handle
safely utf-8 byte sequences, while you're not interested in the actual
characters represented by the byte sequence, or as long as the only
characters in this utf-8 byte sequence are all ASCII characters.
By the way you cannot "translate utf-8 to ASCII", because most
characters encodable in utf-8 cannot be encoded in ASCII:
$ echo é|iconv -f utf-8 -t ascii
iconv: illegal input sequence at position 0
So you can easily process utf-8 data as a whole, without having to
translate it. What would be "normal use" for your strings?
> Can i deal with it directory like ascii?
Globally, yes.
If you want to process the characters, in general, no.
In some cases, yes.
For example: "C'est ça la vie" is encoded in UTF-8 as these bytes:
43 27 65 73 74 20 c3 a7 61 20 6c 61 20 76 69 65
If you want to split this string on spaces (bytes 20), you can do it
as if it was encoded in ASCII, because the space in UNICODE has the
same code as in ASCII, and because UTF-8 doesn't use this code for
anything else than a space. So you can get these four subsequences of
bytes:
43 27 65 73 74
c3 a7 61
6c 61
76 69 65
which, when decoded from UTF-8 give you back these four strings:
"C'est"
"ça"
"la"
"vie"
If you keep in mind that in C you are not processing characters, but
bytes, and if you keep in mind the properties of the UTF-8 encoding,
then you can do a great deal without having to decode UTF-8 bytes to
characters. You may want to use:
typedef unsigned char byte;
byte* bytes="Hello World";
instead of char and string...
-- __Pascal Bourguignon__ http://www.informatimago.com/ I need a new toy. Tail of black dog keeps good time. Pounce! Good dog! Good dog!
- Next message: Daniel C. Bastos: "Re: Why accept error when there have a fork?"
- Previous message: troophous: "Re: Trying to get iostat statistics AIX 5l v 5.3 help!"
- In reply to: yarco.w_at_gmail.com: "How to socket and utf-8?"
- Next in thread: SM Ryan: "Re: How to socket and utf-8?"
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Relevant Pages
|