Re: How to get the encoding table ?

From: Enrique Perez-Terron (enrio_at_online.no)
Date: 09/27/05


Date: Tue, 27 Sep 2005 22:00:24 +0200

On Tue, 27 Sep 2005 06:39:08 +0200, nix <dengcfei@gmail.com> wrote:

>
> Enrique Perez-Terron 写道:
>
>> On Mon, 26 Sep 2005 04:58:57 +0200, nix <dengcfei@gmail.com> wrote:
>>
>> > I want to input mutibyte data into file for test.
>> > for example:
>> > bash:> printf "\200\254" > file1
>> > so I want to get the hexadecimal value of the corresponding multibyte
>> > character.

[...]

> JIS X 0208 and JIS X 0212 are the most popular Japanese Industry
> Standard of character sets, using the form <j-row-column> to represent
> a specific Japanese character such as 平仮名 and 漢字.But they are
> not the encoding.
> The common encoding is such as Shift-JIS, EUC-jp.
>
> My question is when I get a multibyte character such as 間(kanji
> character in Japanese), how can I get the correspoding encoding value.
> It depends on the charmap used by the current locale.

    $ echo 間 | od -t x1
    0000000 e9 96 93 0a
    0000004

Since I have a locale based on utf-8, the response I get is the utf-8
encoding. In this way you get the character encoding for your current
locale.

In addition there is the "iconv" program, if you need to see the encoding
in some other encoding system. For instance, I can do

    $ echo 平仮名 and 漢字 | iconv --from utf-8 --to sjis | od -t x1
    0000000 95 bd 89 bc 96 bc 20 61 6e 64 20 8a bf 8e 9a 0a
    0000020

If you want octal, rather than hex output, use "-t o1" instead of "-t x1".

I should perhaps mention that I am using gnome-terminal and bash.
Other environments may not be able to handle the direct input of CJK
characters, or even the pasting technique I used.

> Is there any utility like "locale -xxx" or "charmap -l" to display the
> characters and encoding values of the charmap.

I am not aware of any such utility.

It reminds me that at some half forgotten point in the past I have
poked around in the files underlying the various parts of the X windows
system, the keyboard extension, the locale system, etc, and at some point
I was surprised to find ascii files listing up the long names of each
character in various character sets, together with some other information,
like perhaps X keysym numbers, and encoding values.

However, when I now tried to find back to it, I had no success.

> References:
> http://www.rikai.com/library/kanjitables/kanji_codes.sjis.shtml
> http://www.rikai.com/library/kanjitables/kanji_codes.euc.shtml

These seem to pretty much provide what you failed to find according
to your original post:

>> > Now, my locale is Ja_JP, maybe the character set is the JIS X 0208 and
>> > JIS X 0212. I googled it and get the character table but it just lists
>> > the character and <row-column> form, for example:
>> > <j0761><j1604>
>> > I can't get the hexadecimal value either.

Am I understanding you correctly?

Cheers,
Enrique



Relevant Pages

  • =?ISO-8859-1?Q?Re=3A_How_to_upload_a_=A3?=
    ... A reference to a character that will display as this glyph ... Correctly encoding some bytes so as to be recognised as this ... ASCII-like encodings are old and only cope with a character set of up ... straight for UTF-8. ...
    (alt.html)
  • Re: C# and encodings
    ... But if windows has numerous code pages, ... encoding, and thus have only 255 code points matched to characters? ... Unicode can't be represented in only 8-bits, ... But Notepad supports Unicode and yet it only recognizes 255 character, ...
    (microsoft.public.dotnet.languages.csharp)
  • Re: Writing to the newsgroup?
    ... you should be able to set the encoding and use the encoding you ... I'm not familiear with Unitype Global writer, ... However, if you use its help feature to inquire about 'character encoding', ... Here's the UTF-8 test. ...
    (sci.lang.japan)
  • Re: [PHP] First stupid post of the year. [SOLVED]
    ... one can argue how many bytes are needed to represent a character ... in what encoding, but that doesn't change the character. ... Unicode it is called U+00A0. ... there are a few ways to encode U+00A0. ...
    (php.general)
  • Re: C# and encodings
    ... different encoding than Unicode does (Unicode set uses three ... Any character encoding that is not Unicode by definition uses a different encoding than Unicode does. ... The point is that the Unicode "character" 0xfeff is not representable in any ANSI code page, and is treated specially by stripping it from input rather than replacing it with the "default character". ...
    (microsoft.public.dotnet.languages.csharp)