Re: UTF-16 file input, C programming.



On Mon, 5 Mar 2007, Ulrich Eckhardt wrote:

Rene Hardi Hansen wrote:
I have a pet project I would like to build in C. I am lacking some
knowlegde on how to do it though.

The project plan this.

Program input:

UTF-16 encoded text files, containing certain data and
filepaths, needed for extraction.

Program output:

The said specific data and filepaths, neatly displayed without all the
"extra" stuff from the files. UTF-16 or UTF-8, I haven't really decided
yet.

Transform the UTF-16 to UTF-8 first. Then, you can simply use char-based
strings, ignoring that they are UTF-8 encoded. BTW, you wouldn't even need
to use C for that, as it is a too sharp tool for the hobby programmer.

I was actually planning to do that, since UTF-8 is more friendly to use in C. However, you are only partly correct, from the fact that all standard ASCII chars, are mapped on a single byte as you mention. But, you are not correct when it comes to all the speciel characters, I have to include consideration for, meaning anything above 127 in ASCII. UTF-8 only maps the standard ASCII chars in one byte and anything above is represented in two or more bytes.

My reason for choosing C, is that I need a minimal of ressource usage. This program will be carried out, on a 266 MHZ ARM processer and therefore I want to have complete control and access to the optimization of the program.


I think I'll try to read in the file, byte for byte and do all the conversions myself. I believe unicode.org has some source, providing functions, that can convert UTF-16 surrogate pairs, into UTF-8 multibyte characters, but I will have to look into that.


If I go about reading in the input using standard, getc(), it pretty much
works, but... if it reads in some of those special characters, meaning
127 in the standard ASCII, it outputs question marks. � Dunno if they
show correctly, but it's just a character replacement, for when the
character is unknown.

You should make sure that your environment is working first, this rather
looks like something isn't prepared to handle UTF-8 strings in the output
stream. My suggestion is to install 'yudit', an editor that is capable of
reading and writing several encodings, including Unicode encodings.


I know my environment can display UTF-8 output, thats not the problem. I'm using Ubuntu as my development platform and UTF-8 is the system wide standard locale.

The standard included editor in Ubuntu is Gedit and it can display my UTF-16 files fine. It tries to open files in UTF-8 as standard and if not possible, asks you to choose another encoding, where you can just set it to UTF-16.


Thank you still, for your reply.

/René


Uli

--
http://www.erlenstar.demon.co.uk/unix/


Relevant Pages

  • Re: unichr() question
    ... Unicode code points. ... If you eventually need UTF-8, you might just as well create a mapping ... Recent Unicode revisions added characters beyond the first ... If you want to learn more about UTF-16, ...
    (comp.lang.python)
  • Re: The Register interview Nigel Brown
    ... performance isn't quite as good as string. ... Have you considered implementing a native UTF-8 ... than UTF-16 with European ... which does not include all Chinese characters. ...
    (borland.public.delphi.non-technical)
  • Re: wstring to ostream
    ... There are different encodings for Unicode characters; UTF-8 and UTF-16 ... a Unicode character can be stored in one or two ...
    (microsoft.public.vc.stl)
  • Re: Supporting full Unicode
    ... > Keeping in mind that in UTF-16 some characters take two bytes and ... It is true that variable-width encodings such as UTF-16 or UTF-8 are ... But UTF-8 is gaining momemtum. ... encoding only, it is now in use as an internal encoding, too. ...
    (comp.lang.ada)
  • Re: GAS-style syntax issue...
    ... but, alas, the issue becomes a little more hairy than a few simple parser ... I guess it is an issue right up there with making the assembler UTF-8 ... (UTF-16 just wastes too much memory IMO, ... majority of text is ASCII... ...
    (alt.lang.asm)