Re: UTF-16 file input, C programming.
- From: Rene Hardi Hansen <rene@xxxxxxx>
- Date: Mon, 5 Mar 2007 14:52:43 +0100
On Mon, 5 Mar 2007, Ulrich Eckhardt wrote:
Rene Hardi Hansen wrote:I have a pet project I would like to build in C. I am lacking some
knowlegde on how to do it though.
The project plan this.
Program input:
UTF-16 encoded text files, containing certain data and
filepaths, needed for extraction.
Program output:
The said specific data and filepaths, neatly displayed without all the
"extra" stuff from the files. UTF-16 or UTF-8, I haven't really decided
yet.
Transform the UTF-16 to UTF-8 first. Then, you can simply use char-based
strings, ignoring that they are UTF-8 encoded. BTW, you wouldn't even need
to use C for that, as it is a too sharp tool for the hobby programmer.
I was actually planning to do that, since UTF-8 is more friendly to use in C. However, you are only partly correct, from the fact that all standard ASCII chars, are mapped on a single byte as you mention. But, you are not correct when it comes to all the speciel characters, I have to include consideration for, meaning anything above 127 in ASCII. UTF-8 only maps the standard ASCII chars in one byte and anything above is represented in two or more bytes.
My reason for choosing C, is that I need a minimal of ressource usage. This program will be carried out, on a 266 MHZ ARM processer and therefore I want to have complete control and access to the optimization of the program.
I think I'll try to read in the file, byte for byte and do all the conversions myself. I believe unicode.org has some source, providing functions, that can convert UTF-16 surrogate pairs, into UTF-8 multibyte characters, but I will have to look into that.
If I go about reading in the input using standard, getc(), it pretty much
works, but... if it reads in some of those special characters, meaning
127 in the standard ASCII, it outputs question marks. � Dunno if they
show correctly, but it's just a character replacement, for when the
character is unknown.
You should make sure that your environment is working first, this rather
looks like something isn't prepared to handle UTF-8 strings in the output
stream. My suggestion is to install 'yudit', an editor that is capable of
reading and writing several encodings, including Unicode encodings.
I know my environment can display UTF-8 output, thats not the problem. I'm using Ubuntu as my development platform and UTF-8 is the system wide standard locale.
The standard included editor in Ubuntu is Gedit and it can display my UTF-16 files fine. It tries to open files in UTF-8 as standard and if not possible, asks you to choose another encoding, where you can just set it to UTF-16.
Thank you still, for your reply.
/René
Uli
--
http://www.erlenstar.demon.co.uk/unix/
- Follow-Ups:
- Re: UTF-16 file input, C programming.
- From: Ulrich Eckhardt
- Re: UTF-16 file input, C programming.
- References:
- UTF-16 file input, C programming.
- From: Rene Hardi Hansen
- Re: UTF-16 file input, C programming.
- From: Ulrich Eckhardt
- UTF-16 file input, C programming.
- Prev by Date: Re: file creation on UNIX
- Next by Date: Re: what is faster dynamic linking or static linking?
- Previous by thread: Re: UTF-16 file input, C programming.
- Next by thread: Re: UTF-16 file input, C programming.
- Index(es):
Relevant Pages
|