Removing BOM from UTF-8



I have a large number of text files created in MS Word and saved in
UTF-8 format. Unfortunately, MS Word adds the BOM to each file. I need
to remove the BOM.

Information regarding BOM and UTF-8 can be found here:

http://www.cl.cam.ac.uk/~mgk25/unicode.html
http://www.w3.org/International/questions/qa-utf8-bom

A brief excerpt:

It has also been suggested to use the UTF-8 encoded BOM (0xEF 0xBB 0xBF)
as a signature to mark the beginning of a UTF-8 file. This practice
should definitely not be used on POSIX systems for several reasons:

* On POSIX systems, the locale and not magic file type codes define
the encoding of plain text files. Mixing the two concepts would add a
lot of complexity and break existing functionality.

* Adding a UTF-8 signature at the start of a file would interfere
with many established conventions such as the kernel looking for “#!” at
the beginning of a plaintext executable to locate the appropriate
interpreter.

* Handling BOMs properly would add undesirable complexity even to
simple programs like cat or grep that mix contents of several files into
one.

It has been suggested that a script could be written to eliminate the
BOM from a file(s). My script writing skills suck. I have been unable to
locate one using Google, so I was hoping that someone might know where I
could either locate such a program, or perhaps give me an idea on how to
script one.

Thanks!

--
Gerard Seibert
gerard@xxxxxxxxxxxxx


I'm interested in the fact that the less secure a man is, the more
likely he is to have extreme prejudice.

Clint Eastwood
_______________________________________________
freebsd-questions@xxxxxxxxxxx mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-questions
To unsubscribe, send any mail to "freebsd-questions-unsubscribe@xxxxxxxxxxx"



Relevant Pages

  • Re: aps.net : BIG BUG in streamwriter
    ... look the BOM! ... editor which proceeds to rewrite it as UTF-16? ... when i want deserialize it with an utf-8 encoding... ...
    (microsoft.public.dotnet.languages.csharp)
  • Re: Custom Resource, XML problem
    ... Why are you assuming that it is 8-bit characters? ... //JWxml is namespace used by CXml ... which is then screamingly obvious as the UTF-8 Byte Order Mark, ... BOM is the only meaning of BOM in my brain was for "Bill Of Material" which ...
    (microsoft.public.vc.mfc)
  • Re: Invalid characters before xml header
    ... "UTF-8" hence the BOM which is a 16 a magic 16 bit unicode value usually put ... Just to confuse things I seem to remember that Encoding.UTF8 and new ... checked - the output XML files were identical. ...
    (microsoft.public.dotnet.languages.csharp)
  • Re: Custom Resource, XML problem
    ... Mr.David Chingand I tried to use it with a XML wrapping ... Why are you assuming that it is 8-bit characters? ... which is then screamingly obvious as the UTF-8 Byte Order Mark, ... you have a BOM, if you do, which one, and convert the text appropriately. ...
    (microsoft.public.vc.mfc)
  • Writing UTF-8 file under Windows
    ... Whatever I try to write a UTF-8 file, I always end up with UTF-16LE ... with the "FF FE" BOM at the beginning and 2 bytes per character. ... I am reading strings from an external resource and try to write to ... Why does Perl add it? ...
    (comp.lang.perl.misc)