Re: replacing character range to remove diacritics



On Mon, 04 Jun 2007 04:34:18 -0700, Peter
<mailbox@xxxxxxxxxxxxxx> wrote:


Hi,

I have made several attempts to remove non-ascii/diacritics characters
from a file using sed, still without luck.

First of all i am working on an AIX 5.2 system and have to use
standard sed, gnu sed is not available. The test file i used contained
two lines;

abcdefghijklmnopqrstuvwxyzABCEDEGHIJKLMNOPQRSTUVWXYZ
diàcriet

A hex display of the file;
0 61626364 65666768 696a6b6c 6d6e6f70 |abcdefghijklmnop|
10 71727374 75767778 797a4142 43454445 |qrstuvwxyzABCEDE|
20 4748494a 4b4c4d4e 4f505152 53545556 |GHIJKLMNOPQRSTUV|
30 5758595a 0a6469e0 63726965 740a |WXYZ di criet |


When i try to replace all characters in the range of 127-255 (x7f-
xff) using the following sed statement; sed -e 's/[\O177-\O377]/!/g'
test.txt
The result is:

abcdefghijklmnopqrstuvwxyz!!!!!!!!!!!!!!!!!!!!!!!!!!
diàcriet

First, i can not explain why A-Z is replaced, can you?
Secondly is there a nice solution for this problem. Currenly is solved
it using perl using the unpack statement to detect the value of the
most significat bit of the character.

The \ character loses its special meaning inside brackets, so
[\0177-\0377] matches the literal characters \ 0 1 3 7 or the range 7-\.
You could use your text editor to insert the literal \0177 and \0377
characters into the brackets, or use an expression like
LANG=C sed 's/[^[:alpha:][:digit:][:space:][:punct:]]/!/g'

--
You shall be rewarded for a dastardly deed.
.



Relevant Pages

  • Re: edit-replace function
    ... > original search string (and if you paste the string from your message into ... > original document by viewing the space characters with CTRL+* a space will ... The brackets are there only ... >> Replacing characters using wildcards" site suggests. ...
    (microsoft.public.word.customization.menustoolbars)
  • Re: Please help with grep and sed commands
    ... Most other characters simply match themselves. ... brackets and commas on either side). ... Yes, grep could do this, as could sed. ... able to guarantee that you won't erroneously have a,, type match ...
    (comp.os.linux.misc)
  • Re: Hot to split string literals that will across two or more lines ?
    ... >> parentheses or round brackets ... > You imply that HTML/XML might use chevrons. ... They inherit their start/end tag characters from SGML's ... They're not actually: they are double-prime marks, ...
    (comp.lang.python)
  • Re: Block Cipher Endedness
    ... sets, unusual text characters are deleted: brackets, braces, wedge ... but simple messages don't need all characters. ... The idiots that practice dogmatic ignorance, ...
    (sci.crypt)
  • Re: Finding specific field
    ... the text of the field and not the brackets, you can copy and paste into the ... comes with more than one kind of accent in Esperanto; ... I have several zillion overstrike characters that generate accented ... Daiya Mitchell, MVP Mac/Word ...
    (microsoft.public.mac.office.word)