Re: replacing character range to remove diacritics
- From: Bill Marcum <marcumbill@xxxxxxxxxxxxx>
- Date: Mon, 04 Jun 2007 13:40:03 GMT
On Mon, 04 Jun 2007 04:34:18 -0700, Peter
<mailbox@xxxxxxxxxxxxxx> wrote:
The \ character loses its special meaning inside brackets, so
Hi,
I have made several attempts to remove non-ascii/diacritics characters
from a file using sed, still without luck.
First of all i am working on an AIX 5.2 system and have to use
standard sed, gnu sed is not available. The test file i used contained
two lines;
abcdefghijklmnopqrstuvwxyzABCEDEGHIJKLMNOPQRSTUVWXYZ
diàcriet
A hex display of the file;
0 61626364 65666768 696a6b6c 6d6e6f70 |abcdefghijklmnop|
10 71727374 75767778 797a4142 43454445 |qrstuvwxyzABCEDE|
20 4748494a 4b4c4d4e 4f505152 53545556 |GHIJKLMNOPQRSTUV|
30 5758595a 0a6469e0 63726965 740a |WXYZ di criet |
When i try to replace all characters in the range of 127-255 (x7f-
xff) using the following sed statement; sed -e 's/[\O177-\O377]/!/g'
test.txt
The result is:
abcdefghijklmnopqrstuvwxyz!!!!!!!!!!!!!!!!!!!!!!!!!!
diàcriet
First, i can not explain why A-Z is replaced, can you?
Secondly is there a nice solution for this problem. Currenly is solved
it using perl using the unpack statement to detect the value of the
most significat bit of the character.
[\0177-\0377] matches the literal characters \ 0 1 3 7 or the range 7-\.
You could use your text editor to insert the literal \0177 and \0377
characters into the brackets, or use an expression like
LANG=C sed 's/[^[:alpha:][:digit:][:space:][:punct:]]/!/g'
--
You shall be rewarded for a dastardly deed.
.
- Follow-Ups:
- Re: replacing character range to remove diacritics
- From: Geoff Clare
- Re: replacing character range to remove diacritics
- References:
- replacing character range to remove diacritics
- From: Peter
- replacing character range to remove diacritics
- Prev by Date: Re: Help using AWK!!
- Next by Date: Re: shell script for dyndns IP update
- Previous by thread: replacing character range to remove diacritics
- Next by thread: Re: replacing character range to remove diacritics
- Index(es):
Relevant Pages
|