Re: tr(1) buggy with de_DE.ISO8859-1(5) locale?
- From: Oliver Fromme <olli@xxxxxxxxxxxxxxxxx>
- Date: Tue, 7 Feb 2006 16:47:59 +0100 (CET)
Martin Krzysiak <cinek@xxxxxx> wrote:
Oliver Fromme wrote:
It's not a bug. It's perfectly POSIX-compatible.
I think this behavior is "undefined" in POSIX,
That's correct. Which means that FreeBSD's tr(1) is
POSIX-compatible. And any script which assumes that
"tr a-z A-Z" works in any locale is _not_ POSIX-
compatible.
Specifically, SUSv3 (a.k.a. POSIX-2001) says:
LC_COLLATE
Determine the locale for the behavior of range
expressions and equivalence classes.
And it also specifically mentions the following as an
example that must be used for case conversions:
tr -s '[:upper:]' '[:lower:]'
It's not only upper-lowercase conversion that is weird.
Try "echo wxyz | tr w-z a-d". Ranges are broken generally
in ISO-locales, in my opinion.
Ranges are not broken, they just work as defined by the
locale. It's an error to assume that "a-d" always means
the four letters a, b, c, d. That's only true in the
US-ASCII locale (a.k.a. "C" or POSIX locale).
When you're browsing in an index of German words, you
_do_ want them to be ordered correctly, don't you?
That is, you expect words starting with a-umlaut ("ä")
to be ordered along with "a", not after "z" or anywhere
else. Therefore, the collation definitions are correct,
not broken.
By the way: Do not set LANG or LC_ALL, expecially for
the root user, and especially when compiling things.
One thing I like about FreeBSD is that I have my German
environment.
What do you mean by "German environment"? I also have a
German environment, but I only set LC_CTYPE, not LC_ALL,
LANG or LC_COLLATE.
But you are right. The only locale that is
expected to work correctly is "C".
I think that all locales work correctly, as far as I can
tell. At least the German ones that I use work correctly.
The only problem is that script authors that use tr(1)
make illegal assumptions about the behaviour of ranges.
How many times did you use tr(1) to convert your texts
to upper/lower case? Do you expect that it works correctly?
I don't have LC_COLLATE set (or LANG or LC_ALL), so I
expect that "tr a-z A-Z" works in the usual way when
used for English texts.
I never need to convert German texts from lower case to
upper case. But if I had to do that, the following way
that you mentioned would work fine for me, too (except
that I have to convert sharp-s ("ß") to "SS" manually):
I would prefer to use it like: "tr a-zäöü A-ZÄÖÜ",
When writing scripts, I either use the correct tr syntax
with [:lower:] [:upper:], or -- if you know that locale
support is not required -- put "unset LC_ALL LC_COLLATE
LANG" at the beginning.
Note that tr(1) is not appropriate to perform non-English
case conversions in general. For example, it does never
handle the German sharp-s ("ß") correctly, no matter how
you set your locale, and no matter what syntax you use
with tr. This is a limitation which cannot be easily
solved, unfortunately. And German is easy ... There are
languages with more complicated rules. For example, in
Turkish, the letter "I" is not the upper-case of "i".
For people who are interested in a simple workaround.
Don't use de_DE.ISO8859-1(5). Instead use de_DE.UTF-8.
tr(1)'s ranges work like expected there.
tr's ranges _always_ work as expected, given how locales
work (especially LC_COLLATE). Using UTF-8 encoding
doesn't guarantee that 'a-z' works for case conversions
either. The _only_ reliable way is to use character
classes, as mentioned several times.
Best regards
Oliver
--
Oliver Fromme, secnetix GmbH & Co. KG, Marktplatz 29, 85567 Grafing
Dienstleistungen mit Schwerpunkt FreeBSD: http://www.secnetix.de/bsd
Any opinions expressed in this message may be personal to the author
and may not necessarily reflect the opinions of secnetix in any way.
'Instead of asking why a piece of software is using "1970s technology,"
start asking why software is ignoring 30 years of accumulated wisdom.'
_______________________________________________
freebsd-stable@xxxxxxxxxxx mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscribe@xxxxxxxxxxx"
- References:
- Re: tr(1) buggy with de_DE.ISO8859-1(5) locale?
- From: Martin Krzysiak
- Re: tr(1) buggy with de_DE.ISO8859-1(5) locale?
- Prev by Date: kernel compile error - ata_modify_if_48bit
- Next by Date: Re: kernel compile error - ata_modify_if_48bit
- Previous by thread: Re: tr(1) buggy with de_DE.ISO8859-1(5) locale?
- Next by thread: Build 4.11 kernel on 6-Release
- Index(es):
Relevant Pages
|