Re: tr(1) buggy with de_DE.ISO8859-1(5) locale?



Martin Krzysiak <cinek@xxxxxx> wrote:
Oliver Fromme wrote:
It's not a bug. It's perfectly POSIX-compatible.

I think this behavior is "undefined" in POSIX,

That's correct. Which means that FreeBSD's tr(1) is
POSIX-compatible. And any script which assumes that
"tr a-z A-Z" works in any locale is _not_ POSIX-
compatible.

Specifically, SUSv3 (a.k.a. POSIX-2001) says:

LC_COLLATE
Determine the locale for the behavior of range
expressions and equivalence classes.

And it also specifically mentions the following as an
example that must be used for case conversions:

tr -s '[:upper:]' '[:lower:]'

It's not only upper-lowercase conversion that is weird.
Try "echo wxyz | tr w-z a-d". Ranges are broken generally
in ISO-locales, in my opinion.

Ranges are not broken, they just work as defined by the
locale. It's an error to assume that "a-d" always means
the four letters a, b, c, d. That's only true in the
US-ASCII locale (a.k.a. "C" or POSIX locale).

When you're browsing in an index of German words, you
_do_ want them to be ordered correctly, don't you?
That is, you expect words starting with a-umlaut ("ä")
to be ordered along with "a", not after "z" or anywhere
else. Therefore, the collation definitions are correct,
not broken.

By the way: Do not set LANG or LC_ALL, expecially for
the root user, and especially when compiling things.

One thing I like about FreeBSD is that I have my German
environment.

What do you mean by "German environment"? I also have a
German environment, but I only set LC_CTYPE, not LC_ALL,
LANG or LC_COLLATE.

But you are right. The only locale that is
expected to work correctly is "C".

I think that all locales work correctly, as far as I can
tell. At least the German ones that I use work correctly.

The only problem is that script authors that use tr(1)
make illegal assumptions about the behaviour of ranges.

How many times did you use tr(1) to convert your texts
to upper/lower case? Do you expect that it works correctly?

I don't have LC_COLLATE set (or LANG or LC_ALL), so I
expect that "tr a-z A-Z" works in the usual way when
used for English texts.

I never need to convert German texts from lower case to
upper case. But if I had to do that, the following way
that you mentioned would work fine for me, too (except
that I have to convert sharp-s ("ß") to "SS" manually):

I would prefer to use it like: "tr a-zäöü A-ZÄÖÜ",

When writing scripts, I either use the correct tr syntax
with [:lower:] [:upper:], or -- if you know that locale
support is not required -- put "unset LC_ALL LC_COLLATE
LANG" at the beginning.

Note that tr(1) is not appropriate to perform non-English
case conversions in general. For example, it does never
handle the German sharp-s ("ß") correctly, no matter how
you set your locale, and no matter what syntax you use
with tr. This is a limitation which cannot be easily
solved, unfortunately. And German is easy ... There are
languages with more complicated rules. For example, in
Turkish, the letter "I" is not the upper-case of "i".

For people who are interested in a simple workaround.
Don't use de_DE.ISO8859-1(5). Instead use de_DE.UTF-8.
tr(1)'s ranges work like expected there.

tr's ranges _always_ work as expected, given how locales
work (especially LC_COLLATE). Using UTF-8 encoding
doesn't guarantee that 'a-z' works for case conversions
either. The _only_ reliable way is to use character
classes, as mentioned several times.

Best regards
Oliver

--
Oliver Fromme, secnetix GmbH & Co. KG, Marktplatz 29, 85567 Grafing
Dienstleistungen mit Schwerpunkt FreeBSD: http://www.secnetix.de/bsd
Any opinions expressed in this message may be personal to the author
and may not necessarily reflect the opinions of secnetix in any way.

'Instead of asking why a piece of software is using "1970s technology,"
start asking why software is ignoring 30 years of accumulated wisdom.'
_______________________________________________
freebsd-stable@xxxxxxxxxxx mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscribe@xxxxxxxxxxx"



Relevant Pages

  • Re: process versus thread
    ... | The locale information is maintained per process, ... | although the script itself never called setlocaleitself. ... windows uses apartment-model threading, where one process is spawned and can ... different settings than another renter...however, different from real life, ...
    (alt.php)
  • Re: performance problem with time.strptime()
    ... My first version of this script hasn't a good ... Here a code example (Python 2.6.2 on Ubuntu 7.10): ... def strptime: ... Maybe it's a problem with my OS locale... ...
    (comp.lang.python)
  • Re: English-> German and GetNumberFormat()
    ... I am German, so maybe I can help here a little bit. ... Even Excel does use the "nice" format for displaying the numbers in the data ... functions obey the number format set by the current locale. ... libraries (see my earlier reply where I said that it wasn't what he was doing, ...
    (microsoft.public.vc.mfc)
  • Re: English-> German and GetNumberFormat()
    ... German uses the same windows character set and code page as the US version, ... Even Excel does use the "nice" format for displaying the numbers in the data cells, but as soon as you edit a cell, you get the "simple" number format in the edit field. ... that locale is not automatically set to the current user's locale. ...
    (microsoft.public.vc.mfc)
  • Re: English-> German and GetNumberFormat()
    ... I am German, so maybe I can help here a little bit. ... same windows character set and code page as the US version, ... Even Excel does use the "nice" format for displaying the numbers in the ... that locale is not automatically set to the current ...
    (microsoft.public.vc.mfc)