Re: sed applying the same regexp twice
From: laura fairhead (laura_fairhead_at_INVALID.com)
Date: 11/04/03
- Next message: Stonestream: "Re: echo foo |grep foo && exit || wrong"
- Previous message: bsh_at_iname.com: "Re: Need Regex Calculator"
- In reply to: Hartmut Sch?fer: "sed applying the same regexp twice"
- Next in thread: Stephane CHAZELAS: "Re: sed applying the same regexp twice"
- Reply: Stephane CHAZELAS: "Re: sed applying the same regexp twice"
- Reply: Hartmut Sch?fer: "Re: sed applying the same regexp twice"
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Date: Tue, 4 Nov 2003 00:19:44 +0000
In article <c8fcab4d.0311030417.618ea33b@posting.google.com>,
Hartmut Sch?fer wrote:
>Hi,
>
>I wrote the following sed script
Hello,
>
> sed 's/[^\.]*/-/g'
The backslash shouldn't be there. In a bracket expression the period
isn't a special character,
>
>to filter a list of MVS datasets, having the names on all hierarchy
>levels substituted by a dash (for doing further processing on the
>result). Thus, the line
>
> ZLS0.DSNDBC.DZL2ZLS.SEXCEP01.I0001.A001
>
>should be transformed into
>
> -.-.-.-.-.-
>
>I read the script above as "substitute every sequence of non-dots by
>one dash". (OK, *empty* sequences of non-dots will match too, but
>since in the input there are no sequences of dots with nothing between
>them, with the expected input this should work as intended.)
>
>Indeed, under AIX and Solaris
>
> echo ZLS0.DSNDBC.DZL2ZLS.SEXCEP01.I0001.A001 | sed 's/[^\.]*/-/g'
>
>outputs the expected
>
> -.-.-.-.-.-
>
>Alas, on IBM Mainfrane z/OS Unix System Services and under Cygwin, I
>get
>
> --.--.--.--.--.-
>
>Investigation shows that obviously, after substituting a sequence of
>non-dots by a dash, another sequence of *zero* non-dots gets detected
>before the same dot (that terminated the just substituted sequence)
>and gets substituted by another dash before the dot gets finally
>eated.
>
>When I change my regexp to request at least one non-dot as follows
>
> sed 's/[^\.][^\.]*/-/g'
>
>so that it doesn't match zero non-dots, the result comes out as
>originally intended. Anyway, I would argue that the original script is
>correct since the regexp after having matched a non-empty sequence of
>non-dots and reaching the dot can't match again *anything* before this
>same dot, so matching another empty sequence of non-dots is an error.
After the RE matches a sequence of non-period characters (and they
get replaced ) the search will continue at the next character
after that sequence, so for example;
If the RE is [^.]* (0 or more non-period characters)
abcd.efgh cursor is at position 1 in the string
^
[abcd].efgh 4 non-period characters match
-.efgh characters are replaced and cursor advances
^
-[].efgh 0 non-period characters match
--.efgh characters are replaced and cursor advances
^
--.[efgh] 4 non-period characters match
--.- characters are replaced and cursor advances
--.- (cursor at EOL)
^
At this point it seems that your z/OS is terminating the search/replace
although forseeably it could regard the NUL string at the very end of
the string as matching also and you would have got another hyphen.
Another example is to consider the second match as if you were starting
over with a new string, eg;
STRING= abc.efgh
The "abc" matches and gets replaced and the cursor (search pointer)
advances to the period character. So, the point you continue from would
be as if you were just feeding in the string;
STRING= .efgh
Obviously there are 0 non-period characters matching at the start of
this string, so the first replace would leave
STRING= -.efgh
Anyway, this behaviour is much more logical and consistent than that you
witnessed in the other systems but sometimes it is counter-intuitive,
so some implementations have a special rule that they can't match a NUL
string directly after matching a non-NUL string. GNU-awk does this
and it can be seen clearly with the following examples;
$ echo '1 2 3' |awk 'gsub(" *","-")'
-1-2-3-
$ echo '1 22 3' |awk 'gsub(" *","-")'
-1-2-2-3-
(btw: notice that this utility does actually match the NUL string at the end
of the line)
This is a trade-off basically because most programmers can't be bothered
to learn mathematical definitions required to define REs rigourously and
in fact the large majority don't read standards that would have those
definitions so these rules make the tool produce the most intuitive result
in the common cases and the side-effect of that is weird behaviour in
others,
$ echo laura |awk 'gsub("$|y","-")'
laura-
$ echo laura |awk 'gsub("$|l","-")'
-aura
( this is awk3.0.5 and the behaviour may be changed in the current version
because I've reported this oneas a bug however there will still be odd results
in many cases simply because the behaviour has been tailored for _intuitive_
correct results rather than _logical_ ones )
I don't think that the standard concerned with specifying the behaviour (POSIX)
dictates that a utility should behave one way or another (the whole RE behaviour
is not formally defined, rather descriptively and loosly).
In general the CORRECT solution is to simply compose more robust REs, so
in your case just use;
s/[^.]\{1,\}/-/g
or
s/[^.][^.]*/-/g
best wishes
laura
>
>Am I right with this, and are z/OS Unix System Services as well as
>Cygwin indeed in error applying the regexp twice, or am I as well as
>AIX and Solaris in error by not applying the regexp twice?
>
>Thank you for your feedback,
>Hartmut Schaefer
-- echo alru_aafriehdab@ittnreen.tocm |sed 's/\(.\)\(.\)/\2\1/g'
- Next message: Stonestream: "Re: echo foo |grep foo && exit || wrong"
- Previous message: bsh_at_iname.com: "Re: Need Regex Calculator"
- In reply to: Hartmut Sch?fer: "sed applying the same regexp twice"
- Next in thread: Stephane CHAZELAS: "Re: sed applying the same regexp twice"
- Reply: Stephane CHAZELAS: "Re: sed applying the same regexp twice"
- Reply: Hartmut Sch?fer: "Re: sed applying the same regexp twice"
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Relevant Pages
|