Re: sed applying the same regexp twice

From: laura fairhead (laura_fairhead_at_INVALID.com)
Date: 11/04/03


Date: Tue, 4 Nov 2003 00:19:44 +0000

In article <c8fcab4d.0311030417.618ea33b@posting.google.com>,
Hartmut Sch?fer wrote:
>Hi,
>
>I wrote the following sed script

Hello,

>
> sed 's/[^\.]*/-/g'

The backslash shouldn't be there. In a bracket expression the period
isn't a special character,

>
>to filter a list of MVS datasets, having the names on all hierarchy
>levels substituted by a dash (for doing further processing on the
>result). Thus, the line
>
> ZLS0.DSNDBC.DZL2ZLS.SEXCEP01.I0001.A001
>
>should be transformed into
>
> -.-.-.-.-.-
>
>I read the script above as "substitute every sequence of non-dots by
>one dash". (OK, *empty* sequences of non-dots will match too, but
>since in the input there are no sequences of dots with nothing between
>them, with the expected input this should work as intended.)
>
>Indeed, under AIX and Solaris
>
> echo ZLS0.DSNDBC.DZL2ZLS.SEXCEP01.I0001.A001 | sed 's/[^\.]*/-/g'
>
>outputs the expected
>
> -.-.-.-.-.-
>
>Alas, on IBM Mainfrane z/OS Unix System Services and under Cygwin, I
>get
>
> --.--.--.--.--.-
>
>Investigation shows that obviously, after substituting a sequence of
>non-dots by a dash, another sequence of *zero* non-dots gets detected
>before the same dot (that terminated the just substituted sequence)
>and gets substituted by another dash before the dot gets finally
>eated.
>
>When I change my regexp to request at least one non-dot as follows
>
> sed 's/[^\.][^\.]*/-/g'
>
>so that it doesn't match zero non-dots, the result comes out as
>originally intended. Anyway, I would argue that the original script is
>correct since the regexp after having matched a non-empty sequence of
>non-dots and reaching the dot can't match again *anything* before this
>same dot, so matching another empty sequence of non-dots is an error.

After the RE matches a sequence of non-period characters (and they
get replaced ) the search will continue at the next character
after that sequence, so for example;

If the RE is [^.]* (0 or more non-period characters)

abcd.efgh cursor is at position 1 in the string
^

[abcd].efgh 4 non-period characters match

-.efgh characters are replaced and cursor advances
 ^

-[].efgh 0 non-period characters match

--.efgh characters are replaced and cursor advances
   ^
--.[efgh] 4 non-period characters match

--.- characters are replaced and cursor advances

--.- (cursor at EOL)
    ^

At this point it seems that your z/OS is terminating the search/replace
although forseeably it could regard the NUL string at the very end of
the string as matching also and you would have got another hyphen.

Another example is to consider the second match as if you were starting
over with a new string, eg;

STRING= abc.efgh

The "abc" matches and gets replaced and the cursor (search pointer)
advances to the period character. So, the point you continue from would
be as if you were just feeding in the string;

STRING= .efgh

Obviously there are 0 non-period characters matching at the start of
this string, so the first replace would leave

STRING= -.efgh

Anyway, this behaviour is much more logical and consistent than that you
witnessed in the other systems but sometimes it is counter-intuitive,
so some implementations have a special rule that they can't match a NUL
string directly after matching a non-NUL string. GNU-awk does this
and it can be seen clearly with the following examples;

$ echo '1 2 3' |awk 'gsub(" *","-")'
-1-2-3-
$ echo '1 22 3' |awk 'gsub(" *","-")'
-1-2-2-3-

(btw: notice that this utility does actually match the NUL string at the end
 of the line)

This is a trade-off basically because most programmers can't be bothered
to learn mathematical definitions required to define REs rigourously and
in fact the large majority don't read standards that would have those
definitions so these rules make the tool produce the most intuitive result
in the common cases and the side-effect of that is weird behaviour in
others,

$ echo laura |awk 'gsub("$|y","-")'
laura-
$ echo laura |awk 'gsub("$|l","-")'
-aura

( this is awk3.0.5 and the behaviour may be changed in the current version
  because I've reported this oneas a bug however there will still be odd results
  in many cases simply because the behaviour has been tailored for _intuitive_
  correct results rather than _logical_ ones )

I don't think that the standard concerned with specifying the behaviour (POSIX)
dictates that a utility should behave one way or another (the whole RE behaviour
is not formally defined, rather descriptively and loosly).

In general the CORRECT solution is to simply compose more robust REs, so
in your case just use;

s/[^.]\{1,\}/-/g

or

s/[^.][^.]*/-/g

best wishes
laura

>
>Am I right with this, and are z/OS Unix System Services as well as
>Cygwin indeed in error applying the regexp twice, or am I as well as
>AIX and Solaris in error by not applying the regexp twice?
>
>Thank you for your feedback,
>Hartmut Schaefer

-- 
echo alru_aafriehdab@ittnreen.tocm |sed 's/\(.\)\(.\)/\2\1/g'


Relevant Pages

  • Re: user defined function that converts string to float
    ... > I need user defined function that converts string to float in c. ... initial, possibly empty, sequence of white-space characters (as ... point character, then an optional exponent part as defined in ... then a nonempty sequence of hexadecimal digits ...
    (comp.lang.c)
  • Re: Check for Common character sequence ( I will pay)?
    ... Dude, programming is all problem-solving. ... You need to identify character sequences of 3 or more characters that appear ... in more than one string. ... and test each 3-character sequence that results. ...
    (microsoft.public.dotnet.framework)
  • Re: Check for Common character sequence ( I will pay)?
    ... Do I need to return an array? ... You need to identify character sequences of 3 or more characters that appear ... in more than one string. ... and test each 3-character sequence that results. ...
    (microsoft.public.dotnet.framework)
  • Re: Check for Common character sequence ( I will pay)?
    ... Yes you are returning an array of FoundString objects. ... in more than one string. ... This means that you have to identify sequences 1 character at a time, ... Again, obviously, if the 3-character sequence doesn't match, neither will ...
    (microsoft.public.dotnet.framework)
  • Re: regex question
    ... But my question is where does it stop matching for the caret ... I think the caret is everything from the beginning of the string all ... The carret at the beggining denotes that the next char ... character in the matching string. ...
    (comp.lang.perl.misc)