Re: How to rewrite with awk?

From: Ed Morton (morton_at_lsupcaemnt.com)
Date: 06/07/05


Date: Tue, 07 Jun 2005 06:52:13 -0500


Ralph A. Moeritz wrote:

> Hi everyone!
>
> I've just gotten my feet wet with shell scripting and
> I'm unfamiliar with tools such as sed & awk. I have the
> following code:
>
> -------------------------------------------------------
> result=$(eval "sqlite $database 'select name, date, \
> sum(sub) from active_quotes where uname = \"$user\" \
> group by quoteid'")
>
> [ $? -eq 0 ] && {
> IFS="|$NL" ## $NL is a newline
> set -- $result
> i=1
> for str
> do
> var[i++]=$str
> done
>
> for (( k=1; k<i; k += 3 ))
> do
> printf "<tr><td>${var[k]}</td><td> \
> ${var[k+1]}</td><td>${var[k+2]}</td></tr>"
> done
> exit 0
> }
> -------------------------------------------------------
>
> As should be apparent, I'm trying to retrieve some rows from
> a database, and output them into an HTML <table>. The above
> code works -BUT- it's bash specific and it uses two loops.
>
> Q1: How can I rewrite the above to use awk?

Let's assume the result of your query would be something like this:

Joe Smith|Jan-05-2005|27
Billy Bob|Feb-05-2005|9

Then you'd do this (assuming you do need the eval, etc. exactly as you'd
written it:

eval sqlite $database 'select name, date, \
sum(sub) from active_quotes where uname = \"$user\" \
group by quoteid' | awk -F\| '{
printf "<tr><td>%s</td><td>%s</td><td>%s</td></tr>\n",$1,$2,$3}'

If I'm wrong about the format of your query output, post that format for
the correct awk script to parse it.

> Q2: Where can I find some good awk tutorials?

Get the book "Effective AWK Programming", 3rd edition from O'Reilly (
http://tinyurl.com/7defj ) and lurk in comp.lang.awk. I've also included
a few sample scripts below, mostly based on answers to previous NG
questions that I thought were worth remebering.

        Ed.

TOPICS:
1. Delete all fields up to field N, preserving input formatting.
2. Extract user id and file name from ls -l.
3. Extract the string that matches a RE.
4. Use an RS that must be at the start of a line.
5. Substituting matched REs in *sub().
6. Writing changes back to the original file.
7. Transposing rows to selected columns and sorting by key.
8. Performing arithmetic on and replacing part of an RE.
9. Recursively in-lining "include" files.
10. Convert the first letter of every word to upper case.
11. Convert a string to an array.
12. Remove text between nested delimiters.
13. Join lines that end in backslashes.

SCRIPTS:
1. Delete all fields up to field N, preserving input formatting.
gawk --re-interval 'sub(/^[[:space:]]*([^[:space:]]*[[:space:]]*){1}/,"")'
gawk --posix '...'
The number within the "{...}" is the number of initial fields to delete.
Note that "gensub()" is not available with "--posix" but it is available
with "--re-interval" so if you need to use an interval expression (e.g.
{1,} or {8} or {2,4}) with gensub() then you must use --re-interval
rather than --posix so --re-interval is generally the preferred method.

2. Extract user id and file name from ls -l.
gawk --re-interval 'sub(/([^[:blank:]]{1,}[[:blank:]]{1,}){8}/,"")'
gawk --posix '...'

3. Extract the string that matches a RE.
gawk '
function extract(extractSrc,extractPattern) {
     if (match(extractSrc, extractPattern)) {
         RMATCH = substr(extractSrc, RSTART, RLENGTH)
         extractReturn = 1
     } else {
         RMATCH = null
         extractReturn = 0
     }
     return extractReturn
}
extract($0,"f.*t") { print RMATCH }
'

4. Use an RS that must be at the start of a line.
gawk 'BEGIN{rs="SeparatorText";RS="(^|\n)"rs}NR==1{next}
     {printf"----\n%s%s\n",rs,$0}'

5. Substituting matched REs in *sub().
PS1> echo "abcbd" | gawk 'sub(/b/,"|&|")'
a|b|cbd
PS1> echo "abcbd" | gawk 'gsub(/b/,"|&|")'
a|b|c|b|d
PS1> echo "abcbd" | gawk '$0=gensub(/b/,"|&|","")'
a|b|cbd
PS1> echo "abcbd" | gawk '$0=gensub(/b/,"|&|","g")'
a|b|c|b|d
PS1> echo "abcbd" | gawk '$0=gensub(/(b)/,"|\\1|","")'
a|b|cbd
PS1> echo "abcbd" | gawk '$0=gensub(/(b)/,"|\\1|","g")'
a|b|c|b|d
PS1> echo "abcbd" | gawk '$0=gensub(/(b)(c)/,"|\\2\\1|","g")'
a|cb|bd

6. Writing changes back to the original file.
gawk '
function printout(_str) { _out[++_nr] = _str }
function flushout( _i) { close(FILENAME);
                           for (_i=1; _i<=_nr;_i++)
                                  print _out[_i] > FILENAME
                         }
{ printout( NR " " $0 ) }
END { flushout() }'

7. Transposing rows to selected columns and sorting by key.

Given the following input file:
Number of executions = 437
Number of compilations = 1
Worst preparation time (ms) = 1
Best preparation time (ms) = 1
Rows deleted = 0

Number of executions = 1
Number of compilations = 1
Worst preparation time (ms) = 4
Best preparation time (ms) = 4
Rows deleted = 0

Number of executions = 29
Number of compilations = 1
Worst preparation time (ms) = 1
Best preparation time (ms) = 1
Rows deleted = 0

To tranpose certain rows into columns and sort by one of the
column, like the following which is sorted by "Number of executions":

Number of executions Number of compilations Rows deleted
437 1 0
29 1 0
29 1 0

This will do it all in gawk:

gawk -vRS="" -F"\n" 'BEGIN{ fields = "1 2 5"; key = "1"
     numflds = split(fields,flds," ")
}
{
     for (i=1; i<=NF;i++) {
         split($i,f,"=")
         # Get rid of all spaces from the end of the title text
         sub(/[[:blank:]]*$/,"",f[1])
         title[i]=f[1]
         # Get rid of all spaces from the value field
         value[i]=f[2]+0
         # Determine the width for this column based on the width
         # of the title text plus 3 for spacing. Left-justify (%-).
         fmt[i]="%-"(length(title[i])+3)"s"
     }
     # We will want to sort on the key column so we need to create a
     # string at the start of each line to sort on later. Take the key
     # columns value and pad it with zeros up to 20 chars followed by
     # a space to separate it fromthe first real column. Conversion of
     # "7" to "0007" and "17" to "0017" is necessary because asort()
     # is alphabetical not numerical so all numeric fields must be the
     # same width to compare alphabetically.
     lines[NR] = sprintf("%020s ",value[key])

     # Now add the real columns, formatted as determined earlier.
     for (i=1; i<=numflds; i++) {
         lines[NR] = lines[NR] sprintf(fmt[flds[i]], value[flds[i]])
     }
}
END {
     # Print the title line
     for (i=1; i<=numflds; i++) {
         printf fmt[flds[i]], title[flds[i]]
     }
     print ""
     # Sort the lines alphabetically, i.e. by the value of the key column
     # added above to the front of each line.
     asort(lines)
     # Print each line
     for (i=1; i<=NR; i++) {
         # strip out the first numeric value, the key value added above
         sub("[[:digit:]]* ","",lines[i])
         print lines[i]
     }
}'

Setting fields and key at the beginning obvious dictates which fields to
be printed and which key to sort on. The only thing it assumes about field
sizes is that the key fields values won't be more than 20 characters.

8. Performing arithmetic on and replacing part of an RE.
Given a tag in a file that is myval="45", to multiply the "45" by 2 and
replace it
with the result (90) in the same tag, i.e. myval="90":

gawk -vRS="myval=\"[0-9]+\""
'{ORS=RT;gsub("[^0-9]","",RT);sub("[0-9]+",2*RT,ORS)}1'

9. Recursively in-lining "include" files.
This script will not only expand all the lines that say "include
subfile", but
by writing the result to a tmp file, resetting ARGV[1] (the highest
level input
file) and not resetting ARGV[2] (the tmp file), it then lets awk do any
normal
record parsing on the result of the expansion since that's now stored in the
tmp file. If you don't need that, just do the "print" to stdout and
remove any
other references to a tmp file or ARGV[2].

awk 'function read(file) {
        while ( (getline < file) > 0) {
            if ($1 == "include") {
                 read($2)
            } else {
                 print > ARGV[2]
            }
        }
        close(file)
    }
    BEGIN{
       read(ARGV[1])
       ARGV[1]=""
       close(ARGV[2])
    }1' a.txt tmp

The result of running the above given these 3 files in the current
directory:

       a.txt b.txt c.txt
       ----- ----- -----
       1 3 5
       2 4 6
       include b.txt include c.txt
       9 7
       10 8

would be to print the numbers 1 through 10 and save them in a file named
"tmp".

10. Convert the first letter of every word to upper case.
awk 'BEGIN{RS="[[:space:]]";FS=OFS=""}{$1=toupper($1);ORS=RT}1'

11. Convert a string to an array.
To convert a string to an array indexed by each word's position in the
string:

awk 'BEGIN{str="abc def";c=split(str,arr);for (i=1;i<=c;i++) print arr[i]}'

To convert a string to an array indexed by each word:

awk 'BEGIN{str="abc def";c=split(str,tmp);for
(i=1;i<=c;i++)arr[tmp[i]]++;delete tmp;
        for (w in arr) print w}'

or:

awk 'function str2arr(s,a,i,_c,_j){_c=split(s,i);for
(_j=1;_j<=_c;_j++)a[i[_j]]++;return _c}
      BEGIN{str="abc def";c=str2arr(str,arr,idx);for (i=1;i<=c;i++)print
idx[i],arr[idx[i]]}'

where arr is an array of numbers indexed by strings and idx is an array
of strings indexed by
numbers for ease of accessing arr[]s indices in the original order.

12. Remove text between nested delimiters.
Given an input file like:

2005-04-26 DEBUG [junk][junkjunkjunk] The message says...
2005-04-26 DEBUG [morejunk]meaningful [junkagain]more meaning to life...
2005-04-26 DEBUG a[junk[junk[junk]junk]]b The message says...

this script:

awk -vFS="" 'function rmvJunk() {
     for (i+=1; i <= NF; i++)
         if ($i == "[") rmvJunk(); else if ($i == "]") return
}
{ for (i=1; i <= NF; i++)
         if ($i == "[") rmvJunk(); else printf "%s",$i
     print ""
}'

will remove everything between the nested pairs of "[...]"s to produce:

2005-04-26 DEBUG The message says...
2005-04-26 DEBUG meaningful more meaning to life...
2005-04-26 DEBUG ab The message says...

so will this much simpler one:

awk '{while (sub(/\[[^][]*]/," "));}1'

though it only works in gawk and /usr/xpg4/bin/awk, not nawk or old awk.
The RE it's using looks for:

\[ = a single "["
[^][] = a character that's neither "]" nor "["
* = repeated
] = until a single "]"

so it finds the innermost "[...]" every iteration through the loop.

13. Join lines that end in backslashes.
gawk 'BEGIN{RS="\\\\\n|\n"}{ORS=RT~/\\/?"":"\n"}1'



Relevant Pages

  • Re: How to sort array
    ... How to using sort in awk? ... printf is a builtin not a function, ... Also, 'printf "%s\n",string' is the same as 'print string' so there's ... having to specify the same thing everywhere, and remove the tmp file ...
    (comp.lang.awk)
  • Re: insert a field seperator
    ... If I have the string as follows in my script: ... How do I use awk or sed to put a: every 2nd charcater so I have ...
    (comp.unix.shell)
  • Re: insert a field seperator
    ... If I have the string as follows in my script: ... How do I use awk or sed to put a: every 2nd charcater so I have ...
    (comp.unix.shell)
  • Re: substitute a text to another
    ... I think that I can use sed or awk ... Good luck! ... I need a script that looking for a "string" in hundreds of files and ...
    (comp.unix.shell)
  • Re: insert a field seperator
    ... Robert Katz wrote: ... If I have the string as follows in my script: ... How do I use awk or sed to put a: every 2nd charcater so I have ...
    (comp.unix.shell)