Re: Read strings from one file and search for them in a directory containing htm files

From: Ed Morton (morton_at_lsupcaemnt.com)
Date: 11/22/05


Date: Tue, 22 Nov 2005 07:31:03 -0600

Meghavvarnam wrote:

<snip>
> Sample data does help a great deal. Here it is:
>
> allStrings.txt contains lines likes these -
> =================== Begin allStrings.txt ====================
> WPA1
> WPA2
> Automatic (WPA2 or WPA1)
> XyZ technology helps make home networking simple.
> XyZ architecture offers network connectivity between personal
> computers, printers, intelligent appliances and wireless devices.
> XyZ architecture leverages ABC/DE and the Web to enable seamless
> proximity networking in addition to control and data transfer among
> networked devices in the home and office.
> If you enable XYZ , then XYZ-enabled devices can print to this device.
> Privacy
> SampleText:<br> Simpler, smarter online supplies ordering
> Learn more about <br>XYZ SampleText
> Transfer printer information to XYZ SampleText?
> ==================== End allStrings.txt =====================
>
> Which means, the script will search for these lines in .htm files. Each
> of these lines need to appear as is (case sensitive) to say that there
> is a match. Now consider we read the 3rd line in the file above -
> Automatic (WPA2 or WPA1).
>
>>>From the .htm snippet pasted below, the third option tag contains the
> search string -
>
> Automatic (WPA2 or WPA1)
>
> So when a match like this occurs, I simply need to write Automatic
> (WPA2 or WPA1) in the files - usedStrings.
>
> ==================== Begin .htm Snippet =====================
> <tr>
> <td>&nbsp;</td>
> <td
> class="clf">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; WPA
> Version
> </td>
> <td>
> <select name="wpa_version" size="1"
> title="Select a WPA version setting">
> <option value="WPA1">WPA1</option>
> <option value="WPA2">WPA2</option>
> <option selected="SELECTED"
> value="Automatic">Automatic (WPA2 or WPA1)</option>
> </select>
> </td>
> </tr>
> <tr>
> <td>&nbsp;</td>
> <td
> class="clf">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
> Encryption:
> </td>
> <td>
> <select name="encr_type" size="1"
> title="Select an encryption setting">
> <option value="AES_TKIP">Automatic
> (AES or TKIP)</option>
> <option value="AES">AES</option>
> <option value="TKIP">TKIP</option>
> </select>
> </td>
> </tr>
> ===================== End .htm Snippet ======================
>

Now we're back to about my original suggestion, if there's no newlines
in the searched text:

> usedStrings.txt
while IFS= read -r string
do
     grep -q ">${string}<" directory/*.htm &&
     echo "$string" >> usedStrings.txt
done < allStrings.txt

Alternatively, doing it all in awk, it's:

gawk 'NR==FNR{strings[$0]++;next}
     { for (string in strings}
            if (index($0,">"string"<") {
                usedStrings[string]++
                delete strings[string] # for efficiency
            }
     }
     END { for (string in usedStrings)
            print string
     }' allStrings.txt directory/*.htm > usedStrings.txt

Note that, since you said something in a previous posting about only
wanting to look for text when it's part of an HTML tag (or something
like that...) the search for ">"string"<" surrounds the line from
"allStrings.txt" with ">" and "<" so it only matches when the text
appears between those 2 characters. If you don't want that restriction,
just get rid of the ">" and "<". Similairly for the grep solution.

If you'd like the awk script to tell you which strings are/aren't used,
that's trivial, e.g.:

gawk 'NR==FNR{strings[$0]++;next}
     { for (string in strings}
            if (index($0,">"string"<") {
                usedStrings[string]++
                delete strings[string] # for efficiency
            }
     }
     END {
        print "Used Strings:"
        for (string in usedStrings)
            printf "\t%s\n",string
        print "Unused Strings:"
        for (string in strings)
            printf "\t%s\n",string
     }' allStrings.txt directory/*.htm

If there can be newlines in the strings yopu're trying to match in the
HTML files, then we need to figure out what "match" means since there
aren't newlines in the strings in "allStrings.txt" and we need to figure
out a different record separator than a newline char.

        Ed.



Relevant Pages

  • Re: Help with shell script
    ... for the string 'abc', and puts the matching lineinto the variable xyz. ... in xyz to awk. ... ..which will print the second column from the /etc/passwd delimited by ':'s. ...
    (comp.unix.shell)
  • Re: Filtering a form using a combo and 2 dates
    ... open my Form that is based on "XYZ table/query" but only return ... the records that match this string. ... not have a field called "dateStart" ... "RentalAgreement" is between some values. ...
    (microsoft.public.access.forms)
  • Re: String To List
    ... list with elements 'xyz' and 'abc'. ... could the string module not use a ... string enclosed by delim (or which begin with delim and end with ... (odd number of delimiters, or opening/closing delims ...
    (comp.lang.python)
  • Re: How to extract a string starting with abc & ending with xyz ?
    ... But if the length of the string varies between 10 to 50* what ... You are only comparing the string with "abc xyz". ... The automatic assumption that all programs should write error messages ...
    (comp.lang.c)
  • Re: socket programming-receive() function , VC++ 6.0
    ... You do not have a MS specific issue or any real general networking programming issue other than lack of very basic knowledge. ... The most common thing to do is to prepend each message with its length (length here means the acutal number of bytes sent, not string length which doesn't count terminators or may have some other meaning). ... You then need to treat the messages received as a stream recognizing that any received message may be only a fraction of a transmitted message, a whole message, parts of multiple messages, or any other fractured contiguous combination. ... The only thing TCP guarantees is that the bytes are received in the same order as they are sent and they are correct. ...
    (microsoft.public.win32.programmer.networks)