Re: Read strings from one file and search for them in a directory containing htm files

From: Meghavvarnam (meghsatish_at_yahoo.com)
Date: 11/22/05

  • Next message: swagat.dasgupta_at_gmail.com: "pipin' problems - stdin & stdout messed up after executing piped command"
    Date: 22 Nov 2005 01:28:15 -0800
    
    

    Ed Morton wrote:
    > Meghavvarnam wrote:
    > >>Megh,
    > >>
    > >>In the absence of some sample input data and sample files there's only
    > >>so much we can do. The script as written works just fine for the simple
    > >>tests I through at it, but it certainly isn't particularly robust.
    > >
    > >
    > > Lars,
    > >
    > > I understand the difficulty when we dont have the data. Its difficult
    > > to get a sense of what the behaviour of our code should be. Here is
    > > part of a huge .htm file that I have :
    > >
    > > ================== Begin Snippet ====================
    > > <h3 class="subTitle">Reset Bluetooth</h3>
    > > <div class="pad10">
    > > <table width="450" class="tabbedcontent" summary
    > > ="This table is used to display the Bluetooth interface configuration
    > > parameters.">
    > > <tr>
    > > <td class="clf" colspan="3">
    > > Use this option to reset Bluetooth to
    > > factory default settings.
    > > </td>
    > > </tr>
    > > <tr>
    > > <td width="5%">&nbsp;</td>
    > > <td class="clf" colspan="2">
    > > <INPUT type="radio"
    > > name="bt_reset_bluetooth" title="Select a reset Bluetooth setting"
    > > value="choice_bt_reset_bluetooth_yes" accesskey="y" >
    > > Yes, reset Bluetooth
    > > </td>
    > > </tr>
    > > <tr>
    > > <td width="5%">&nbsp;</td>
    > > <td class="clf" colspan="2">
    > > <INPUT type="radio"
    > > name="bt_reset_bluetooth" title="Select a reset Bluetooth setting"
    > > value="choice_bt_reset_bluetooth_no" accesskey="n" CHECKED>
    > > No
    > > </td>
    > > </tr>
    > > </table>
    > > </div> <!-- end div pad10 -->
    > > ================== End Snippet ====================
    > >
    > > Now the file that contains all the strings that I search for has, for
    > > example, a string - "Reset".
    > >
    > > When I look for "Reset", I need as output that meets the following
    > > criteria :
    > > 1. Only lines that contain "Reset" and not a line from the aboe snippet
    > > like - <h3 class="subTitle">Reset Bluetooth</h3>
    > > 2. The output is also case sensitive. For example, it will not print
    > > lines that may have "reset".
    > > 3. And since we search in a html file, strings found in a comment do
    > > not form part of the output.
    > > 4. Basically tags like title, summary, value need to contain for
    > > example, ONLY "Reset", to match the criteria and be part of the output.
    >
    > So, it sounds like none of the lines in your sample input actually match
    > your search criteria. Is that right? It now sounds like what you really
    > want is something like this:
    >
    > > usedStrings.txt
    > while IFS= read -r string
    > do
    > gawk -vstring="=\"$string\"" '{
    > for (i=1;i<=NF;i++) {
    > if ($i ~ string) {
    > print FILENAME > usedStrings.txt
    > nextfile
    > }
    > }
    > }' directory/*.htm
    > done < allStrings.txt
    >
    > put posting sample input with some matches to your selection criteria
    > plus that expected output would help. Note that I used "gawk" to take
    > advantage of it's "nextfile" operator. If you use some other awk, you
    > need to work around that.

    Sample data does help a great deal. Here it is:

    allStrings.txt contains lines likes these -
    =================== Begin allStrings.txt ====================
    WPA1
    WPA2
    Automatic (WPA2 or WPA1)
    XyZ technology helps make home networking simple.
    XyZ architecture offers network connectivity between personal
    computers, printers, intelligent appliances and wireless devices.
    XyZ architecture leverages ABC/DE and the Web to enable seamless
    proximity networking in addition to control and data transfer among
    networked devices in the home and office.
    If you enable XYZ , then XYZ-enabled devices can print to this device.
    Privacy
    SampleText:<br> Simpler, smarter online supplies ordering
    Learn more about <br>XYZ SampleText
    Transfer printer information to XYZ SampleText?
    ==================== End allStrings.txt =====================

    Which means, the script will search for these lines in .htm files. Each
    of these lines need to appear as is (case sensitive) to say that there
    is a match. Now consider we read the 3rd line in the file above -
    Automatic (WPA2 or WPA1).

    >>From the .htm snippet pasted below, the third option tag contains the
    search string -

    Automatic (WPA2 or WPA1)

    So when a match like this occurs, I simply need to write Automatic
    (WPA2 or WPA1) in the files - usedStrings.

    ==================== Begin .htm Snippet =====================
                            <tr>
                               <td>&nbsp;</td>
                               <td
    class="clf">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; WPA
    Version
                               </td>
                               <td>
                                  <select name="wpa_version" size="1"
    title="Select a WPA version setting">
                                     <option value="WPA1">WPA1</option>
                                     <option value="WPA2">WPA2</option>
                                     <option selected="SELECTED"
    value="Automatic">Automatic (WPA2 or WPA1)</option>
                                  </select>
                               </td>
                            </tr>
                            <tr>
                               <td>&nbsp;</td>
                               <td
    class="clf">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
    Encryption:
                               </td>
                               <td>
                                  <select name="encr_type" size="1"
    title="Select an encryption setting">
                                     <option value="AES_TKIP">Automatic
    (AES or TKIP)</option>
                                     <option value="AES">AES</option>
                                     <option value="TKIP">TKIP</option>
                                  </select>
                               </td>
                            </tr>
    ===================== End .htm Snippet ======================

    I also tried the script that you sent in your previous post. It created
    the file - usedStrings.txt. However it does not populate it. The file
    remains empty.

    In the process you are helping me learn awk as well.

    Thank you so much once again for all the help ! Am growing to
    understand some awk scripts and its behaviour.

    Warm Regards,
    Megh

    > Glad to see you've overcome your google groups non-quoting hurdle!
    >
    > Ed.


  • Next message: swagat.dasgupta_at_gmail.com: "pipin' problems - stdin & stdout messed up after executing piped command"