Re: Read strings from one file and search for them in a directory containing htm files
From: Ed Morton (morton_at_lsupcaemnt.com)
Date: 11/22/05
- Next message: Ed Morton: "Re: Comparing value with records in a shell array."
- Previous message: anthony: "Re: Comparing value with records in a shell array."
- In reply to: Meghavvarnam: "Re: Read strings from one file and search for them in a directory containing htm files"
- Next in thread: Meghavvarnam: "Re: Read strings from one file and search for them in a directory containing htm files"
- Reply: Meghavvarnam: "Re: Read strings from one file and search for them in a directory containing htm files"
- Reply: Meghavvarnam: "Re: Read strings from one file and search for them in a directory containing htm files"
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Date: Tue, 22 Nov 2005 07:31:03 -0600
Meghavvarnam wrote:
<snip>
> Sample data does help a great deal. Here it is:
>
> allStrings.txt contains lines likes these -
> =================== Begin allStrings.txt ====================
> WPA1
> WPA2
> Automatic (WPA2 or WPA1)
> XyZ technology helps make home networking simple.
> XyZ architecture offers network connectivity between personal
> computers, printers, intelligent appliances and wireless devices.
> XyZ architecture leverages ABC/DE and the Web to enable seamless
> proximity networking in addition to control and data transfer among
> networked devices in the home and office.
> If you enable XYZ , then XYZ-enabled devices can print to this device.
> Privacy
> SampleText:<br> Simpler, smarter online supplies ordering
> Learn more about <br>XYZ SampleText
> Transfer printer information to XYZ SampleText?
> ==================== End allStrings.txt =====================
>
> Which means, the script will search for these lines in .htm files. Each
> of these lines need to appear as is (case sensitive) to say that there
> is a match. Now consider we read the 3rd line in the file above -
> Automatic (WPA2 or WPA1).
>
>>>From the .htm snippet pasted below, the third option tag contains the
> search string -
>
> Automatic (WPA2 or WPA1)
>
> So when a match like this occurs, I simply need to write Automatic
> (WPA2 or WPA1) in the files - usedStrings.
>
> ==================== Begin .htm Snippet =====================
> <tr>
> <td> </td>
> <td
> class="clf"> WPA
> Version
> </td>
> <td>
> <select name="wpa_version" size="1"
> title="Select a WPA version setting">
> <option value="WPA1">WPA1</option>
> <option value="WPA2">WPA2</option>
> <option selected="SELECTED"
> value="Automatic">Automatic (WPA2 or WPA1)</option>
> </select>
> </td>
> </tr>
> <tr>
> <td> </td>
> <td
> class="clf">
> Encryption:
> </td>
> <td>
> <select name="encr_type" size="1"
> title="Select an encryption setting">
> <option value="AES_TKIP">Automatic
> (AES or TKIP)</option>
> <option value="AES">AES</option>
> <option value="TKIP">TKIP</option>
> </select>
> </td>
> </tr>
> ===================== End .htm Snippet ======================
>
Now we're back to about my original suggestion, if there's no newlines
in the searched text:
> usedStrings.txt
while IFS= read -r string
do
grep -q ">${string}<" directory/*.htm &&
echo "$string" >> usedStrings.txt
done < allStrings.txt
Alternatively, doing it all in awk, it's:
gawk 'NR==FNR{strings[$0]++;next}
{ for (string in strings}
if (index($0,">"string"<") {
usedStrings[string]++
delete strings[string] # for efficiency
}
}
END { for (string in usedStrings)
print string
}' allStrings.txt directory/*.htm > usedStrings.txt
Note that, since you said something in a previous posting about only
wanting to look for text when it's part of an HTML tag (or something
like that...) the search for ">"string"<" surrounds the line from
"allStrings.txt" with ">" and "<" so it only matches when the text
appears between those 2 characters. If you don't want that restriction,
just get rid of the ">" and "<". Similairly for the grep solution.
If you'd like the awk script to tell you which strings are/aren't used,
that's trivial, e.g.:
gawk 'NR==FNR{strings[$0]++;next}
{ for (string in strings}
if (index($0,">"string"<") {
usedStrings[string]++
delete strings[string] # for efficiency
}
}
END {
print "Used Strings:"
for (string in usedStrings)
printf "\t%s\n",string
print "Unused Strings:"
for (string in strings)
printf "\t%s\n",string
}' allStrings.txt directory/*.htm
If there can be newlines in the strings yopu're trying to match in the
HTML files, then we need to figure out what "match" means since there
aren't newlines in the strings in "allStrings.txt" and we need to figure
out a different record separator than a newline char.
Ed.
- Next message: Ed Morton: "Re: Comparing value with records in a shell array."
- Previous message: anthony: "Re: Comparing value with records in a shell array."
- In reply to: Meghavvarnam: "Re: Read strings from one file and search for them in a directory containing htm files"
- Next in thread: Meghavvarnam: "Re: Read strings from one file and search for them in a directory containing htm files"
- Reply: Meghavvarnam: "Re: Read strings from one file and search for them in a directory containing htm files"
- Reply: Meghavvarnam: "Re: Read strings from one file and search for them in a directory containing htm files"
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Relevant Pages
|