Re: Parse irregular data, dump into delimited text file

From: William James (w_a_x_man_at_yahoo.com)
Date: 11/27/05


Date: 27 Nov 2005 13:52:39 -0800

d@rren.cymraeg.org wrote:
> I've been given an MS Word document containing information to input into
> a database. I've knocked it into shape using various unix tools eg.
> sed, cut etc. so now I have data in a plain text file like this :
>
> name|address|postcode|telephone
>
> The address field contains data in an irregular form, eg.
> 12, the high street, town, place, biggerplace
> The vicarage, town place
>
> I need to be able to format address field above ready for importing
> into another database. In this new database, I have 3 fields for
> address (address1, address2, address3).
>
> So my problem is how to cut this address data and then put it back in a
> text file with delimiter. address3 should contain only one word,
> however, address1 and address2 may contain more than one word. When
> filling in the fields, data should be added from left to right, or in
> the order address1 then address2 then address3. If there is an address
> field left with a blank, that is not a problem as it will be handled by
> the mailmerge software.

This input

name1|12, high street, town, place, biggerplace|postcode1|telephone1
name2|The vicarage, town, place|postcode2|telephone2
name3|The vicarage, place|postcode2|telephone2

produces this output

name1|12 high street|town, place|biggerplace|postcode1|telephone1
name2|The vicarage|town|place|postcode2|telephone2
name3|The vicarage|place||postcode2|telephone2

The language is Ruby:

# Read each line of the file given on the command line.
ARGF.each { |line|
  # Remove the newline at the end of the string.
  line.chomp!

  # Split the string into an array.
  array = line.split( "|" )

  # Split the address field on commas, removing surrounding
  # whitespace.
  address = array[1].split( /\s*,\s*/ )

  # I'm assuming that if the first part of the address is
  # a number, it should be combined with the next part.
  if address[0] =~ /^\d+$/
    address[0..1] = address[0..1].join( " " )
  end

  if address.size > 3
    # Combine all but the first and last part into one part.
    address[1..-2] = address[1..-2].join( ", " )
  else
    # If we have fewer than 3 parts, tack on an empty string.
    address.push( "" ) while address.size < 3
  end

  array[1] = address

  # Assuming that the ouput field-separator is "|".
  puts array.flatten.join( "|" )
}



Relevant Pages

  • Re: Hot to parse RTF String
    ... Define styles in the Word document template containing the formatting ... Use VBA in Word to retrieve the strings from the database. ... automatically apply the associated style to each string as it brings it in. ... Word document, but parsed, that is formatted. ...
    (microsoft.public.word.docmanagement)
  • Byte Array to String & back corruption
    ... I have an MS Word document, as a BLOB in a database. ... //However, if I convert the byte array, to string, then back to byte array, ... String tempstring=new String; ...
    (comp.lang.java.programmer)
  • String Question
    ... Word document. ... The information sent from the database is ... a string and this string populates an MS Word cell in a ...
    (microsoft.public.word.vba.general)
  • Re: A little Rolodex [revised]
    ... including alpha sort and searching for any embedded string, ... In this application, a database is a directory, ... you may optionally provide any alternate UCASE program, ... NN -> first store NN as key length ...
    (comp.sys.hp48)
  • Re: return multiple rows from sql statement
    ... strings from input values is almost certainly a safe path to SQL ... Also, being a MySQL function, it knows what MySQL needs or uses. ... All characters that are entered in the fields make their way into the database unaltered. ... The insert of what surprisinlgly was NOT a syntax error, but a string called "mysql_insert_id" into an integer field resulted in the value zero being put in. ...
    (comp.lang.php)