Re: extracting text from docx files



On Tue, Aug 09, 2011 at 02:36:32PM +0100, Anton Shterenlikht wrote:
I often receive information in *.docx format
from my MS using colleagues. Sometimes I can
ask for a pdf (or similar) instead, but not always.

You have a lot of nice options:
- Force them to use BSD/Linux ;)
- explain them, why docx is shit!
- don't read it


Usually I unzip a docx and then search
through all *xml files to find the
useful data. However, I can't find any
xml styles to use, so I have to convert
the relevant xml file(s) to plain text
by hand. I wonder if anybody can suggest
a better way. Perhaps there's something
in ports that can help.

But if you really, really need to read docx, you can try the web
application from Microsoft. A few months ago, I got also a lot of docx
and I opend it with the microsoft web app; this worked for me to extract
the information...

More information:
http://office.microsoft.com/en-us/web-apps/

The downside: you have to sign up on a microsoft service :(

cheers

--
Christian Barthel
Public-Key: http://bc.user-mode.org/bc.asc
Mail: bc@xxxxxxxxxxxxxxxxx
Web: http://bc.user-mode.org
_______________________________________________
freebsd-questions@xxxxxxxxxxx mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-questions
To unsubscribe, send any mail to "freebsd-questions-unsubscribe@xxxxxxxxxxx"



Relevant Pages

  • Re: An often asked question - document consistency
    ... as PDF, as LaTeX source code, and ... complicated structure of the LaTeX document to be correspondingly ... Is it possible for me to convert LaTeX to xml (texml claims ... and then have Microsoft Word 2003 read this somehow? ...
    (comp.text.tex)
  • Re: COMPATIBILITY
    ... .docx is the format we use. ... Microsoft about moving Word to SGML in 1989, ... XML provides a perfect circuit-breaker to the whole problem. ... XSLTs) implemented in the same standard syntax. ...
    (microsoft.public.mac.office.word)
  • Re: Download Templates
    ... large proportion of the content you read about Microsoft products is ... competing ODF (Open Document Format) seems to have rather missed the point ... of being in XML. ... which ODF is one. ...
    (microsoft.public.mac.office.word)
  • Re: .docx files
    ... Even as long as I've been on the internet I know what XML means Extensible Markup Language. ... to download a .docx and Safari has come back saying "I don't know what to do ... Inside the .docx file there is actually a little website with a folder ... when I try to drag that file onto EITHER the "Convert One ...
    (microsoft.public.mac.office.word)
  • Re: optional nillable values
    ... \par Microsoft Online Support ... \par The problem when the transmitted xml is deserialized on the client the buddy field is always set to 'false' regardless ... \par Phil Lee ... \par> that Nullable object is able to be transmit through XML webservice (through ...
    (microsoft.public.dotnet.framework.webservices)