Skip to main content
diogoferreira
Inspiring
April 1, 2019
Question

Read word document (.docx) lines

  • April 1, 2019
  • 1 reply
  • 276 views

Hi,

I am trying do read a word document (.docx) lines.

First, I tried with a .txt file (UTF-8) and it worked perfectly with this code:

    var myTextFile = File(script_file_path + "/test.txt");

    var lines = [];

    myTextFile.open('r');

    while (!myTextFile.eof) {

        lines[lines.length] = myTextFile.readln();

    }

    myTextFile.close();

    $.writeln(String(lines[1]));

But when I import the word document, the output from lines[] is something like this:

l"%3˜fi3V∆É—⁄öl µw%Î=ñÅì^i7+Ÿ◊‰-d&·î0fiA…6Äl4ºΩL60#µ√íÕS

Would it be a file encoding problem? How can I solve that?

Can anyone help please?

Thank you very much!

This topic has been closed for replies.

1 reply

Jongware
Community Expert
Community Expert
April 2, 2019

It is not a simple encoding problem.

As you now have found out by yourself (but could have found right away by a simple google on the "docx file format"), a "Word document" is something quite else than a plain text document. Well, somehow you might have expected that: a Word document can contain tables, footnotes, fonts, images, colors, italics, and much more than can be expressed in "plain text" -- and there must be some way to store all that information in a file, other than "as plain text".

Word solves this by writing out all of its data in an XML structure, and not in a single file either -- just like IDML, there are various major and minor parts that can better be saved as separate files, so a typical Word file consists of at least 20 (yes! Twenty!) XML files, each with a function of its own. On top of that, non-native Word content such as embedded images, worksheets, fonts, and equations are also stored in their entirety.

All of these files are stored in separate subfolders with sensible names such as "_rels" and "docProps", and then the entire package is compressed and stored into a single zip file. There is your docx, then -- no chance at all of reading out "the" lines of text. There never were any plain lines of text to begin with.

If you need a script to use the plain text anyway, open the document in Word and save as "Plain Text". You will lose all of the formatting and special items, of course; it's called "plain" text for a reason.