Read word document (.docx) lines

Question

Hi,

I am trying do read a word document (.docx) lines.

First, I tried with a .txt file (UTF-8) and it worked perfectly with this code:

    var myTextFile = File(script_file_path + "/test.txt");
    var lines = [];
    myTextFile.open('r');
    while (!myTextFile.eof) {
        lines[lines.length] = myTextFile.readln();
    }
    myTextFile.close();
    $.writeln(String(lines[1]));

But when I import the word document, the output from lines[] is something like this:

l"%3˜ﬁ3V∆É—⁄öl µw%Î=ñÅì^i7+Ÿ◊‰-d&·î0ﬁA…6Äl4ºΩL60#µ√íÕS

Would it be a file encoding problem? How can I solve that?

Can anyone help please?

Thank you very much!

Jongware · Answer

It is not a simple encoding problem.As you now have found out by yourself (but could have found right away by a simple google on the "docx file format"), a "Word document" is something quite else than a plain text document. Well, somehow you might have expected that: a Word document can contain tables, footnotes, fonts, images, colors, italics, and much more than can be expressed in "plain text" -- and there must be some way to store all that information in a file, other than "as plain text".Word solves this by writing out all of its data in an XML structure, and not in a single file either -- just like IDML, there are various major and minor parts that can better be saved as separate files, so a typical Word file consists of at least 20 (yes! Twenty!) XML files, each with a function of its own. On top of that, non-native Word content such as embedded images, worksheets, fonts, and equations are also stored in their entirety.All of these files are stored in separate subfolders with sensible names such as "_rels" and "docProps", and then the entire package is compressed and stored into a single zip file. There is your docx, then -- no chance at all of reading out "the" lines of text. There never were any plain lines of text to begin with.If you need a script to use the plain text anyway, open the document in Word and save as "Plain Text". You will lose all of the formatting and special items, of course; it's called "plain" text for a reason.

Sign up

To post, reply, or follow discussions, please sign in with your Adobe ID.

Sign in to Adobe Community

To post, reply, or follow discussions, please sign in with your Adobe ID.

Scanning file for viruses.

This file cannot be downloaded