How to extract text data

Report · Jan 19, 2021

Hello ID masters, first ID post here.

I don't use ID at all, but I was tasked with extracting all data from the Design dept Linesheets. The documents are not organized in layers, everything is in one layer and text frames have no names.

In Illustrator I would probably get all text frames and get their position on the page to tell what text frames belong to each style.

I did some preliminary testing and I was able to get data for each page in ID.

var idoc = app.activeDocument;
var p = idoc.pages[1];

var tframes = p.textFrames;

for (var a=0; a<tframes.length; a++) {
    tframe = tframes[a];
    
    $.writeln(tframe.contents);
    $.writeln(tframe.geometricBounds);
}

Is there a better way to extract the data in InDesign? either manually or with scripting?

thanks in advance

Carlos

Report · Jan 19, 2021

This third-party product extracts text from InDesign documents:

https://www.rorohiko.com/wordpress/indesign-downloads/textexporter-4/

Report · Jan 20, 2021

Hi Carlos,

There isn't any better than the other solutions. The one that works for you is the best by definition.

Once that said, the best way to answer you is that you provide a sample of the data as you would like them to be outputted.

And in a more generic concern, I tend to avoid parsing obects at their first hierarchy levels (i.e. page.textFrames…) because you can have groups or nested objects that you could miss. I generally prefer to use allPageItems (that returns elements no matter how they are used) and then filter by type. Or Grep, if I know exactly where to search.

In both cases, it seems like you will need some computation here.

Report · Jan 20, 2021

Hi Loic, See? that shows I know nothing about ID, I didn't know there was an allPageItems. I just wrote the script as it if was Illustrator. Thanks for the tip.

Ideally, I want to extract the fields in order from top to bottom (left to right where it applies) for each style, to more or less know what they are and put them in the right column. I can ignore the big title on top of each page. I've noticed each style is pretty consistent with the number of fields, except the "PROTO", not all styles have it. Also some styles might not have the note in column H.

the only fixed text is "Rise=", "I=", "LO=" and the "$" sign at the bottom field, everything else changes per style

thanks!

Report · Jan 22, 2021

Hi Carlos,

Thanks for that, makes things clearer indeed. The toughest point in your project is to be sure the code will be consistent enough to handle every possible scenario. But let's say that will be the case. Then the second pain point is that there are two references per page and that the half of the page divides references.

There are really multiple solutions and it's hard to state one is really preferable to the sole execption maybe of the performance and code complexity. If I had to do this, I would personnaly use GREP engine to retrieve text (see doc.findGrep and associated properties AND options). I find GREP convenient to catch text but one may disagree.

Grep searches return array of text objects (litteraly the text instance, not the content as string). From every text, I would get the associated page and would check if the object is on the right or left side of the page. Eventually I would look at its content and properties to define its quality.

Once I would have all necessary data, I would reconstruct the output data.

I can imagine that it's seems tough to achieve but it's not that hard. Just a bit too long to code for me to deliver. But if you start writing something, feel free to ask for gentle assistance.

Best

Loic

Report · Jan 23, 2021

Hi Loic, thanks for the additional pointers.

I'm going by finding each text per page, sorting their position first by left then by top. I'm making progress, it looks promising.

thanks!

Adobe Community

How to extract text data

1 Correct answer