A

Anonymous

Answered

Text and Data Mining and PDF

Forum|Forum|7 years ago
December 2, 2018
3 replies
1419 views

Hi,

from the perspective of "Text and Data Mining" (https://www.rightsdirect.com/text-and-data-mining/), do I need to consider any extra steps when setting up InDesign document to be exported to PDF? I ask this question with relation to preparing scientific publications.

I would think of XML tags. But to be honest I'm not sure how it translates to final PDF. Is PDF a good format allowing text and data mining?

Would be grateful for any hints.

Peter

This topic has been closed for replies.

Correct answer Colin Flashman

I would imagine that this would work in a similar fashion to creating PDFs intended for accessibility purposes. This page may have more information: https://indesignsecrets.com/creating-accessible-pdfs.php

A

Anonymous

Thank you guys for your hints. I am now taking a closer look at accessibility options and document structure in InDesign.

Colin Flashman

Correct answer

Community Expert

I would imagine that this would work in a similar fashion to creating PDFs intended for accessibility purposes. This page may have more information: https://indesignsecrets.com/creating-accessible-pdfs.php

If the answer wasn't in my post, perhaps it might be on my blog at colecandoo!

Bevi Chagnon - PubCom.com

Legend

The free Lynda.com video about making accessible PDFs (noted above) is very elementary.

It takes a helluva lot more to make an accessible PDF than what's covered in the video!

For those who want to explore this niche of publishing, you'll need a full course of instruction on accessible PDFs, the accessibility standards, and the procedures to do in InDesign.

|    Bevi Chagnon   |  Designer, Trainer, & Technologist for Accessible Documents ||    PubCom |    Classes & Books for Accessible InDesign, PDFs & MS Office |

David W. Goodrich

Participating Frequently

As no one else has answered I'll throw in my inexpert two cents. PDF is great for reproducing pages of text but that is not the same thing as preserving text or other data. As Test Screen Name wrote a month ago in another context:

> By the way, on PDF, there are no rules for rendering text at all. Each character has a position

> on the page; that's where you see them, and all there is to it. The rules were used by another

> app, to decide where each character goes. The editing in Acrobat does a near miraculous job

> of running over the page, guessing where lines, words and paragraphs are, and giving a kind

>of primitive editor.

The key concept is "Each character has a position on the page; that's ... all there is to it." That is really good for allowing accurate reproduction of a composed page. One modest way to help Acrobat decipher the text in a PDF is to enable "Tagged Text" when exporting from InDesign -- but note that "Tagged Text" has a special meaning in PDF.

Good luck!

David

A

Anonymous

Hi David W. Goodrich, thank you for the contribution. I'm trying to understand this subject better.

By "tagged text" do you mean embracing the text with a tag like, e.g. "Introduction" or "methods", so that the content is recognized as something particular?

Whether it is about position of the character or not, the text in PDF can be extracted as being meaningful (and is searchable).

David W. Goodrich

Participating Frequently

I have no firm idea of the definition for "Tagged Text" in Acrobat-speak, so I'm not the one to ask. I just know is that it isn't anything like "Tagged" as in HTML/XML tags. Try exporting the same page from ID to PDF with and without Tagging and compare the result when you copy-and-paste the same passage from each PDF into a text editor (such as Windows Notepad).

Another way to look at text in a PDF is that there are no word-spaces per se, just distances between characters. This means that any software trying to extract text from a PDF must guess where the word-boundaries fall, based on the distance between where one character ends and the next begins. Letter-spacing can be a serious confusion, not to mention tables or formats like bulleted lists.

Don't get me wrong, PDF is superlative for reproducing composed pages for humans to read: it preserves the spacing, and uses the actual fonts (unless you fail to embed them). It lets me typeset Chinese, Arabic and alphabetic text all in the same line using the fonts I specify. But text and data for machine processing or reading may be better off in another format, such as HTML or XML.

Sign up

To post, reply, or follow discussions, please sign in with your Adobe ID.

Sign in to Adobe Community

To post, reply, or follow discussions, please sign in with your Adobe ID.

Scanning file for viruses.

This file cannot be downloaded