Skip to main content
December 2, 2018
Answered

Text and Data Mining and PDF

  • December 2, 2018
  • 3 replies
  • 1419 views

Hi,

from the perspective of "Text and Data Mining" (https://www.rightsdirect.com/text-and-data-mining/), do I need to consider any extra steps when setting up InDesign document to be exported to PDF? I ask this question with relation to preparing scientific publications.

I would think of XML tags. But to be honest I'm not sure how it translates to final PDF. Is PDF a good format allowing text and data mining?

Would be grateful for any hints.

Peter

    This topic has been closed for replies.
    Correct answer Colin Flashman

    I would imagine that this would work in a similar fashion to creating PDFs intended for accessibility purposes. This page may have more information: https://indesignsecrets.com/creating-accessible-pdfs.php

    3 replies

    December 7, 2018

    Thank you guys for your hints. I am now taking a closer look at accessibility options and document structure in InDesign.

    Colin Flashman
    Community Expert
    Colin FlashmanCommunity ExpertCorrect answer
    Community Expert
    December 5, 2018

    I would imagine that this would work in a similar fashion to creating PDFs intended for accessibility purposes. This page may have more information: https://indesignsecrets.com/creating-accessible-pdfs.php

    If the answer wasn't in my post, perhaps it might be on my blog at colecandoo!
    Bevi Chagnon - PubCom.com
    Legend
    December 5, 2018

    The free Lynda.com video about making accessible PDFs (noted above) is very elementary.

    It takes a helluva lot more to make an accessible PDF than what's covered in the video!

    For those who want to explore this niche of publishing, you'll need a full course of instruction on accessible PDFs, the accessibility standards, and the procedures to do in InDesign.

    |    Bevi Chagnon   |  Designer, Trainer, & Technologist for Accessible Documents ||    PubCom |    Classes & Books for Accessible InDesign, PDFs & MS Office |
    David W. Goodrich
    Participating Frequently
    December 3, 2018

    As no one else has answered I'll throw in my inexpert two cents.  PDF is great for reproducing pages of text but that is not the same thing as preserving text or other data.  As Test Screen Name wrote a month ago in another context:

    > By the way, on PDF, there are no rules for rendering text at all. Each character has a position

    > on the page; that's where you see them, and all there is to it. The rules were used by another

    > app, to decide where each character goes. The editing in Acrobat does a near miraculous job

    > of running over the page, guessing where lines, words and paragraphs are, and giving a kind

    >of primitive editor.

    The key concept is "Each character has a position on the page; that's ... all there is to it."  That is really good for allowing accurate reproduction of a composed page.  One modest way to help Acrobat decipher the text in a PDF is to enable "Tagged Text" when exporting from InDesign -- but note that "Tagged Text"  has a special meaning in PDF.

    Good luck!

    David

    December 4, 2018

    Hi David W. Goodrich, thank you for the contribution. I'm trying to understand this subject better.

    By "tagged text" do you mean embracing the text with a tag like, e.g. "Introduction" or "methods", so that the content is recognized as something particular?

    Whether it is about position of the character or not, the text in PDF can be extracted as being meaningful (and is searchable).

    David W. Goodrich
    Participating Frequently
    December 5, 2018

    I have no firm idea of the definition for "Tagged Text" in Acrobat-speak, so I'm not the one to ask.  I just know is that it isn't anything like "Tagged" as in HTML/XML tags.  Try exporting the same page from ID to PDF with and without Tagging and compare the result when you copy-and-paste the same passage from each PDF into a text editor (such as Windows Notepad).

    Another way to look at text in a PDF is that there are no word-spaces per se, just distances between characters.  This means that any software trying to extract text from a PDF must guess where the word-boundaries fall, based on the distance between where one character ends and the next begins.  Letter-spacing can be a serious confusion, not to mention tables or formats like bulleted lists.

    Don't get me wrong, PDF is superlative for reproducing composed pages for humans to read: it preserves the spacing, and uses the actual fonts (unless you fail to embed them).  It lets me typeset Chinese, Arabic and alphabetic text all in the same line using the fonts I specify.  But text and data for machine processing or reading may be better off in another format, such as HTML or XML.