Tagged PDF. From ISO 32000-1 (the ISO Standard for PDF). 14.8 Tagged PDF 14.8.1 General Tagged PDF (PDF 1.4) is a stylized use of PDF that builds on the logical structure framework described in 14.7, “Logical Structure.” It defines a set of standard structure types and attributes that allow page content (text, graphics, and images) to be extracted and reused for other purposes. A tagged PDF document is one that conforms to the rules described in this sub-clause. A conforming writer is not required to produce tagged PDF documents; however, if it does, it shall conform to these rules. NOTE 1 It is intended for use by tools that perform the following types of operations: • Simple extraction of text and graphics for pasting into other applications • Automatic reflow of text and associated graphics to fit a page of a different size than was assumed for the original layout • Processing text for such purposes as searching, indexing, and spell-checking • Conversion to other common file formats (such as HTML, XML, and RTF) with document structure and basic styling information preserved • Making content accessible to users with visual impairments (see 14.9, “Accessibility Support”) A tagged PDF document shall conform to the following rules: • Page content (14.8.2, “Tagged PDF and Page Content”). Tagged PDF defines a set of rules for representing text in the page content so that characters, words, and text order can be determined reliably. All text shall be represented in a form that can be converted to Unicode. Word breaks shall be represented explicitly. Actual content shall be distinguished from artifacts of layout and pagination. Content shall be given in an order related to its appearance on the page, as determined by the conforming writer. • A basic layout model (14.8.3, “Basic Layout Model”). A set of rules for describing the arrangement of structure elements on the page. • Structure types (14.8.4, “Standard Structure Types”). A set of standard structure types define the meaning of structure elements, such as paragraphs, headings, articles, and tables. • Structure attributes (14.8.5, “Standard Structure Attributes”). Standard structure attributes preserve styling information used by the conforming writer in laying out content on the page. A Tagged PDF document shall also contain a mark information dictionary (see Table 321) with a value of true for the Marked entry. NOTE 2 The types and attributes defined for Tagged PDF are intended to provide a set of standard fallback roles and minimum guaranteed attributes to enable conforming readers to perform operations such as those mentioned previously. Conforming writers are free to define additional structure types as long as they also provide a role mapping to the nearest equivalent standard types, as described in 14.7.3, “Structure Types.” Likewise, conforming writers can define additional structure attributes using any of the available extension mechanisms. Section 14 of ISO 32000-1 expands on each of the rules to provide a detailed discussion. Something of an overview. PDF page content is painted to the page. An Adobe document ("AcrobatWorkshop_final.pdf") provides useful background. The content is often not placed in the PDF in a natural read order. Body text may be painted/drawn first. Then the Header followed by the Footer. Body text is often not painted/drawn in the human expected order. A nicely detailed discussion of this is available here: http://www.appligent.com/talkingpdf-eachpdfpageisapainting (Each PDF Page is a Painting - Why PDF "reading order" is irrelevant to accessibility) So, we have content painted to the PDF page. As-is that's not any help for repurpose of content out to another file format or for Accessibility. This is where Logical Structure (Section 14.7, ISO 32000-1) comes into play. 14.7.1 General PDF’s logical structure facilities (PDF 1.3) shall provide a mechanism for incorporating structural information about a document’s content into a PDF file. Such information may include the organization of the document into chapters and sections or the identification of special elements such as figures, tables, and footnotes. The logical structure facilities shall be extensible, allowing conforming writers to choose what structural information to include and how to represent it, while enabling conforming readers to navigate a file without knowing the producer’s structural conventions. PDF logical structure shares basic features with standard document markup languages such as HTML, SGML, and XML. A document’s logical structure shall be expressed as a hierarchy of structure elements, each represented by a dictionary object. Like their counterparts in other markup languages, PDF structure elements may have content and attributes. In PDF, rendered document content takes over the role occupied by text in HTML, SGML, and XML. A PDF document’s logical structure shall be stored separately from its visible content, with pointers from each to the other. This separation allows the ordering and nesting of logical elements to be entirely independent of the order and location of graphics objects on the document’s pages. 14.7.2 Structure Hierarchy The logical structure of a document shall be described by a hierarchy of objects called the structure hierarchy or structure tree. But, how to make use of Logical Strucuture? That would be "Tagged PDF". Now, in regards of content export to Excel. Provided the PDF page content is, in fact, sourced from an authoring application that supports insert of a table with designation of table header row(s) along with good Tag management of the PDF output then we can get some good stuff exported. A Tagged PDF having a properly tagged Table element and requisite child elements provides the need "guidance" of where the content is to go as it lands in Excel. If the PDF is not Tagged then Acrobat makes an on-the-fly best-guess as to what might be adequate tagging. This is fine for simple PDF document content. However, the "simple" threshold is past rather quickly. Consequently the "best-guess" can often fail to be even kinda-sort of-ok. With that said what can often provide an acceptable export of a table's content in a PDF into Excel is to select the content and right click for the context menu. Select Copy As Table, Save As Table, or Open Table in Spreadsheet. You may have to try each in turn to see which provides something adequate for your needs. Useful Resources: Listing of Duff Johnson's Articles http://www.duff-johnson.com/Articles.html Duff's current Blog at CommonLook — http://www.commonlook.com/blog Along with this and that Duff is the Chair of the U.S. Committee for ISO 32000 (PDF) and ISO 14289 (PDF/UA). [ http://www.commonlook.com/duffjohnson#standards ] AcrobatUsers Accessibility sub-forum (read-only). http://acrobatusers.com/forum/accessibility/ An ISO approved copy of ISO 32000-1:2008 provided by Adobe. http://wwwimages.adobe.com/www.adobe.com/content/dam/Adobe/en/devnet/pdf/pdfs/PDF32000_2008.pdf Once published (perhaps August, 2012) three additional documents will be "must read" for anyone having an interest in or working with Tagged PDF. --| ISO 14289-1 (PDF/UA) --| The Implementer's Guide to PDF/UA --| PDF/UA — WCAG 2.0 Mapping "WCAG 2.0 for PDF is PDF/UA." 8^) Well, that's a bushel basket full, eh? Maybe enough to make some dandelion wine. Be well...
... View more