Converting PDF to Excel and Retain Formatting

Report · May 23, 2012

I am trying to convert a PDF to an Excel file.

I just launched the free test version of Adobe Acrobat X Pro (version 10.1.3) and opened a newly created PDF.

I clicked on the following: File > Save As > Spreadsheet > XML Spreadsheet 2003 (have Office 2003).

When I opened the new XML file, the main headings appear horizontally across the spreadsheet. All of the dollar amounts appear vertically in Column A. We want the dollar amounts to appear under the appropriate headings. How do I retain the formatting from the PDF?

Report · May 24, 2012

It sounds as if I will not be able to easily convert a PDF to an Excel file using Adobe Acrobat X Pro as advertised. Is this true?

Not true.

The issue is not Acrobat.

Rather it is the PDF (how source content was mastered and the PDF created).

Use an authoring application having good tag management.

Examples: FrameMaker, InDesign, MS Word (with PDFMaker from Acrobat), MS Word 2010 (usng MS Save As PDF-XPS, accessible PDF & making use of the UI that promotes authoring for accessible PDF).

None are 100% (yet) for all aspects of an output of a well-formed Tagged PDF. However, they are (currently) "best-of-breed".

In the authoring file you'd master a proper table. Table header row(s) must be properly identified.

The Tagged output PDF's Table element must be properly post-processed with Acrobat (header row cells' Span attribute set, Scope attribute checked, and, perhaps, Headers attribute with associated ID set).

With proper content mastering (of the table in particular), tag management, and post-processing the Table element you'd have a properly "tagged" PDF.

Properly tagged, the table content in the Tagged PDF can be exported to Excel with rather nice results.

Properly mastered in the authoring file a table that is part of an untagged PDF can still be exported to Excel with fairly good results most times.

It is all in how content is mastered:

Example (an extreme but it conveys the point):

Use of space bar and Tab can yield the appearance of a "table" in Word or Notepad.

As with so much the "perception" is not reality.

Such content is merely "body text" tricked out to look like a table.

In the PDF such content is "body text" and has no correlation to tabular data.

There's a spectrum from OMG, yuck to Spot On.

What lands in Excel will reflect where the mastered content falls within this spectrum.

Variables are what was used to create the PDF, what was used for tag management (if any), how content is placed in the authoring file.

Often the PDFs that are least "supportive" are those that are programmatically created via a server application.

What PDFLibrary is in use (they are not all equal) and how effectively is it being used.

Be well...

Message was edited by: CtDave

View solution in original post

Report · May 24, 2012

Hi,

Start with a Tagged PDF. Export/Save As to spreadsheet works better with that.

Be well...

Report · May 24, 2012

Thanks for responding Dave. What is a Tagged PDF?

Ginger

GINGER GREENBERG

Database Manager

T 503.699.6258

TF 800.634.9982 ext. 6258

MARYLHURST UNIVERSITY

You. Unlimited.

17600 PACIFIC HIGHWAY

MARYLHURST, OR 97036-0261

marylhurst.edu

Report · May 24, 2012

Tagged PDF.

From ISO 32000-1 (the ISO Standard for PDF).
14.8 Tagged PDF

14.8.1 General
Tagged PDF (PDF 1.4) is a stylized use of PDF that builds on the logical structure framework described in 14.7, “Logical Structure.”

It defines a set of standard structure types and attributes that allow page content (text, graphics, and images) to be extracted and reused for other purposes.
A tagged PDF document is one that conforms to the rules described in this sub-clause. A conforming writer is not required to produce tagged PDF documents; however, if it does, it shall conform to these rules.

NOTE 1
It is intended for use by tools that perform the following types of operations:
• Simple extraction of text and graphics for pasting into other applications

• Automatic reflow of text and associated graphics to fit a page of a different size than was assumed for the original layout

• Processing text for such purposes as searching, indexing, and spell-checking

• Conversion to other common file formats (such as HTML, XML, and RTF) with document structure and basic styling information preserved

• Making content accessible to users with visual impairments (see 14.9, “Accessibility Support”)

A tagged PDF document shall conform to the following rules:

• Page content (14.8.2, “Tagged PDF and Page Content”). Tagged PDF defines a set of rules for representing text in the page content so that characters, words, and text order can be determined reliably.
All text shall be represented in a form that can be converted to Unicode. Word breaks shall be represented explicitly. Actual content shall be distinguished from artifacts of layout and pagination.
Content shall be given in an order related to its appearance on the page, as determined by the conforming writer.

• A basic layout model (14.8.3, “Basic Layout Model”). A set of rules for describing the arrangement of structure elements on the page.

• Structure types (14.8.4, “Standard Structure Types”). A set of standard structure types define the meaning
of structure elements, such as paragraphs, headings, articles, and tables.

• Structure attributes (14.8.5, “Standard Structure Attributes”). Standard structure attributes preserve styling information used by the conforming writer in laying out content on the page.

A Tagged PDF document shall also contain a mark information dictionary (see Table 321) with a value of true for the Marked entry.

NOTE 2
The types and attributes defined for Tagged PDF are intended to provide a set of standard fallback roles and minimum guaranteed attributes to

enable conforming readers to perform operations such as those mentioned previously. Conforming writers are free to define additional

structure types as long as they also provide a role mapping to the nearest equivalent standard types, as described in 14.7.3, “Structure

Types.” Likewise, conforming writers can define additional structure attributes using any of the available extension mechanisms.

Section 14 of ISO 32000-1 expands on each of the rules to provide a detailed discussion.

Something of an overview.

PDF page content is painted to the page.
An Adobe document ("AcrobatWorkshop_final.pdf") provides useful background.
The content is often not placed in the PDF in a natural read order.
Body text may be painted/drawn first. Then the Header followed by the Footer.
Body text is often not painted/drawn in the human expected order.
A nicely detailed discussion of this is available here:
http://www.appligent.com/talkingpdf-eachpdfpageisapainting
(Each PDF Page is a Painting - Why PDF "reading order" is irrelevant to accessibility)

So, we have content painted to the PDF page. As-is that's not any help for repurpose of content out to another file format or for Accessibility.
This is where Logical Structure (Section 14.7, ISO 32000-1) comes into play.

14.7.1 General
PDF’s logical structure facilities (PDF 1.3) shall provide a mechanism for incorporating structural information about a document’s content into a PDF file.
Such information may include the organization of the document into chapters and sections or the identification of special elements such as figures, tables, and footnotes.
The logical structure facilities shall be extensible, allowing conforming writers to choose what structural information to include and how to represent it, while enabling conforming readers to navigate a file without knowing the producer’s structural conventions.

PDF logical structure shares basic features with standard document markup languages such as HTML, SGML, and XML.
A document’s logical structure shall be expressed as a hierarchy of structure elements, each represented by a dictionary object.
Like their counterparts in other markup languages, PDF structure elements may have content and attributes.
In PDF, rendered document content takes over the role occupied by text in HTML, SGML, and XML.

A PDF document’s logical structure shall be stored separately from its visible content, with pointers from each to the other.
This separation allows the ordering and nesting of logical elements to be entirely independent of the order and location of graphics objects on

the document’s pages.

14.7.2 Structure Hierarchy
The logical structure of a document shall be described by a hierarchy of objects called the structure hierarchy or structure tree.

But, how to make use of Logical Strucuture?
That would be "Tagged PDF".

Now, in regards of content export to Excel.
Provided the PDF page content is, in fact, sourced from an authoring application that supports insert of a table with designation of table header row(s) along with good Tag management of the PDF output then we can get some good stuff exported.
A Tagged PDF having a properly tagged Table element and requisite child elements provides the need "guidance" of where the content is to go as it lands in Excel.
If the PDF is not Tagged then Acrobat makes an on-the-fly best-guess as to what might be adequate tagging.
This is fine for simple PDF document content. However, the "simple" threshold is past rather quickly. Consequently the "best-guess" can often fail to be even kinda-sort of-ok.
With that said what can often provide an acceptable export of a table's content in a PDF into Excel is to select the content and right click for the context menu. Select Copy As Table, Save As Table, or Open Table in Spreadsheet.
You may have to try each in turn to see which provides something adequate for your needs.

Useful Resources:
Listing of Duff Johnson's Articles
http://www.duff-johnson.com/Articles.html
Duff's current Blog at CommonLook — http://www.commonlook.com/blog
Along with this and that Duff is the Chair of the U.S. Committee for ISO 32000 (PDF) and ISO 14289 (PDF/UA).
[ http://www.commonlook.com/duffjohnson#standards ]

AcrobatUsers Accessibility sub-forum (read-only).
http://acrobatusers.com/forum/accessibility/

An ISO approved copy of ISO 32000-1:2008 provided by Adobe.
http://wwwimages.adobe.com/www.adobe.com/content/dam/Adobe/en/devnet/pdf/pdfs/PDF32000_2008.pdf

Once published (perhaps August, 2012) three additional documents will be "must read" for anyone having an interest in or working with

Tagged PDF.
--| ISO 14289-1 (PDF/UA)
--| The Implementer's Guide to PDF/UA
--| PDF/UA — WCAG 2.0 Mapping
"WCAG 2.0 for PDF is PDF/UA." 8^)

Well, that's a bushel basket full, eh?
Maybe enough to make some dandelion wine.

Be well...

Report · May 24, 2012

Thank you Dave for the detailed information. This process is overwhelming.

I was not successful at doing the following:

select the content and right click for the context menu. Select Copy As

Table, Save As Table, or Open Table in Spreadsheet.

You may have to try each in turn to see which provides something adequate

for your needs.

It sounds as if I will not be able to easily convert a PDF to an Excel

file using Adobe Acrobat X Pro as advertised. Is this true?

Thanks,

Ginger

GINGER GREENBERG

Database Manager

T 503.699.6258

TF 800.634.9982 ext. 6258

MARYLHURST UNIVERSITY

You. Unlimited.

17600 PACIFIC HIGHWAY

MARYLHURST, OR 97036-0261

marylhurst.edu

Report · May 24, 2012

It sounds as if I will not be able to easily convert a PDF to an Excel file using Adobe Acrobat X Pro as advertised. Is this true?

Not true.

The issue is not Acrobat.

Rather it is the PDF (how source content was mastered and the PDF created).

Use an authoring application having good tag management.

Examples: FrameMaker, InDesign, MS Word (with PDFMaker from Acrobat), MS Word 2010 (usng MS Save As PDF-XPS, accessible PDF & making use of the UI that promotes authoring for accessible PDF).

None are 100% (yet) for all aspects of an output of a well-formed Tagged PDF. However, they are (currently) "best-of-breed".

In the authoring file you'd master a proper table. Table header row(s) must be properly identified.

The Tagged output PDF's Table element must be properly post-processed with Acrobat (header row cells' Span attribute set, Scope attribute checked, and, perhaps, Headers attribute with associated ID set).

With proper content mastering (of the table in particular), tag management, and post-processing the Table element you'd have a properly "tagged" PDF.

Properly tagged, the table content in the Tagged PDF can be exported to Excel with rather nice results.

Properly mastered in the authoring file a table that is part of an untagged PDF can still be exported to Excel with fairly good results most times.

It is all in how content is mastered:

Example (an extreme but it conveys the point):

Use of space bar and Tab can yield the appearance of a "table" in Word or Notepad.

As with so much the "perception" is not reality.

Such content is merely "body text" tricked out to look like a table.

In the PDF such content is "body text" and has no correlation to tabular data.

There's a spectrum from OMG, yuck to Spot On.

What lands in Excel will reflect where the mastered content falls within this spectrum.

Variables are what was used to create the PDF, what was used for tag management (if any), how content is placed in the authoring file.

Often the PDFs that are least "supportive" are those that are programmatically created via a server application.

What PDFLibrary is in use (they are not all equal) and how effectively is it being used.

Be well...

Message was edited by: CtDave

Report · May 24, 2012

Hi Dave,

The authoring application is a Sybase product called InfoMaker 11.5.

Would you happen to know if it has good tag management?

Thanks for your help.

Ginger

GINGER GREENBERG

Database Manager

T 503.699.6258

TF 800.634.9982 ext. 6258

MARYLHURST UNIVERSITY

You. Unlimited.

17600 PACIFIC HIGHWAY

MARYLHURST, OR 97036-0261

marylhurst.edu

Report · May 24, 2012

If you are willing to take some time (at least as an experiment), you should be able to copy and paste from Acrobat to a new spread sheet. The key is to do column copies using the alt key (windows) along with the text select. Then copy and paste. My markup selects all words that are touched and copies into a column, but ignores blank lines. This may be a lot of work if you have to do this regularly, but for an occasional copy it might do your job.

Report · May 24, 2012

Thank you Bill for the interesting workaround. It works great but will

probably be too time consuming for our users. I was hoping to find a

simpler process. I will pass your tip on.

Thanks,

Ginger

GINGER GREENBERG

Database Manager

T 503.699.6258

TF 800.634.9982 ext. 6258

MARYLHURST UNIVERSITY

You. Unlimited.

17600 PACIFIC HIGHWAY

MARYLHURST, OR 97036-0261

marylhurst.edu

Report · May 24, 2012

If the formatting information is not included in the PDF (basically tags), then you can create them. If your application could create the tags, it would be best. There are some third party products that supposedly do the job pretty well (the products are dedicated to this conversion), but the cost may not fit into your model. You should be able to check PDF Planet or the PDFZone for possible alternatives.

Report · May 25, 2012

Thank you for the information Bill.

Ginger

GINGER GREENBERG

Database Manager

T 503.699.6258

TF 800.634.9982 ext. 6258

MARYLHURST UNIVERSITY

You. Unlimited.

17600 PACIFIC HIGHWAY

MARYLHURST, OR 97036-0261

marylhurst.edu