Exit
  • Global community
    • Language:
      • Deutsch
      • English
      • Español
      • Français
      • Português
  • 日本語コミュニティ
  • 한국 커뮤니티
1

How to read data from pdf document and insert into database?

Community Beginner ,
Dec 09, 2018 Dec 09, 2018

I got PDF document from the customer. The document is 60 pages long. I need to read the data from middle of the page 49 to page 58. In ColdFusion there is cfpdf tag that allows reading the pdf documents. Here is example of what I have so far:

<cftry>      <cfset mypdf = expandPath("./data.pdf")>      <cfpdf action="read" source="#mypdf#" name="PDFInfo">       <cfdump var="#PDFInfo#">            <cfcatch type="any">           <cfdump var="#cfcatch#">      </cfcatch> </cftry>

After document is dumped on the screen there are information like:

Author  [empty string] CenterWindowOnScreen    no ChangingDocument    Allowed Commenting  Allowed ContentExtraction   Allowed CopyContent     Allowed PageSizes PDFDocumentarray 1 PDFDocument - struct height  792 width   612 2 PDFDocument - struct height  792 width   612 3 PDFDocument - struct height  792 width   612 4 PDFDocument - struct height  792 width   612

I never before used the cfpdf and this is something new for me. I tried to search on the web but couldn't find the example on how I can get the data from PDF document. Is there a good way to get the data from specific pages in the file/document? Also I guess there has to be a loop that will allow accessing individual row data. I have done something similar with .csv and .xls files. If anyone have a good example of resource for this problem please let me know. Thanks

5.7K
Translate
Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
LEGEND ,
Dec 10, 2018 Dec 10, 2018

According to Adobe Help Page for CFPDF, you can extract text from the PDF:

Extract text
<cfpdf
----required
action="extracttext" <!---extract all the words in the PDF.--->
source= "absolute or relative path of the PDF file|PDF document variable| cfdocument variable"
pages = "*" <!----page numbers from where the text needs to be extracted from the PDF document--->
----optional
addquads = "add the position or quadrants for the text in the PDF"
honourspaces = "true|false"
overwrite = "true" <!---Overwrite the specified object in the PDF document--->
password = "" <!--- PDF document password--->
type = "string|xml" <!---format in which the text needs to be extracted--->
one of the following:
destination = "PDF output file pathname"
name = "PDF document variable"
usestructure = "true|false"
Extract image
<cfpdf
required
action = "extractimage" <!---extract images and save it to a directory--->
source = "absolute or relative path of the PDF file|PDF document variable|
cfdocument variable"
pages = "*" <!---page numbers from where the images need to be extracted--->
optional
overwrite = "true|false" <!---overwrite any existing image when set to true--->
format = "png|tiff|jpg" <!---format in which the images should be extracted--->
imageprefix = "*" <!---the string that you want to prefix with the image
name--->
password = "" <!--- PDF document password--->
destination = "PDF output file pathname" />

Note that there is an attribute that will allow you to specify what pages to extract from.

HTH,

^ _ ^

Translate
Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Community Expert ,
Dec 10, 2018 Dec 10, 2018

You could just delete pages 1 to 48 and 59 to 60. The following code does just that.

<cfpdf

    action = "deletepages"

    pages = "1-48,59-60"

    source = "#expandPath('./data.pdf')#"

    overwrite = "yes"

    destination ="#expandPath('./data_pages49to58.pdf')#">

Done.

Translate
Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Community Beginner ,
Dec 11, 2018 Dec 11, 2018

I'm wondering after deleting the pages what is the next step. In the answer from WolfShade he mentioned action="extracttext". In that case the only thing I get on the screen is text without any delimiters. All data is in one row. I'm not sure how to read the data from PDF for each row?

Translate
Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
LEGEND ,
Dec 11, 2018 Dec 11, 2018

You didn't indicate in your original post that the data was in any particular format.

What extracttext does is give you all the text content of the PDF in a single string.  It's up to you to determine the best way to aggregate the value.

V/r,

^ _ ^

Translate
Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Community Expert ,
Dec 12, 2018 Dec 12, 2018
LATEST

LikeWolfShade​​ I also thought you wouldn't mind saving the data as PDF. In any case, his suggestion, action="extracttext", is as good as cfpdf gets.

Alternatively, you might want to try a tool of yesteryear, Raymond Camden's PDFUtils. Extract the ZIP, and move the pdfutils directory to your workspace.

Copy the file you pruned earlier, data_pages49to58.pdf, to the directory. Then create and run, within the pdfutils directory, a CFM containing the code

<cfset pdf = createObject("component", "pdfutils")>

<cfset mypdf = expandPath("./data_pages49to58.pdf")>

<cfset pdfText = pdf.getText(mypdf)>

<cfdump var="#pdfText#" >

You should then be able to figure out how to fetch the text.

Translate
Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Resources