• Global community
    • Language:
      • Deutsch
      • English
      • Español
      • Français
      • Português
  • 日本語コミュニティ
    Dedicated community for Japanese speakers
  • 한국 커뮤니티
    Dedicated community for Korean speakers
Exit
1

How to read data from pdf document and insert into database?

Community Beginner ,
Dec 09, 2018 Dec 09, 2018

Copy link to clipboard

Copied

I got PDF document from the customer. The document is 60 pages long. I need to read the data from middle of the page 49 to page 58. In ColdFusion there is cfpdf tag that allows reading the pdf documents. Here is example of what I have so far:

<cftry> 
     <cfset mypdf = expandPath("./data.pdf")>
     <cfpdf action="read" source="#mypdf#" name="PDFInfo"> 
     <cfdump var="#PDFInfo#"> 
    
     <cfcatch
type="any">
          <cfdump var="#cfcatch#">
     </cfcatch>
</cftry>

After document is dumped on the screen there are information like:

Author  [empty string] 
CenterWindowOnScreen    no
ChangingDocument    Allowed
Commenting  Allowed
ContentExtraction   Allowed
CopyContent     Allowed
PageSizes
PDFDocumentarray
1 PDFDocument - struct height  792 width   612
2 PDFDocument - struct height  792 width   612
3 PDFDocument - struct height  792 width   612
4 PDFDocument - struct height  792 width   612

I never before used the cfpdf and this is something new for me. I tried to search on the web but couldn't find the example on how I can get the data from PDF document. Is there a good way to get the data from specific pages in the file/document? Also I guess there has to be a loop that will allow accessing individual row data. I have done something similar with .csv and .xls files. If anyone have a good example of resource for this problem please let me know. Thanks

Views

5.2K

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
LEGEND ,
Dec 10, 2018 Dec 10, 2018

Copy link to clipboard

Copied

According to Adobe Help Page for CFPDF, you can extract text from the PDF:

Extract text
<cfpdf
----required
action="extracttext" <!---extract all the words in the PDF.--->
source= "absolute or relative path of the PDF file|PDF document variable| cfdocument variable"
pages = "*" <!----page numbers from where the text needs to be extracted from the PDF document--->
----optional
addquads = "add the position or quadrants for the text in the PDF"
honourspaces = "true|false"
overwrite = "true" <!---Overwrite the specified object in the PDF document--->
password = "" <!--- PDF document password--->
type = "string|xml" <!---format in which the text needs to be extracted--->
one of the following:
destination = "PDF output file pathname"
name = "PDF document variable"
usestructure = "true|false"
Extract image
<cfpdf
required
action = "extractimage" <!---extract images and save it to a directory--->
source = "absolute or relative path of the PDF file|PDF document variable|
cfdocument variable"
pages = "*" <!---page numbers from where the images need to be extracted--->
optional
overwrite = "true|false" <!---overwrite any existing image when set to true--->
format = "png|tiff|jpg" <!---format in which the images should be extracted--->
imageprefix = "*" <!---the string that you want to prefix with the image
name--->
password = "" <!--- PDF document password--->
destination = "PDF output file pathname" />

Note that there is an attribute that will allow you to specify what pages to extract from.

HTH,

^ _ ^

Votes

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Community Expert ,
Dec 10, 2018 Dec 10, 2018

Copy link to clipboard

Copied

You could just delete pages 1 to 48 and 59 to 60. The following code does just that.

<cfpdf

    action = "deletepages"

    pages = "1-48,59-60"

    source = "#expandPath('./data.pdf')#"

    overwrite = "yes"

    destination ="#expandPath('./data_pages49to58.pdf')#">

Done.

Votes

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Community Beginner ,
Dec 11, 2018 Dec 11, 2018

Copy link to clipboard

Copied

I'm wondering after deleting the pages what is the next step. In the answer from WolfShade he mentioned action="extracttext". In that case the only thing I get on the screen is text without any delimiters. All data is in one row. I'm not sure how to read the data from PDF for each row?

Votes

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
LEGEND ,
Dec 11, 2018 Dec 11, 2018

Copy link to clipboard

Copied

You didn't indicate in your original post that the data was in any particular format.

What extracttext does is give you all the text content of the PDF in a single string.  It's up to you to determine the best way to aggregate the value.

V/r,

^ _ ^

Votes

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Community Expert ,
Dec 12, 2018 Dec 12, 2018

Copy link to clipboard

Copied

LATEST

LikeWolfShade​​ I also thought you wouldn't mind saving the data as PDF. In any case, his suggestion, action="extracttext", is as good as cfpdf gets.

Alternatively, you might want to try a tool of yesteryear, Raymond Camden's PDFUtils. Extract the ZIP, and move the pdfutils directory to your workspace.

Copy the file you pruned earlier, data_pages49to58.pdf, to the directory. Then create and run, within the pdfutils directory, a CFM containing the code

<cfset pdf = createObject("component", "pdfutils")>

<cfset mypdf = expandPath("./data_pages49to58.pdf")>

<cfset pdfText = pdf.getText(mypdf)>

<cfdump var="#pdfText#" >

You should then be able to figure out how to fetch the text.

Votes

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Resources
Documentation