How to read data from pdf document and insert into database?

Report · Dec 09, 2018

I got PDF document from the customer. The document is 60 pages long. I need to read the data from middle of the page 49 to page 58. In ColdFusion there is cfpdf tag that allows reading the pdf documents. Here is example of what I have so far:

<cftry> 
     <cfset mypdf = expandPath("./data.pdf")> 
     <cfpdf action="read" source="#mypdf#" name="PDFInfo">  
     <cfdump var="#PDFInfo#">  
     
     <cfcatch type="any"> 
          <cfdump var="#cfcatch#"> 
     </cfcatch> 
</cftry>

After document is dumped on the screen there are information like:

Author  [empty string] 
CenterWindowOnScreen    no 
ChangingDocument    Allowed 
Commenting  Allowed 
ContentExtraction   Allowed 
CopyContent     Allowed 
PageSizes 
PDFDocumentarray 
1 PDFDocument - struct height  792 width   612 
2 PDFDocument - struct height  792 width   612 
3 PDFDocument - struct height  792 width   612 
4 PDFDocument - struct height  792 width   612

I never before used the cfpdf and this is something new for me. I tried to search on the web but couldn't find the example on how I can get the data from PDF document. Is there a good way to get the data from specific pages in the file/document? Also I guess there has to be a loop that will allow accessing individual row data. I have done something similar with .csv and .xls files. If anyone have a good example of resource for this problem please let me know. Thanks

Report · Dec 10, 2018

According to Adobe Help Page for CFPDF, you can extract text from the PDF:

Extract text

<cfpdf

----required

action="extracttext" 

source= "absolute or relative path of the PDF file|PDF document variable| cfdocument variable"

pages = "*" 

----optional

addquads = "add the position or quadrants for the text in the PDF"

honourspaces = "true|false"

overwrite = "true" 

password = "" 

type = "string|xml" 

one of the following:

destination = "PDF output file pathname"

name = "PDF document variable"

usestructure = "true|false"

Extract image

<cfpdf

required

action = "extractimage" 

source = "absolute or relative path of the PDF file|PDF document variable|

cfdocument variable"

pages = "*" 

optional

overwrite = "true|false" 

format = "png|tiff|jpg" 

imageprefix = "*" <!---the string that you want to prefix with the image

name--->

password = "" 

destination = "PDF output file pathname" />

Note that there is an attribute that will allow you to specify what pages to extract from.

HTH,

^ _ ^

Report · Dec 10, 2018

You could just delete pages 1 to 48 and 59 to 60. The following code does just that.

<cfpdf

action = "deletepages"

pages = "1-48,59-60"

source = "#expandPath('./data.pdf')#"

overwrite = "yes"

destination ="#expandPath('./data_pages49to58.pdf')#">

Done.

Report · Dec 11, 2018

I'm wondering after deleting the pages what is the next step. In the answer from WolfShade he mentioned action="extracttext". In that case the only thing I get on the screen is text without any delimiters. All data is in one row. I'm not sure how to read the data from PDF for each row?

Report · Dec 11, 2018

You didn't indicate in your original post that the data was in any particular format.

What extracttext does is give you all the text content of the PDF in a single string. It's up to you to determine the best way to aggregate the value.

V/r,

^ _ ^

Report · Dec 12, 2018

LikeWolfShade I also thought you wouldn't mind saving the data as PDF. In any case, his suggestion, action="extracttext", is as good as cfpdf gets.

Alternatively, you might want to try a tool of yesteryear, Raymond Camden's PDFUtils. Extract the ZIP, and move the pdfutils directory to your workspace.

Copy the file you pruned earlier, data_pages49to58.pdf, to the directory. Then create and run, within the pdfutils directory, a CFM containing the code

You should then be able to figure out how to fetch the text.