Copy link to clipboard
Copied
I got PDF document from the customer. The document is 60 pages long. I need to read the data from middle of the page 49 to page 58. In ColdFusion there is cfpdf
tag that allows reading the pdf documents. Here is example of what I have so far:
<cftry>
<cfset mypdf = expandPath("./data.pdf")>
<cfpdf action="read" source="#mypdf#" name="PDFInfo">
<cfdump var="#PDFInfo#">
<cfcatch type="any">
<cfdump var="#cfcatch#">
</cfcatch>
</cftry>
After document is dumped on the screen there are information like:
Author [empty string]
CenterWindowOnScreen no
ChangingDocument Allowed
Commenting Allowed
ContentExtraction Allowed
CopyContent Allowed
PageSizes
PDFDocumentarray
1 PDFDocument - struct height 792 width 612
2 PDFDocument - struct height 792 width 612
3 PDFDocument - struct height 792 width 612
4 PDFDocument - struct height 792 width 612
I never before used the cfpdf and this is something new for me. I tried to search on the web but couldn't find the example on how I can get the data from PDF document. Is there a good way to get the data from specific pages in the file/document? Also I guess there has to be a loop that will allow accessing individual row data. I have done something similar with .csv and .xls files. If anyone have a good example of resource for this problem please let me know. Thanks
Copy link to clipboard
Copied
According to Adobe Help Page for CFPDF, you can extract text from the PDF:
Extract text
<
cfpdf
----required
action=
"extracttext"
<!---extract all the words in the PDF.--->
source= "absolute
or
relative path of the PDF file|PDF document variable| c
fdocument
variable"
pages =
"*"
<!----page numbers from where the text needs to be extracted from the P
DF document--->
----optional
addquads =
"add the position or quadrants for the text in the PDF"
honourspaces =
"true|false"
overwrite =
"true"
<!---Overwrite the specified object in the PDF document--->
password =
""
<!--- PDF document password--->
type =
"string|xml"
<!---format in which the text needs to be extracted--->
one of the following:
destination =
"PDF output file pathname"
name =
"PDF document variable"
usestructure =
"true|false"
Extract image
<
cfpdf
required
action =
"extractimage"
<!---extract images and save it to a directory--->
source = "absolute
or
relative path of the PDF file|PDF document variable|
cfdocument
variable"
pages =
"*"
<!---page numbers from where the images need to be extracted--->
optional
overwrite =
"true|false"
<!---overwrite any existing image when set to true--->
format =
"png|tiff|jpg"
<!---format in which the images should be extracted--->
imageprefix =
"*"
<!---the string that you want to prefix with the image
name--->
password =
""
<!--- PDF document password--->
destination =
"PDF output file pathname" />
Note that there is an attribute that will allow you to specify what pages to extract from.
HTH,
^ _ ^
Copy link to clipboard
Copied
You could just delete pages 1 to 48 and 59 to 60. The following code does just that.
<cfpdf
action = "deletepages"
pages = "1-48,59-60"
source = "#expandPath('./data.pdf')
#"
overwrite = "yes"
destination ="#expandPath('./data_pages49to58.pdf')
#">
Done.
Copy link to clipboard
Copied
I'm wondering after deleting the pages what is the next step. In the answer from WolfShade he mentioned action="extracttext". In that case the only thing I get on the screen is text without any delimiters. All data is in one row. I'm not sure how to read the data from PDF for each row?
Copy link to clipboard
Copied
You didn't indicate in your original post that the data was in any particular format.
What extracttext does is give you all the text content of the PDF in a single string. It's up to you to determine the best way to aggregate the value.
V/r,
^ _ ^
Copy link to clipboard
Copied
LikeWolfShade I also thought you wouldn't mind saving the data as PDF. In any case, his suggestion, action="extracttext", is as good as cfpdf gets.
Alternatively, you might want to try a tool of yesteryear, Raymond Camden's PDFUtils. Extract the ZIP, and move the pdfutils directory to your workspace.
Copy the file you pruned earlier, data_pages49to58.pdf, to the directory. Then create and run, within the pdfutils directory, a CFM containing the code
<cfset pdf = createObject("component", "pdfutils")>
<cfset mypdf = expandPath("./data_pages49to58.pdf")>
<cfset pdfText = pdf.getText(mypdf)>
<cfdump var="#pdfText#" >
You should then be able to figure out how to fetch the text.