Highlighted

Content extraction while file uploading.

New Here ,
Jun 12, 2018

Copy link to clipboard

Copied

I am facing issue for extracting contents from uploading both pdf and word ? i am trying a upload cv and i want to extract the content from there and saved to database. 

Views

235

Likes

Translate

Translate

Report

Report
Community Guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more

Content extraction while file uploading.

New Here ,
Jun 12, 2018

Copy link to clipboard

Copied

I am facing issue for extracting contents from uploading both pdf and word ? i am trying a upload cv and i want to extract the content from there and saved to database. 

Views

236

Likes

Translate

Translate

Report

Report
Community Guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
Jun 12, 2018 0
LEGEND ,
Jun 12, 2018

Copy link to clipboard

Copied

Just out of curiosity, why not just save the binary (.doc/.pdf) directly into the database?

V/r,

^ _ ^

Likes

Translate

Translate

Report

Report
Community Guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
Reply
Loading...
Jun 12, 2018 0
LEGEND ,
Jun 12, 2018

Copy link to clipboard

Copied

I did some looking around.  The only way that I am aware of to extract text from PDF or Word documents is using a Solr collection.  And at that point, it's in a custom format for Solr to use, not something that will be human readable (at least, not very well.)

I created a test Word .docx file and opened it in Notepad.  It looks like binary code, so there are no strings to search/extract.

Now, if you wanted to extract data from an Excel sheet, CF can do that.  Quite well.

I think your best bet is to just store the binary in the database.  But I know nothing of the requirements for your project, so my suggestion could be incorrect.

V/r,

^ _ ^

Likes

Translate

Translate

Report

Report
Community Guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
Reply
Loading...
Jun 12, 2018 0
Adobe Community Professional ,
Jun 12, 2018

Copy link to clipboard

Copied

There are all kinds of tools to extract text from PDF, like Apache PDFBox. For MS Office formats, there's Apache POI. I assume there are lots of other tools as well. But the big thing about all of this is that PDFs and Office documents are largely unstructured content. In addition, PDFs may not even have text at all, and if they do it might even be less structured than Office documents. Both of the tools I mentioned are Java. I don't know if there are any CF wrappers for them.

Dave Watts, Fig Leaf Software

Likes

Translate

Translate

Report

Report
Community Guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
Reply
Loading...
Jun 12, 2018 0
LEGEND ,
Jun 12, 2018

Copy link to clipboard

Copied

It's my understanding that CF uses POI for Excel, and there is the CFSPREADSHEET tags for that.  I have not seen any native CF (yet) for Word.  There is the CFPDF tag, but I've never used it and am not familiar with what it can and cannot do.  However, I'm sure some enterprising individual could create a custom tag that delves into POI for Word.

If I had more time, I'd do it. 

V/r,

^ _ ^

Likes

Translate

Translate

Report

Report
Community Guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
Reply
Loading...
Jun 12, 2018 0
New Here ,
Jun 13, 2018

Copy link to clipboard

Copied

<cfpdf action = "extracttext" source = "abc.pdf"  name= "mypdf"> this will work when already uploaded  file.

is there any method for word ?.

Likes

Translate

Translate

Report

Report
Community Guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
Reply
Loading...
Jun 13, 2018 0
LEGEND ,
Jun 13, 2018

Copy link to clipboard

Copied

Not as far as I know.  But I haven't really had a need to work with Word documents.  I'll look around some more, though.

V/r,

^ _ ^

Likes

Translate

Translate

Report

Report
Community Guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
Reply
Loading...
Jun 13, 2018 0
Adobe Community Professional ,
Jun 14, 2018

Copy link to clipboard

Copied

Hi althafc39854916​,

Assuming you are on a recent version of ColdFusion, then you can use its in-built POI library to process Word .doc and .docx files. Examples:

<!--- DOCX file --->

<cfset myFile = createObject("java","java.io.FileInputStream").init("C:/Users/BKBK/Desktop/testDoc.docx")>

<cfset document = createobject("java", "org.apache.poi.xwpf.usermodel.XWPFDocument")>

<cfset wordExtractor = createobject("java", "org.apache.poi.xwpf.extractor.XWPFWordExtractor")>

<!--- DOC file --->

<!---

<cfset myFile = createObject("java","java.io.FileInputStream").init("C:/Users/BKBK/Desktop/myFile.doc")>

<cfset document = createobject("java", "org.apache.poi.hwpf.HWPFDocument")>

<cfset wordExtractor = createobject("java", "org.apache.poi.hwpf.extractor.WordExtractor")>

--->

<!--- For docx as well as doc --->

<cfset doc = document.init(myFile)>

<cfset wordExtractor.init(doc)>

<cfoutput>

<pre> #wordExtractor.getText()# </pre>

</cfoutput>

Likes

Translate

Translate

Report

Report
Community Guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
Reply
Loading...
Jun 14, 2018 1
LEGEND ,
Jun 14, 2018

Copy link to clipboard

Copied

I had a feeling that there would be some kind of POI method for this, but I couldn't find anything via Google on that.

I'm not the OP, but I have a feeling that if OP sees this he/she will be quite happy.

Thanks, BKBK!

V/r,

^ _ ^

Likes

Translate

Translate

Report

Report
Community Guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
Reply
Loading...
Jun 14, 2018 0