• Global community
    • Language:
      • Deutsch
      • English
      • Español
      • Français
      • Português
  • 日本語コミュニティ
    Dedicated community for Japanese speakers
  • 한국 커뮤니티
    Dedicated community for Korean speakers
Exit
0

Content extraction while file uploading.

New Here ,
Jun 12, 2018 Jun 12, 2018

Copy link to clipboard

Copied

I am facing issue for extracting contents from uploading both pdf and word ? i am trying a upload cv and i want to extract the content from there and saved to database. 

Views

344

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
LEGEND ,
Jun 12, 2018 Jun 12, 2018

Copy link to clipboard

Copied

Just out of curiosity, why not just save the binary (.doc/.pdf) directly into the database?

V/r,

^ _ ^

Votes

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
LEGEND ,
Jun 12, 2018 Jun 12, 2018

Copy link to clipboard

Copied

I did some looking around.  The only way that I am aware of to extract text from PDF or Word documents is using a Solr collection.  And at that point, it's in a custom format for Solr to use, not something that will be human readable (at least, not very well.)

I created a test Word .docx file and opened it in Notepad.  It looks like binary code, so there are no strings to search/extract.

Now, if you wanted to extract data from an Excel sheet, CF can do that.  Quite well.

I think your best bet is to just store the binary in the database.  But I know nothing of the requirements for your project, so my suggestion could be incorrect.

V/r,

^ _ ^

Votes

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Community Expert ,
Jun 12, 2018 Jun 12, 2018

Copy link to clipboard

Copied

There are all kinds of tools to extract text from PDF, like Apache PDFBox. For MS Office formats, there's Apache POI. I assume there are lots of other tools as well. But the big thing about all of this is that PDFs and Office documents are largely unstructured content. In addition, PDFs may not even have text at all, and if they do it might even be less structured than Office documents. Both of the tools I mentioned are Java. I don't know if there are any CF wrappers for them.

Dave Watts, Fig Leaf Software

Votes

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
LEGEND ,
Jun 12, 2018 Jun 12, 2018

Copy link to clipboard

Copied

It's my understanding that CF uses POI for Excel, and there is the CFSPREADSHEET tags for that.  I have not seen any native CF (yet) for Word.  There is the CFPDF tag, but I've never used it and am not familiar with what it can and cannot do.  However, I'm sure some enterprising individual could create a custom tag that delves into POI for Word.

If I had more time, I'd do it. 

V/r,

^ _ ^

Votes

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
New Here ,
Jun 13, 2018 Jun 13, 2018

Copy link to clipboard

Copied

<cfpdf action = "extracttext" source = "abc.pdf"  name= "mypdf"> this will work when already uploaded  file.

is there any method for word ?.

Votes

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
LEGEND ,
Jun 13, 2018 Jun 13, 2018

Copy link to clipboard

Copied

Not as far as I know.  But I haven't really had a need to work with Word documents.  I'll look around some more, though.

V/r,

^ _ ^

Votes

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Community Expert ,
Jun 14, 2018 Jun 14, 2018

Copy link to clipboard

Copied

Hi althafc39854916​,

Assuming you are on a recent version of ColdFusion, then you can use its in-built POI library to process Word .doc and .docx files. Examples:

<!--- DOCX file --->

<cfset myFile = createObject("java","java.io.FileInputStream").init("C:/Users/BKBK/Desktop/testDoc.docx")>

<cfset document = createobject("java", "org.apache.poi.xwpf.usermodel.XWPFDocument")>

<cfset wordExtractor = createobject("java", "org.apache.poi.xwpf.extractor.XWPFWordExtractor")>

<!--- DOC file --->

<!---

<cfset myFile = createObject("java","java.io.FileInputStream").init("C:/Users/BKBK/Desktop/myFile.doc")>

<cfset document = createobject("java", "org.apache.poi.hwpf.HWPFDocument")>

<cfset wordExtractor = createobject("java", "org.apache.poi.hwpf.extractor.WordExtractor")>

--->

<!--- For docx as well as doc --->

<cfset doc = document.init(myFile)>

<cfset wordExtractor.init(doc)>

<cfoutput>

<pre> #wordExtractor.getText()# </pre>

</cfoutput>

Votes

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
LEGEND ,
Jun 14, 2018 Jun 14, 2018

Copy link to clipboard

Copied

LATEST

I had a feeling that there would be some kind of POI method for this, but I couldn't find anything via Google on that.

I'm not the OP, but I have a feeling that if OP sees this he/she will be quite happy.

Thanks, BKBK!

V/r,

^ _ ^

Votes

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Resources
Documentation