Content extraction while file uploading.

Report · Jun 12, 2018

I am facing issue for extracting contents from uploading both pdf and word ? i am trying a upload cv and i want to extract the content from there and saved to database.

Report · Jun 12, 2018

Just out of curiosity, why not just save the binary (.doc/.pdf) directly into the database?

V/r,

^ _ ^

Report · Jun 12, 2018

I did some looking around. The only way that I am aware of to extract text from PDF or Word documents is using a Solr collection. And at that point, it's in a custom format for Solr to use, not something that will be human readable (at least, not very well.)

I created a test Word .docx file and opened it in Notepad. It looks like binary code, so there are no strings to search/extract.

Now, if you wanted to extract data from an Excel sheet, CF can do that. Quite well.

I think your best bet is to just store the binary in the database. But I know nothing of the requirements for your project, so my suggestion could be incorrect.

V/r,

^ _ ^

Report · Jun 12, 2018

There are all kinds of tools to extract text from PDF, like Apache PDFBox. For MS Office formats, there's Apache POI. I assume there are lots of other tools as well. But the big thing about all of this is that PDFs and Office documents are largely unstructured content. In addition, PDFs may not even have text at all, and if they do it might even be less structured than Office documents. Both of the tools I mentioned are Java. I don't know if there are any CF wrappers for them.

Dave Watts, Fig Leaf Software

Dave Watts, Eidolon LLC

Report · Jun 12, 2018

It's my understanding that CF uses POI for Excel, and there is the CFSPREADSHEET tags for that. I have not seen any native CF (yet) for Word. There is the CFPDF tag, but I've never used it and am not familiar with what it can and cannot do. However, I'm sure some enterprising individual could create a custom tag that delves into POI for Word.

If I had more time, I'd do it.

V/r,

^ _ ^

Report · Jun 13, 2018

<cfpdf action = "extracttext" source = "abc.pdf" name= "mypdf"> this will work when already uploaded file.

is there any method for word ?.

Report · Jun 13, 2018

Not as far as I know. But I haven't really had a need to work with Word documents. I'll look around some more, though.

V/r,

^ _ ^

Report · Jun 14, 2018

Hi althafc39854916,

Assuming you are on a recent version of ColdFusion, then you can use its in-built POI library to process Word .doc and .docx files. Examples:

<!---

--->

<pre> #wordExtractor.getText()# </pre>

</cfoutput>

Report · Jun 14, 2018

I had a feeling that there would be some kind of POI method for this, but I couldn't find anything via Google on that.

I'm not the OP, but I have a feeling that if OP sees this he/she will be quite happy.

Thanks, BKBK!

V/r,

^ _ ^

Content extraction while file uploading.

li.media.uploader-dialog.title