I am facing issue for extracting contents from uploading both pdf and word ? i am trying a upload cv and i want to extract the content from there and saved to database.
Just out of curiosity, why not just save the binary (.doc/.pdf) directly into the database?
^ _ ^
I did some looking around. The only way that I am aware of to extract text from PDF or Word documents is using a Solr collection. And at that point, it's in a custom format for Solr to use, not something that will be human readable (at least, not very well.)
I created a test Word .docx file and opened it in Notepad. It looks like binary code, so there are no strings to search/extract.
Now, if you wanted to extract data from an Excel sheet, CF can do that. Quite well.
I think your best bet is to just store the binary in the database. But I know nothing of the requirements for your project, so my suggestion could be incorrect.
^ _ ^
There are all kinds of tools to extract text from PDF, like Apache PDFBox. For MS Office formats, there's Apache POI. I assume there are lots of other tools as well. But the big thing about all of this is that PDFs and Office documents are largely unstructured content. In addition, PDFs may not even have text at all, and if they do it might even be less structured than Office documents. Both of the tools I mentioned are Java. I don't know if there are any CF wrappers for them.
Dave Watts, Fig Leaf Software
It's my understanding that CF uses POI for Excel, and there is the CFSPREADSHEET tags for that. I have not seen any native CF (yet) for Word. There is the CFPDF tag, but I've never used it and am not familiar with what it can and cannot do. However, I'm sure some enterprising individual could create a custom tag that delves into POI for Word.
If I had more time, I'd do it.
^ _ ^
<cfpdf action = "extracttext" source = "abc.pdf" name= "mypdf"> this will work when already uploaded file.
is there any method for word ?.
Not as far as I know. But I haven't really had a need to work with Word documents. I'll look around some more, though.
^ _ ^
Assuming you are on a recent version of ColdFusion, then you can use its in-built POI library to process Word .doc and .docx files. Examples:
<!--- DOCX file --->
<cfset myFile = createObject("java","java.io.FileInputStream").init("C:/Users/BKBK/Desktop/testDoc.docx")>
<cfset document = createobject("java", "org.apache.poi.xwpf.usermodel.XWPFDocument")>
<cfset wordExtractor = createobject("java", "org.apache.poi.xwpf.extractor.XWPFWordExtractor")>
<!--- DOC file --->
<cfset myFile = createObject("java","java.io.FileInputStream").init("C:/Users/BKBK/Desktop/myFile.doc")>
<cfset document = createobject("java", "org.apache.poi.hwpf.HWPFDocument")>
<cfset wordExtractor = createobject("java", "org.apache.poi.hwpf.extractor.WordExtractor")>
<!--- For docx as well as doc --->
<cfset doc = document.init(myFile)>
<pre> #wordExtractor.getText()# </pre>
I had a feeling that there would be some kind of POI method for this, but I couldn't find anything via Google on that.
I'm not the OP, but I have a feeling that if OP sees this he/she will be quite happy.
^ _ ^