Copy link to clipboard
Copied
Hi,
We are Solution developer using Acrobat,as we have reuirement of extracting text from pdf using C# we have downloaded adobe sdk and installed. We have found only four exmaples in C# and those are used only for viewing pdf in windows application. Can you please guide us how to extract text from pdf using SDK in C#.
Thanks you for your help.
Regards
kiranmai
Try page 135 of this document
More than likely, you'll need to write a plugin to handle this because IAC doesn't seem to support it through COM
If you are creative in using the JS interface, you can extract all the words from a document. You would need to use a loop and put everything into an array or a List.
Take a look at page 311 of
http://www.adobe.com/content/dam/Adobe/en/devnet/acrobat/pdfs/js_api_reference.pdf
Copy link to clipboard
Copied
Have you read the documentation? Both for the native COM/.NET calls as well as those available via the JSBridge??
Copy link to clipboard
Copied
Thank you for your quick reply, can you please suggest the correct document that can help me, as i feel the most of the documentation is meant for C/C++ developers.
Regards
Kiranmai
Copy link to clipboard
Copied
Try page 135 of this document
More than likely, you'll need to write a plugin to handle this because IAC doesn't seem to support it through COM
If you are creative in using the JS interface, you can extract all the words from a document. You would need to use a loop and put everything into an array or a List.
Take a look at page 311 of
http://www.adobe.com/content/dam/Adobe/en/devnet/acrobat/pdfs/js_api_reference.pdf
Copy link to clipboard
Copied
Okay so I went ahead and actually added the text extraction functionality to my own C# application, since this was a requested feature by the client anyhow, which originally we were told to bypass if it wasn't "cut and dry", but it wasn't bad so I went ahead and gave the client the text extraction that they wanted. Decided I'd post the source code here for you. This returns the text from the entire document as a string.
private static string GetText(AcroPDDoc pdDoc)
{
AcroPDPage page;
int pages = pdDoc.GetNumPages();
string pageText = "";
for (int i = 0; i < pages; i++)
{
page = (AcroPDPage)pdDoc.AcquirePage(i);
object jso, jsNumWords, jsWord;
List<string> words = new List<string>();
try
{
jso = pdDoc.GetJSObject();
if (jso != null)
{
object[] args = new object[] { i };
jsNumWords = jso.GetType().InvokeMember("getPageNumWords", BindingFlags.InvokeMethod, null, jso, args, null);
int numWords = Int32.Parse(jsNumWords.ToString());
for (int j = 0; j <= numWords; j++)
{
object[] argsj = new object[] { i, j, false };
jsWord = jso.GetType().InvokeMember("getPageNthWord", BindingFlags.InvokeMethod, null, jso, argsj, null);
words.Add((string)jsWord);
}
}
foreach (string word in words)
{
pageText += word;
}
}
catch
{
}
}
return pageText;
}
Copy link to clipboard
Copied
the code sample is very helpful.
maybe the code would be more wonderful if we prefix
BindingFlags with qualifier System.Reflection.BindingFlags
so beginner or a not so alert .net user would not have to search to find out when the class does not have
using System.Reflection
Copy link to clipboard
Copied
Thank you for your support it helped me a lot to extract text from pdf.
Can you please suggest me how to extract images from pdf and also how to extract text from image based pdf in c#
Copy link to clipboard
Copied
There are no APIs exposed from Acrobat to C# for extracting images or for OCR.
Copy link to clipboard
Copied
Thank you for your reply, can you please suggest how to extract images from pdf using adobe SDK using any other language .net supported language(other than c#)
Copy link to clipboard
Copied
You cannot use .NET – by itself- to extract image from PDF using the Acrobat SDK. You would have to write a plugin in C/C++ and then call the plugin from .NET.
Copy link to clipboard
Copied
Thank you for your reply. In the samples provided by SDK does not contain sample to extract images from pdf, can you please provide plungin in C/C++ to extract images from pdf
Copy link to clipboard
Copied
The samples are only there to illustrate some points of the SDK. There are thousands of possible tasks with plug-ins, perhaps millions. Writing the plug-in is _your_ job.
Copy link to clipboard
Copied
Hi I can direct you to a program-guide that tells how to extract text and Image from PDF using C#.NET. Have a try.
Copy link to clipboard
Copied
Hi,
Thank you for your guidence to extract text from pdf.
you have guided to extract text from pdf using javascript objects, i have checked in the documents that you have guided, they contain code only to extract text from pdf , i have requirement of extracting images also, but that documents does not contain code to extract images, can you please guide to extract images from pdf.
Copy link to clipboard
Copied
There are no methods for extracting images using C# with the Acrobat SDK.
Copy link to clipboard
Copied
Hi,
Thank you for your relpy, yes i know that there are no methods to extract images from pdf using c#, i also came to know that we can do it by using C/C++ plugin, but in the samples provided by sdk contains only text extract plugin not image extraction. As I develop our products using c# I am not so good at C/C++ to create plugin, can you people please guide how to create plugin to extract images from pdf using adobe SDK.
Copy link to clipboard
Copied
The easiest thing to do would be to simply run the “Extract All Images” command using the AVCommand APIs. That will handle all the complexities for you.
Copy link to clipboard
Copied
Hi,
I am using
getPageNthWord and
getPageNthWordQuads to get extract words and their position from pdf,now i have requirment to get each word font properties aswell like size, font name, italic or bold , etc, do we have any function like
'getPageNthWordQuads' to get font properties for extracted word from pdf.
Thanks
Kiranmai
Copy link to clipboard
Copied
Not with JavaScript. With a plug-in, yes, but you need to understand PDF internals better e.g. to realise why italic, bold, and size are not simple concepts. See http://forums.adobe.com/thread/1166866?start=0&tstart=0
Copy link to clipboard
Copied
Hi,
Thank you for ur reply, now i am trying to create sample plugin in c++ i am getting error and even if i try to build starter sample plugin i am getting follwing error, as i am new to c++ i am not able to solve this.
i have defined our environement as win_env in environ.h and i agetting error in
ACROASSERT.h
please suggest me how to solve this so that i may create plugin to reach my requirement.
Thanks
Kiranmai
Copy link to clipboard
Copied
What do you mean "defined our environment as WIN_ENV"? How did you do this, and why did you have to?
Are you using the pre-made project file for the sample plug-in - not trying to create a new project?
Copy link to clipboard
Copied
By the way, one possible cause of problems compiling is trying to use plug-in code in your own EXE. You cannot, it is only made to be plugged in to the Acrobat EXE (hence the name).
Copy link to clipboard
Copied
sorry i am cofused, i am getting error when i open sample starter plugin example from visual studio and debug it. you mean to say we cannot debug plugin in visual studio, if so then cant we create plugin using visual studio.
Copy link to clipboard
Copied
no i have started with new project and in that new project i added environ.h header file by defifning environment as windows .
i am just using pre-made project file as reference as i am new to c++
Copy link to clipboard
Copied
It is possible to create a new project, but the requirements for setting it up are very complex. It is not worth wasting your time trying to solve the many problems you will get. For this reason, I recommend starting with one of the existing project files, or using the Wizard to create a new project. The project will build a file of type *.API like all of the other plug-ins.
I hope you have considered how you will communicate from your application to the plug-in. This is a challenging project in itself.