• Global community
    • Language:
      • Deutsch
      • English
      • Español
      • Français
      • Português
  • 日本語コミュニティ
    Dedicated community for Japanese speakers
  • 한국 커뮤니티
    Dedicated community for Korean speakers
Exit
0

Extract Text from pdf using C#

Guest
Mar 29, 2012 Mar 29, 2012

Copy link to clipboard

Copied

Hi,

We are Solution developer using Acrobat,as we have reuirement of extracting text from pdf using C# we have downloaded adobe sdk and installed. We have found only four exmaples in C# and those are used only for viewing pdf in windows application. Can you please guide us how to extract text from pdf using SDK in C#.

Thanks you for your help.

Regards

kiranmai

TOPICS
Acrobat SDK and JavaScript

Views

36.6K

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines

correct answers 1 Correct answer

Explorer , Apr 04, 2012 Apr 04, 2012

Try page 135 of this document

http://wwwimages.adobe.com/www.adobe.com/content/dam/Adobe/en/devnet/acrobat/pdfs/plugin_apps_developer_guide.pdf

More than likely, you'll need to write a plugin to handle this because IAC doesn't seem to support it through COM

If you are creative in using the JS interface, you can extract all the words from a document. You would need to use a loop and put everything into an array or a List.
Take a look at page 311 of

http://www.adobe.com/content/dam/Adobe/en/devnet/acrobat/pdfs/js_api_reference.pdf

Votes

Translate

Translate
Adobe Employee ,
Mar 30, 2012 Mar 30, 2012

Copy link to clipboard

Copied

Have you read the documentation? Both for the native COM/.NET calls as well as those available via the JSBridge??

Votes

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Guest
Mar 30, 2012 Mar 30, 2012

Copy link to clipboard

Copied

Thank you for your quick reply, can you please suggest the correct document that can help me, as i feel the most of the documentation is meant for C/C++ developers.

Regards

Kiranmai

Votes

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Explorer ,
Apr 04, 2012 Apr 04, 2012

Copy link to clipboard

Copied

Try page 135 of this document

http://wwwimages.adobe.com/www.adobe.com/content/dam/Adobe/en/devnet/acrobat/pdfs/plugin_apps_develo...

More than likely, you'll need to write a plugin to handle this because IAC doesn't seem to support it through COM

If you are creative in using the JS interface, you can extract all the words from a document. You would need to use a loop and put everything into an array or a List.
Take a look at page 311 of

http://www.adobe.com/content/dam/Adobe/en/devnet/acrobat/pdfs/js_api_reference.pdf

Votes

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Explorer ,
Apr 04, 2012 Apr 04, 2012

Copy link to clipboard

Copied

Okay so I went ahead and actually added the text extraction functionality to my own C# application, since this was a requested feature by the client anyhow, which originally we were told to bypass if it wasn't "cut and dry", but it wasn't bad so I went ahead and gave the client the text extraction that they wanted. Decided I'd post the source code here for you. This returns the text from the entire document as a string.

       private static string GetText(AcroPDDoc pdDoc)

        {

            AcroPDPage page;

            int pages = pdDoc.GetNumPages();

            string pageText = "";

            for (int i = 0; i < pages; i++)

            {

                page = (AcroPDPage)pdDoc.AcquirePage(i);

                object jso, jsNumWords, jsWord;

                List<string> words = new List<string>();

                try

                {

                    jso = pdDoc.GetJSObject();

                    if (jso != null)

                    {

                        object[] args = new object[] { i };

                        jsNumWords = jso.GetType().InvokeMember("getPageNumWords", BindingFlags.InvokeMethod, null, jso, args, null);

                        int numWords = Int32.Parse(jsNumWords.ToString());

                        for (int j = 0; j <= numWords; j++)

                        {

                            object[] argsj = new object[] { i, j, false };

                            jsWord = jso.GetType().InvokeMember("getPageNthWord", BindingFlags.InvokeMethod, null, jso, argsj, null);

                            words.Add((string)jsWord);

                        }

                    }

                    foreach (string word in words)

                    {

                        pageText += word;

                    }

                }

                catch

                {

                }

            }

            return pageText;

        }

Votes

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Guest
May 30, 2012 May 30, 2012

Copy link to clipboard

Copied

the code sample is very helpful.

maybe the code would be more wonderful if we prefix

BindingFlags with qualifier System.Reflection.BindingFlags

so beginner or a not so alert .net user would not have to search to find out when the class does not have

using System.Reflection

Votes

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Guest
Jun 26, 2012 Jun 26, 2012

Copy link to clipboard

Copied

Thank you for your support it helped me a lot to extract text from pdf.

Can you please suggest me how to extract images from pdf and also how to extract text from image based pdf in c#

Votes

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Adobe Employee ,
Jun 26, 2012 Jun 26, 2012

Copy link to clipboard

Copied

There are no APIs exposed from Acrobat to C# for extracting images or for OCR.

Votes

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Guest
Jul 01, 2012 Jul 01, 2012

Copy link to clipboard

Copied

Thank you for your reply, can you please suggest how to extract images from pdf using adobe SDK using any other language .net supported language(other than c#)

Votes

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Adobe Employee ,
Jul 02, 2012 Jul 02, 2012

Copy link to clipboard

Copied

You cannot use .NET – by itself- to extract image from PDF using the Acrobat SDK. You would have to write a plugin in C/C++ and then call the plugin from .NET.

Votes

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Guest
Jul 02, 2012 Jul 02, 2012

Copy link to clipboard

Copied

Thank you for your reply. In the samples provided by SDK does not contain sample to extract images from pdf, can you please provide plungin in C/C++ to extract images from pdf

Votes

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
LEGEND ,
Jul 03, 2012 Jul 03, 2012

Copy link to clipboard

Copied

The samples are only there to illustrate some points of the SDK. There are thousands of possible tasks with plug-ins, perhaps millions. Writing the plug-in is _your_ job.

Votes

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
New Here ,
Jul 04, 2012 Jul 04, 2012

Copy link to clipboard

Copied

Hi I can direct you to a program-guide that tells how to  extract text and Image from PDF using C#.NET. Have a try.

Votes

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Guest
Jul 05, 2012 Jul 05, 2012

Copy link to clipboard

Copied

Hi,

Thank you for your guidence to extract text from pdf.

you have guided to extract text from pdf using javascript objects, i have checked in the documents that you have guided, they contain code only to extract text from pdf , i have requirement of extracting images also, but that documents does not contain code to extract images, can you please guide to extract images from pdf.

Votes

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Adobe Employee ,
Jul 05, 2012 Jul 05, 2012

Copy link to clipboard

Copied

There are no methods for extracting images using C# with the Acrobat SDK.

Votes

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Guest
Jul 05, 2012 Jul 05, 2012

Copy link to clipboard

Copied

Hi,

Thank you for your relpy, yes i know that there are no methods to extract images from pdf using c#, i also came to know that we can do it by using C/C++ plugin, but in the samples provided by sdk contains only text extract plugin not image extraction. As I develop our products using c# I am not so good at C/C++ to create plugin, can you people please guide how to create plugin to extract images from pdf using adobe SDK.

Votes

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Adobe Employee ,
Jul 05, 2012 Jul 05, 2012

Copy link to clipboard

Copied

The easiest thing to do would be to simply run the “Extract All Images” command using the AVCommand APIs. That will handle all the complexities for you.

Votes

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Guest
Mar 12, 2013 Mar 12, 2013

Copy link to clipboard

Copied

Hi,

I am using

 

getPageNthWord and

getPageNthWordQuads to get extract words and their position from pdf,now i have requirment to get each word font properties aswell like size, font name, italic or bold , etc, do we have any function like

 

'getPageNthWordQuads' to get font properties for extracted word from pdf.

Thanks

Kiranmai

Votes

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
LEGEND ,
Mar 13, 2013 Mar 13, 2013

Copy link to clipboard

Copied

Not with JavaScript. With a plug-in, yes, but you need to understand PDF internals better e.g. to realise why italic, bold, and size are not simple concepts. See http://forums.adobe.com/thread/1166866?start=0&tstart=0

Votes

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Guest
Mar 14, 2013 Mar 14, 2013

Copy link to clipboard

Copied

Hi,

Thank you for ur reply, now i am trying to create sample plugin in c++ i am getting error and even if i try to build starter sample plugin  i am getting follwing error, as i am new to c++ i am not able to solve this.

i have defined our environement as win_env in environ.h and i agetting error in

 

ACROASSERT.h

please suggest me how to solve this so that i may create plugin to reach my requirement.

Error.png

Thanks

Kiranmai

Votes

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
LEGEND ,
Mar 14, 2013 Mar 14, 2013

Copy link to clipboard

Copied

What do you mean "defined our environment as WIN_ENV"? How did you do this, and why did you have to?

Are you using the pre-made project file for the sample plug-in - not trying to create a new project?

Votes

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
LEGEND ,
Mar 14, 2013 Mar 14, 2013

Copy link to clipboard

Copied

By the way, one possible cause of problems compiling is trying to use plug-in code in your own EXE. You cannot, it is only made to be plugged in to the Acrobat EXE (hence the name).

Votes

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Guest
Mar 14, 2013 Mar 14, 2013

Copy link to clipboard

Copied

sorry i am cofused, i am getting error when i open sample starter plugin example from visual studio and debug it. you mean to say we cannot debug plugin in visual studio, if so then cant we create plugin using visual studio.

Votes

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Guest
Mar 14, 2013 Mar 14, 2013

Copy link to clipboard

Copied

no i have started with new project and in that new project i added environ.h header file by defifning environment as windows .

i am just using pre-made project file as reference as i am new to c++

Votes

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
LEGEND ,
Mar 14, 2013 Mar 14, 2013

Copy link to clipboard

Copied

It is possible to create a new project, but the requirements for setting it up are very complex. It is not worth wasting your time trying to solve the many problems you will get. For this reason, I recommend starting with one of the existing project files, or using the Wizard to create a new project. The project will build a file of type *.API like all of the other plug-ins.

I hope you have considered how you will communicate from your application to the plug-in. This is a challenging project in itself.

Votes

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines