Exit
  • Global community
    • Language:
      • Deutsch
      • English
      • Español
      • Français
      • Português
  • 日本語コミュニティ
  • 한국 커뮤니티
0

Extracting selective Text from pdf

New Here ,
Feb 07, 2018 Feb 07, 2018

Is it possible to extract text from a pdf on the basis of any properties of that text, like its font?

I have a set of pdfs and an excel file.

The set of pdfs have different types of fields or questions(the general content of the pdfs is same but questions slightly vary for each pdf document ) and a more general list of all the questions and many more are in the excel file.

I wanna know which of these questions(in excel file) are present in each of the documents and make a matrix of that.

I could use the advanced search but due to the nature of the data i have to perform the search on the excel file using the pdf questions.

If anything is possible pls help

Thanks in Advance

1.2K
Translate
Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Community Expert ,
Feb 08, 2018 Feb 08, 2018

I don't believe that's possible. A script might be able to extract texts of a specific font size, but not based on the font type itself.

It probably can be done using a stand-alone tool or maybe even a plugin, though.

Translate
Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
New Here ,
Feb 08, 2018 Feb 08, 2018

Can you direct me to the script that can do it by font sizes? I'll see if it can lead me somewhere.

Translate
Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
LEGEND ,
Feb 08, 2018 Feb 08, 2018

This is complex, you aren't likely to find a script already written to do this. It would use JavaScript GetPageNthWord and GetPageNthWordQuads to get each word and its size, then try to deduce from the bounding quadrilateral what actual size the text was.

Translate
Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Community Expert ,
Feb 08, 2018 Feb 08, 2018
LATEST

It's not a pre-existing thing. I've developed scripts that do something similar to it in the past, so if you're interested feel free to contact me privately (try6767 at gmail.com) and we could discuss it further.

Translate
Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines