Skip to main content
February 15, 2010
Question

Copy/Pasting from Hebrew font in PDF

  • February 15, 2010
  • 3 replies
  • 11167 views

Hi -

I'm trying to copy and paste text from a PDF so I can edit and analyze the contents.  The file was created in Hebrew. It is set of Israel's election results and available on their government website: http://www.moin.gov.il/Apps/PubWebSite/mainmenu.nsf/4DF815EA4AC4E503C2256BA6002EE732/8E408A044EE1D3EDC2257520002817B8/$FILE/News.pdf.

Under document properties, the fonts listed are Helvetica (standard) and two unknown, embedded subsets (TTE1C42600t00 and TTE1DA2290t00).

I have tried:

- Copy and pasting text from Reader 9 --> opening in Word and Excel, changing around fonts

- Copy and pasting text from Acrobat 8 Professional --> opening in Word and Excel, changing around fonts

- Right-click, open table as spreadsheet

- Exporting as .doc, .TIFF, PostScript, .txt, .html

- Export as image, running OCR (trialware Hebrew OCR program I used did not pick up all characters correctly)

- Adobe website mentions an Adobe Reader Middle Eastern Edition 7, but when I go to download it, it takes me to the regular Reader v9 page

Can anyone think of a way to extract the data from this document so that it is editable?

Any help would be appreciated!

    This topic has been closed for replies.

    3 replies

    Participant
    August 18, 2010

    MP12345 wrote:

    Hi -

    I'm trying to copy and paste text from a PDF so I can edit and analyze the contents.  The file was created in Hebrew. It is set of Israel's election results and available on their government website.

    Under document properties, the fonts listed are Helvetica (standard) and two unknown, embedded subsets (TTE1C42600t00 and TTE1DA2290t00).

    I have tried:

    - Copy and pasting text from Reader 9 --> opening in Word and Excel, changing around fonts

    - Copy and pasting text from Acrobat 8 Professional --> opening in Word and Excel, changing around fonts

    - Right-click, open table as spreadsheet

    - Exporting as .doc, .TIFF, PostScript, .txt, .html

    - Export as image, running OCR (trialware Hebrew OCR program I used did not pick up all characters correctly)

    - Adobe website mentions an Adobe Reader Middle Eastern Edition 7, but when I go to download it, it takes me to the regular Reader v9 page

    Can anyone think of a way to extract the data from this document so that it is editable?

    Any help would be appreciated!


    I recommend you to try another OCR.

    Harbs.
    Legend
    April 11, 2010

    When a pdf has custom-encoded fonts (such as this one), there's not much to do to get the text out using standard methods. One thing you can do (assuming all the custom encoding is the same!), is do a search/replace for each letter to fix the gobbledegook once you get it into a word processing program. Unfortunately, a lot of documents have mixed and matched custom encoding, so it's prety much hopeless.

    In my experience, the most reliable OCR software for Hebrew is FineReader

    HTH,

    Harbs

    Adobe Employee
    April 6, 2010

    so none of the experiments you have listed has actually worked for you?