Copy link to clipboard
Copied
Hi -
I'm trying to copy and paste text from a PDF so I can edit and analyze the contents. The file was created in Hebrew. It is set of Israel's election results and available on their government website: http://www.moin.gov.il/Apps/PubWebSite/mainmenu.nsf/4DF815EA4AC4E503C2256BA6002EE732/8E408A044EE1D3E...
Under document properties, the fonts listed are Helvetica (standard) and two unknown, embedded subsets (TTE1C42600t00 and TTE1DA2290t00).
I have tried:
- Copy and pasting text from Reader 9 --> opening in Word and Excel, changing around fonts
- Copy and pasting text from Acrobat 8 Professional --> opening in Word and Excel, changing around fonts
- Right-click, open table as spreadsheet
- Exporting as .doc, .TIFF, PostScript, .txt, .html
- Export as image, running OCR (trialware Hebrew OCR program I used did not pick up all characters correctly)
- Adobe website mentions an Adobe Reader Middle Eastern Edition 7, but when I go to download it, it takes me to the regular Reader v9 page
Can anyone think of a way to extract the data from this document so that it is editable?
Any help would be appreciated!
Copy link to clipboard
Copied
so none of the experiments you have listed has actually worked for you?
Copy link to clipboard
Copied
When a pdf has custom-encoded fonts (such as this one), there's not much to do to get the text out using standard methods. One thing you can do (assuming all the custom encoding is the same!), is do a search/replace for each letter to fix the gobbledegook once you get it into a word processing program. Unfortunately, a lot of documents have mixed and matched custom encoding, so it's prety much hopeless.
In my experience, the most reliable OCR software for Hebrew is FineReader
HTH,
Harbs
Copy link to clipboard
Copied
MP12345 wrote:
Hi -
I'm trying to copy and paste text from a PDF so I can edit and analyze the contents. The file was created in Hebrew. It is set of Israel's election results and available on their government website.
Under document properties, the fonts listed are Helvetica (standard) and two unknown, embedded subsets (TTE1C42600t00 and TTE1DA2290t00).
I have tried:
- Copy and pasting text from Reader 9 --> opening in Word and Excel, changing around fonts
- Copy and pasting text from Acrobat 8 Professional --> opening in Word and Excel, changing around fonts
- Right-click, open table as spreadsheet
- Exporting as .doc, .TIFF, PostScript, .txt, .html
- Export as image, running OCR (trialware Hebrew OCR program I used did not pick up all characters correctly)
- Adobe website mentions an Adobe Reader Middle Eastern Edition 7, but when I go to download it, it takes me to the regular Reader v9 page
Can anyone think of a way to extract the data from this document so that it is editable?
Any help would be appreciated!
I recommend you to try another OCR.