Issue with Arabic Text Extraction Using PDF-Extract API

Question

I am using the PDF-Extract API to extract text from a PDF document that contains both English and Arabic text. The PDF is scannable (non-image, non-OCR). However, I have noticed an issue with how the Arabic script is extracted.In the original PDF, the Arabic names appear correctly within a paragraph as: محمد مصطفى سالم منصور
منصور ال محمد مصطفى سالم
محمد مصطفى سالم However, the Output is   ﺍ
ﻝ
ﺭ
ﻣﺤﻤﺪ ﻣﺼﻄﻔﻰ ﺳﺎﻟﻢ ﻣﻨﺼﻮ
ﺭ
ﺍ
ﻝ
ﻣﻨﺼﻮﻣﺤﻤﺪ ﻣﺼﻄﻔﻰ ﺳﺎﻟﻢ
ﻣﺤﻤﺪ ﻣﺼﻄﻔﻰ ﺳﺎﻟﻢAdditionally, the extracted Arabic text moves to the beginning of the paragraph instead of retaining its original position in the document. I would like to understand:What is the cause of this issue?Is there a fix or workaround available to correctly extract non-Latin text while maintaining its formatting and position?

Joel Geraci · Answer

I'm not sure how you are getting anything at all from the Arabic text.  Those areas should be identified as "Figures". The API is currently optimized for English language content.

Sign up

To post, reply, or follow discussions, please sign in with your Adobe ID.

Sign in to Adobe Community

To post, reply, or follow discussions, please sign in with your Adobe ID.