0
Issue with Arabic Text Extraction Using PDF-Extract API
New Here
,
/t5/acrobat-services-api-discussions/issue-with-arabic-text-extraction-using-pdf-extract-api/td-p/15123502
Jan 31, 2025
Jan 31, 2025
Copy link to clipboard
Copied
I am using the PDF-Extract API to extract text from a PDF document that contains both English and Arabic text. The PDF is scannable (non-image, non-OCR). However, I have noticed an issue with how the Arabic script is extracted.
In the original PDF, the Arabic names appear correctly within a paragraph as:
محمد مصطفى سالم منصور
منصور ال محمد مصطفى سالم
محمد مصطفى سالم
However, the Output is
ﺍ
ﻝ
ﺭ
ﻣﺤﻤﺪ ﻣﺼﻄﻔﻰ ﺳﺎﻟﻢ ﻣﻨﺼﻮ
ﺭ
ﺍ
ﻝ
ﻣﻨﺼﻮﻣﺤﻤﺪ ﻣﺼﻄﻔﻰ ﺳﺎﻟﻢ
ﻣﺤﻤﺪ ﻣﺼﻄﻔﻰ ﺳﺎﻟﻢ
Additionally, the extracted Arabic text moves to the beginning of the paragraph instead of retaining its original position in the document.
I would like to understand:
- What is the cause of this issue?
- Is there a fix or workaround available to correctly extract non-Latin text while maintaining its formatting and position?
TOPICS
Bug
,
How to
,
PDF Extract API
,
PDF Services API
,
Python SDK
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting.
Learn more
Community Expert
,
LATEST
/t5/acrobat-services-api-discussions/issue-with-arabic-text-extraction-using-pdf-extract-api/m-p/15130523#M9588
Feb 03, 2025
Feb 03, 2025
Copy link to clipboard
Copied
I'm not sure how you are getting anything at all from the Arabic text. Those areas should be identified as "Figures". The API is currently optimized for English language content.
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting.
Learn more

