Issue with Arabic Text Extraction Using PDF-Extract API
I am using the PDF-Extract API to extract text from a PDF document that contains both English and Arabic text. The PDF is scannable (non-image, non-OCR). However, I have noticed an issue with how the Arabic script is extracted.
In the original PDF, the Arabic names appear correctly within a paragraph as:
محمد مصطفى سالم منصور
منصور ال محمد مصطفى سالم
محمد مصطفى سالم
However, the Output is
ﺍ
ﻝ
ﺭ
ﻣﺤﻤﺪ ﻣﺼﻄﻔﻰ ﺳﺎﻟﻢ ﻣﻨﺼﻮ
ﺭ
ﺍ
ﻝ
ﻣﻨﺼﻮﻣﺤﻤﺪ ﻣﺼﻄﻔﻰ ﺳﺎﻟﻢ
ﻣﺤﻤﺪ ﻣﺼﻄﻔﻰ ﺳﺎﻟﻢ
Additionally, the extracted Arabic text moves to the beginning of the paragraph instead of retaining its original position in the document.
I would like to understand:
- What is the cause of this issue?
- Is there a fix or workaround available to correctly extract non-Latin text while maintaining its formatting and position?
