Issue with Arabic Text Extraction Using PDF-Extract API

Report · Jan 31, 2025

I am using the PDF-Extract API to extract text from a PDF document that contains both English and Arabic text. The PDF is scannable (non-image, non-OCR). However, I have noticed an issue with how the Arabic script is extracted.

In the original PDF, the Arabic names appear correctly within a paragraph as:

محمد مصطفى سالم منصور
منصور ال محمد مصطفى سالم
محمد مصطفى سالم

However, the Output is

ﺍ
ﻝ
ﺭ
ﻣﺤﻤﺪ ﻣﺼﻄﻔﻰ ﺳﺎﻟﻢ ﻣﻨﺼﻮ
ﺭ
ﺍ
ﻝ
ﻣﻨﺼﻮﻣﺤﻤﺪ ﻣﺼﻄﻔﻰ ﺳﺎﻟﻢ
ﻣﺤﻤﺪ ﻣﺼﻄﻔﻰ ﺳﺎﻟﻢ

Additionally, the extracted Arabic text moves to the beginning of the paragraph instead of retaining its original position in the document.

I would like to understand:

What is the cause of this issue?
Is there a fix or workaround available to correctly extract non-Latin text while maintaining its formatting and position?

Report · Feb 03, 2025

I'm not sure how you are getting anything at all from the Arabic text. Those areas should be identified as "Figures". The API is currently optimized for English language content.

Issue with Arabic Text Extraction Using PDF-Extract API

Photos