Skip to main content
Participant
January 31, 2025
Question

Issue with Arabic Text Extraction Using PDF-Extract API

  • January 31, 2025
  • 1 reply
  • 311 views

I am using the PDF-Extract API to extract text from a PDF document that contains both English and Arabic text. The PDF is scannable (non-image, non-OCR). However, I have noticed an issue with how the Arabic script is extracted.

In the original PDF, the Arabic names appear correctly within a paragraph as:

 

محمد مصطفى سالم منصور
منصور ال محمد مصطفى سالم
محمد مصطفى سالم

 


However, the Output is 

 

 

ﺍ
ﻝ
ﺭ
ﻣﺤﻤﺪ ﻣﺼﻄﻔﻰ ﺳﺎﻟﻢ ﻣﻨﺼﻮ
ﺭ
ﺍ
ﻝ
ﻣﻨﺼﻮﻣﺤﻤﺪ ﻣﺼﻄﻔﻰ ﺳﺎﻟﻢ
ﻣﺤﻤﺪ ﻣﺼﻄﻔﻰ ﺳﺎﻟﻢ


Additionally, the extracted Arabic text moves to the beginning of the paragraph instead of retaining its original position in the document.

 

I would like to understand:

  1. What is the cause of this issue?
  2. Is there a fix or workaround available to correctly extract non-Latin text while maintaining its formatting and position?

 

1 reply

Joel Geraci
Community Expert
Community Expert
February 3, 2025

I'm not sure how you are getting anything at all from the Arabic text.  Those areas should be identified as "Figures". The API is currently optimized for English language content.