Exit
  • Global community
    • Language:
      • Deutsch
      • English
      • Español
      • Français
      • Português
  • 日本語コミュニティ
  • 한국 커뮤니티
0

Issue with Arabic Text Extraction Using PDF-Extract API

New Here ,
Jan 31, 2025 Jan 31, 2025

I am using the PDF-Extract API to extract text from a PDF document that contains both English and Arabic text. The PDF is scannable (non-image, non-OCR). However, I have noticed an issue with how the Arabic script is extracted.

In the original PDF, the Arabic names appear correctly within a paragraph as:

 

محمد مصطفى سالم منصور
منصور ال محمد مصطفى سالم
محمد مصطفى سالم

 


However, the Output is 

 

 


Additionally, the extracted Arabic text moves to the beginning of the paragraph instead of retaining its original position in the document.

 

I would like to understand:

  1. What is the cause of this issue?
  2. Is there a fix or workaround available to correctly extract non-Latin text while maintaining its formatting and position?

 

TOPICS
Bug , How to , PDF Extract API , PDF Services API , Python SDK
91
Translate
Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Community Expert ,
Feb 03, 2025 Feb 03, 2025
LATEST

I'm not sure how you are getting anything at all from the Arabic text.  Those areas should be identified as "Figures". The API is currently optimized for English language content. 

Translate
Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Resources