Exit
  • Global community
    • Language:
      • Deutsch
      • English
      • Español
      • Français
      • Português
  • 日本語コミュニティ
  • 한국 커뮤니티
0

PDF Extract API produces issues within text strings when encountering apostrophes

New Here ,
Jan 31, 2023 Jan 31, 2023

Hi there,

 

I've noticed that the PDF Extract API is sometimes returning strings that contain issues. When comparing with the original PDF, this only seems to occur when there is an apostrophe in the text. 

It looks like it is replacing the words around the apostrophe with non printable characters.

I can't just filter out these characters because I'll still be missing the words that were replaced.

Is there a way to solve this issue?

 

Note: My PDF is not an image or scan; it was saved from Microsoft Word. I can copy and paste the proper text directly from the PDF using Preview for Mac.

 

e.g. from file 1 - "...prior written approval of the Principal\ue202\x9c1\ue014\x8e\x99\x9b\x8e\x9c\x8e\x97\x9d\x8a\x9d\x92\x9f\x8eï (b) Upon request..."

[original text from file 1] - "...prior written approval of the Principal's Representative (b) Upon request..."

 

e.g. from file 2 - "...in respect of *ROG)LHOGVULJKWV under clause..."

[original text from file 2] - "...in respect of The Company's rights under clause..."

 

Cheers,

Brad

199
Translate
Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
no replies

Have something to add?

Join the conversation
Resources