• Global community
    • Language:
      • Deutsch
      • English
      • Español
      • Français
      • Português
  • 日本語コミュニティ
    Dedicated community for Japanese speakers
  • 한국 커뮤니티
    Dedicated community for Korean speakers
Exit
0

PDF Extract API: Missing information of Carriage returns, Paragraph Identification and Font Color.

New Here ,
Dec 27, 2023 Dec 27, 2023

Copy link to clipboard

Copied

Hi Team, we are facing the following issues  

      1.  Extracted text doesn't contain carriage returns.

      2. Does Paragraph identification supported by the API? If so can anyone provide the documentation for that.

      3. Also we need the Font color and Strike off property which are missing in extracted JSON, Is this supported by the API for text?

We are looking to buy long term license for the PDF Extract API, but above informaton is missing. Appreciate any help on them.

Thanks!

TOPICS
Feature request , Java SDK , PDF Extract API , Sales and Licensing

Views

287

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Community Expert ,
Jan 02, 2024 Jan 02, 2024

Copy link to clipboard

Copied

Please see my responses based on your numbering and starting with a quote of the question.

  1.  Extracted text doesn't contain carriage returns. - The text in the JSON does not have line breaks. However, if you make the request with "getCharBounds" true, you can detect where line breaks occur based on the bounding box of each character. 
  2. Does Paragraph identification supported by the API? If so can anyone provide the documentation for that. - I don't understand what you mean by "Paragraph identification". Can you elaborate?
  3.  Also we need the Font color and Strike off property which are missing in extracted JSON, Is this supported by the API for text? - Font Color has been added as feature request. I expect it to be available in an update soon. What do you mean by "Strike off"? This perhaps?

Votes

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
New Here ,
Jan 02, 2024 Jan 02, 2024

Copy link to clipboard

Copied

Thanks for the reply.

2. Does Paragraph identification supported by the API? If so can anyone provide the documentation for that --> Let's say we have a paragraph of two/three lines with different fonts properties included. So in the json it's getting individual text (sub text of a paragraph) and it's following font properties, but we need information of paragraph like this two/three lines is a paragraph and in that paragraph individual font properties should identify.

Votes

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Community Expert ,
Jan 02, 2024 Jan 02, 2024

Copy link to clipboard

Copied

LATEST

Ok - I think you are talking about "spans". Spans occur when a single paragraph contains multiple fonts or font styles. When "includeStyling" is set to true in the request, you'll get spans in the output. This can be a little tricky to interpret, though. You won't get an "element" that covers the whole paragraph, if, for example, you have a normal paragraph with a section of italic inside it, you'll get three spans but the first element will be the first span followed by the seconds and third. You then have to "reconstruct" the paragraph text from the spans based on the Path property. I'm actually in the process of writing a sample to do exactly that.    

Votes

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Resources