Copy link to clipboard
Copied
Hi Team, we are facing the following issues
1. Extracted text doesn't contain carriage returns.
2. Does Paragraph identification supported by the API? If so can anyone provide the documentation for that.
3. Also we need the Font color and Strike off property which are missing in extracted JSON, Is this supported by the API for text?
We are looking to buy long term license for the PDF Extract API, but above informaton is missing. Appreciate any help on them.
Thanks!
Copy link to clipboard
Copied
Please see my responses based on your numbering and starting with a quote of the question.
Copy link to clipboard
Copied
Thanks for the reply.
2. Does Paragraph identification supported by the API? If so can anyone provide the documentation for that --> Let's say we have a paragraph of two/three lines with different fonts properties included. So in the json it's getting individual text (sub text of a paragraph) and it's following font properties, but we need information of paragraph like this two/three lines is a paragraph and in that paragraph individual font properties should identify.
Copy link to clipboard
Copied
Ok - I think you are talking about "spans". Spans occur when a single paragraph contains multiple fonts or font styles. When "includeStyling" is set to true in the request, you'll get spans in the output. This can be a little tricky to interpret, though. You won't get an "element" that covers the whole paragraph, if, for example, you have a normal paragraph with a section of italic inside it, you'll get three spans but the first element will be the first span followed by the seconds and third. You then have to "reconstruct" the paragraph text from the spans based on the Path property. I'm actually in the process of writing a sample to do exactly that.