PDF Embed API: Linearized PDFs not displaying first page before rest is loaded

New Here ,
Sep 22, 2021 Sep 22, 2021

Copy link to clipboard

Copied

Hi.

We are using the PDF embed API to display PDFs to our end users.

We are requesting the PDF from a client CDN, which delivers the PDF in a byte-stream. 


I have looked up both the documentation regarding linearized PDFs, as well as the sample demo on github. As many of our requested PDF are large, it is crucial to fix this issue to improve use experience.

All the promises are resolved in what i believe is the correct order: 
getInfo => Metadata (which returns fileSize)
getInitialBuffer => Initial 1024 bits (which returns a buffer: [ArrayBuffer(1025)])
getFileBufferRanges => fetches the requested ranges (awaits all promises, returnes bufferList: ArrayBuffer[])

Below are some images of our setup, as well as the header for the requested PDF.

Thanks in advance!

 

Header for the requested PDF:

Kristoffer5FFE_0-1632314524233.png

Code using the URL directly, letting the API handle needed promises:

Kristoffer5FFE_2-1632314708202.png

Code using a promise with linearizationObject (separated in two images):

Kristoffer5FFE_3-1632314817810.png

Kristoffer5FFE_4-1632314848092.png

 

Views

664

Likes

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
New Here ,
Sep 22, 2021 Sep 22, 2021

Copy link to clipboard

Copied

Addition to this post: We are able to display the PDFs normally, but the goal is to display the first page before the whole PDF is loaded:)

Likes

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Adobe Employee ,
Sep 22, 2021 Sep 22, 2021

Copy link to clipboard

Copied

Hi! Thank you for using PDF Embed API. 
To be able to display the first page of a linearized PDF before the whole file is loaded, the server hosting the PDF must support range requests. 
Could you please confirm through the network tab whether the range calls for initial buffer and further requested ranges are getting resolved correctly?
Also, is this issue observed for specific files or every file you've tried with?

Likes

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
New Here ,
Sep 23, 2021 Sep 23, 2021

Copy link to clipboard

Copied

Hi. 
 
Thanks for the fast reply. 
 
Below is the network tab for the requests:
Kristoffer5FFE_2-1632383488405.png
Requests: 
1. Whole file
2. getInfo (a metadata fetch to the server. I only return the filesize from this)
3. getInitialBuffer (Range request: bytes=0-1024)
4. getFileBufferRanges (Range request: bytes=range.start-range.end)
 
I also included the resolved responses for each function in the linearizationObject (inlcluded the requested range from getFileBufferRanges) 
Kristoffer5FFE_0-1632383280672.png

 

The issue is for every file i have tested on. Three in total as of now, and all of these are exported from Acrobat DC with the option "Optimize for fast web view". We have a lot more PDFs on the server, but i have chosen only a few to test with, which i know is formatted using this option.

Likes

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
LEGEND ,
Sep 23, 2021 Sep 23, 2021

Copy link to clipboard

Copied

I have no knowledge of how the particular API used by Embed PDF works, but I can offer general observations on linearized PDF and how it works, gleaned over 20 years ago when it was first introduced. Some of this will be obvious/known.

 

The job of linearized PDF is to give a more responsive PDF experience on the web. Without linearization support BOTH in the PDF and the client, nothing will be visible, or known about the PDF, until the last byte has been read from the web.

Linearized PDF has been rearranged so that, if read from the start, the information received is in the best order for displaying as it receives. The info will include, in this order

1. Organisational info e.g. page size.

2. Page text and line art (so text can be shown quickly in a substitution font)

3. Images.

4. Fonts. (you may well see the text rewritten in the correct font)

There are of course many other things in PDF pages and I am not sure of their order.

Clients will often initiate a request to read the whole file, and leave it running, storing all the info received. This means (a) only a single request may be seen in many cases (b) bandwidth is often consumed for the whole file, even if only the first page is shown. [There were heated discussions about this last point].

Byte range support does allow the client to jump in the file. For example if a PDF file shows page 1, which has a link to page 1000, and the user follows the link. In such a case, the initial PDF will load page 1 and might continue loading on the same request, then cancel the request or add a second one to start reading page 1000. 

The rules of PDF do not dictate any particular client behaviour, and so the implementors of linearisation in the Embed PDF Library could have made very different decisions to the (now largely dead) Adobe PDF browser plug-in. 

It's possible to imagine many PDFs that see no real improvement in load speed with linearization.

 

Further comment: with modern internet speeds, 14MB seems hardly to need this benefit, and it may be hard to see or measure any improvement. On a 50 Mb/sec line, the whole thing could be loaded in 2 seconds, while a complex page to render might take that long to render... I suggest testing with much larger linearized PDFs to check out the function.

Likes

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
New Here ,
Sep 28, 2021 Sep 28, 2021

Copy link to clipboard

Copied

Hi.

Thank you for the indepth information on linearized PDFs.

We have asked our content-creators to split the PDFs into smaller pieces due to this problem still not being resolved, but for future compatibility we really wish to solve this issue to make it possible to preview the file before the whole file is ready.

 

Do you guys have any idea where the error may be?

 

As the sample code on https://github.com/adobe/pdf-embed-api-samples/tree/master/More%20Samples/Linearization doesn't inlude an actual PDF or promise, could you provide me with such an example (and a url for a properly linearized PDF) so i can compare the network requests and code, as well as the result of a the first page being previewed before the rest?

 

Kind regards,

Kristoffer

Likes

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
LEGEND ,
Sep 30, 2021 Sep 30, 2021

Copy link to clipboard

Copied

What are your timings right now?

How long does the request to load the entire file take?

How long does it take before ANY info is shown on the page?

 

Why do you expect a different pattern of network requests, and what difference are you looking for?

You say "preview" - what are you expecting to see on screen, that is different from a non-linearized file?

Likes

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
New Here ,
Sep 30, 2021 Sep 30, 2021

Copy link to clipboard

Copied

Hi.

Thanks again for the reply.

As you suggested, i have tried with a much larger PDF:
Timings (network request below):

  • Whole file: 46s  for 90.1MB
  • PDF metadata: 73ms for 8.3kB
  • Initial buffer: 116ms for initial 1024B
  • Range request: 86ms for 11.2kB (i see that the range received from the EmbedAPI is always 10240B,  is this a coincidence? I would expect this to locate value for the number of bytes for the first page in the PDF header)

Network requests:

Screenshot 2021-09-30 at 14.17.12.png

 

The embedded window stays at 0% (see image below) until the whole file is received, then it displays the file. In other words: for this particular PDF, the file is shown after 46 seconds. The window with the loader is shown as soon as the script is loaded and attached to the div. 

Screenshot 2021-09-30 at 13.35.59.png

I am not necessarily looking for a different pattern for the network request. The current behaviour is expected for non-linearized PDFs, but I am expecting a different behaviour when enabling linearization support for the API, as the documentation specificaly says, and i quote, "Linearization is an approach to optimize PDFs for faster viewing by displaying the first page as quickly as possible before the entire PDF gets downloaded..... PDF Embed API supports the rendering of linearized PDFs which are hosted on servers with byte-range support." From this, I expect to see the first page almost immediately.

 

As the documentation suggests that my expectations are realistic, I would expect something to be wrong my either my code, the PDFEmbedAPI, the PDF itself or the byte-range requests from the server.

 

I really dont know how to test this further, as I feel like I have tried it all.. so I really appreciate all help.

 

 

 

 

Likes

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Adobe Employee ,
Sep 30, 2021 Sep 30, 2021

Copy link to clipboard

Copied

Hi, thank you for sharing these findings. 
It looks like there are no further range calls being made after the first range call to get more of the PDF's content for first page render. That could mean that it was determined from the data returned for the first range request that we need to fallback to the usual workflow and wait for the entire PDF. This might either be due to the PDF's structure or incorrect range data being returned.
Could you inspect the data being returned for the first range request and see if it seems correct?
Also, would it be possible to share the URL or the PDF?

Likes

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
LEGEND ,
Sep 30, 2021 Sep 30, 2021

Copy link to clipboard

Copied

LATEST

Your tests do indeed suggest that the linearized file is not being used. One thing to note is that, once linearized a PDF cannot be modified in any way, by any software, without breaking the linearization. No form filling, signing, securing, changing metadata, reader enabling, stamping etc. is possible, by the very design of linearization, even if the file still says it is linearized. It can be linearized again after edits of course (but not after signing).

Likes

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Resources