Copy link to clipboard
We are using the PDF embed API to display PDFs to our end users.
We are requesting the PDF from a client CDN, which delivers the PDF in a byte-stream.
I have looked up both the documentation regarding linearized PDFs, as well as the sample demo on github. As many of our requested PDF are large, it is crucial to fix this issue to improve use experience.
All the promises are resolved in what i believe is the correct order:
getInfo => Metadata (which returns fileSize)
getInitialBuffer => Initial 1024 bits (which returns a buffer: [ArrayBuffer(1025)])
getFileBufferRanges => fetches the requested ranges (awaits all promises, returnes bufferList: ArrayBuffer)
Below are some images of our setup, as well as the header for the requested PDF.
Thanks in advance!
Header for the requested PDF:
Code using the URL directly, letting the API handle needed promises:
Code using a promise with linearizationObject (separated in two images):
Copy link to clipboard
Addition to this post: We are able to display the PDFs normally, but the goal is to display the first page before the whole PDF is loaded:)
Hi! Thank you for using PDF Embed API.
To be able to display the first page of a linearized PDF before the whole file is loaded, the server hosting the PDF must support range requests.
Could you please confirm through the network tab whether the range calls for initial buffer and further requested ranges are getting resolved correctly?
Also, is this issue observed for specific files or every file you've tried with?
The issue is for every file i have tested on. Three in total as of now, and all of these are exported from Acrobat DC with the option "Optimize for fast web view". We have a lot more PDFs on the server, but i have chosen only a few to test with, which i know is formatted using this option.
I have no knowledge of how the particular API used by Embed PDF works, but I can offer general observations on linearized PDF and how it works, gleaned over 20 years ago when it was first introduced. Some of this will be obvious/known.
The job of linearized PDF is to give a more responsive PDF experience on the web. Without linearization support BOTH in the PDF and the client, nothing will be visible, or known about the PDF, until the last byte has been read from the web.
Linearized PDF has been rearranged so that, if read from the start, the information received is in the best order for displaying as it receives. The info will include, in this order
1. Organisational info e.g. page size.
2. Page text and line art (so text can be shown quickly in a substitution font)
4. Fonts. (you may well see the text rewritten in the correct font)
There are of course many other things in PDF pages and I am not sure of their order.
Clients will often initiate a request to read the whole file, and leave it running, storing all the info received. This means (a) only a single request may be seen in many cases (b) bandwidth is often consumed for the whole file, even if only the first page is shown. [There were heated discussions about this last point].
Byte range support does allow the client to jump in the file. For example if a PDF file shows page 1, which has a link to page 1000, and the user follows the link. In such a case, the initial PDF will load page 1 and might continue loading on the same request, then cancel the request or add a second one to start reading page 1000.
The rules of PDF do not dictate any particular client behaviour, and so the implementors of linearisation in the Embed PDF Library could have made very different decisions to the (now largely dead) Adobe PDF browser plug-in.
It's possible to imagine many PDFs that see no real improvement in load speed with linearization.
Further comment: with modern internet speeds, 14MB seems hardly to need this benefit, and it may be hard to see or measure any improvement. On a 50 Mb/sec line, the whole thing could be loaded in 2 seconds, while a complex page to render might take that long to render... I suggest testing with much larger linearized PDFs to check out the function.
Thank you for the indepth information on linearized PDFs.
We have asked our content-creators to split the PDFs into smaller pieces due to this problem still not being resolved, but for future compatibility we really wish to solve this issue to make it possible to preview the file before the whole file is ready.
Do you guys have any idea where the error may be?
As the sample code on https://github.com/adobe/pdf-embed-api-samples/tree/master/More%20Samples/Linearization doesn't inlude an actual PDF or promise, could you provide me with such an example (and a url for a properly linearized PDF) so i can compare the network requests and code, as well as the result of a the first page being previewed before the rest?
Copy link to clipboard
What are your timings right now?
How long does the request to load the entire file take?
How long does it take before ANY info is shown on the page?
Why do you expect a different pattern of network requests, and what difference are you looking for?
You say "preview" - what are you expecting to see on screen, that is different from a non-linearized file?
Thanks again for the reply.
As you suggested, i have tried with a much larger PDF:
Timings (network request below):
The embedded window stays at 0% (see image below) until the whole file is received, then it displays the file. In other words: for this particular PDF, the file is shown after 46 seconds. The window with the loader is shown as soon as the script is loaded and attached to the div.
I am not necessarily looking for a different pattern for the network request. The current behaviour is expected for non-linearized PDFs, but I am expecting a different behaviour when enabling linearization support for the API, as the documentation specificaly says, and i quote, "Linearization is an approach to optimize PDFs for faster viewing by displaying the first page as quickly as possible before the entire PDF gets downloaded..... PDF Embed API supports the rendering of linearized PDFs which are hosted on servers with byte-range support." From this, I expect to see the first page almost immediately.
As the documentation suggests that my expectations are realistic, I would expect something to be wrong my either my code, the PDFEmbedAPI, the PDF itself or the byte-range requests from the server.
I really dont know how to test this further, as I feel like I have tried it all.. so I really appreciate all help.
Hi, thank you for sharing these findings.
It looks like there are no further range calls being made after the first range call to get more of the PDF's content for first page render. That could mean that it was determined from the data returned for the first range request that we need to fallback to the usual workflow and wait for the entire PDF. This might either be due to the PDF's structure or incorrect range data being returned.
Could you inspect the data being returned for the first range request and see if it seems correct?
Also, would it be possible to share the URL or the PDF?
Your tests do indeed suggest that the linearized file is not being used. One thing to note is that, once linearized a PDF cannot be modified in any way, by any software, without breaking the linearization. No form filling, signing, securing, changing metadata, reader enabling, stamping etc. is possible, by the very design of linearization, even if the file still says it is linearized. It can be linearized again after edits of course (but not after signing).