Extract Api Help inconsistencies in parse (incorrect order, combined items, header issues)
- September 7, 2021
- 1 reply
- 742 views
Hello,
I am using a Pdc from John Deere.The Pdf is well defined, clean and I can't see anything that should trick the system. It is a parts document. I am noticing that following behaviors.
1. Although the documents say that reoccuring headers won't be listed, its not accurate. I see the headers which are Part and Quantity Required.
2. I can see that the document word-wrapped Quantity Required so they are vertifical to each other. However, the iterations I get per page are
--Quantity
--Required
--Quantity Required
These are headers same with Part (they are on the same line(s) but one is on one side of the page and the other is on the other.
So it ends up causing confusing in parsing because of the varaiations. I just can't see why as the documents are the same relatively speaking except for the inner content, which is a simple structure
3. After the download and parsing, I am using linq to get rid of things I do not need, however the items themselves are out of order a lot.
The format is
# - {PartName}
Supported Models
Remarks
Is this item substituted
From a formatting perspective, there are the following variations
1. part and part number and supported items are always there. However, sometimes in the document there is a space after the part number and then comes the supported models and most of the time there are no spaces in between
2. All of the items are not there all the time. Remarks and if its substituted don't have too be
For the most part it works great but then for no reason I see
Part - Part Number together
Support Models - Remarks together
Remarks - Is substituted together
and do on
I just cannot find a reason for it, but I will say, no matter how many times I submit the document, Adobe parses it the same way every time, so its 100% reproducable.
It's not a complete pain, because you always need logic to make sure things are parsed correctly, but it is weird that things are out of order and or combined for no apparent reason and thought Adobe might want to use it as an example for debugging.
The file is attached
I would love to hear back
Cheers!
