Extract Api Help inconsistencies in parse (incorrect order, combined items, header issues)
Copy link to clipboard
Copied
Hello,
I am using a Pdc from John Deere.The Pdf is well defined, clean and I can't see anything that should trick the system. It is a parts document. I am noticing that following behaviors.
1. Although the documents say that reoccuring headers won't be listed, its not accurate. I see the headers which are Part and Quantity Required.
2. I can see that the document word-wrapped Quantity Required so they are vertifical to each other. However, the iterations I get per page are
--Quantity
--Required
--Quantity Required
These are headers same with Part (they are on the same line(s) but one is on one side of the page and the other is on the other.
So it ends up causing confusing in parsing because of the varaiations. I just can't see why as the documents are the same relatively speaking except for the inner content, which is a simple structure
3. After the download and parsing, I am using linq to get rid of things I do not need, however the items themselves are out of order a lot.
The format is
# - {PartName}
Supported Models
Remarks
Is this item substituted
From a formatting perspective, there are the following variations
1. part and part number and supported items are always there. However, sometimes in the document there is a space after the part number and then comes the supported models and most of the time there are no spaces in between
2. All of the items are not there all the time. Remarks and if its substituted don't have too be
For the most part it works great but then for no reason I see
Part - Part Number together
Support Models - Remarks together
Remarks - Is substituted together
and do on
I just cannot find a reason for it, but I will say, no matter how many times I submit the document, Adobe parses it the same way every time, so its 100% reproducable.
It's not a complete pain, because you always need logic to make sure things are parsed correctly, but it is weird that things are out of order and or combined for no apparent reason and thought Adobe might want to use it as an example for debugging.
The file is attached
I would love to hear back
Cheers!
Copy link to clipboard
Copied
Ahh to update, I meant a "Pdf" not a "Pdc"
It appears I only have 1 data specific issue, which is the lines being combined together
The other issue "Out of Order" I believe I figured it out although its still a little inconsistent (IMHO)
It appears Adobe is treating this as a 2 column page. That ok and makes sense. However here is what it is doing
Column 1 Data Scenario
If any of the lines that are possible (Part, Part Number, Supported Models, Remarks, IsSubstituted) are simply an emply line so it looks like
Part
Part Number
Remarks
Substituted
It will place the value of Column 2 where that space is. So normal processing
Part
Part Number
Supported Models
**Remarks
**IsSubstituted
**QuantityRequire Value (the #)
But with a space its
Part
Part Number
**QuantityRequire Value (the #)
Supported Models
**Remarks
**IsSubstituted
and I believe this should be considered a bug. The reason is that if you had done this
Part
**QuantityRequire Value (the #)
Part Number
Supported Models
**Remarks
**IsSubstituted
If where the value in Column 2, which is on the same Y-axis as the item in Column 1, you placed the value of Column 2 okay I can go with that
Column1Value or Column2Value if Column1 is null
Column1Value
Column2Value (repated) or
Column1 Value
Column2 Value
Column1Value (this column has no corresponding Column 2 Value)
Column1 Value
would make sense, but arbitrarily placing it because ther was a space between Column 1 values, and because it then puts it in the first "space" (if any) it finds, it can be placed after any Column1 value
Again IMHO this is an inconsistency bugd. Would love to know your thoughts
Copy link to clipboard
Copied
here is a specific example of what I got back. you can see that every single item was combined, however look at the document it makes no sense as to why.
20 | Spacer Part Number : R86675 Supported Models : 9200, 9300, 9400 Remarks : (SUB R186804) |
Copy link to clipboard
Copied
I've looked at the output and I can actually see the logic that the AI applied. I just think it's never been trained for anything like this. It's not a bug exactly. It's more like a misunderstanding of how the content is organized... something that's easy for humans but hard for machines.
Can I share this document with our engineering team? I can't promise a quick fix but at least I can get this type of document onto the radar.
Copy link to clipboard
Copied
Sure, like i said I called it a consistentcy bug. I would sure love to understand the logic applied in which case it determines when to place different columned data in different positions in the return.
Right now lol my code parsing to get around it is something that i want to avoid, as I'd like to have a consistent way to automate without so much manual modifications (ok i do it in code but still custom code to do it)
I will say, this leads to one really hard issue and if and when it puts the Quantity # (not the words) within the content of everything else... you can't just assume to know what it means.
Imagine the # (3) just randomily placed between X and Y... it just looks like an extension of X when its a single string, when its separate strings with bounds its easy to tell, but otherwise... its guess work.
Thanks!!!!

