Extract Api Help inconsistencies in parse (incorrect order, combined items, header issues)

Report · Sep 07, 2021

Hello,

I am using a Pdc from John Deere.The Pdf is well defined, clean and I can't see anything that should trick the system. It is a parts document. I am noticing that following behaviors.

1. Although the documents say that reoccuring headers won't be listed, its not accurate. I see the headers which are Part and Quantity Required.

2. I can see that the document word-wrapped Quantity Required so they are vertifical to each other. However, the iterations I get per page are

--Quantity

--Required

--Quantity Required

These are headers same with Part (they are on the same line(s) but one is on one side of the page and the other is on the other.

So it ends up causing confusing in parsing because of the varaiations. I just can't see why as the documents are the same relatively speaking except for the inner content, which is a simple structure

3. After the download and parsing, I am using linq to get rid of things I do not need, however the items themselves are out of order a lot.

The format is

# - {PartName}

Supported Models

Remarks

Is this item substituted

From a formatting perspective, there are the following variations

1. part and part number and supported items are always there. However, sometimes in the document there is a space after the part number and then comes the supported models and most of the time there are no spaces in between

2. All of the items are not there all the time. Remarks and if its substituted don't have too be

For the most part it works great but then for no reason I see

Part - Part Number together

Support Models - Remarks together

Remarks - Is substituted together

and do on

I just cannot find a reason for it, but I will say, no matter how many times I submit the document, Adobe parses it the same way every time, so its 100% reproducable.

It's not a complete pain, because you always need logic to make sure things are parsed correctly, but it is weird that things are out of order and or combined for no apparent reason and thought Adobe might want to use it as an example for debugging.

The file is attached

I would love to hear back

Cheers!

Report · Sep 07, 2021

Ahh to update, I meant a "Pdf" not a "Pdc"

It appears I only have 1 data specific issue, which is the lines being combined together

The other issue "Out of Order" I believe I figured it out although its still a little inconsistent (IMHO)

It appears Adobe is treating this as a 2 column page. That ok and makes sense. However here is what it is doing

Column 1 Data Scenario

If any of the lines that are possible (Part, Part Number, Supported Models, Remarks, IsSubstituted) are simply an emply line so it looks like

Part

Part Number

Remarks

Substituted

It will place the value of Column 2 where that space is. So normal processing

Part

Part Number

Supported Models

**Remarks

**IsSubstituted

**QuantityRequire Value (the #)

But with a space its

Part

Part Number

**QuantityRequire Value (the #)

Supported Models

**Remarks

**IsSubstituted

and I believe this should be considered a bug. The reason is that if you had done this

Part

**QuantityRequire Value (the #)

Part Number

Supported Models

**Remarks

**IsSubstituted

If where the value in Column 2, which is on the same Y-axis as the item in Column 1, you placed the value of Column 2 okay I can go with that

Column1Value or Column2Value if Column1 is null

Column1Value

Column2Value (repated) or

Column1 Value

Column2 Value

Column1Value (this column has no corresponding Column 2 Value)

Column1 Value

would make sense, but arbitrarily placing it because ther was a space between Column 1 values, and because it then puts it in the first "space" (if any) it finds, it can be placed after any Column1 value

Again IMHO this is an inconsistency bugd. Would love to know your thoughts

Report · Sep 07, 2021

here is a specific example of what I got back. you can see that every single item was combined, however look at the document it makes no sense as to why.

20	Spacer Part Number : R86675 Supported Models : 9200, 9300, 9400 Remarks : (SUB R186804)

Report · Sep 07, 2021

I've looked at the output and I can actually see the logic that the AI applied. I just think it's never been trained for anything like this. It's not a bug exactly. It's more like a misunderstanding of how the content is organized... something that's easy for humans but hard for machines.

Can I share this document with our engineering team? I can't promise a quick fix but at least I can get this type of document onto the radar.

Report · Sep 07, 2021

Sure, like i said I called it a consistentcy bug. I would sure love to understand the logic applied in which case it determines when to place different columned data in different positions in the return.

Right now lol my code parsing to get around it is something that i want to avoid, as I'd like to have a consistent way to automate without so much manual modifications (ok i do it in code but still custom code to do it)

I will say, this leads to one really hard issue and if and when it puts the Quantity # (not the words) within the content of everything else... you can't just assume to know what it means.

Imagine the # (3) just randomily placed between X and Y... it just looks like an extension of X when its a single string, when its separate strings with bounds its easy to tell, but otherwise... its guess work.

Thanks!!!!

Extract Api Help inconsistencies in parse (incorrect order, combined items, header issues)

Photos