Skip to main content
Participating Frequently
September 7, 2021
Question

Extract Api Help inconsistencies in parse (incorrect order, combined items, header issues)

  • September 7, 2021
  • 1 reply
  • 742 views

Hello,

I am using a Pdc from John Deere.The Pdf is well defined, clean and I can't see anything that should trick the system. It is a parts document. I am noticing that following behaviors.

 

1. Although the documents say that reoccuring headers won't be listed, its not accurate. I see the headers which are Part and Quantity Required.

 

2. I can see that the document word-wrapped Quantity Required so they are vertifical to each other. However, the iterations I get per page are

--Quantity

--Required

--Quantity Required

 

These are headers same with Part (they are on the same line(s) but one is on one side of the page and the other is on the other.

 

So it ends up causing confusing in parsing because of the varaiations. I just can't see why as the documents are the same relatively speaking except for the inner content, which is a simple structure

 

3. After the download and parsing, I am using linq to get rid of things I do not need, however the items themselves are out of order a lot.

 

The format is

# - {PartName}

Supported Models

Remarks

Is this item substituted

 

From a formatting perspective, there are the following variations

1. part and part number and supported items are always there. However, sometimes in the document there is a space after the part number and then comes the supported models and most of the time there are no spaces in between

 

2. All of the items are not there all the time. Remarks and if its substituted don't have too be

 

For the most part it works great but then for no reason I see

Part - Part Number together

Support Models - Remarks together

Remarks - Is substituted together

and do on

 

I just cannot find a reason for it, but I will say, no matter how many times I submit the document, Adobe parses it the same way every time, so its 100% reproducable.

 

It's not a complete pain, because you always need logic to make sure things are parsed correctly, but it is weird that things are out of order and or combined for no apparent reason and thought Adobe might want to use it as an example for debugging.

 

The file is attached

 

I would love to hear back

 

Cheers!

 

 

 

    This topic has been closed for replies.

    1 reply

    Participating Frequently
    September 7, 2021

    Ahh to update, I meant a "Pdf" not a "Pdc"

     

    It appears I only have 1 data specific issue, which is the lines being combined together

     

    The other issue "Out of Order" I believe I figured it out although its still a little inconsistent (IMHO)

     

    It appears Adobe is treating this as a 2 column page. That ok and makes sense. However here is what it is doing

     

    Column 1 Data Scenario

    If any of the lines that are possible (Part, Part Number, Supported Models, Remarks, IsSubstituted) are simply an emply line so it looks like

     

    Part

    Part Number

     

    Remarks

    Substituted

     

    It will place the value of Column 2 where that space is. So normal processing

     

    Part

    Part Number

    Supported Models

    **Remarks

    **IsSubstituted

    **QuantityRequire Value (the #)

     

    But with a space its

     

    Part

    Part Number

    **QuantityRequire Value (the #)

    Supported Models

    **Remarks

    **IsSubstituted

     

    and I believe this should be considered a bug. The reason is that if you had done this

    Part

    **QuantityRequire Value (the #)

    Part Number

    Supported Models

    **Remarks

    **IsSubstituted

     

    If where the value in Column 2, which is on the same Y-axis as the item in Column 1, you placed the value of Column 2 okay I can go with that

     

    Column1Value or Column2Value if Column1 is null

    Column1Value

    Column2Value (repated) or

     

    Column1 Value

    Column2 Value

    Column1Value (this column has no corresponding Column 2 Value)

    Column1 Value 

     

    would make sense, but arbitrarily placing it because ther was a space between Column 1 values, and because it then puts it in the first "space" (if any) it finds, it can be placed after any Column1 value

     

    Again IMHO this is an inconsistency bugd. Would love to know your thoughts

    Participating Frequently
    September 7, 2021

    here is a specific example of what I got back. you can see that every single item was combined, however look at the document it makes no sense as to why.

     

    20

    Spacer Part Number : R86675 Supported Models : 9200, 9300, 9400 Remarks : (SUB R186804)