Exit
  • Global community
    • Language:
      • Deutsch
      • English
      • Español
      • Français
      • Português
  • 日本語コミュニティ
  • 한국 커뮤니티
0

Why does Extract API output extra bounding boxes and treat lines as rectangles?

New Here ,
Sep 09, 2025 Sep 09, 2025

Hello Adobe Team,

I’m working with scanned invoices and using two APIs together:

  • OCR API → to make the scanned PDF editable/searchable.

  • Extract API → with parameters:

     
    const params = new ExtractPDFParams({ elementsToExtract: [ExtractElementType.TEXT, ExtractElementType.TABLES], addCharInfo: true });

This works, but I’ve noticed unexpected results when reviewing the JSON and trying to re-render the PDF:

  1. The JSON output includes BBox attributes that add rectangular boxes around text and table elements.

  2. When rendering from this JSON in Flutter, extra borders appear that do not exist in the original scanned PDF (e.g. double borders around tables, boxes around text).

  3. It seems the API is treating every detected line or text area as a bounding rectangle, not just the actual drawn table/line borders from the original file.

    • Example: a single drawn line in the PDF becomes a rectangle in the JSON.

    • This makes it impossible to distinguish between real visual borders vs. bounding boxes used for OCR positioning.

TOPICS
PDF Extract API , PDF Services API
120
Translate
Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Community Expert ,
Sep 09, 2025 Sep 09, 2025

You don't need the OCR step. Extract ignores the recognized text anyway. Extract API is able to perform OCR on image-only PDF automatically. But the output of Extract API was never designed to recreate the PDF. Also, in the real world, a line is a rectangle. It might be a very thin rectangle, but it's a rectangle. 

Translate
Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
New Here ,
Sep 09, 2025 Sep 09, 2025

Two Questions I have:
1. I have first applied the OCR API  then Used the Extracted API for that PDF. But into the extracted JSON why still getting  >> "is_scanned": true??

2. Does the extracted JSON contains complete information to recreate the original pdf throught any technology?

Translate
Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Community Expert ,
Sep 10, 2025 Sep 10, 2025

1) Because, as IU mentioned in my first response, Extract ignores the OCRed text. 

2) No.

Translate
Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
New Here ,
Sep 11, 2025 Sep 11, 2025

Thanks for your answer, is there any way to convert the Scanned pdf to Searchable/Editable using the ADOBE PDF API?

Translate
Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Community Expert ,
Sep 11, 2025 Sep 11, 2025
LATEST

That's what the OCR API does.

Translate
Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Resources