Language property of tags is lost when merging Libreoffice generated PDF documents

Report · May 11, 2021

Hi,

My organization uses LibreOffice to create technically orientated multilingual documents. We have used Acrobat to merge these LibreOffice generated PDF documents as part of another PDF documents. Now we have found out that merging documents with Acrobat removes documents language tagging.

LibreOffice generates the language information as part of the document structure tagging, which is properly recognized by screen readers and it passes Acrobats accessibility check. However, after merging documents with Acrobat DC pro, all of the language information is stripped away from the document. Document structure tagging is otherwise preserved, but the language properties are removed in the process.

This problem does not happen when merging Word 2019 generated PDF documents. But as I understand, Word 2019 uses different technical approach and marks the language as part of the content stream. However, the LibreOffice way is completely valid in my understanding, with degards to the PDF standard, and you can even manually language tag text in Acrobat this way.

Is there any way to prevent the language properties in document tag structure from being stripped away while merging documents with Acrobat?

I have attached two example PDF files generated with LibreOffice and resulting Acrobat generated merged document.

Report · May 11, 2021

Don't merge the files in Acrobat.

Use LibreOffice Draw to edit and merge the desired documents, and then click Export Directly to PDF.

Then see the results when you open that PDF with Acrobat.

You can use the Accessibility Checker tool and run a full report to see what Acrobat reports back.

Report · May 18, 2021

I really do not get this solution. I do not see it any way it resolving my issue.

I could use LibreOffice Draw to merge PDF-files, but it has not really been designed for this purpose and it loses all of the original tagging, bookmarks and named destinations of the PDF files (in addition to having other issues). This is why people pay for using Acrobat.

1. I need to combine LibreOffice generated standards compliant PDF-files with PDF-document generated with other tools as MS Word or Adobe Illustrator.

2. I also need the merging process to retain the accessibility and other advanced PDF features such as bookmarks and named destinations of the all of the files. I also do not want to regenerate these manually or use some automated guessing algorithm to do it (worse than original).

3. Acrobat does everything else correctly except that the language property of content tagging is lost

If this is a known issue, then I hope someone at Adobe listens and fixes it.

Report · May 18, 2021

Your original Libre Office PDFs aren't tagged correctly for their respective languages. Here are the problems I found in the first sample you provided. I didn't go further into the other documents because I think the problem lies with your originals.

The PDF's overall language is mis-specified as "Finnish" when it really is an English document with two small portions of text in Finnish and Sweedish. File / Properties / Advanced tab / Language.
Global setting of language for the entire document.
The Finnish sample doesn't have the language attribute applied to it at all.
The Sweedish sample has the Language attribute applied to the <Span> tag, which is a recommended method. But it's coded as sv-SE rather thant SV (which is what you'll see if you select Sweedish from the drop-down menu). I'm not familiar with Nordic languages in PDFs, but having worked with other languages, I know that the exact spelling of the language's abbreviation is sometimes critical. Since Adobe writes the PDF and PDF/UA standards, I'd trust what Acrobat gives, not Libre Office which is an open source program and might not be correctly encoding your PDFs to the PDF standards.
Set the language on the TAG tab.
Note: If the entire paragraph is in a particular language, then set the Language attribute on the <P> or other block-level tag.
If only a few words within the paragraph are in a particular language, then wrap those words in a <Span> tag and put the Language attribute on the <Span> tag, under the TAG tab at the top. We do not recommend setting the language attribute on the CONTENT tab as this has caused problems with some assistive technologies like screen readers. Although Adobe's engineers told me it's ok to do that, in reality it doesn't work in the real world.

Conclusion: I don't believe Libre Office is creating a compliant PDF for you per the PDF standard. Check how you're formatting the source document in Libre Office, especially to set the language of your text. You might need to manually correct the tags in Acrobat before combining the PDFs together.

And keep in mind that this is a user-to-user all-volunteer help forum for Adobe's products, so you might not find any experts on Libre Office here. I've only played with it a few times (does a decent job of basic office tasks) but I don't believe it's encoding PDFs as well as it should. The PDF specification (ISO 320000) is 1,000+ pages with exacting details for encoding. Not many volunteer open-source programmers have the time or experience to meet that specification well enough. Not saying it's not possible, just saying it might not be what programmers want to do in their spare time.

| Bevi Chagnon | Designer & Technologist for Accessible Documents
| Classes & Books for Accessible InDesign, PDFs & MS Office |

Report · May 20, 2021

First of all, thank you very much for looking in to this issue with such a great detail. I really do appreciate it. But, I disagree with many of your conclusions.

1. About the default language

In these kind of problem cases, I always try to make the example document as simple as possible to highlight the real problem and to make trouble shooting easier. I also use English to make the content more widely understandable.

The default language of the document is Finnish, because I live in Finland and the default language of my documents is the Finnish language. It does not matter that the majority of the text is not in the Finnish language. For this document it only means, that unless specified otherwise, the language of the text is Finnish.

2. About ‘missing’ Finnish language attribute

See previous comment. The Finnish language paragraph does not have a language attribute because it is the default language of the document and that is why it is omitted.

3. sv-SE language code

sv-SE is the code for Swedish that is spoken in Sweden in the same way as en-GB is for British English and en-US is for American English. It is shown as code, because Acrobat does not have a "clear-text" mapping for that language code, but it is a standard language code. Finland is a bilingual country and code for Sweden spoken in Finland is sv-FI. The difference is small and so I often mark text as 'official' swedish for compatibility reasons.

4. Yes, if the entire paragraph is in particular language, then the language attribute should preferably be at block level. However, I think this strategy is still standards compliant.

5. This is very interesting! My experience is that only the language attributes in the TAG tab are stripped away! This is why I had thought, that the CONTENT plane would be the preferred place for the language attribute, but you are actually suggesting otherwise.

One strong evidence for the validity of this and other LibreOffice PDFs is that they work perfectly with Acrobat and screen readers. If you install JAWS screen reader and properly configure it to use multilanguage capable synthesizer and install voices for the required languages. The document is then read correctly and the screen reader automatically changes pronunciation according to the language. This is actually pretty impressive how well it works, when you have everything set up properly. The only thing is, that if you merge the PDF-file with other files using Acrobat Pro, it does not work any more. Note that the default Windows and Acrobat screen reader is not multilanguage capable and can not be used for testing multilingual content.

Report · May 20, 2021

I have now tested setting language attributes using Adobes own tool chain and the result is that language attributes in the TAG plane are lost in the merging PDF files with Acrobat.

Test_CONTENT-lang-attribute.pdf

- I created this document with Windows Notepad and printed it as a PDF file using the Adobe PDF printer.

- Set the document languge to English using Acrobat Pro 2017

- Tagged the document using Acrobat Pro and set the languges for text in the CONTENT plane.

- After merging the document with itself, the language attributes are correctly in place (Test_CONTENT-lang-attribute_merged.pdf).

Test_TAG-lang-attribute.pdf

- This document is also created with Windows Notepad and printed as a PDF file using the Adobe PDF printer.

- Set the document languge to English using Acrobat Pro 2017

- Tagged the document using Acrobat Pro and set the languges for text in the TAG plane.

- After merging the document with itself, the language attributes are LOST (Test_TAG-lang-attribute_merged.pdf).

MY CONCLUSION IS THAT LANGUAGE ATTRIBUTES IN THE TAG PLANE ARE ALWAYS LOST WHEN MERGING DOCUMENTS WITH ACROBAT.

In my opinion this is potentially a huge issue and could result in a lot of wasted work for organizations. There is a lot of work done in the EU because of accessibility directive, which requires proper language tagging of PDF files. If it is later found out that the language tagging has been stripped away from PDF files, it is a lot of work to reprocess them.

Report · May 20, 2021

Well, you've answered your own questions via empirical method.

I hope that now, you can make sense of why I suggested to use LibreOffice Draw to edit and merge the desired documents instead of Adobe Acrobat.

To clarify, what I meant by "merge" was not aimed to open a technical discussion between PDF editing platforms.

I was actually trying to help by trial and error.

I am not a developer, but I've repaired a lot of computers in my last 20 years. So basically I troubleshoot a lot, I void warranties, and break things apart in that process.

As a PC repair person, I really don't care which programs are best than others.

If something breaks and doesn't work, I will learn and use whatever tools allow me to complete my work and meet my deadlines.

And Since your initial inquiry did clearly mentioned:

"LibreOffice generates the language information as part of the document structure tagging, which is properly recognized by screen readers and it passes Acrobats accessibility check"

It is irrelevant, in my personal opinion, which platform (open source or commercial) is designed to do better or more than the other because that wasn't the point.

Report · May 21, 2021

I apologize if I was a bit too harsh in my response, I was a little annoyed that your reply was so quickly marked as a correct solution and without any acknowledgment from my part. That is why I also felt that I had to make it clear that it was not working for me.

As such, your reply was completely reasonable suggestion that unfortunately did not work. I also want to thank you for it. Maybe I should have also been clearer in that I meant to say that LibreOffice Writer can output properly tagged PDF-files because it understands the data model of a word processing document.

Adobe Community

Language property of tags is lost when merging Libreoffice generated PDF documents