My organization uses LibreOffice to create technically orientated multilingual documents. We have used Acrobat to merge these LibreOffice generated PDF documents as part of another PDF documents. Now we have found out that merging documents with Acrobat removes documents language tagging.
LibreOffice generates the language information as part of the document structure tagging, which is properly recognized by screen readers and it passes Acrobats accessibility check. However, after merging documents with Acrobat DC pro, all of the language information is stripped away from the document. Document structure tagging is otherwise preserved, but the language properties are removed in the process.
This problem does not happen when merging Word 2019 generated PDF documents. But as I understand, Word 2019 uses different technical approach and marks the language as part of the content stream. However, the LibreOffice way is completely valid in my understanding, with degards to the PDF standard, and you can even manually language tag text in Acrobat this way.
Is there any way to prevent the language properties in document tag structure from being stripped away while merging documents with Acrobat?
I have attached two example PDF files generated with LibreOffice and resulting Acrobat generated merged document.
Don't merge the files in Acrobat.
Use LibreOffice Draw to edit and merge the desired documents, and then click Export Directly to PDF.
Then see the results when you open that PDF with Acrobat.
You can use the Accessibility Checker tool and run a full report to see what Acrobat reports back.
I really do not get this solution. I do not see it any way it resolving my issue.
I could use LibreOffice Draw to merge PDF-files, but it has not really been designed for this purpose and it loses all of the original tagging, bookmarks and named destinations of the PDF files (in addition to having other issues). This is why people pay for using Acrobat.
1. I need to combine LibreOffice generated standards compliant PDF-files with PDF-document generated with other tools as MS Word or Adobe Illustrator.
2. I also need the merging process to retain the accessibility and other advanced PDF features such as bookmarks and named destinations of the all of the files. I also do not want to regenerate these manually or use some automated guessing algorithm to do it (worse than original).
3. Acrobat does everything else correctly except that the language property of content tagging is lost
If this is a known issue, then I hope someone at Adobe listens and fixes it.
Your original Libre Office PDFs aren't tagged correctly for their respective languages. Here are the problems I found in the first sample you provided. I didn't go further into the other documents because I think the problem lies with your originals.
Conclusion: I don't believe Libre Office is creating a compliant PDF for you per the PDF standard. Check how you're formatting the source document in Libre Office, especially to set the language of your text. You might need to manually correct the tags in Acrobat before combining the PDFs together.
And keep in mind that this is a user-to-user all-volunteer help forum for Adobe's products, so you might not find any experts on Libre Office here. I've only played with it a few times (does a decent job of basic office tasks) but I don't believe it's encoding PDFs as well as it should. The PDF specification (ISO 320000) is 1,000+ pages with exacting details for encoding. Not many volunteer open-source programmers have the time or experience to meet that specification well enough. Not saying it's not possible, just saying it might not be what programmers want to do in their spare time.
First of all, thank you very much for looking in to this issue with such a great detail. I really do appreciate it. But, I disagree with many of your conclusions.
1. About the default language
In these kind of problem cases, I always try to make the example document as simple as possible to highlight the real problem and to make trouble shooting easier. I also use English to make the content more widely understandable.
The default language of the document is Finnish, because I live in Finland and the default language of my documents is the Finnish language. It does not matter that the majority of the text is not in the Finnish language. For this document it only means, that unless specified otherwise, the language of the text is Finnish.
2. About ‘missing’ Finnish language attribute
See previous comment. The Finnish language paragraph does not have a language attribute because it is the default language of the document and that is why it is omitted.
3. sv-SE language code
sv-SE is the code for Swedish that is spoken in Sweden in the same way as en-GB is for British English and en-US is for American English. It is shown as code, because Acrobat does not have a "clear-text" mapping for that language code, but it is a standard language code. Finland is a bilingual country and code for Sweden spoken in Finland is sv-FI. The difference is small and so I often mark text as 'official' swedish for compatibility reasons.
4. Yes, if the entire paragraph is in particular language, then the language attribute should preferably be at block level. However, I think this strategy is still standards compliant.
5. This is very interesting! My experience is that only the language attributes in the TAG tab are stripped away! This is why I had thought, that the CONTENT plane would be the preferred place for the language attribute, but you are actually suggesting otherwise.
One strong evidence for the validity of this and other LibreOffice PDFs is that they work perfectly with Acrobat and screen readers. If you install JAWS screen reader and properly configure it to use multilanguage capable synthesizer and install voices for the required languages. The document is then read correctly and the screen reader automatically changes pronunciation according to the language. This is actually pretty impressive how well it works, when you have everything set up properly. The only thing is, that if you merge the PDF-file with other files using Acrobat Pro, it does not work any more. Note that the default Windows and Acrobat screen reader is not multilanguage capable and can not be used for testing multilingual content.
I have now tested setting language attributes using Adobes own tool chain and the result is that language attributes in the TAG plane are lost in the merging PDF files with Acrobat.
- I created this document with Windows Notepad and printed it as a PDF file using the Adobe PDF printer.
- Set the document languge to English using Acrobat Pro 2017
- Tagged the document using Acrobat Pro and set the languges for text in the CONTENT plane.
- After merging the document with itself, the language attributes are correctly in place (Test_CONTENT-lang-attribute_merged.pdf).
- This document is also created with Windows Notepad and printed as a PDF file using the Adobe PDF printer.
- Set the document languge to English using Acrobat Pro 2017
- Tagged the document using Acrobat Pro and set the languges for text in the TAG plane.
- After merging the document with itself, the language attributes are LOST (Test_TAG-lang-attribute_merged.pdf).
MY CONCLUSION IS THAT LANGUAGE ATTRIBUTES IN THE TAG PLANE ARE ALWAYS LOST WHEN MERGING DOCUMENTS WITH ACROBAT.
In my opinion this is potentially a huge issue and could result in a lot of wasted work for organizations. There is a lot of work done in the EU because of accessibility directive, which requires proper language tagging of PDF files. If it is later found out that the language tagging has been stripped away from PDF files, it is a lot of work to reprocess them.
Well, you've answered your own questions via empirical method.
I hope that now, you can make sense of why I suggested to use LibreOffice Draw to edit and merge the desired documents instead of Adobe Acrobat.
To clarify, what I meant by "merge" was not aimed to open a technical discussion between PDF editing platforms.
I was actually trying to help by trial and error.
I am not a developer, but I've repaired a lot of computers in my last 20 years. So basically I troubleshoot a lot, I void warranties, and break things apart in that process.
As a PC repair person, I really don't care which programs are best than others.
If something breaks and doesn't work, I will learn and use whatever tools allow me to complete my work and meet my deadlines.
And Since your initial inquiry did clearly mentioned:
"LibreOffice generates the language information as part of the document structure tagging, which is properly recognized by screen readers and it passes Acrobats accessibility check"
It is irrelevant, in my personal opinion, which platform (open source or commercial) is designed to do better or more than the other because that wasn't the point.
I apologize if I was a bit too harsh in my response, I was a little annoyed that your reply was so quickly marked as a correct solution and without any acknowledgment from my part. That is why I also felt that I had to make it clear that it was not working for me.
As such, your reply was completely reasonable suggestion that unfortunately did not work. I also want to thank you for it. Maybe I should have also been clearer in that I meant to say that LibreOffice Writer can output properly tagged PDF-files because it understands the data model of a word processing document.