How to get the speical character in acrobat by JS ?

Report · Mar 28, 2018

How to get the speical character and output txt in acrobat by JS ?

example: ® ™ è

I can get them,

but why the ™ have lost when I save these character to txt,?

function getword2(doc){

var i, j, ckWord, numWords, aWords = [];

for (i = 0; i < doc.numPages; i++ ) {

var bWords = [];

numWords = doc.getPageNumWords(i);

for (j = 0; j < numWords; j++) {

ckWord = doc.getPageNthWord(i, j, false);

if (ckWord) {

bWords.push(ckWord); // Add word to array

}

aWords.push( bWords );

}

return aWords.join('\r\n');

}

function output_csv2(){

var outputString = getword2(this);

this.createDataObject("output.txt", outputString);

this.exportDataObject({ cName:"output.txt", nLaunch: "2"});

}

Report · Mar 28, 2018

JavaScript text uses the Unicode encoding. This is a 16 bit code that can represent just about any character in existence. On the other hand, plain text is 8bit, which uses the ASCII or ANSI encoding which only provides for Western European characters plus punctuation and a few special characters used on early teletype machines. So, if the text scraped from the PDF page does not have an easy translation to ASCII it will be replaced with garbage.

Note: by "easy translation" I mean that the 8bit Unicode prefix is 00

Thom Parker - Software Developer at PDFScripting
Use the Acrobat JavaScript Reference early and often

Report · Mar 30, 2018

What should I do?

Can I save the other file format?

Thanks!

Report · Apr 07, 2018

UP

Report · Apr 07, 2018

Sorry about the late reply. I actually wrote this a week ago and it didn't get posted.

*********************************************************************************************

That's a really good question. Not one I've thought about before. The first thing to do is some more testing to determine you're exact situation. Find out the exact character codes that are causing this issue. It's entirely possible that the problem may be in the text file viewer, and not with the character codes.

Modify your script to list the words and word indexes for a single page you know has this issue. Once you know the index of a word with problem characters you can use this script in the console window to find the Unicode code of the problem character

var cWord = this.getPageNthWord(nIndex);

cCode = cWord.charCodeAt(n).toString(16);

For example, the character codes for the 3 characters you've listed in the post are

® = 0xae, ANSI code

™ = 0x2122 Unicode, Also 0x99 in ANSI

è = 0xe8 ANSI code

Except for the trade mark these characters are coded as ANSI, which is an 8th bit extension to the 7-bit ASCII codes for covering special symbols. A good plain text viewer should display these symbols since they are still 8 bit. Maybe if you view the text on something different you'll seem them.

The only other alternative is to create a different kind of file format, which is outside the scope of what we can do on the forum.

Thom Parker - Software Developer at PDFScripting
Use the Acrobat JavaScript Reference early and often

Report · Apr 07, 2018

Thanks very much。

Maybe I need to find other solutions.

Report · Apr 08, 2018

What exactly is it that you are trying to do? Perhaps we can suggest another approach.

Thom Parker - Software Developer at PDFScripting
Use the Acrobat JavaScript Reference early and often

Report · Apr 08, 2018

I have a PDF file, and a XLS file,

The pdf file with 100s pages will to be printing,

I must to check the content of every page by the xls file.

If the speical character of ervery page can be read

It will easy to compare them.

Report · Apr 09, 2018

You might need to encode the string you're writing to the data file as UTF-8 for it to work. In order to do that you should use the setDataObjectContents method and a UTF-8 encoded stream, instead of setting the contents in the createDataObject method directly.

Report · Apr 09, 2018

Can you give an example?

Please!

Report · Apr 09, 2018

Something like this:

this.createDataObject("output.txt", "");
this.setDataObjectContents("output.txt", util.streamFromString(outputString, "utf-8"));
this.exportDataObject({ cName:"output.txt", nLaunch: "2"});

Adobe Community

How to get the speical character in acrobat by JS ?

1 Correct answer