How to get the speical character in acrobat by JS ?

New Here ,
Mar 28, 2018 Mar 28, 2018

Copy link to clipboard

Copied

How to get the speical character and output  txt  in acrobat by JS ?

example:  ® ™ è

I can get them,

but why the ™ have lost when I save these character to txt,?

function getword2(doc){

  var i, j, ckWord, numWords, aWords = [];

  for (i = 0; i < doc.numPages; i++ ) {

  var bWords = [];

  numWords = doc.getPageNumWords(i);

  for (j = 0; j < numWords; j++) {

  ckWord = doc.getPageNthWord(i, j, false);

  if (ckWord) {

  bWords.push(ckWord);  // Add word to array

  }

  }

  aWords.push( bWords );

  }

  return aWords.join('\r\n');

}

function output_csv2(){

  var outputString = getword2(this);

  this.createDataObject("output.txt", outputString); 

  this.exportDataObject({ cName:"output.txt", nLaunch: "2"});

}

TOPICS
Acrobat SDK and JavaScript

Views

523

Likes

Translate

Translate

Report

Report
Community Guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines

correct answers 1 Correct Answer

Most Valuable Participant , Apr 09, 2018 Apr 09, 2018
You might need to encode the string you're writing to the data file as UTF-8 for it to work. In order to do that you should use the setDataObjectContents method and a UTF-8 encoded stream, instead of setting the contents in the createDataObject method directly.

Likes

Translate

Translate
Adobe Community Professional ,
Mar 28, 2018 Mar 28, 2018

Copy link to clipboard

Copied

JavaScript text uses the Unicode encoding. This is a 16 bit code that can represent just about any character in existence. On the other hand, plain text is 8bit, which uses the ASCII or ANSI encoding which only provides for Western European characters plus punctuation and a few special characters used on early teletype machines. So, if the text scraped from the PDF page does not have an easy translation to ASCII it will be replaced with garbage.

Note: by "easy translation" I mean that the 8bit Unicode prefix is 00

Thom Parker - Software Developer at PDFScripting
Use the Acrobat JavaScript Reference early and often

Likes

Translate

Translate

Report

Report
Community Guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Participant ,
Mar 30, 2018 Mar 30, 2018

Copy link to clipboard

Copied

What should I do?

Can I save the other file format?

Thanks!

Likes

Translate

Translate

Report

Report
Community Guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Participant ,
Apr 07, 2018 Apr 07, 2018

Copy link to clipboard

Copied

UP

Likes

Translate

Translate

Report

Report
Community Guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Adobe Community Professional ,
Apr 07, 2018 Apr 07, 2018

Copy link to clipboard

Copied

Sorry about the late reply. I actually wrote this a week ago and it didn't get posted.

*********************************************************************************************

That's a really good question. Not one I've thought about before. The first thing to do is some more testing to determine you're exact situation. Find out the exact character codes that are causing this issue. It's entirely possible that the problem may be in the text file viewer, and not with the character codes.

Modify your script to list the words and word indexes for a single page you know has this issue. Once you know the index of a word with problem characters you can use this script in the console window to find the Unicode code of the problem character

var cWord = this.getPageNthWord(nIndex);

cCode = cWord.charCodeAt(n).toString(16);

For example, the character codes for the 3 characters you've listed in the post are

® = 0xae, ANSI code

™ = 0x2122  Unicode, Also 0x99 in ANSI

è = 0xe8  ANSI code

Except for the trade mark these characters are coded as ANSI, which is an 8th bit extension to the 7-bit ASCII codes for covering special symbols. A good plain text viewer should display these symbols since they are still 8 bit. Maybe if you view the text on something different you'll seem them.

The only other alternative is to create a different kind of file format, which is outside the scope of what we can do on the forum.

Thom Parker - Software Developer at PDFScripting
Use the Acrobat JavaScript Reference early and often

Likes

Translate

Translate

Report

Report
Community Guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Participant ,
Apr 07, 2018 Apr 07, 2018

Copy link to clipboard

Copied

Thanks very much。

Maybe I need to find other solutions.

Likes

Translate

Translate

Report

Report
Community Guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Adobe Community Professional ,
Apr 08, 2018 Apr 08, 2018

Copy link to clipboard

Copied

What exactly is it that you are trying to do?  Perhaps we can suggest another approach.

Thom Parker - Software Developer at PDFScripting
Use the Acrobat JavaScript Reference early and often

Likes

Translate

Translate

Report

Report
Community Guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Participant ,
Apr 08, 2018 Apr 08, 2018

Copy link to clipboard

Copied

I have a PDF file, and a XLS file,

The pdf file with 100s pages will to be printing,

I must to check the content of every page by the xls file.

If  the speical character of ervery page can be read

It will easy to compare them.

Likes

Translate

Translate

Report

Report
Community Guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Most Valuable Participant ,
Apr 09, 2018 Apr 09, 2018

Copy link to clipboard

Copied

You might need to encode the string you're writing to the data file as UTF-8 for it to work. In order to do that you should use the setDataObjectContents method and a UTF-8 encoded stream, instead of setting the contents in the createDataObject method directly.

Likes

Translate

Translate

Report

Report
Community Guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Participant ,
Apr 09, 2018 Apr 09, 2018

Copy link to clipboard

Copied

Can you give an example?

Please!

Likes

Translate

Translate

Report

Report
Community Guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Most Valuable Participant ,
Apr 09, 2018 Apr 09, 2018

Copy link to clipboard

Copied

LATEST

Something like this:

this.createDataObject("output.txt", "");

this.setDataObjectContents("output.txt", util.streamFromString(outputString, "utf-8"));

this.exportDataObject({ cName:"output.txt", nLaunch: "2"});

Likes

Translate

Translate

Report

Report
Community Guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines