Skip to main content
Participant
March 28, 2018
Answered

How to get the speical character in acrobat by JS ?

  • March 28, 2018
  • 2 replies
  • 3017 views

How to get the speical character and output  txt  in acrobat by JS ?

example:  ® ™ è

I can get them,

but why the ™ have lost when I save these character to txt,?

function getword2(doc){

  var i, j, ckWord, numWords, aWords = [];

  for (i = 0; i < doc.numPages; i++ ) {

  var bWords = [];

  numWords = doc.getPageNumWords(i);

  for (j = 0; j < numWords; j++) {

  ckWord = doc.getPageNthWord(i, j, false);

  if (ckWord) {

  bWords.push(ckWord);  // Add word to array

  }

  }

  aWords.push( bWords );

  }

  return aWords.join('\r\n');

}

function output_csv2(){

  var outputString = getword2(this);

  this.createDataObject("output.txt", outputString); 

  this.exportDataObject({ cName:"output.txt", nLaunch: "2"});

}

This topic has been closed for replies.
Correct answer try67

You might need to encode the string you're writing to the data file as UTF-8 for it to work. In order to do that you should use the setDataObjectContents method and a UTF-8 encoded stream, instead of setting the contents in the createDataObject method directly.

2 replies

try67
Community Expert
try67Community ExpertCorrect answer
Community Expert
April 9, 2018

You might need to encode the string you're writing to the data file as UTF-8 for it to work. In order to do that you should use the setDataObjectContents method and a UTF-8 encoded stream, instead of setting the contents in the createDataObject method directly.

tt27079448
Inspiring
April 9, 2018

Can you give an example?

Please!

try67
Community Expert
Community Expert
April 9, 2018

Something like this:

this.createDataObject("output.txt", "");

this.setDataObjectContents("output.txt", util.streamFromString(outputString, "utf-8"));

this.exportDataObject({ cName:"output.txt", nLaunch: "2"});

Thom Parker
Community Expert
Community Expert
March 28, 2018

JavaScript text uses the Unicode encoding. This is a 16 bit code that can represent just about any character in existence. On the other hand, plain text is 8bit, which uses the ASCII or ANSI encoding which only provides for Western European characters plus punctuation and a few special characters used on early teletype machines. So, if the text scraped from the PDF page does not have an easy translation to ASCII it will be replaced with garbage.

Note: by "easy translation" I mean that the 8bit Unicode prefix is 00

Thom Parker - Software Developer at PDFScriptingUse the Acrobat JavaScript Reference early and often
tt27079448
Inspiring
March 30, 2018

What should I do?

Can I save the other file format?

Thanks!

tt27079448
Inspiring
April 7, 2018

UP