Using Web.jsxlib to read a Google Doc

Report · Jan 20, 2021

I'm using Marc Autret's Web.jsxlib in a script to grab text from our Google Docs on Google Drive ... it's fabulous!

But I have one problem. When the Google Doc contains text like this:

JerriAnne Boggis, BHTNH’s executive director, explained the background of the gift: “Marian Anderson was ... (note the typographer's apostrophe and the typographer's open quotation mark)

then the text I get in InDesign from Web.jsxlib looks like this:

JerriAnne Boggis, BHTNHâ€€s executive director, explained the background of the gift: â€€Marian Anderson was ...

In other words, all the "smart quotes" etc. (including m-dash and n-dash) in Google Doc gets turned into â€€ -- whatever that is.

I've tried Web.jsxlib with and without setting the wantText parameter to 1 ... it doesn't seem to affect the problem. So I'm afraid I'm stumped.

If Web.jsxlib can't be tweaked to deal with the problem, is there something I can do in JavaScript to turn the weird characters back to the "smart" characters that appeared in the Google Doc?

(FYI, when I "view source" on the Google Doc in my browser, the smart characters are rendered properly there.)

Thanks for any help you can suggest.

Report · Jan 20, 2021

Hello Andover Beacon,

I didn't look at Marc Autret's Web.jsxlib script, but you can you the script below to correct the encoding issue, as you described.

var doc = app.documents[0];

var findWhat = ['â€€s', 'â€€'];

var changeTo = ["'s", '"'];

for(var i = 0; i < findWhat.length; i++){
    app.findGrepPreferences = app.changeGrepPreferences = null;
    app.findChangeGrepOptions = NothingEnum.nothing;
    app.findChangeGrepOptions.includeFootnotes = true;
    app.findChangeGrepOptions.includeHiddenLayers = true;
    app.findChangeGrepOptions.includeLockedLayersForFind = true;
    app.findChangeGrepOptions.includeLockedStoriesForFind = false;
    app.findChangeGrepOptions.includeMasterPages = true;
    app.findGrepPreferences.findWhat = findWhat[i];
    app.changeGrepPreferences.changeTo = changeTo[i];
    doc.changeGrep();
}
app.findGrepPreferences = NothingEnum.nothing;
app.changeGrepPreferences = NothingEnum.nothing;

Regards,

Mike

Report · Jan 21, 2021

Thanks, Mike! Good suggestion. I'll get to work on that shortly.

Report · Jan 28, 2021

For those of you who have been waiting with bated breath, sorry for the delay. Here's how I implemented Mike's suggestion:

var story = app .activeWindow .activePage .textFrames[0] .parentStory ;  // turn the contents of the Google Doc into an InDesign story

// Have to clean up some problems with the text we pull from the Google Doc
fixUnicode ( story, "\u00e2\u0080\u0093",  "\u2013" ) ;  // Unicode EN DASH
fixUnicode ( story, "\u00e2\u0080\u0094",  "\u2014" ) ;  // Unicode EM DASH
fixUnicode ( story, "\u00e2\u0080\u0098",  "\u2018" ) ;  // Unicode LEFT SINGLE QUOTATION MARK
fixUnicode ( story, "\u00e2\u0080\u0099",  "\u2019" ) ;  // Unicode RIGHT SINGLE QUOTATION MARK
fixUnicode ( story, "\u00e2\u0080\u009c",  "\u201c" ) ;  // Unicode LEFT DOUBLE QUOTATION MARK
fixUnicode ( story, "\u00e2\u0080\u009d",  "\u201d" ) ;  // Unicode RIGHT DOUBLE QUOTATION MARK
fixUnicode ( story, "\u00e2\u0080\u00A6",  "\u2026" ) ;  // Unicode HORIZONTAL ELLIPSIS

function fixUnicode ( paramObj, paramFindWhat, paramReplaceWith )  {  
  // Web.jsxlib seems to return multibyte Unicode characters as regular single bytes
  var options = app.findChangeTextOptions.properties ;  // remember our current options
  app.findTextPreferences = NothingEnum.NOTHING ;  // clear out current options/preferences
  app.changeTextPreferences = NothingEnum.NOTHING ;

  // Set up options/preferences for our search
  app.findChangeTextOptions = null ;  // nothing special to set
  app.findTextPreferences.findWhat = paramFindWhat ;	// set the search string
  var arrayTextFound = paramObj .findText() ;  // find it, and store an array of matching text objects in arrayTextFound
  for ( var i = 0 ; i < arrayTextFound .length ; i++ )  {  // for each hit
    arrayTextFound[i] .select() ;  // select the found text
    app .selection[0] .contents = paramReplaceWith ;  // fix the selected text
  } // end for each hit

  // Clean up and continue
  app.findChangeTextOptions.properties = options ;  // put things back where we found them
  app.findTextPreferences = NothingEnum.NOTHING ;  // wipe off our fingerprints
  app.changeTextPreferences = NothingEnum.NOTHING ;
  return ;
}  // end function fixUnicode

I wanted to call out the offending Unicode characters explicitly, for the sake of clarity and ease of maintenance. And while I was at it, I expanded the list of offending characters to all the ones that Google Docs seems to use, at least in the types of documents I deal with.

I switched from Mike's GREP find to a Text find just because try as I might, I could not make a GREP expression using "uXXXX" notation that fixUnicode() would find. I'd love to know how to do that, but it's currently above my pay grade, apparently.

Thanks, all, for your help. And if you have any comments on the coding, I'd welcome your critique. I'm pretty new at this, and eager to get more fluent.

Charlie

Report · Jan 28, 2021

Nice work, Andover.

Searching using unicode values: use \x{2013}. In a search string, escape the backslash : "\\x{2013}"

// turn the contents of the Google Doc into an InDesign story

Your first line just gets a reference to the story, it doesn't turn anything into something else

A point of efficiency: The first few lines in the fixUnicode() function can be taken out, you need that only once, it's inefficient to set these things every time the function is called. And you can combine then in one line:

var options = app.findChangeTextOptions.properties ; // remember our current options

app.findTextPreferences = app.changeTextPreferences = app.findChangeTextOptions = null ;

In 'var options' you store just the options, you probably want to store the properties of findTextPreferences and changeTextPreferences as well.

In your function, why find items, then stick in another value? Use change(). So your script could be something like this:

fixUnicode (. . .)
fixUnicode (. . .)
. . .

// Record the state of the find/change window
// and clear it

function fixUnicode ( paramObj, paramFindWhat, paramReplaceWith ) {
  app.findTextPreferences.findWhat = paramFindWhat ;
  app.findTextPreferences.changeTo = paramReplaceWith ;
  paramObj.changeText();
}

// Restore the window's settings

Peter

Report · Jan 29, 2021

Excellent, thanks very much Peter. I'll take all those comments on-board.

Could you point me to some resource that would help me understand when/where/why/how to use the \uXXX notation and when to use the \x{XXXX} notation? Is this difference part of a broader issue that the ID Find/Change dialog box uses a somewhat different "flavor" of GREP than ExtendScript uses? Or is my confusion arising from some other difference I'm not aware of?

Charlie

Report · Jan 29, 2021

> Could you point me to some resource that would help me understand when/where/why/how to use the \uXXX notation and when to use the \x{XXXX} notation?

You wrote that you have my JavaScript for InDesign guide. Look for the section "Unicode characters" (starts on p. 60) -- it lists all the (baffling) formats and contexts in which to use those formats.

> Is this difference part of a broader issue that the ID Find/Change dialog box uses a somewhat different "flavor" of GREP than ExtendScript uses?

There are several different dialects of regular expressions/GREP. All these dialects have a common core, which goes back to GREP's origin, but as is the case with natural languages, most of those dialects went their own way. InDesign uses the Boost GREP libraries, which are fairly common these days. It's a powerful library.

ExtendScript's GREP is the same as JavaScript's GREP. This is an entirely different affair than InDesign's GREP. JavaScript's GREP -- even in its latest incarnations -- is much less powerful than InDesign's GREP.

For JavaScript's GREP you can find many resources on the web. For InDesign's GREP you can consult various links on https://creativepro.com/indesign/ generally and my GREP book (which you mentioned) specifically.

P.

Report · Jan 30, 2021

Wow, that explains a lot! I'm going to be wasting a lot less time now that I know there's one syntax for GREP in ID and a different syntax for GREP in JavaScript and ExtendScript.

Thanks!

Report · Jan 20, 2021

Or you could write to Marc Autret and ask him. In general, when you encounter a problem with a script, it's more useful to ask the script's author for help rather than a forum.

P.

Report · Jan 21, 2021

Thanks, Peter.

After reading this excellent introduction to Unicode, I think I've now learned that what I described above is an encoding issue, but I'm still not sure whether:

a) the encoding issue is something Web.jsxlib (specifically, HttpSecure.Win.jsxinc) "should" deal with, or

b) Web.jsxlib is returning exactly what it should, and my script needs to take steps (as Mike suggested above) to deal with it.

I'll try to contact Marc and pose the question to him. And in the meantime I'll try a GREP workaround as Mike suggests.

PS: Your InDesign scripting and GREP books have each been of immeasurable help to me over the past couple of months! I've gone from zero to doing really useful stuff thanks in no small part to your books. Much appreciated.

Report · Feb 03, 2021

Hi @Andover Beacon

Thanks for your feedback. (BTW, feel free to open a new issue in https://github.com/indiscripts/IdExtenso/issues)

The Web module has just been updated (today), so maybe you'll get safer results now regarding character encoding. Use $$.Web(url, 1) to grab text/xml data. (The `wantText` option wasn't working as expected in Win environment.)

Now, the encoding problem may persist because Microsoft's XMLHTTP component is not 100% deterministic regarding the returned data of HTTP response (in `responseText` mode). Quoting https://docs.microsoft.com/en-us/previous-versions/windows/desktop/ms762275(v=vs.85)

XMLHTTP attempts to decode the response into a Unicode string. It assumes the default encoding is UTF-8, but can decode any type of UCS-2 (big or little endian) or UCS-4 encoding as long as the server sends the appropriate Unicode byte-order mark. It does not process the <? XML coding declaration. If you know the response is going to be XML, use the responseXML property for full XML encoding support.

I'd love to investigate your issue from the original Google Doc URL you're testing. We may then discover at which point the character codes go crazy.

Let me know.

Best,

Marc