Why is the text in TextStyleRanges duplicated?

Report · Dec 07, 2013

I am using the COM InDesign API from C# (I don't know if that makes any difference, my guess is that this behaviour would be the same with other technologies as well).

I have a story in a document that looks like the following (this is a simplified example, and also, this document was not created by me, but rather given to me by a designer):

"This is the first sentence. This is the second sentence."

When I process this document with the SDK, I process every TextStyleRange in every Paragraph of the story. Is that the right approach to use to convert the text to another format?

My problem is, that the values of the Paragraph.Contents and the TextStyleRange.Contents are inconsistent. I created a little test app that iterates through the paragraphs and the textStyleRanges and writes out the value of their Contents property. The result is similar to the following:

Paragraph 1, Contents: "This is the first sentence."

TextStyleRange 1.1, Contents: "This is the first sentence."

Paragraph 2, Contents: "This is the "

TextStyleRange 2.1, Contents: "This "

TextStyleRange 2.2, Contents: "is the second sentence."

Paragraph 3, Contents: "second sentence."

TextStyleRange 3.1, Contents: "second sentence."

So as you can see, the textStyleRanges of the second paragraph contain some text which is not in the Contents property of the second Paragraph. And that piece of text is repeated in the last paragraph. So if I iterate through and process every TextStyleRange, then I end up with the following text:

"This is the first sentence. This is the second sentence. second sentence."

So the last two words are duplicated.

I don't know why this duplication is in the data, it does not show up if I open the indd file in InDesign.

Also, if I export the document in IDML, it somehow produces correct result, something like this:

<Content>This is the first sentence.</Content>

</CharacterStyleRange>

<Br />

<Content> second sentence.</Content>

</CharacterStyleRange>

</ParagraphStyleRange>

</Story>

What can be the reason for this? How can I programatically figure out that I don't have to process the last paragraph?

Report · Dec 07, 2013

This is due to an unfortunate change in behavior since (thinking) possibly CS2 -> CS3. "TextStyleRange" used to limit the contents to a single paragraph, but something changed and now it *may* extend into any next paragraphs which have the same formatting.

Since it also happens in Javascript, it must be intentional and built-in. Note that "Find text" in the UI *also* works the same (perhaps it used not to, too long ago).

For Javascript, I can work around it by comparing the index of the last text style range with the index of the last character in its parent paragraph. This works, because the "index" indicates an offset in *characters* from the start of a story.

If you already work per paragraph anyway, it could be useful to first gather all starting indexes for each paragraph. Then you only have to test the last style range against that, and cull it if it overshoots.

... Annoying, isn't it?

Report · Dec 07, 2013

Thanks for the info! So I can get the index range of every paragraph or textStyleRange by getting the first and last items of the insertionPoints collection, and querying their index, right?

And isn't it possible for two textStyleRanges to overlap in a way that I can not omit any of them? I am thinking of something similar to this:

Paragraph 2, Contents: "This is the "

TextStyleRange 2.1, Contents: "This is the second"

Paragraph 3, Contents: "second sentence."

TextStyleRange 3.1, Contents: "second sentence."

In the above example, I can't simply omit the last styleRange, because that way I would miss the last word. So I would have to break that range into two words and omit only the first word.

Should I prepare for the above scenario or is this impossible?

Report · Dec 07, 2013

Logically TextStyleRanges cannot overlap 🙂 It would defeat their purpose. So your scenario is, theoretically at least, not possible.

"Insertionpoints" -- mind that the final hard return is considered part of the "current" paragraph. In Javascript you can also query the "index" of any character, and I think it's slightly safer to use that instead. "InsertionPoints" are sort of virtual constructs, whereas there is nothing more concrete than a character, which *is* or *is not* inside a paragraph/text style range.

Report · Dec 07, 2013

Wait: if you are iterating over only the Text Style Ranges, you can't be sure what *paragraps* you get. It's better to iterate over only Paragraphs, and check the very last TextStyleRange for over-run.

The reason behind this is that TextStyleRanges are (also) "virtual" constructions: they are happy to run over the *end* of a paragraph, but if you ask a paragraph for its first, it will always start at the start of that paragraph -- even if asking the paragraph above it assures *its* last TSR "runs" into the next paragraph.

Report · Dec 07, 2013

Umm, I am starting to get a bit confused .

Here is a pseudocode of what I plan to do:

int lastProcessedIndex = 0;
foreach(Paragraph p in story.Paragraphs)
{
     foreach(TextStyleRange tsr in p.TextStyleRanges)
     {
          if(tsr.InsertionPoints.LastItem.Index <= lastProcessedIndex)
          {
               // We have already processed the contents of this TSR in the previous paragraph, so omit it.
               continue;
          }
          Process(tsr);
          lastProcessedIndex = tsr.InsertionPoints.LastItem.Index;
     }
}

Would this approach be correct?

Report · Dec 07, 2013

C# is working at the same high-level object model as other scripting languages, so this discussion would be more appropriate at the scripting forum.

Anyway, paragraphs and text style ranges are two mostly independent ways to slice the whole sequence of characters that make a story. If your unmentioned export format can not handle those interwoven "strands" but requires a hierarchical approach, that would be the first structure to follow. For something similar to let's say HTML, you're right to start with paragraphs.

Dependent on what you also need to express, you will further have to subdivide the paragraphs and take care to not extend beyond them. Your next choice of text style runs is one obvious candidate, even when the (final) style run in a paragraph may extend beyond the end of the paragraph. In that case you have to intersect (take the range where paragraph and TSR overlap). The result is no TSR any more, but a possibly shorter unit. You can specify that either as a range of charaters which with the scripting object model means individual characters, or as single "Text" object. In a similar project I used an expression such as the following javascript snippet to produce that arbitrary range: story.texts.itemByRange(story.characters.item(firstOffset),story.characters.item(lastOffset)).getElements()[0].

Note though that in practice it is not that simple - you might encounter breaks in the text style ranges that you want to ignore - e.g. runin styles (grep, line styles etc.) or changes in attributes that you won't export anyway. There will also be other points to break - anchor characters for inlined graphics, hyperlink sources, embedded notes, tables, special characters and so forth, some of them also extend across paragraphs. The basic operation remain the same, you just find the next offset for a split point and reduce your working range.

Hth,

Dirk

Why is the text in TextStyleRanges duplicated?

1 Correct answer