access multiple documents’ textlayers’ content without opening files

Report · Aug 29, 2009

In the MacIntosh-forum the question has been posed if multiple photoshop-files can be searched for text-contents.

Now setting up a selection-interface, opening selected files and doing a file-by-file-and-textlayer-by-textlayer-search in Photoshop shouldn’t be too hard, but the real intent was for indexical purposes in an Indesign-document, so avoiding the opening should be a time-saver if it’s possible to access the textItem.contents externaly.

With readIn() (a technique I’ve been introduced to here regarding pdf-pagecounts) I’ve gotten as far as noticing that the text seems to be situated between "Text" and a closing bracket, but quite frankly the code baffles me.

And apart from the problem of closing brackets in the text what I’ve gotten so far includes some texts twice and adds some numbers in the result.

Any help appreciated.

Report · Aug 29, 2009

My guess that opening the files in Photoshop would be faster than string searches on the file stream using javascript.

But if you want to try this should get you close.

var f = new File('/c/temp/text_small.psd');
var re = /Text \(þÿ\x00(.+)/g;
var textStrings = new Array();

f.open('r');
var str = f.read();
f.close();

var m = re.exec( str );
while (m != null) {
     zpt = m[1];
     var text = new String();
     for(var p = 0; p < zpt.length; p++ ){
          if( m[1].charAt(p).length == 1 ) text+=m[1].charAt(p);
     }
     textStrings.push(text);
     m = re.exec( str );
}

Report · Aug 29, 2009

NIce one Mike, I was working on one simular but got it all wrong, mine got all the layer names

I see that it duplicates all the text so a quick fix..

var f = new File('filename.psd');
var re = /Text \(þÿ\x00(.+)/g;
var textStrings = new Array();

f.open('r');
var str = f.read();
f.close();

var m = re.exec( str );
while (m != null) {
     zpt = m[1];
     var text = new String();
     for(var p = 0; p < zpt.length; p++ ){
          if( m[1].charAt(p).length == 1 ) text+=m[1].charAt(p);
     }
     textStrings.push(text);
     m = re.exec( str );
}
textStrings = ReturnUniqueSortedList(textStrings);
alert(textStrings);

Report · Aug 29, 2009

I didn't test it and knew it wasn't done. It took a long time even with the small psd I looked at to see the file structure so I gave it up as a wrong approach.

What size file did you test and do you think the way would be fast enough to be useable? Maybe I was wrong.

Also I would bet from the RegExp I used that if the text has a linebreak, this will only get the first line.

Report · Aug 29, 2009

I only tested it on a 2meg file, but it took less that a second using my laptop. (1.8GHz duel core)

Report · Aug 30, 2009

First of all thanks for Your time!

I still have to sort through that code though, because that’s a bit steep for me …

Anyway, Michael, if Your assessment in post#1 regarding the relative speed of opening in Photoshop versus file stream is correct that would mean starting such an operation from Indesign would need some BridgeTalk – which I was trying to avoid, but maybe it is the better way.

Thanks again!

Report · Aug 30, 2009

Christoph, if you are on OSX Tiger and above then by a country mile the quickest and easiest way to search ".psd" files for there text layer contents is to use the files meta data which is accessible to the spotlight engine. No file opening required just use the same method as the PDF page count route.

The item key that you are looking for in this case is "kMDItemLayerNames"

This will give you all layer and layer set name strings... Enjoy!!!

Paul R knows exactly how to call this shell from JavaScript

Report · Aug 30, 2009

I am currently working under Mac OS X 10.5.7, but "kMDItemLayerNames" seems to hint at the layernames, which could be different from the actual text-content if somebody should have taken the effort to rename text-layers, doesn’t it?

Report · Aug 30, 2009

Yes if someone had renamed the layers then this would be broken as far as searching for the text content. But if I was to bother renaming the layers for some reason then I would also know what name I would be looking for too. It was a possible option but may not be suitable for you.

Report · Aug 30, 2009

Thanks for pointing it out in any case!

I also considered copying all the texts into the File Info Description in Photoshop, so that they should be searchable as Contents in the Finder.

And I actually don’t need the thing at all at current, but was intrigued with a query by someone else and got to thinking how one could achieve it.

Report · Sep 01, 2009

Well, this is the time to let my ignorance shine …

In my tests »m« returns null immediately, so I don’t get any results.

But quite frankly I have no idea what »\(þÿ\x00(.+)« in the RegExp means anyway.

Could You please explain that part or how You arrived at it in a bit more detail?

On the other hand I might not comprehend it anyway, so please don’t inconvenience Yourself.

Report · Sep 01, 2009

The ignorance may be mine. Try var re = /Text\s\(\xFE\xFF\x00(.+)/g;

What that 'says' is look for Text(one whitespace)(the ascii code in hex for the next 3 chars)then remember/store the rest of the line.

I first looked at a sample PSD using a hexeditor to see if I could find a pattern to search. As you pointed out in the frist post, text appears a lot of times and I didn't have any luck finding the pattern that way.

So next I opened the sample with a text editor and found this:(nb I have replace the x00 chars with '00' here so the pattern can be seen)

<<
     /EngineDict
     <<
          /Editor
          <<
               /Text (þÿ00M00y00T00e00x00t00

I think the text is stored in what is called a zero pad text. Which is why if a match is found it loops through the string to remove the x00 chars.

I would have thought it would work with the RegExp I posted. But I changed to re to match as decribed above incase the space or þÿ is the problem.

The only other thing I can think of is I tested this on a PC. If you are on a Mac you might want to look at a pad with a texteditor to see if the þÿ part looks the same. I know that the difference in the way the byte order is read can be a problem. I wouldn' t think so here but...

Also as I noted it runs slow on my maching after the first run. The first run is quick. Every run after that is much slower. It may be just my system but it this might be very slow scanning a folder in a loop.

Mike

Report · Sep 01, 2009

Here is yet another version..

Report · Sep 01, 2009

And here's my tweaks to Paul's version.

function main() {
  var file = File("~/Desktop/tmp/filename.psd");
  file.open("r");
  file.encoding = 'BINARY';
  var dat = file.read();
  file.close();
  var result;
  var pos =[];
  var Text= [];
  var rex = /TxLr.+Txt TEXT/g;
  while ((result = rex.exec(dat)) != null) {
    pos.push(result.index+(result[0].length));
  }
  function readByte(str, ofs) {
    return str.charCodeAt(ofs);
  }
  function readInt16(str, ofs) {
    return (readByte(str, ofs) << 8) + readByte(str, ofs+1);
  }
  function readWord(str, ofs) {
    return (readInt16(str, ofs) << 16) + readInt16(str, ofs+2);
  }
  function readUnicodeChar(str, ofs) {
    return String.fromCharCode(readInt16(str,  ofs));
  }
  for (var i = 0; i < pos.length; i++) {
    var ofs = pos;
    var textLength = readWord(dat, ofs)-1;
    ofs += 4;
    var str = '';
    for (var j = 0; j < textLength; j++) {
      str += readUnicodeChar(dat, ofs);
      ofs += 2;
    }
    Text.push(str);
  }
  alert(Text);
};
main();

Report · Sep 01, 2009

\xFE\xFF is the beginning of a UTF16 string (or file) in Adobe-land. What follows is a Unicode16 string, two bytes per character. Stripping the \x00's out of the byte array will work provided there are not any Unicode characters in the string that use any bits in they high byte.

Report · Sep 01, 2009

Thanks X,

That explains why I am getting some strange results with some text in my sample.

Paul,

It's my turn to say 'Good job'. If I may make a suggestion, your code will not return all the text if the text is longer than 255 chars. You may want to check whats at pos+17 to see if you need to handle longer text strings.

These changes should take care of the text length

var test = parseInt(pos)+17;
var textLength = (dat.charCodeAt(parseInt(test ))<<8)+dat.charCodeAt(parseInt(test)+1);
var start = test + 3;

Report · Sep 01, 2009

Actually, the length is encoded using 4 bytes though I'm sure something else would break if a string got that long.

Report · Sep 01, 2009

Yes, I think Photoshop might choke if you tried to put more than 65536 chars in a text layer.

Report · Sep 01, 2009

Obviously we are a few timezones apart, so that’s a lot of help so early in the morning for me, and I’ll have to test the code You so generously provided.

Thanks a lot for now, You all!

(Incidentally, Michael, I work on a Mac, I should have mentioned that.)

Report · Sep 01, 2009

I hadn’t known one may only assign one post the Correct Answer-seal, so Paul and Michael please forgive me for attaching it to xbytor’s post, because the code You posted also works fine on my test-file.

Thanks again, all of You!

(But, dang, does that go over my head …)

Report · Sep 02, 2009

(But, dang, does that go over my head …)

Mike's approach pulls the text from one part of the psd file. I recognize the format but avoid working with it like the plague.

Paul's approach is something I am far more familiar with. Layers are (apparrently) stored in PSD files as serialized ActionDescriptor objects,

identical to the one that you see in the Script Listener logs whenever you create a new text layer. His code (and my tweaks) take advantage

of the fact that the key/ID for a text layer descriptor is TxLr' and that the first item in that descriptor is a string call 'Txt '. The first four bytes are

length of the string and the string is encoded in 16bit Unicode and has a \x00\x00 terminator.

Report · Sep 02, 2009

The only advantage I can see to using the text engine data as I did would be if you also wanted to know what font is used for the text. As far as I can tell that is only in the engine data.

Report · Sep 02, 2009

Thank You for the explanations!

Report · Sep 02, 2009

I work on a Mac

I do, too. The only time this might be a problem would be with docs create on a PPC mac, but I'm not sure. I read through all of this a couple of years ago when I wrote the relevant code and promptly forgot about it. The XMP spec has a section on how XMP block are encoded when they are embedded in other docs if you really are curious.

access multiple documents’ textlayers’ content without opening files

1 Correct answer

Explore related tutorials & articles