Skip to main content
Participant
December 10, 2017
Answered

Searching for non-arabic letters in a mixed document

  • December 10, 2017
  • 2 replies
  • 2191 views

Hi all,

I am curious if someone has an idea if it is possible to specifically search for non-arabic letters in a language mixed document.

The thing is that throughout the whole Arabic text (almost 70 pages) single words and whole sentences are written with Roman letters. But those - although having the same font size like the Arabic letters - appear much bigger. Of course I could manually search for those cases and apply another character style but this is not efficient. I thought of using GREP search and replace but I couldn't figure out how to do it. Can anyone give instructions what to do or offer another way to achieve what I need?

Many thanks in advance

Mona

    This topic has been closed for replies.
    Correct answer TᴀW

    The first thing to do is check the language setting that is applied to the text. If you're lucky you'll find that the Arabic text has Arabic as its applied language, and the non-Arabic text has a different language setting (English-US or similar). This happens because Word by default automatically applies the correct language to text.

    If not, though, a GREP search is probably the best way. Searching for

    [\u\l]

    is a start, especially if we are only dealing with English letters. If there could be accented characters from other languages, a more inclusive GREP would be needed.

    Even with a GREP search, I recommend going through the founds results one by one and not clicking on Change All, because you will probably want to mark spaces and punctuation that belongs to the English text as English as well, not just the letters.

    Ariel

    2 replies

    David W. Goodrich
    Participating Frequently
    December 11, 2017

    I routinely use a GREP search to find strings of CJK characters so I can apply a character style that includes the font I want:

    [\x{2E80}-\x{9FBB}]+

    This finds all strings of chars. encoded with hexadecimal values between between 2E80 and 9FBB, and perhaps you can swap in one or more ranges that work for Arabic.  I assume Arabic includes word-spaces, so you may need a separate GREP to find spaces between strings of Arabic and apply the language attribute and font (perhaps by means of a char. style).  I have no idea how well GREP searches text running right-to-left -- hopefully just fine.

    My CJK search string isn't perfect.  It leaves out some stuff, including full-width punctuation and compatibility forms up at the top of Unicode's first plane, and of course doesn't get any CJK added in the second.  Nor can it distinguish C, J, and K, so I do that manually (you cannot rely on applied fonts -- I recently received Chinese files where most chars. had a Japanese font applied).

    Good luck!

    David

    Jongware
    Community Expert
    Community Expert
    December 12, 2017

    > ... I have no idea how well GREP searches text running right-to-left -- hopefully just fine.

    Just as one would expect hope for. If you search for a single Arabic character, it finds the first one of an Arabic word – the rightmost one. It matches the correct reading order (a.k.a. "logical order" when viewed as an unformatted stream of characters), even when mixing Arabic and an LTR language in a single query.

    TᴀW
    TᴀWCorrect answer
    Legend
    December 10, 2017

    The first thing to do is check the language setting that is applied to the text. If you're lucky you'll find that the Arabic text has Arabic as its applied language, and the non-Arabic text has a different language setting (English-US or similar). This happens because Word by default automatically applies the correct language to text.

    If not, though, a GREP search is probably the best way. Searching for

    [\u\l]

    is a start, especially if we are only dealing with English letters. If there could be accented characters from other languages, a more inclusive GREP would be needed.

    Even with a GREP search, I recommend going through the founds results one by one and not clicking on Change All, because you will probably want to mark spaces and punctuation that belongs to the English text as English as well, not just the letters.

    Ariel

    Participant
    December 10, 2017

    Thanks Ariel, you are great! I used your code and it found the Roman letters. By adding "+" I am able to search for a whole word. But I have no clue what to write if I want to highlight all the Roman letters of a phrase - something like "start with the first Roman letter and go to the last one before the first Arabic letter comes". Do you know how to do this? Is there a way?

    Thank you.

    TᴀW
    Legend
    December 10, 2017

    So, you can add a bunch of stuff inside the brackets:

    [\u\l ,.!;]

    whatever you want to include in your phrase -- a space, a comma, a period, an exclamation mark -- it's up to you. And then, as you say, add the + outside.

    The problem is that this can also catch commas and punctuation or spaces that strictly speaking are not part of the English -- it depends on the sentence. Sometimes, you can have an Arabic sentence that has a list of words in English separated by commas, but those commas are part of the Arabic sentence. On the other hand, you can sometimes have an Arabic sentence with an English phrase with commas between some of the words in English, and those commas are part of the English.

    It's impossible for a computer to tell which is the case. So you have to go one by one and check it with our superior human brain :-)

    Still, you can certainly add some extra stuff in the brackets as described, and that might make it quicker for most of the cases...