Skip to main content
Inspiring
March 8, 2017
Answered

Grep: find text between quotations when the number of words are more than 20?

  • March 8, 2017
  • 2 replies
  • 12356 views

(?<=“).*?(?=“)

is a grep expression to find text inside quotation marks.

but how to delimit the search to define exactly a number of words inside the curly quotes?

for example, detect only quotations that have more than n words? 50, for example?

this comes as it is a publishing practice to style those «block quotations»  and get broken off without quotation marks.

thanks

    This topic has been closed for replies.
    Correct answer Laubender

    Tested the greps:

    Laubender/9: the 3 options didn’t work. They found, 0, 9 and 12 matches.

    Jongware/13 (second grep) and Laubender/16 both tagged 285 episodes.

    (they are the same...?)

    Obi/15 crashed. It is working in small batches, apparently fine. Not tested in the whole document. Very greedy.

    *****

    I checked how many straight opening quotations indeed has the file= 321. (and closing 326... buy it is easy to fix. Obi resolved it in the past, here!)

    Changing {5,} to {1} in Jongware/13 and Laubender gives 311 matches

    and surprised me that changing the same {5,} to {0,} was 316

    Finally, both Laubender and Jongware were very fast and any trace of greeding was perceived.

    Conclusion: the supergrep for this task is

    “([^\s“]+\s+){n,}[^\s\r\n”]+”


    Did you try Laubender 12 ?

    If you copy/paste the expressions you have to look carefully after the curley quotes in the expression.

    Best you type the expression yourself. Without using straight quotes.

    I only tested with English text. Not e.g. German text where the quotes for opening and closing are very different.

    The only GREP expression working straight for me is:
    Laubender 12

    “([()[\]]?\<[^“]+\>[,;:!?.…()[\]\h]*){21}”

    Obi-wan 15 is also working, but first I had to change the quotes to that:

    “(([^ “]+)\h){21,}(?2)”

    Did not test with very long quotes. Just the examples you are seeing in my screenshots.

    Regards,
    Uwe

    2 replies

    Erica Gamet
    Inspiring
    March 8, 2017

    The issue here is that you'll have to account for any types of spaces and punctuations that may occur, while limiting the number of WORDS to 50. In theory, it's probably doable...it's just above my "pay grade," as they say.

    Erica Gamet
    Inspiring
    March 8, 2017

    To tell ID a certain number or minimum number of times use curly brackets. {50,} is how to delineate 50 or more times.

    Inspiring
    March 8, 2017

    Erica,

    I am now using this grep as seems better that the mentioned:

    “.*?\”

    And is fine. Adding the curly brackets is the problem, because it does't work:

    “.*?\”{20}

    Other think to resolve is that the premise asks for quotations of 20 or more words. Those from 1 to 19 must be not considered.

    Thanks, I was upset as this thread is seen as irrelevant.

    And it is a superb tool when the author put inside quotes the whole group of quotations inside the text.

    Ps. One possible method could be extract all the quotations, place them in an ascending/descending list to filter easily by number of words and, by find/change, cross the information, to isolate one group and style it. But this seems a very dubious method or needing a script. Think a grep should be work.

    *******

    But grep has formulas to find words, like  \w+…

    Transporting from numbers \d{20} which is ok, to words, is nonsense: \w+{20}

    Erica Gamet
    Inspiring
    March 8, 2017

    Maybe a combo of GREP and a script. Have you set this question to the GREP group on Facebook? (Treasures of GREP)