Skip to main content
Known Participant
November 2, 2010
Question

Finding the data between two different substrings in a string

  • November 2, 2010
  • 2 replies
  • 3688 views

Hey guys-

This is probably really easy for someone...I can't figure it out with CF.  I've searched a bunch but haven't found exactly what I'm looking for.

I'm searching a HTML document with Coldfusion and need to pull some data from it.  Just as a quick example, somewhere in the HTML document there might be some code that looks like (this is obviously super simplified):

<tr><td>Grand Total: </td></tr><tr><td>$1000.00</td> </tr>

I need to do a couple of things.  1)  I need to capture the $1000.00 (this amount is dynamic, it will be different on every HTML doc) 2) I need to delete the whole row (or two rows in this example).

I'm guessing it's something along the lines of setting " <tr><td>Grand Total:</td></tr><tr><td> " to a variable, then searching for everything after until it finds </td> and storing that as a variable.... and then for step two storing everything from the  <tr><td>Grand Total:</td></tr><tr><td> to the next </tr> as a variable, and then using Replace to take it out of the document.  I just dont know how to do the "capture everything from 1 pre-established variable to the next time the code has variable number 2 (in this case, either </td> or </tr>) part.

Any thoughts or help would be so greatly appreciated!

Thanks all!!
JE

    This topic has been closed for replies.

    2 replies

    Inspiring
    November 3, 2010

    I've read this thread and I must be missing something.  If what you want to do is remove all occurances of

    <tr><td>Grand Total: </td></tr><tr><td>something</td> </tr>

    then just use reReplace.  Something like this (might need a little tweaking):

    <cfset x=reReplaceNoCase(varname,"<tr>[\s]*?<td>Grand Total:[\s]*?</td>[\s]*?</tr>[\s]*?<tr>[\s]*?<td>[\w|\W]+?</td>[\s]*?</tr>","","ALL")>

    -reed

    jecultureAuthor
    Known Participant
    November 3, 2010

    Hi Reed -

    The problem is that this is part of a very large HTML doc.  so the variable isn't just

    <tr><td>Grand Total: </td></tr><tr><td>something</td> </tr> ,its that plus a million more tds, trs, images, yada yada.  there is only one "Grand Total" on the page, so I'm using that to be the search criteria, and the very next piece of data not HTML is what I'm looking to try to capture.  Does that make sense?

    Inspiring
    November 3, 2010

    Then you want a combination of what Ilsaac suggested and what I suggested.

    <cfset x=reFindNoCase("<tr>[\s]*?<td>Grand Total:[\s]*?</td>[\s]*?</tr>[\s]*?<tr>[\s]*?<td>([\w|\W]+?)</td>[\s]*?< /tr>",varname,1,"TRUE")>

    will give you the info in X to identify the position and length of the matched string ("$1000.00" in your example), which can then be extracted using mid()

    <cfset x=reReplaceNoCase(varname,"<tr>[\s]*?<td>Grand Total:[\s]*?</td>[\s]*?</tr>[\s]*?<tr>[\s]*?<td>[\w|\W]+?</td>[\s]*?< /tr>","","ALL")>

    Will then remove the two lines of HTML from the variable.


    did you really mean to say that the var contained the images data - maybe you meant just the IMG tags?  So long as the var can fit into CF memory, the above will work.  If it won't fit, then you can to a CFLOOP over the file, taking a chuck at a time, and building a sliding window through the file.  On each interation after the first you do the above code.  As you slide a chunk out of the window you append it to an output file.  When you're done you'll have the input file with the desired changes.  Might sound slow but it isn't - CF can rip through huge files like this real fast.

    -reed

    ilssac
    Inspiring
    November 2, 2010

    Simple Answer: Regex using the reFind() ColdFusion function.

    Tricky answer: you may need some of the fancier pieces of Regex like a look ahead and or look behind, but hopefully not becuse, IIRC ColdFusion regex does not support look behind.

    But I am not the person to give you the actual regex syntax.  I always have to loop up the syntax and figure it out step by step.

    jecultureAuthor
    Known Participant
    November 2, 2010

    Thank you for the response...I'm not sure what regex is, but I'm guessing this really is just a combo of something like "find position of variable 2, find position of variable 1, subtract variable 1 count from variable 2 count and capture that many characters after variable 1", so if I had to take a shot in the dark...it would just be a combo of LEFT() RIGHT() LEN(), etc, but I'm really not smart enough with the syntax to know what I should be putting where....

    Thoughts?

    ilssac
    Inspiring
    November 2, 2010

    Regex is short for "regular expression" http://www.regular-expressions.info/reference.html.

    The syntax that allows for sophisticated string pattern matching including variable and|or unknown portions of the string to match.

    I.E. a simple example that is sort of what you want, but yours is more complex.

    <cfdump var="#refind("<a[^>]*>[^</a>]*</a>",anHTMLStringVar,1,TRUE)#">

    The refind function will use the regular expressing in the first parameter to search the string in the second parameter, starting at position one and returning an array of sub-expressions.

    The <a[^>]*>[^</a>]*</a> regex translates to this:

    <a -- find a string of "<a"

    [^>]* -- followed by zero or more characters that are NOT a ">".

    > followed by the ">" character.

    [^</a>]* followed by zero or or more characters that are not "</a>"

    followed by the string "</a>

    In other words, this would find <a...>...</a> tags in a page, no matter what they linked to or other parameters used in the tag.

    Regular Expressions is another tool for the programmers tool box, like SQL, HTML, JavaScript, or CSS to name a few.