Skip to main content
BreakawayPaul
Inspiring
March 16, 2011
Question

Refind returning strange results

  • March 16, 2011
  • 2 replies
  • 495 views

I'm having to read the <title> and <h1> contents of a group of web pages in order to insert them into a database.  My problem is that some of the H1 tags have attributes (like class= or align=).  I'm using the following code to pull the H1:

<cfset regex1 = "<h1[^>]*>(.+?)</h1>">

Then feeding that into a refindnocase().

But the result I'm getting isn't the tag contents, but the attribute itself.  How can I make the refind ignore the attributes of the H1 and just return the tag contents?  The code works great if there are no attributes, but not when there are.

The odd part is that I always thought with a regex, only the things in (parens) were saved for later use.  I don't get it.

    This topic has been closed for replies.

    2 replies

    Inspiring
    March 16, 2011

    What reFindNoCase() returns by default is just where the first character of a match starts.

    EG, this code@:

    <cfset regex1 = "<h1[^>]*>(.+?)</h1>">
    <cfset s = 'before <h1 class="big">Hello World</h1> after'>
    <cfoutput>#reFindNoCase(regex1, s)#</cfoutput>

    Returns 8.  Which is where the match starts.

    If, however you tell reFindNoCase() to catpure subexpressions:

    <cfdump var="#reFindNoCase(regex1, s, 1, true)#">

    You'll get something like this:

    struct
    LEN
    array
    132
    211
    POS
    array
    18
    224

    Wherein the second element of the pos array says where the "Hello World" starts, and the len says how long it is.  The first element of the arrays identify where the whole match starts & how long it is.

    It's impossible to tell where you're going wrong, because you didn't post your code...

    --

    Adam

    BreakawayPaul
    Inspiring
    March 17, 2011

    Ok... I see now where I was going wrong.  I was using this rather novice code to pull out the H1 contents:

    <cfset regex1 = "<h1[^>]*>(.+?)</h1>">
    <cfset findspot = refindnocase(regex1,contents,10)>
    <cfset h1 = trim(listlast(listfirst(mid(contents,findspot,250),"<"),">"))>

    Basically finding the starting point of the H1, then chopping out the contents using the > and < as list delimiters.

    It worked great, EXCEPT that I finally noticed that someone stuck spans inside my H1s.  Like this:

    <h1 class="minor"><span class="space">Title</span></h1>

    As you can imagine, that boogered up my plans!

    I've globally removed the span tags, and now my script works.

    Thanks!

    Inspiring
    March 16, 2011

    In regards to parenthesis: they are used to create subexpressions.  This is different from from "saved for later use".

    See "Using Subexpressions" in http://help.adobe.com/en_US/ColdFusion/9.0/Developing/WSc3ff6d0ea77859461172e0811cbec0a38f-7ffb.html

    One approach you could take would be to create a function that works in two steps.

    Step 1: Find the elements you want using the regular expression you have posted. This will return both the tag and its content.

    Step 2: Remove the outer tags using a second regular expression.  You might use something like Ray Camden's stripHtml function as a starting point.  See: http://cflib.org/udf/stripHTML