Inspiring

Question

Refind returning strange results

Forum|Forum|15 years ago
March 16, 2011
2 replies
496 views

I'm having to read the <title> and <h1> contents of a group of web pages in order to insert them into a database. My problem is that some of the H1 tags have attributes (like class= or align=). I'm using the following code to pull the H1:


<cfset regex1 = "<h1[^>]*>(.+?)</h1>">

Then feeding that into a refindnocase().

But the result I'm getting isn't the tag contents, but the attribute itself. How can I make the refind ignore the attributes of the H1 and just return the tag contents? The code works great if there are no attributes, but not when there are.

The odd part is that I always thought with a regex, only the things in (parens) were saved for later use. I don't get it.

This topic has been closed for replies.

A

Adam Cameron.

Inspiring

What reFindNoCase() returns by default is just where the first character of a match starts.

EG, this code@:

<cfset regex1 = "<h1[^>]*>(.+?)</h1>">
<cfset s = 'before <h1 class="big">Hello World</h1> after'>
<cfoutput>#reFindNoCase(regex1, s)#</cfoutput>

Returns 8. Which is where the match starts.

If, however you tell reFindNoCase() to catpure subexpressions:

You'll get something like this:

struct

LEN

array
1	32
2	11

POS

array
1	8
2	24

Wherein the second element of the pos array says where the "Hello World" starts, and the len says how long it is. The first element of the arrays identify where the whole match starts & how long it is.

It's impossible to tell where you're going wrong, because you didn't post your code...

--

Adam

BreakawayPaulAuthor

Inspiring

Ok... I see now where I was going wrong. I was using this rather novice code to pull out the H1 contents:

<cfset regex1 = "<h1[^>]*>(.+?)</h1>">
<cfset findspot = refindnocase(regex1,contents,10)>
<cfset h1 = trim(listlast(listfirst(mid(contents,findspot,250),"<"),">"))>

Basically finding the starting point of the H1, then chopping out the contents using the > and < as list delimiters.

It worked great, EXCEPT that I finally noticed that someone stuck spans inside my H1s. Like this:

<h1 class="minor"><span class="space">Title</span></h1>

As you can imagine, that boogered up my plans!

I've globally removed the span tags, and now my script works.

Thanks!

J

JR__Bob__Dobbs

Inspiring

In regards to parenthesis: they are used to create subexpressions. This is different from from "saved for later use".

See "Using Subexpressions" in http://help.adobe.com/en_US/ColdFusion/9.0/Developing/WSc3ff6d0ea77859461172e0811cbec0a38f-7ffb.html

One approach you could take would be to create a function that works in two steps.

Step 1: Find the elements you want using the regular expression you have posted. This will return both the tag and its content.

Step 2: Remove the outer tags using a second regular expression. You might use something like Ray Camden's stripHtml function as a starting point. See: http://cflib.org/udf/stripHTML

Sign up

To post, reply, or follow discussions, please sign in with your Adobe ID.

Sign in to Adobe Community

To post, reply, or follow discussions, please sign in with your Adobe ID.

Scanning file for viruses.

This file cannot be downloaded