Copy link to clipboard
Copied
Hello, all,
I'm working on a project whereby we can run a script that will check all the links of a generated page to make sure that each link (local and off-site) are valid.
As of now, and I can be open to change, I'm using AJaX to call a CFC function that uses CFDIRECTORY to recursively get all .cfm and .htm pages of a site, swap out the physical drive path for each entry with its FQDN address, loops through that using CFHTTP to parse the page, then uses REMatch to get an array of all anchor tags in the page. I am currently looping through the arrays and adding the values as KEY to a struct to eliminate duplicates, then CFRETURN the ArrayToList(StructKeyArray(array),"|"). CF is returning something like:
<a href="https://www.domain.com/index.cfm">Link One</a>|<a id="newlink" href="https://www.google.com">Link Two</a>|<a alt="test" id="testing" href="https://www.another.com/index.asp">Link Three</a>
So how can I get just the value of the href attributes??
V/r,
^ _ ^
1 Correct answer
Aaaaaaaaaaaand.. now I learn that there was an easier way to do this. Sigh.
V/r,
^ _ ^
Copy link to clipboard
Copied
Nevermind.. I managed to get CF to sort out the attribute values and send that as a list.
After inserting the anchor tags into an associative struct with the tags as the key to remove duplicates, I grabbed the StructKeyArray() of that and looped over it using REFindNoCase(), grabbed the position and length of the href attribute, and used MID() to get only the URL.
<cfset ska = StructKeyArray(variables.remDupes) />
<cfloop index="itm" from="1" to="#ArrayLen(variables.ska)#">
<cfset idx = REFindNoCase("href\s*=\s*['""][^'""]+", variables.ska[itm],'1', 'true') />
<cfset variables.ska[itm] = mid(variables.ska[itm],val(idx.pos[1]+6),val(idx.len[1]-6) />
</cfloop>
<cfreturn ArrayToList(variables.ska,"|") />
HTH,
^ _ ^
Copy link to clipboard
Copied
Although, I just realized that if there are any spaces like:
href = "https://www.domain.com" that this won't work, so I have to tweak this, a bit.
V/r,
^ _ ^
Copy link to clipboard
Copied
Aaaaaaaaaaaand.. now I learn that there was an easier way to do this. Sigh.
V/r,
^ _ ^
Copy link to clipboard
Copied
If anyone ever tries something like what I'm doing, save yourself the trouble and just use Ben Nadel's solution. I copied/pasted his CFFUNCTION and placed it in my .cfc, and it works WONDERFULLY. So simple.
HTH,
^ _ ^
Copy link to clipboard
Copied
I don't have the code at hand at this moment but there's a library called JSOUP. You feed it an HTML String (document, whatever) and can query tags like a DOM. No need to care for RegExp at all.
Copy link to clipboard
Copied
Bardnet wrote
No need to care for RegExp at all.
But I _love_ RegEx. I know, there are a lot of people who are all like "RegEx doesn't work well with HTML because it's.. " whatever. IDC. I use RegEx whenever possible. And I don't even fully understand it, but I still love it and use it whenever I can. I love REMatch(), and REreplaceNoCase(), and REFind(). I wish Adobe would take Ben Nadel's idea and make an actual REMultiMatch() native to the server. (Wish in one hand, spit in the other.. y'know.)
V/r,
^ _ ^
Copy link to clipboard
Copied
For completeness' sake
<cfset oJsoup = createObject( "java", "org.jsoup.Jsoup" )>
<cfset oFile = createObject( "java", "java.io.File" ).init( Expandpath( "./file001.html" ) )>
<cfset oDoc = oJsoup.parse( oFile, "UTF8" )>
<cfset arrLinks = oDoc.select( "A" )>
<cfloop array="#arrLinks#" index="oEl">
#oEl.attr( "href" )#
</cfloop>
JSOUP has plenty parse methods, input does not need to be a local file: Jsoup (jsoup Java HTML Parser 1.11.3 API)

