Welcome!

By registering with us, you'll be able to discuss, share and private message with other members of our community.

SignUp Now!

How to? FFIND, regular expressions and double quotes

Jul
177
1
Folks,

I've been driving myself quietly doolally this morning trying to get a regular expression to work with FFIND. Fundamentally what I'm trying to do is find hyperlinks containing ampersands that have not been encoded, across just shy of 400 HTML pages (held locally). The regular expression I came up with (which simply finds links with ampersands, without attempting to ignore those that are already &, a secondary issue) is:

(.*) href=\"(.*)&(.*)\"

and running this through an online REGEX tester that I've used before (albeit it is skewed to use from PHP, and I had to add wrapping forward slashes) gives the result I'm expecting with the contents of the index.html file supplied as the data. Issuing an equivalent FFIND fails, however:
Code:
I:\websites\Badgers\new>ffind /l /p /v /e"(.*) href=\"(.*)&(.*)\"" index.html
Usage : FFIND [/+n /-n /8 /A[[:][-][+]rhsdaecjot] /BC /D[list] /E["xx"] /I"text" /FGIK /L[n] /M /N[dehjs] /O[[:][-]acdeginrsu] /PR /Sn /T"xx" /U /V /W
/X["xx..."] /Y] file...
TCC: (Sys) The system cannot find the path specified.
".*\"" index.html"

I:\websites\Badgers\new>
Perhaps unsurprisingly, since I seem to remember discussions about how to search for strings containing double quotes back in the days when JP Software's support was handled on Compuserve, it looks like the double quotes are problematic. I tried using the TCC escape character before each of the imbedded double quotes, but that didn't give the expected result:
Code:
I:\websites\Badgers\new>ffind /l /p /v /e"(.*) href=\^"(.*)&(.*)\^"" index.html

0 lines in 0 files

I:\websites\Badgers\new>
There is one line in that file that matches my desired expression, as evidenced by this output:
Code:
I:\websites\Badgers\new>ffind /l /p /v /e"(.*) href=(.*)&(.*)" index.html

---- I:\websites\Badgers\new\index.html
[115] shake a stick at, the Badgers &lsquo;<a href="hall_of_fame.html">hall of fame</a>&rsquo;, a list of
[200] <a href="http://maps.google.co.uk/maps/ms?msa=0&amp;msid=202456031702105905110.0004bf1c657589aaecc27&amp;z=10" target="_blank">a 2012 version of
the Google Map showing opposition ground locations</a>
[210] <br><p>The <a href="whatsold_2011.html">older What&rsquo;s New entries</a> are
[229] <a class="mTlink" href="club_officers.html">[Club&nbsp;Officers]</a>
[230] <a class="mTlink" href="hall_of_fame.html">[Hall&nbsp;Of&nbsp;Fame]</a>
[233] Copyright ® 2000-2012 <a href="http://homepage.ntlworld.com/steve.pitts/" target="_blank">Steve Pitts</a>/Badgers Cricket Club &ndash; All right
s reserved

6 lines in 1 file

I:\websites\Badgers\new>
where five of the lines are spurious (because the ampersand is outside of the href attribute), but the link to Google on line 200 matches the original regex (albeit that the ampersands are encoded).

So I guess the simple question is, how do I escape double quote characters in the regular expression when using the /E option to FFIND??
 
Here ya go... looks like you gotta do the weird escape thing \^" and also put it in a character class [ ] .
Code:
ffind /l /p /v /e"href=[\^"][^\^"]*&.*?[\^"]" index.html

Edit: I figured out what the "weird escape thing" is doing. The \ is the regex escape that escapes the ^ which is the TCC escape that escapes the " and putting in the [] forces the regex to accept it as a character. Does that make any sense?
By the way, apparently epement and I submitted our responses at almost exactly the same time. I clicked the Post Reply button and when the screen refreshed both of our posts were there. Nice timing.
 
So I guess the simple question is, how do I escape double quote characters in the regular expression when using the /E option to FFIND??

Use the expression \x22. E.g.,
Code:
ffind /l /p /v /e"(.*) href=\x22(.*)&(.*)\x22" index.html

I don't know why you're using capture groups, though. I think you want something like this instead:
Code:
ffind /l /p /v /e"href=\x22[^^\x22]*&(?!amp;)[^^\x22]*\x22" index.html

I've tested it and it works for me.
Eric
 
Thanks folks, both options fulfill the stated question perfectly and Eric has even been kind enough to solve my original requirement. Much appreciated.

I don't know why you're using capture groups, though
Habit, since you ask :) Most of my use of regular expressions has been when programming page scrapers in PHP, when in general I want to capture the runs of 'other' characters. I also find it easier to read things with the brackets separating distinct elements of the syntax. Like most powerful tools, I find that regexes are a pain to remember the syntax for if you don't use them regularly - something that I suspect is also going to apply to the new TPIPE command!?
 
Steve, I can add because I have a sense of humor about it at the moment that I try to avoid using "full" regular expressions as much as possible, also. In my case, it's due the combination of bad memory and probably a half-dozen regular expression "syntaxes" I've had to use in my more than 40 years (I wrote my first program for pay when I was 17) of doing this kind of stuff.

- Dan


- Dan
 

Similar threads

Back
Top