Thanks in advance,
Dan
]]>As soon as you’re looking for couples ()
, []
, {}
, <>
… in a text, containing unlimited nested respective couples ()
, []
, {}
, <>
, you absolutely need to use a very strong feature of PCRE regex : the recursive patterns. If you don’t, it quite impossible to handle any arbitrary nesting depth !
Generally speaking, you automatically create a recursive pattern, when you add a recursive call to a group, located INSIDE the sub-pattern whose it makes a reference.
For instance, let’s suppose the general regex ....(....(?n)....)....
and that the showed group is the group n. Then, the form (?n)
, located inside the group n, is a recursive call to that group n
Concerning your problem, Daniel, the solution is the regex, below :
(^([^()\r\n])*(\(((?2)|(?3))*\))(?2)*\r?\n)+
You may use the PCRE_EXTENTED option (?x)
to get the regex :
(?x) ( ^ ([^()\r\n])* ( \( ( (?2) | (?3) )* \) ) (?2)* \r?\n )+
These regexes look for any consecutive sequence of entire lines, with their End of Line character(s), whose each of them :
Is, of the general form, ....(.......).....
Contents well-balanced nested couples ()
, inside the upper-level block (....)
So, Daniel, if you leave the replace field empty, you’ll get, ONLY, the lines where the number of opening round brackets is different from the number of closing round brackets, or lines without any round bracket at all !
For instance, these regexes, above, don’t match any of these following lines :
abcdef
abc(def
a(b)def)ghi
a(bc(((d))ef)g
But they do match, in one go, the block of these seven following lines, with well-balanced couples of ()
:
abc(de)f
(a(bdef)ghi)
a(bc(((d))e)f)g
a()bc
((ab(cde((fgh)ij)kl))mno)pqr
ab(c(de(fgh(ijk))lm)((()))n()()op)qrs
In short :
The form (?2) is a NON recursive call to the sub-pattern [^()\r\n]
The form (?3) is a recursive call to the sub-pattern \( ( (?2) | (?3) )* \)
The anchor ^
, at beginning and \r?\n,
at the end, allow to cover an entire line, which can be repeated, due to the final + sign, applying to group 1
The opening and closing round brackets need to be escaped \(
and \)
. Just notice that escaping round brackets, inside the class character [....]
, at the beginning of the regex, is not mandatory !
Inside the block \(....\)
, the regex looks for any sequence, even empty, of :
(?2)
Characters different from round brackets and from End of Line characters OR(?3)
Nested other blocks of round brackets (....)
and so on…
I’ll give you any further information, about the recursion concept, if anyone needs to !
Best regards,
guy038
P.S.,
To end, I give you an other regex, with a recursive pattern (?2)
, which can match the general case of the string ....(.........)............(..)...(....)...........
So, this regex, below, matches the tallest sequence of characters, even on several lines, which contains as much as opening round brackets than closing round brackets, with well-nested and/or juxtaposed other blocs (....)
:
([^()]*(\(([^()]|(?2))*\))[^()]*)+
With the PCRE_EXTENTED option (?x)
, we get the regex :
(?x) ( [^()]* ( \( ( [^()] | (?2) )* \) ) [^()]* )+
And, if you don’t think to use the group 1, in the replacement part, with the backreference \1
, you may set group 1, as a non-capturing group, with the syntax ?:
, in :
(?:[^()]*(\(([^()]|(?1))*\))[^()]*)+
(?x) (?: [^()]* ( \( ( [^()] | (?1) )* \) ) [^()]* )+
Of course, because of the first non-capturing group, the old recursive group 2 becomes the recursive group 1
]]>The insight here is that if you remove everything except parentheses, then all good lines will look the same. Since they all look the same, they can be replaced with nothing. That leaves only bad lines.
[a-z0-9\.,-]
with nothing.
^\(\(\)\(\)\(\)\)$
with nothing.
Here’s a snipit from the file:
g(ccreekm1,xyh(9358.6227238,9897.5418358,673.91697223),dt(3,23,7,32,52,45),l3grd,xyh(-0.58809487458,0.80686194848,0.055840975887),1.4,1,1)
g(ccreekm1,xyh(9357.6569034,9898.7712656,673.98295962),dt(3,23,7,32,53,48),l3grd,xyh(-0.59079549957,0.80517318715,0.051544314686),1.4,1,1)
g(ccreekm1,xyh(9356.7105651,9900.0235405,674.01411243),dt(3,23,7,32,54,49),l3grd,xyh(-0.60135170686,0.79870482939,0.021135755628),1.4,1,1)
g(ccreekm1,xyh(9355.7366367,9901.5794355,674.00311243),dt(3,23,7,32,55,54),l3grd,xyh(-0.53407133705,0.84541817075,-0.0059936220705),1.4,1,1)
g(ccreekm1,xyh(9355.0445335,9903.0878211,674.04228009),dt(3,23,7,32,56,46),l3grd,xyh(-0.48097880186,0.87669403229,0.008183270413),1.4,1,1)
g(ccreekm1,xyh(9354.7094565,9904.8417723,674.03229694),dt(3,23,7,32,57,47),l3grd,xyh(-0.30673979981,0.95175419176,0.0086402362665),1.4,1,1)
g(ccreekm1,xyh(9353.94977,9906.3023111,674.01629694),dt(3,23,7,32,58,50),l3grd,xyh(-0.45809271816,0.8888512565,-0.0097213881814),1.4,1,1)
g(ccreekm1,xyh(9353.568582,9907.9593727,674.01729694),dt(3,23,7,32,59,51),l3grd,xyh(-0.2294316975,0.97332458606,0.00058851543608),1.4,1,1)
g(ccreekm1,xyh(9353.065188,9909.7433868,673.92329694),dt(3,23,7,33,0,54),l3grd,xyh(-0.27037942404,0.96142079994,-0.050645952426),1.4,1,1)
That times 100000 and somewhere is a missing right parenthesis. Can you help?
Dan
]]>1\.4,1,1[^\)]$
Dot has a special meaning for regular expressions, so it must be escaped with a backslash. Same with parentheses.
[^)] is a so called negated character class and means “any character except right parenthesis”.
$ means “match end of line”.