Revision 1 as of 2009-02-16 17:22:58

Clear message



The [ _findre primitive function] which was introduced in [ release/12.pre13/13] is really jsut the core matching capability. Using it often requires some extra code to do things like extract the matched sub-string.

There are some functions in [ bridges/generics] which make it easier to get things done with _findre. (This page describes the functions in [ bridges/generics/11].)


You've called _findre and it returned a match. Now you have a list of lists of integers. How do you get the text that matched the regular expression? You could pull out the integers you need and call [,int,int) _sub], or you could just use ./generic/re_match_text

   1   txt = "";
   2   match = _findre(txt, "/\\w+/");
   3   match_text = ./generic/re_match_text(txt, match);
   5   // match_text == "/doc/"

Or maybe you want the text of just one parenthesized sub-expression within the match. ./generic/re_match_text has an optional third argument for which piece of the match you want.

   1   txt = "";
   2   match = _findre(txt, "/(\\w+)/");
   3   match_text = ./generic/re_match_text(txt, match, 1);
   5   // match_text == "doc"

What if you ask for a parenthesized sub-expression that matched nothing? (This can happen if it's on the side of an alternation not taken during the match.) ./generic/re_match_text will return FALSE in that case by default:

   1   txt = "foo %bar";
   2   match = _findre(txt, "(%(\\w+)|\\w+)\\s*");
   3   match_text = ./generic/re_match_text(txt, match, 2);
   5   // match_text == FALSE

However, it also takes an optional fourth argument specifying what to return when there is no such match:

   1   txt = "foo %bar";
   2   match = _findre(txt, "(%(\\w+)|\\w+)\\s*");
   3   match_text = ./generic/re_match_text(txt, match, 2, "NOMATCH!");
   5   // match_text == "NOMATCH!"


Suppose that you need to split up a piece of text into a list and you have a regular expression that matches the separators between the pieces you're interested in. ./generic/re_splt_text can help.

   1   txt = "";
   2   word_list = ./generic/re_splt_text(txt, "\\W+");
   4   /*
   5   word_list ==
   6     < "http",
   7       "www",
   8       "vestasys",
   9       "org",
  10       "doc",
  11       "sdl",
  12       "ref",
  13       "primitive",
  14       "functions",
  15       "" >
  16   */

Note that if the regular expression matches right at the beginning of the string or right at the end of the string, you will get empty strings for the portiong before or after those matches.

Also note that ./generic/re_splt_text is a recursive function. Deep recursion can cause a fatal error if the stack overflows, so you should use caution when applying ./generic/re_splt_text on very large text strings.


Once you've matched a regular expression, you may want to modify the original text by replacing the match with something else. ./generic/re_substitute is the function to help you with that.

   1   txt = "";
   2   new_txt = ./generic/re_substitute(txt, "\\W+", ",");
   4   // new_txt == "http,"

If you want to replace all matches, rather than just the first one, pass TRUE for the optional fourth parameter:

   1   txt = "";
   2   new_txt = ./generic/re_substitute(txt, "\\W+", ",", TRUE);
   4   // new_txt == "http,www,vestasys,org,doc,sdl,ref,primitive,functions,"

Suppose that rather than replacing a match with fixed text you want to replace it with the text which matched a particular parenthesized sub-expression of the rgular expression. Just pass an integer for the replacement:

   1   txt = "#:foo:# bar #:blah:# mog";
   2   new_txt = ./generic/re_substitute(txt, "#:(\\w+):#", 1, TRUE);
   4   // new_txt == "foo bar blah mog"

What if you want the replacement text to be made up of fixed text combined together with the text macthing one or more parenthesized sub-expressions? You can pass a list for the replacement:

   1   txt = "abc:123 def:456 ghi:789";
   2   new_txt = ./generic/re_substitute(txt, "(\\w+):(\\d+)", <2, "=", 1>, TRUE);
   4   // new_txt == "123=abc 456=def 789=ghi"

Note that you can use the integer zero for the entire match. You can use this to preserve the matched text and add your own:

   1   txt = "";
   2   new_txt = ./generic/re_substitute(txt, "\\w+", <"{", 0, "}">, TRUE);
   4   // new_txt == "{http}://{www}.{vestasys}.{org}/{doc}/{sdl}-{ref}/{primitive}-{functions}/"


In addition to text strings and integers, you can pass functions to be used for replacement. On each match, the function will be called with the original text and the result of _findre. The function can use those (e.g. with ./generic/re_match_text) and determine the rpelacement text however it likes.

Here's an example:

   1   /**nocache**/
   2   my_subst(orig_txt, match)
   3   {
   4     match_1 = ./generic/re_match_text(orig_txt, match, 1);
   5     // If the first sub-expression was "foo", replace with "bar"
   6     return if (match_1 == "foo") then "bar"
   7            // ...otherwise leave the match the same
   8            else ./generic/re_match_text(orig_txt, match);
   9   };
  10   txt = "%abc %foo %xyz";
  11   new_txt = ./generic/re_substitute(txt, "%(\\w+)", my_subst, TRUE);
  13   // new_txt == "%abc bar %xyz"

Note that you can put a function such as this in a list passed for a substitution along with other pieces (just like text strings and integers).


Perhaps a more realistic use of a function passed to ./generic/re_substitute is for variable substitution. A portion of the match could be looked up in a binding, and the associated value could be substituted.

   1   my_vars = [foo="abc", bar="xyz"];
   2   /**nocache**/
   3   my_subst(orig_txt, match)
   4   {
   5     match_1 = ./generic/re_match_text(orig_txt, match, 1);
   6     // If this is one of our variables, replace it...
   7     return if my_vars!$match_1 then my_vars/$match_1
   8            // ...otherwise leave the match the same
   9            else ./generic/re_match_text(orig_txt, match);
  10   };
  11   txt = "%foo, %bar";
  12   new_txt = ./generic/re_substitute(txt, "%(\\w+)", my_subst, TRUE);
  14   // new_txt == "abc, xyz"

To make this a little easier, ./generic/re_replace_by_binding can generate such a function for you. Its arguments are the index of the sub-expression to use as the name and the binding of replacements.

   1   my_vars = [foo="abc", bar="xyz"];
   2   my_subst = ./generic/re_replace_by_binding(1, my_vars);
   3   txt = "%foo, %bar";
   4   new_txt = ./generic/re_substitute(txt, "%(\\w+)", my_subst, TRUE);
   6   // new_txt == "abc, xyz"

As in the example above, the function created by ./generic/re_replace_by_binding will leave the original un-modified if there is no corresponding name in the binding.

   1   my_vars = [foo="abc", bar="xyz"];
   2   my_subst = ./generic/re_replace_by_binding(1, my_vars);
   3   txt = "%foo, %mog, %bar";
   4   new_txt = ./generic/re_substitute(txt, "%(\\w+)", my_subst, TRUE);
   6   // new_txt == "abc, %mog, xyz"