@@ -20,17 +20,20 @@ expressions will allow you to manipulate text with surprising ease.
2020What is a regular expression? At its most basic, a regular expression
2121is a template that is used to determine if a string has certain
2222characteristics. The string is most often some text, such as a line,
23- sentence, web page, or even a whole book, but less commonly it could be
24- some binary data as well.
23+ sentence, web page, or even a whole book, but it doesn't have to be. It
24+ could be binary data, for example. Biologists often use Perl to look
25+ for patterns in long DNA sequences.
26+
2527Suppose we want to determine if the text in variable, C<$var> contains
2628the sequence of characters S<C<m u s h r o o m>>
2729(blanks added for legibility). We can write in Perl
2830
2931 $var =~ m/mushroom/
3032
3133The value of this expression will be TRUE if C<$var> contains that
32- sequence of characters, and FALSE otherwise. The portion enclosed in
33- C<'E<sol>'> characters denotes the characteristic we are looking for.
34+ sequence of characters anywhere within it, and FALSE otherwise. The
35+ portion enclosed in C<'E<sol>'> characters denotes the characteristic we
36+ are looking for.
3437We use the term I<pattern> for it. The process of looking to see if the
3538pattern occurs in the string is called I<matching>, and the C<"=~">
3639operator along with the C<m//> tell Perl to try to match the pattern
@@ -60,7 +63,7 @@ many examples. The first part of the tutorial will progress from the
6063simplest word searches to the basic regular expression concepts. If
6164you master the first part, you will have all the tools needed to solve
6265about 98% of your needs. The second part of the tutorial is for those
63- comfortable with the basics and hungry for more power tools. It
66+ comfortable with the basics, and hungry for more power tools. It
6467discusses the more advanced regular expression operators and
6568introduces the latest cutting-edge innovations.
6669
@@ -135,7 +138,7 @@ And finally, the C<//> default delimiters for a match can be changed
135138to arbitrary delimiters by putting an C<'m'> out front:
136139
137140 "Hello World" =~ m!World!; # matches, delimited by '!'
138- "Hello World" =~ m{World}; # matches, note the matching '{}'
141+ "Hello World" =~ m{World}; # matches, note the paired '{}'
139142 "/usr/bin/perl" =~ m"/perl"; # matches after '/usr/bin',
140143 # '/' becomes an ordinary char
141144
@@ -151,7 +154,7 @@ Let's consider how different regexps would match C<"Hello World">:
151154 "Hello World" =~ /oW/; # doesn't match
152155 "Hello World" =~ /World /; # doesn't match
153156
154- The first regexp C<world> doesn't match because regexps are
157+ The first regexp C<world> doesn't match because regexps are by default
155158case-sensitive. The second regexp matches because the substring
156159S<C<'o W'>> occurs in the string S<C<"Hello World">>. The space
157160character C<' '> is treated like any other character in a regexp and is
@@ -169,8 +172,8 @@ always match at the earliest possible point in the string:
169172 "That hat is red" =~ /hat/; # matches 'hat' in 'That'
170173
171174With respect to character matching, there are a few more points you
172- need to know about. First of all, not all characters can be used "as
173- is" in a match. Some characters, called I<metacharacters>, are
175+ need to know about. First of all, not all characters can be used
176+ "as- is" in a match. Some characters, called I<metacharacters>, are
174177generally reserved for use in regexp notation. The metacharacters are
175178
176179 {}[]()^$.|*+?-#\
@@ -832,8 +835,8 @@ Counting the opening parentheses to get the correct number for a
832835backreference is error-prone as soon as there is more than one
833836capturing group. A more convenient technique became available
834837with Perl 5.10: relative backreferences. To refer to the immediately
835- preceding capture group one now may write C<\g{-1}>, the next but
836- last is available via C<\g{-2}>, and so on.
838+ preceding capture group one now may write C<\g-1> or C<\g {-1}>, the next but
839+ last is available via C<\g-2> or C<\g {-2}>, and so on.
837840
838841Another good reason in addition to readability and maintainability
839842for using relative backreferences is illustrated by the following example,
@@ -1989,10 +1992,11 @@ C<\x>I<XY> (without curly braces and I<XY> are two hex digits) doesn't
19891992go further than 255. (Starting in Perl 5.14, if you're an octal fan,
19901993you can also use C<\o{oct}>.)
19911994
1992- /\x{263a}/; # match a Unicode smiley face :)
1995+ /\x{263a}/; # match a Unicode smiley face :)
1996+ /\x{ 263a }/; # Same
19931997
19941998B<NOTE>: In Perl 5.6.0 it used to be that one needed to say C<use
1995- utf8> to use any Unicode features. This is no more the case: for
1999+ utf8> to use any Unicode features. This is no longer the case: for
19962000almost all Unicode processing, the explicit C<utf8> pragma is not
19972001needed. (The only case where it matters is if your Perl script is in
19982002Unicode and encoded in UTF-8, then an explicit C<use utf8> is needed.)
@@ -2070,16 +2074,16 @@ C<\p{Mark}>, meaning things like accent marks.
20702074
20712075The Unicode C<\p{Script}> and C<\p{Script_Extensions}> properties are
20722076used to categorize every Unicode character into the language script it
2073- is written in. (C<Script_Extensions> is an improved version of
2074- C<Script>, which is retained for backward compatibility, and so you
2075- should generally use C<Script_Extensions>.)
2076- For example,
2077+ is written in. For example,
20772078English, French, and a bunch of other European languages are written in
20782079the Latin script. But there is also the Greek script, the Thai script,
2079- the Katakana script, I<etc>. You can test whether a character is in a
2080- particular script (based on C<Script_Extensions>) with, for example
2081- C<\p{Latin}>, C<\p{Greek}>, or C<\p{Katakana}>. To test if it isn't in
2082- the Balinese script, you would use C<\P{Balinese}>.
2080+ the Katakana script, I<etc>. (C<Script> is an older, less advanced,
2081+ form of C<Script_Extensions>, retained only for backwards
2082+ compatibility.) You can test whether a character is in a particular
2083+ script with, for example C<\p{Latin}>, C<\p{Greek}>, or
2084+ C<\p{Katakana}>. To test if it isn't in the Balinese script, you would
2085+ use C<\P{Balinese}>. (These all use C<Script_Extensions> under the
2086+ hood, as that gives better results.)
20832087
20842088What we have described so far is the single form of the C<\p{...}> character
20852089classes. There is also a compound form which you may run into. These
@@ -2459,7 +2463,7 @@ substring delimited by parentheses. The problem with this regexp is
24592463that it is pathological: it has nested indeterminate quantifiers
24602464of the form C<(a+|b)+>. We discussed in Part 1 how nested quantifiers
24612465like this could take an exponentially long time to execute if there
2462- was no match possible. To prevent the exponential blowup, we need to
2466+ is no match possible. To prevent the exponential blowup, we need to
24632467prevent useless backtracking at some point. This can be done by
24642468enclosing the inner quantifier as an independent subexpression:
24652469
@@ -2645,8 +2649,8 @@ section L</"Pragmas and debugging"> below.
26452649
26462650More fun with C<?{}>:
26472651
2648- $x =~ /(?{print "Hi Mom!";})/; # matches,
2649- # prints 'Hi Mom!'
2652+ $x =~ /(?{print "Hi Mom!";})/; # matches,
2653+ # prints 'Hi Mom!'
26502654 $x =~ /(?{$c = 1;})(?{print "$c";})/; # matches,
26512655 # prints '1'
26522656 $x =~ /(?{$c = 1;})(?{print "$^R";})/; # matches,
0 commit comments