Skip to content

Commit 05423e5

Browse files
khwilliamsonjkeenan
authored andcommitted
perlretut: Grammar, clarifications, white-space
1 parent a19f662 commit 05423e5

File tree

1 file changed

+28
-24
lines changed

1 file changed

+28
-24
lines changed

pod/perlretut.pod

Lines changed: 28 additions & 24 deletions
Original file line numberDiff line numberDiff line change
@@ -20,17 +20,20 @@ expressions will allow you to manipulate text with surprising ease.
2020
What is a regular expression? At its most basic, a regular expression
2121
is a template that is used to determine if a string has certain
2222
characteristics. The string is most often some text, such as a line,
23-
sentence, web page, or even a whole book, but less commonly it could be
24-
some binary data as well.
23+
sentence, web page, or even a whole book, but it doesn't have to be. It
24+
could be binary data, for example. Biologists often use Perl to look
25+
for patterns in long DNA sequences.
26+
2527
Suppose we want to determine if the text in variable, C<$var> contains
2628
the sequence of characters S<C<m u s h r o o m>>
2729
(blanks added for legibility). We can write in Perl
2830

2931
$var =~ m/mushroom/
3032

3133
The value of this expression will be TRUE if C<$var> contains that
32-
sequence of characters, and FALSE otherwise. The portion enclosed in
33-
C<'E<sol>'> characters denotes the characteristic we are looking for.
34+
sequence of characters anywhere within it, and FALSE otherwise. The
35+
portion enclosed in C<'E<sol>'> characters denotes the characteristic we
36+
are looking for.
3437
We use the term I<pattern> for it. The process of looking to see if the
3538
pattern occurs in the string is called I<matching>, and the C<"=~">
3639
operator along with the C<m//> tell Perl to try to match the pattern
@@ -60,7 +63,7 @@ many examples. The first part of the tutorial will progress from the
6063
simplest word searches to the basic regular expression concepts. If
6164
you master the first part, you will have all the tools needed to solve
6265
about 98% of your needs. The second part of the tutorial is for those
63-
comfortable with the basics and hungry for more power tools. It
66+
comfortable with the basics, and hungry for more power tools. It
6467
discusses the more advanced regular expression operators and
6568
introduces the latest cutting-edge innovations.
6669

@@ -135,7 +138,7 @@ And finally, the C<//> default delimiters for a match can be changed
135138
to arbitrary delimiters by putting an C<'m'> out front:
136139

137140
"Hello World" =~ m!World!; # matches, delimited by '!'
138-
"Hello World" =~ m{World}; # matches, note the matching '{}'
141+
"Hello World" =~ m{World}; # matches, note the paired '{}'
139142
"/usr/bin/perl" =~ m"/perl"; # matches after '/usr/bin',
140143
# '/' becomes an ordinary char
141144

@@ -151,7 +154,7 @@ Let's consider how different regexps would match C<"Hello World">:
151154
"Hello World" =~ /oW/; # doesn't match
152155
"Hello World" =~ /World /; # doesn't match
153156

154-
The first regexp C<world> doesn't match because regexps are
157+
The first regexp C<world> doesn't match because regexps are by default
155158
case-sensitive. The second regexp matches because the substring
156159
S<C<'o W'>> occurs in the string S<C<"Hello World">>. The space
157160
character C<' '> is treated like any other character in a regexp and is
@@ -169,8 +172,8 @@ always match at the earliest possible point in the string:
169172
"That hat is red" =~ /hat/; # matches 'hat' in 'That'
170173

171174
With respect to character matching, there are a few more points you
172-
need to know about. First of all, not all characters can be used "as
173-
is" in a match. Some characters, called I<metacharacters>, are
175+
need to know about. First of all, not all characters can be used
176+
"as-is" in a match. Some characters, called I<metacharacters>, are
174177
generally reserved for use in regexp notation. The metacharacters are
175178

176179
{}[]()^$.|*+?-#\
@@ -832,8 +835,8 @@ Counting the opening parentheses to get the correct number for a
832835
backreference is error-prone as soon as there is more than one
833836
capturing group. A more convenient technique became available
834837
with Perl 5.10: relative backreferences. To refer to the immediately
835-
preceding capture group one now may write C<\g{-1}>, the next but
836-
last is available via C<\g{-2}>, and so on.
838+
preceding capture group one now may write C<\g-1> or C<\g{-1}>, the next but
839+
last is available via C<\g-2> or C<\g{-2}>, and so on.
837840

838841
Another good reason in addition to readability and maintainability
839842
for using relative backreferences is illustrated by the following example,
@@ -1989,10 +1992,11 @@ C<\x>I<XY> (without curly braces and I<XY> are two hex digits) doesn't
19891992
go further than 255. (Starting in Perl 5.14, if you're an octal fan,
19901993
you can also use C<\o{oct}>.)
19911994

1992-
/\x{263a}/; # match a Unicode smiley face :)
1995+
/\x{263a}/; # match a Unicode smiley face :)
1996+
/\x{ 263a }/; # Same
19931997

19941998
B<NOTE>: In Perl 5.6.0 it used to be that one needed to say C<use
1995-
utf8> to use any Unicode features. This is no more the case: for
1999+
utf8> to use any Unicode features. This is no longer the case: for
19962000
almost all Unicode processing, the explicit C<utf8> pragma is not
19972001
needed. (The only case where it matters is if your Perl script is in
19982002
Unicode and encoded in UTF-8, then an explicit C<use utf8> is needed.)
@@ -2070,16 +2074,16 @@ C<\p{Mark}>, meaning things like accent marks.
20702074

20712075
The Unicode C<\p{Script}> and C<\p{Script_Extensions}> properties are
20722076
used to categorize every Unicode character into the language script it
2073-
is written in. (C<Script_Extensions> is an improved version of
2074-
C<Script>, which is retained for backward compatibility, and so you
2075-
should generally use C<Script_Extensions>.)
2076-
For example,
2077+
is written in. For example,
20772078
English, French, and a bunch of other European languages are written in
20782079
the Latin script. But there is also the Greek script, the Thai script,
2079-
the Katakana script, I<etc>. You can test whether a character is in a
2080-
particular script (based on C<Script_Extensions>) with, for example
2081-
C<\p{Latin}>, C<\p{Greek}>, or C<\p{Katakana}>. To test if it isn't in
2082-
the Balinese script, you would use C<\P{Balinese}>.
2080+
the Katakana script, I<etc>. (C<Script> is an older, less advanced,
2081+
form of C<Script_Extensions>, retained only for backwards
2082+
compatibility.) You can test whether a character is in a particular
2083+
script with, for example C<\p{Latin}>, C<\p{Greek}>, or
2084+
C<\p{Katakana}>. To test if it isn't in the Balinese script, you would
2085+
use C<\P{Balinese}>. (These all use C<Script_Extensions> under the
2086+
hood, as that gives better results.)
20832087

20842088
What we have described so far is the single form of the C<\p{...}> character
20852089
classes. There is also a compound form which you may run into. These
@@ -2459,7 +2463,7 @@ substring delimited by parentheses. The problem with this regexp is
24592463
that it is pathological: it has nested indeterminate quantifiers
24602464
of the form C<(a+|b)+>. We discussed in Part 1 how nested quantifiers
24612465
like this could take an exponentially long time to execute if there
2462-
was no match possible. To prevent the exponential blowup, we need to
2466+
is no match possible. To prevent the exponential blowup, we need to
24632467
prevent useless backtracking at some point. This can be done by
24642468
enclosing the inner quantifier as an independent subexpression:
24652469

@@ -2645,8 +2649,8 @@ section L</"Pragmas and debugging"> below.
26452649

26462650
More fun with C<?{}>:
26472651

2648-
$x =~ /(?{print "Hi Mom!";})/; # matches,
2649-
# prints 'Hi Mom!'
2652+
$x =~ /(?{print "Hi Mom!";})/; # matches,
2653+
# prints 'Hi Mom!'
26502654
$x =~ /(?{$c = 1;})(?{print "$c";})/; # matches,
26512655
# prints '1'
26522656
$x =~ /(?{$c = 1;})(?{print "$^R";})/; # matches,

0 commit comments

Comments
 (0)