1=head1 NAME 2 3perlrecharclass - Perl Regular Expression Character Classes 4 5=head1 DESCRIPTION 6 7The top level documentation about Perl regular expressions 8is found in L<perlre>. 9 10This manual page discusses the syntax and use of character 11classes in Perl Regular Expressions. 12 13A character class is a way of denoting a set of characters, 14in such a way that one character of the set is matched. 15It's important to remember that matching a character class 16consumes exactly one character in the source string. (The source 17string is the string the regular expression is matched against.) 18 19There are three types of character classes in Perl regular 20expressions: the dot, backslashed sequences, and the bracketed form. 21 22=head2 The dot 23 24The dot (or period), C<.> is probably the most used, and certainly 25the most well-known character class. By default, a dot matches any 26character, except for the newline. The default can be changed to 27add matching the newline with the I<single line> modifier: either 28for the entire regular expression using the C</s> modifier, or 29locally using C<(?s)>. 30 31Here are some examples: 32 33 "a" =~ /./ # Match 34 "." =~ /./ # Match 35 "" =~ /./ # No match (dot has to match a character) 36 "\n" =~ /./ # No match (dot does not match a newline) 37 "\n" =~ /./s # Match (global 'single line' modifier) 38 "\n" =~ /(?s:.)/ # Match (local 'single line' modifier) 39 "ab" =~ /^.$/ # No match (dot matches one character) 40 41 42=head2 Backslashed sequences 43 44Perl regular expressions contain many backslashed sequences that 45constitute a character class. That is, they will match a single 46character, if that character belongs to a specific set of characters 47(defined by the sequence). A backslashed sequence is a sequence of 48characters starting with a backslash. Not all backslashed sequences 49are character class; for a full list, see L<perlrebackslash>. 50 51Here's a list of the backslashed sequences, which are discussed in 52more detail below. 53 54 \d Match a digit character. 55 \D Match a non-digit character. 56 \w Match a "word" character. 57 \W Match a non-"word" character. 58 \s Match a white space character. 59 \S Match a non-white space character. 60 \h Match a horizontal white space character. 61 \H Match a character that isn't horizontal white space. 62 \v Match a vertical white space character. 63 \V Match a character that isn't vertical white space. 64 \pP, \p{Prop} Match a character matching a Unicode property. 65 \PP, \P{Prop} Match a character that doesn't match a Unicode property. 66 67=head3 Digits 68 69C<\d> matches a single character that is considered to be a I<digit>. 70What is considered a digit depends on the internal encoding of 71the source string. If the source string is in UTF-8 format, C<\d> 72not only matches the digits '0' - '9', but also Arabic, Devanagari and 73digits from other languages. Otherwise, if there is a locale in effect, 74it will match whatever characters the locale considers digits. Without 75a locale, C<\d> matches the digits '0' to '9'. 76See L</Locale, Unicode and UTF-8>. 77 78Any character that isn't matched by C<\d> will be matched by C<\D>. 79 80=head3 Word characters 81 82C<\w> matches a single I<word> character: an alphanumeric character 83(that is, an alphabetic character, or a digit), or the underscore (C<_>). 84What is considered a word character depends on the internal encoding 85of the string. If it's in UTF-8 format, C<\w> matches those characters 86that are considered word characters in the Unicode database. That is, it 87not only matches ASCII letters, but also Thai letters, Greek letters, etc. 88If the source string isn't in UTF-8 format, C<\w> matches those characters 89that are considered word characters by the current locale. Without 90a locale in effect, C<\w> matches the ASCII letters, digits and the 91underscore. 92 93Any character that isn't matched by C<\w> will be matched by C<\W>. 94 95=head3 White space 96 97C<\s> matches any single character that is consider white space. In the 98ASCII range, C<\s> matches the horizontal tab (C<\t>), the new line 99(C<\n>), the form feed (C<\f>), the carriage return (C<\r>), and the 100space (the vertical tab, C<\cK> is not matched by C<\s>). The exact set 101of characters matched by C<\s> depends on whether the source string is 102in UTF-8 format. If it is, C<\s> matches what is considered white space 103in the Unicode database. Otherwise, if there is a locale in effect, C<\s> 104matches whatever is considered white space by the current locale. Without 105a locale, C<\s> matches the five characters mentioned in the beginning 106of this paragraph. Perhaps the most notable difference is that C<\s> 107matches a non-breaking space only if the non-breaking space is in a 108UTF-8 encoded string. 109 110Any character that isn't matched by C<\s> will be matched by C<\S>. 111 112C<\h> will match any character that is considered horizontal white space; 113this includes the space and the tab characters. C<\H> will match any character 114that is not considered horizontal white space. 115 116C<\v> will match any character that is considered vertical white space; 117this includes the carriage return and line feed characters (newline). 118C<\V> will match any character that is not considered vertical white space. 119 120C<\R> matches anything that can be considered a newline under Unicode 121rules. It's not a character class, as it can match a multi-character 122sequence. Therefore, it cannot be used inside a bracketed character 123class. Details are discussed in L<perlrebackslash>. 124 125C<\h>, C<\H>, C<\v>, C<\V>, and C<\R> are new in perl 5.10.0. 126 127Note that unlike C<\s>, C<\d> and C<\w>, C<\h> and C<\v> always match 128the same characters, regardless whether the source string is in UTF-8 129format or not. The set of characters they match is also not influenced 130by locale. 131 132One might think that C<\s> is equivalent with C<[\h\v]>. This is not true. 133The vertical tab (C<"\x0b">) is not matched by C<\s>, it is however 134considered vertical white space. Furthermore, if the source string is 135not in UTF-8 format, the next line (C<"\x85">) and the no-break space 136(C<"\xA0">) are not matched by C<\s>, but are by C<\v> and C<\h> respectively. 137If the source string is in UTF-8 format, both the next line and the 138no-break space are matched by C<\s>. 139 140The following table is a complete listing of characters matched by 141C<\s>, C<\h> and C<\v>. 142 143The first column gives the code point of the character (in hex format), 144the second column gives the (Unicode) name. The third column indicates 145by which class(es) the character is matched. 146 147 0x00009 CHARACTER TABULATION h s 148 0x0000a LINE FEED (LF) vs 149 0x0000b LINE TABULATION v 150 0x0000c FORM FEED (FF) vs 151 0x0000d CARRIAGE RETURN (CR) vs 152 0x00020 SPACE h s 153 0x00085 NEXT LINE (NEL) vs [1] 154 0x000a0 NO-BREAK SPACE h s [1] 155 0x01680 OGHAM SPACE MARK h s 156 0x0180e MONGOLIAN VOWEL SEPARATOR h s 157 0x02000 EN QUAD h s 158 0x02001 EM QUAD h s 159 0x02002 EN SPACE h s 160 0x02003 EM SPACE h s 161 0x02004 THREE-PER-EM SPACE h s 162 0x02005 FOUR-PER-EM SPACE h s 163 0x02006 SIX-PER-EM SPACE h s 164 0x02007 FIGURE SPACE h s 165 0x02008 PUNCTUATION SPACE h s 166 0x02009 THIN SPACE h s 167 0x0200a HAIR SPACE h s 168 0x02028 LINE SEPARATOR vs 169 0x02029 PARAGRAPH SEPARATOR vs 170 0x0202f NARROW NO-BREAK SPACE h s 171 0x0205f MEDIUM MATHEMATICAL SPACE h s 172 0x03000 IDEOGRAPHIC SPACE h s 173 174=over 4 175 176=item [1] 177 178NEXT LINE and NO-BREAK SPACE only match C<\s> if the source string is in 179UTF-8 format. 180 181=back 182 183It is worth noting that C<\d>, C<\w>, etc, match single characters, not 184complete numbers or words. To match a number (that consists of integers), 185use C<\d+>; to match a word, use C<\w+>. 186 187 188=head3 Unicode Properties 189 190C<\pP> and C<\p{Prop}> are character classes to match characters that 191fit given Unicode classes. One letter classes can be used in the C<\pP> 192form, with the class name following the C<\p>, otherwise, the property 193name is enclosed in braces, and follows the C<\p>. For instance, a 194match for a number can be written as C</\pN/> or as C</\p{Number}/>. 195Lowercase letters are matched by the property I<LowercaseLetter> which 196has as short form I<Ll>. They have to be written as C</\p{Ll}/> or 197C</\p{LowercaseLetter}/>. C</\pLl/> is valid, but means something different. 198It matches a two character string: a letter (Unicode property C<\pL>), 199followed by a lowercase C<l>. 200 201For a list of possible properties, see 202L<perlunicode/Unicode Character Properties>. It is also possible to 203defined your own properties. This is discussed in 204L<perlunicode/User-Defined Character Properties>. 205 206 207=head4 Examples 208 209 "a" =~ /\w/ # Match, "a" is a 'word' character. 210 "7" =~ /\w/ # Match, "7" is a 'word' character as well. 211 "a" =~ /\d/ # No match, "a" isn't a digit. 212 "7" =~ /\d/ # Match, "7" is a digit. 213 " " =~ /\s/ # Match, a space is white space. 214 "a" =~ /\D/ # Match, "a" is a non-digit. 215 "7" =~ /\D/ # No match, "7" is not a non-digit. 216 " " =~ /\S/ # No match, a space is not non-white space. 217 218 " " =~ /\h/ # Match, space is horizontal white space. 219 " " =~ /\v/ # No match, space is not vertical white space. 220 "\r" =~ /\v/ # Match, a return is vertical white space. 221 222 "a" =~ /\pL/ # Match, "a" is a letter. 223 "a" =~ /\p{Lu}/ # No match, /\p{Lu}/ matches upper case letters. 224 225 "\x{0e0b}" =~ /\p{Thai}/ # Match, \x{0e0b} is the character 226 # 'THAI CHARACTER SO SO', and that's in 227 # Thai Unicode class. 228 "a" =~ /\P{Lao}/ # Match, as "a" is not a Laoian character. 229 230 231=head2 Bracketed Character Classes 232 233The third form of character class you can use in Perl regular expressions 234is the bracketed form. In its simplest form, it lists the characters 235that may be matched inside square brackets, like this: C<[aeiou]>. 236This matches one of C<a>, C<e>, C<i>, C<o> or C<u>. Just as the other 237character classes, exactly one character will be matched. To match 238a longer string consisting of characters mentioned in the characters 239class, follow the character class with a quantifier. For instance, 240C<[aeiou]+> matches a string of one or more lowercase ASCII vowels. 241 242Repeating a character in a character class has no 243effect; it's considered to be in the set only once. 244 245Examples: 246 247 "e" =~ /[aeiou]/ # Match, as "e" is listed in the class. 248 "p" =~ /[aeiou]/ # No match, "p" is not listed in the class. 249 "ae" =~ /^[aeiou]$/ # No match, a character class only matches 250 # a single character. 251 "ae" =~ /^[aeiou]+$/ # Match, due to the quantifier. 252 253=head3 Special Characters Inside a Bracketed Character Class 254 255Most characters that are meta characters in regular expressions (that 256is, characters that carry a special meaning like C<*> or C<(>) lose 257their special meaning and can be used inside a character class without 258the need to escape them. For instance, C<[()]> matches either an opening 259parenthesis, or a closing parenthesis, and the parens inside the character 260class don't group or capture. 261 262Characters that may carry a special meaning inside a character class are: 263C<\>, C<^>, C<->, C<[> and C<]>, and are discussed below. They can be 264escaped with a backslash, although this is sometimes not needed, in which 265case the backslash may be omitted. 266 267The sequence C<\b> is special inside a bracketed character class. While 268outside the character class C<\b> is an assertion indicating a point 269that does not have either two word characters or two non-word characters 270on either side, inside a bracketed character class, C<\b> matches a 271backspace character. 272 273A C<[> is not special inside a character class, unless it's the start 274of a POSIX character class (see below). It normally does not need escaping. 275 276A C<]> is either the end of a POSIX character class (see below), or it 277signals the end of the bracketed character class. Normally it needs 278escaping if you want to include a C<]> in the set of characters. 279However, if the C<]> is the I<first> (or the second if the first 280character is a caret) character of a bracketed character class, it 281does not denote the end of the class (as you cannot have an empty class) 282and is considered part of the set of characters that can be matched without 283escaping. 284 285Examples: 286 287 "+" =~ /[+?*]/ # Match, "+" in a character class is not special. 288 "\cH" =~ /[\b]/ # Match, \b inside in a character class 289 # is equivalent with a backspace. 290 "]" =~ /[][]/ # Match, as the character class contains. 291 # both [ and ]. 292 "[]" =~ /[[]]/ # Match, the pattern contains a character class 293 # containing just ], and the character class is 294 # followed by a ]. 295 296=head3 Character Ranges 297 298It is not uncommon to want to match a range of characters. Luckily, instead 299of listing all the characters in the range, one may use the hyphen (C<->). 300If inside a bracketed character class you have two characters separated 301by a hyphen, it's treated as if all the characters between the two are in 302the class. For instance, C<[0-9]> matches any ASCII digit, and C<[a-m]> 303matches any lowercase letter from the first half of the ASCII alphabet. 304 305Note that the two characters on either side of the hyphen are not 306necessary both letters or both digits. Any character is possible, 307although not advisable. C<['-?]> contains a range of characters, but 308most people will not know which characters that will be. Furthermore, 309such ranges may lead to portability problems if the code has to run on 310a platform that uses a different character set, such as EBCDIC. 311 312If a hyphen in a character class cannot be part of a range, for instance 313because it is the first or the last character of the character class, 314or if it immediately follows a range, the hyphen isn't special, and will be 315considered a character that may be matched. You have to escape the hyphen 316with a backslash if you want to have a hyphen in your set of characters to 317be matched, and its position in the class is such that it can be considered 318part of a range. 319 320Examples: 321 322 [a-z] # Matches a character that is a lower case ASCII letter. 323 [a-fz] # Matches any letter between 'a' and 'f' (inclusive) or the 324 # letter 'z'. 325 [-z] # Matches either a hyphen ('-') or the letter 'z'. 326 [a-f-m] # Matches any letter between 'a' and 'f' (inclusive), the 327 # hyphen ('-'), or the letter 'm'. 328 ['-?] # Matches any of the characters '()*+,-./0123456789:;<=>? 329 # (But not on an EBCDIC platform). 330 331 332=head3 Negation 333 334It is also possible to instead list the characters you do not want to 335match. You can do so by using a caret (C<^>) as the first character in the 336character class. For instance, C<[^a-z]> matches a character that is not a 337lowercase ASCII letter. 338 339This syntax make the caret a special character inside a bracketed character 340class, but only if it is the first character of the class. So if you want 341to have the caret as one of the characters you want to match, you either 342have to escape the caret, or not list it first. 343 344Examples: 345 346 "e" =~ /[^aeiou]/ # No match, the 'e' is listed. 347 "x" =~ /[^aeiou]/ # Match, as 'x' isn't a lowercase vowel. 348 "^" =~ /[^^]/ # No match, matches anything that isn't a caret. 349 "^" =~ /[x^]/ # Match, caret is not special here. 350 351=head3 Backslash Sequences 352 353You can put a backslash sequence character class inside a bracketed character 354class, and it will act just as if you put all the characters matched by 355the backslash sequence inside the character class. For instance, 356C<[a-f\d]> will match any digit, or any of the lowercase letters between 357'a' and 'f' inclusive. 358 359Examples: 360 361 /[\p{Thai}\d]/ # Matches a character that is either a Thai 362 # character, or a digit. 363 /[^\p{Arabic}()]/ # Matches a character that is neither an Arabic 364 # character, nor a parenthesis. 365 366Backslash sequence character classes cannot form one of the endpoints 367of a range. 368 369=head3 Posix Character Classes 370 371Posix character classes have the form C<[:class:]>, where I<class> is 372name, and the C<[:> and C<:]> delimiters. Posix character classes appear 373I<inside> bracketed character classes, and are a convenient and descriptive 374way of listing a group of characters. Be careful about the syntax, 375 376 # Correct: 377 $string =~ /[[:alpha:]]/ 378 379 # Incorrect (will warn): 380 $string =~ /[:alpha:]/ 381 382The latter pattern would be a character class consisting of a colon, 383and the letters C<a>, C<l>, C<p> and C<h>. 384 385Perl recognizes the following POSIX character classes: 386 387 alpha Any alphabetical character. 388 alnum Any alphanumerical character. 389 ascii Any ASCII character. 390 blank A GNU extension, equal to a space or a horizontal tab (C<\t>). 391 cntrl Any control character. 392 digit Any digit, equivalent to C<\d>. 393 graph Any printable character, excluding a space. 394 lower Any lowercase character. 395 print Any printable character, including a space. 396 punct Any punctuation character. 397 space Any white space character. C<\s> plus the vertical tab (C<\cK>). 398 upper Any uppercase character. 399 word Any "word" character, equivalent to C<\w>. 400 xdigit Any hexadecimal digit, '0' - '9', 'a' - 'f', 'A' - 'F'. 401 402The exact set of characters matched depends on whether the source string 403is internally in UTF-8 format or not. See L</Locale, Unicode and UTF-8>. 404 405Most POSIX character classes have C<\p> counterparts. The difference 406is that the C<\p> classes will always match according to the Unicode 407properties, regardless whether the string is in UTF-8 format or not. 408 409The following table shows the relation between POSIX character classes 410and the Unicode properties: 411 412 [[:...:]] \p{...} backslash 413 414 alpha IsAlpha 415 alnum IsAlnum 416 ascii IsASCII 417 blank 418 cntrl IsCntrl 419 digit IsDigit \d 420 graph IsGraph 421 lower IsLower 422 print IsPrint 423 punct IsPunct 424 space IsSpace 425 IsSpacePerl \s 426 upper IsUpper 427 word IsWord 428 xdigit IsXDigit 429 430Some character classes may have a non-obvious name: 431 432=over 4 433 434=item cntrl 435 436Any control character. Usually, control characters don't produce output 437as such, but instead control the terminal somehow: for example newline 438and backspace are control characters. All characters with C<ord()> less 439than 32 are usually classified as control characters (in ASCII, the ISO 440Latin character sets, and Unicode), as is the character C<ord()> value 441of 127 (C<DEL>). 442 443=item graph 444 445Any character that is I<graphical>, that is, visible. This class consists 446of all the alphanumerical characters and all punctuation characters. 447 448=item print 449 450All printable characters, which is the set of all the graphical characters 451plus the space. 452 453=item punct 454 455Any punctuation (special) character. 456 457=back 458 459=head4 Negation 460 461A Perl extension to the POSIX character class is the ability to 462negate it. This is done by prefixing the class name with a caret (C<^>). 463Some examples: 464 465 POSIX Unicode Backslash 466 [[:^digit:]] \P{IsDigit} \D 467 [[:^space:]] \P{IsSpace} \S 468 [[:^word:]] \P{IsWord} \W 469 470=head4 [= =] and [. .] 471 472Perl will recognize the POSIX character classes C<[=class=]>, and 473C<[.class.]>, but does not (yet?) support this construct. Use of 474such a constructs will lead to an error. 475 476 477=head4 Examples 478 479 /[[:digit:]]/ # Matches a character that is a digit. 480 /[01[:lower:]]/ # Matches a character that is either a 481 # lowercase letter, or '0' or '1'. 482 /[[:digit:][:^xdigit:]]/ # Matches a character that can be anything, 483 # but the letters 'a' to 'f' in either case. 484 # This is because the character class contains 485 # all digits, and anything that isn't a 486 # hex digit, resulting in a class containing 487 # all characters, but the letters 'a' to 'f' 488 # and 'A' to 'F'. 489 490 491=head2 Locale, Unicode and UTF-8 492 493Some of the character classes have a somewhat different behaviour depending 494on the internal encoding of the source string, and the locale that is 495in effect. 496 497C<\w>, C<\d>, C<\s> and the POSIX character classes (and their negations, 498including C<\W>, C<\D>, C<\S>) suffer from this behaviour. 499 500The rule is that if the source string is in UTF-8 format, the character 501classes match according to the Unicode properties. If the source string 502isn't, then the character classes match according to whatever locale is 503in effect. If there is no locale, they match the ASCII defaults 504(52 letters, 10 digits and underscore for C<\w>, 0 to 9 for C<\d>, etc). 505 506This usually means that if you are matching against characters whose C<ord()> 507values are between 128 and 255 inclusive, your character class may match 508or not depending on the current locale, and whether the source string is 509in UTF-8 format. The string will be in UTF-8 format if it contains 510characters whose C<ord()> value exceeds 255. But a string may be in UTF-8 511format without it having such characters. 512 513For portability reasons, it may be better to not use C<\w>, C<\d>, C<\s> 514or the POSIX character classes, and use the Unicode properties instead. 515 516=head4 Examples 517 518 $str = "\xDF"; # $str is not in UTF-8 format. 519 $str =~ /^\w/; # No match, as $str isn't in UTF-8 format. 520 $str .= "\x{0e0b}"; # Now $str is in UTF-8 format. 521 $str =~ /^\w/; # Match! $str is now in UTF-8 format. 522 chop $str; 523 $str =~ /^\w/; # Still a match! $str remains in UTF-8 format. 524 525=cut 526