1 2=encoding utf8 3 4=head1 NAME 5 6perlunicook - cookbookish examples of handling Unicode in Perl 7 8=head1 DESCRIPTION 9 10This manpage contains short recipes demonstrating how to handle common Unicode 11operations in Perl, plus one complete program at the end. Any undeclared 12variables in individual recipes are assumed to have a previous appropriate 13value in them. 14 15=head1 EXAMPLES 16 17=head2 ℞ 0: Standard preamble 18 19Unless otherwise notes, all examples below require this standard preamble 20to work correctly, with the C<#!> adjusted to work on your system: 21 22 #!/usr/bin/env perl 23 24 use v5.36; # or later to get "unicode_strings" feature, 25 # plus strict, warnings 26 use utf8; # so literals and identifiers can be in UTF-8 27 use warnings qw(FATAL utf8); # fatalize encoding glitches 28 use open qw(:std :encoding(UTF-8)); # undeclared streams in UTF-8 29 use charnames qw(:full :short); # unneeded in v5.16 30 31This I<does> make even Unix programmers C<binmode> your binary streams, 32or open them with C<:raw>, but that's the only way to get at them 33portably anyway. 34 35B<WARNING>: C<use autodie> (pre 2.26) and C<use open> do not get along with each 36other. 37 38=head2 ℞ 1: Generic Unicode-savvy filter 39 40Always decompose on the way in, then recompose on the way out. 41 42 use Unicode::Normalize; 43 44 while (<>) { 45 $_ = NFD($_); # decompose + reorder canonically 46 ... 47 } continue { 48 print NFC($_); # recompose (where possible) + reorder canonically 49 } 50 51=head2 ℞ 2: Fine-tuning Unicode warnings 52 53As of v5.14, Perl distinguishes three subclasses of UTF‑8 warnings. 54 55 use v5.14; # subwarnings unavailable any earlier 56 no warnings "nonchar"; # the 66 forbidden non-characters 57 no warnings "surrogate"; # UTF-16/CESU-8 nonsense 58 no warnings "non_unicode"; # for codepoints over 0x10_FFFF 59 60=head2 ℞ 3: Declare source in utf8 for identifiers and literals 61 62Without the all-critical C<use utf8> declaration, putting UTF‑8 in your 63literals and identifiers won’t work right. If you used the standard 64preamble just given above, this already happened. If you did, you can 65do things like this: 66 67 use utf8; 68 69 my $measure = "Ångström"; 70 my @μsoft = qw( cp852 cp1251 cp1252 ); 71 my @ὑπέρμεγας = qw( ὑπέρ μεγας ); 72 my @鯉 = qw( koi8-f koi8-u koi8-r ); 73 my $motto = " "; # FAMILY, GROWING HEART, DROMEDARY CAMEL 74 75If you forget C<use utf8>, high bytes will be misunderstood as 76separate characters, and nothing will work right. 77 78=head2 ℞ 4: Characters and their numbers 79 80The C<ord> and C<chr> functions work transparently on all codepoints, 81not just on ASCII alone — nor in fact, not even just on Unicode alone. 82 83 # ASCII characters 84 ord("A") 85 chr(65) 86 87 # characters from the Basic Multilingual Plane 88 ord("Σ") 89 chr(0x3A3) 90 91 # beyond the BMP 92 ord("") # MATHEMATICAL ITALIC SMALL N 93 chr(0x1D45B) 94 95 # beyond Unicode! (up to MAXINT) 96 ord("\x{20_0000}") 97 chr(0x20_0000) 98 99=head2 ℞ 5: Unicode literals by character number 100 101In an interpolated literal, whether a double-quoted string or a 102regex, you may specify a character by its number using the 103C<\x{I<HHHHHH>}> escape. 104 105 String: "\x{3a3}" 106 Regex: /\x{3a3}/ 107 108 String: "\x{1d45b}" 109 Regex: /\x{1d45b}/ 110 111 # even non-BMP ranges in regex work fine 112 /[\x{1D434}-\x{1D467}]/ 113 114=head2 ℞ 6: Get character name by number 115 116 use charnames (); 117 my $name = charnames::viacode(0x03A3); 118 119=head2 ℞ 7: Get character number by name 120 121 use charnames (); 122 my $number = charnames::vianame("GREEK CAPITAL LETTER SIGMA"); 123 124=head2 ℞ 8: Unicode named characters 125 126Use the C<< \N{I<charname>} >> notation to get the character 127by that name for use in interpolated literals (double-quoted 128strings and regexes). In v5.16, there is an implicit 129 130 use charnames qw(:full :short); 131 132But prior to v5.16, you must be explicit about which set of charnames you 133want. The C<:full> names are the official Unicode character name, alias, or 134sequence, which all share a namespace. 135 136 use charnames qw(:full :short latin greek); 137 138 "\N{MATHEMATICAL ITALIC SMALL N}" # :full 139 "\N{GREEK CAPITAL LETTER SIGMA}" # :full 140 141Anything else is a Perl-specific convenience abbreviation. Specify one or 142more scripts by names if you want short names that are script-specific. 143 144 "\N{Greek:Sigma}" # :short 145 "\N{ae}" # latin 146 "\N{epsilon}" # greek 147 148The v5.16 release also supports a C<:loose> import for loose matching of 149character names, which works just like loose matching of property names: 150that is, it disregards case, whitespace, and underscores: 151 152 "\N{euro sign}" # :loose (from v5.16) 153 154Starting in v5.32, you can also use 155 156 qr/\p{name=euro sign}/ 157 158to get official Unicode named characters in regular expressions. Loose 159matching is always done for these. 160 161=head2 ℞ 9: Unicode named sequences 162 163These look just like character names but return multiple codepoints. 164Notice the C<%vx> vector-print functionality in C<printf>. 165 166 use charnames qw(:full); 167 my $seq = "\N{LATIN CAPITAL LETTER A WITH MACRON AND GRAVE}"; 168 printf "U+%v04X\n", $seq; 169 U+0100.0300 170 171=head2 ℞ 10: Custom named characters 172 173Use C<:alias> to give your own lexically scoped nicknames to existing 174characters, or even to give unnamed private-use characters useful names. 175 176 use charnames ":full", ":alias" => { 177 ecute => "LATIN SMALL LETTER E WITH ACUTE", 178 "APPLE LOGO" => 0xF8FF, # private use character 179 }; 180 181 "\N{ecute}" 182 "\N{APPLE LOGO}" 183 184=head2 ℞ 11: Names of CJK codepoints 185 186Sinograms like “東京” come back with character names of 187C<CJK UNIFIED IDEOGRAPH-6771> and C<CJK UNIFIED IDEOGRAPH-4EAC>, 188because their “names” vary. The CPAN C<Unicode::Unihan> module 189has a large database for decoding these (and a whole lot more), provided you 190know how to understand its output. 191 192 # cpan -i Unicode::Unihan 193 use Unicode::Unihan; 194 my $str = "東京"; 195 my $unhan = Unicode::Unihan->new; 196 for my $lang (qw(Mandarin Cantonese Korean JapaneseOn JapaneseKun)) { 197 printf "CJK $str in %-12s is ", $lang; 198 say $unhan->$lang($str); 199 } 200 201prints: 202 203 CJK 東京 in Mandarin is DONG1JING1 204 CJK 東京 in Cantonese is dung1ging1 205 CJK 東京 in Korean is TONGKYENG 206 CJK 東京 in JapaneseOn is TOUKYOU KEI KIN 207 CJK 東京 in JapaneseKun is HIGASHI AZUMAMIYAKO 208 209If you have a specific romanization scheme in mind, 210use the specific module: 211 212 # cpan -i Lingua::JA::Romanize::Japanese 213 use Lingua::JA::Romanize::Japanese; 214 my $k2r = Lingua::JA::Romanize::Japanese->new; 215 my $str = "東京"; 216 say "Japanese for $str is ", $k2r->chars($str); 217 218prints 219 220 Japanese for 東京 is toukyou 221 222=head2 ℞ 12: Explicit encode/decode 223 224On rare occasion, such as a database read, you may be 225given encoded text you need to decode. 226 227 use Encode qw(encode decode); 228 229 my $chars = decode("shiftjis", $bytes, 1); 230 # OR 231 my $bytes = encode("MIME-Header-ISO_2022_JP", $chars, 1); 232 233For streams all in the same encoding, don't use encode/decode; instead 234set the file encoding when you open the file or immediately after with 235C<binmode> as described later below. 236 237=head2 ℞ 13: Decode program arguments as utf8 238 239 $ perl -CA ... 240 or 241 $ export PERL_UNICODE=A 242 or 243 use Encode qw(decode); 244 @ARGV = map { decode('UTF-8', $_, 1) } @ARGV; 245 246=head2 ℞ 14: Decode program arguments as locale encoding 247 248 # cpan -i Encode::Locale 249 use Encode qw(locale); 250 use Encode::Locale; 251 252 # use "locale" as an arg to encode/decode 253 @ARGV = map { decode(locale => $_, 1) } @ARGV; 254 255=head2 ℞ 15: Declare STD{IN,OUT,ERR} to be utf8 256 257Use a command-line option, an environment variable, or else 258call C<binmode> explicitly: 259 260 $ perl -CS ... 261 or 262 $ export PERL_UNICODE=S 263 or 264 use open qw(:std :encoding(UTF-8)); 265 or 266 binmode(STDIN, ":encoding(UTF-8)"); 267 binmode(STDOUT, ":utf8"); 268 binmode(STDERR, ":utf8"); 269 270=head2 ℞ 16: Declare STD{IN,OUT,ERR} to be in locale encoding 271 272 # cpan -i Encode::Locale 273 use Encode; 274 use Encode::Locale; 275 276 # or as a stream for binmode or open 277 binmode STDIN, ":encoding(console_in)" if -t STDIN; 278 binmode STDOUT, ":encoding(console_out)" if -t STDOUT; 279 binmode STDERR, ":encoding(console_out)" if -t STDERR; 280 281=head2 ℞ 17: Make file I/O default to utf8 282 283Files opened without an encoding argument will be in UTF-8: 284 285 $ perl -CD ... 286 or 287 $ export PERL_UNICODE=D 288 or 289 use open qw(:encoding(UTF-8)); 290 291=head2 ℞ 18: Make all I/O and args default to utf8 292 293 $ perl -CSDA ... 294 or 295 $ export PERL_UNICODE=SDA 296 or 297 use open qw(:std :encoding(UTF-8)); 298 use Encode qw(decode); 299 @ARGV = map { decode('UTF-8', $_, 1) } @ARGV; 300 301=head2 ℞ 19: Open file with specific encoding 302 303Specify stream encoding. This is the normal way 304to deal with encoded text, not by calling low-level 305functions. 306 307 # input file 308 open(my $in_file, "< :encoding(UTF-16)", "wintext"); 309 OR 310 open(my $in_file, "<", "wintext"); 311 binmode($in_file, ":encoding(UTF-16)"); 312 THEN 313 my $line = <$in_file>; 314 315 # output file 316 open($out_file, "> :encoding(cp1252)", "wintext"); 317 OR 318 open(my $out_file, ">", "wintext"); 319 binmode($out_file, ":encoding(cp1252)"); 320 THEN 321 print $out_file "some text\n"; 322 323More layers than just the encoding can be specified here. For example, 324the incantation C<":raw :encoding(UTF-16LE) :crlf"> includes implicit 325CRLF handling. 326 327=head2 ℞ 20: Unicode casing 328 329Unicode casing is very different from ASCII casing. 330 331 uc("henry ⅷ") # "HENRY Ⅷ" 332 uc("tschüß") # "TSCHÜSS" notice ß => SS 333 334 # both are true: 335 "tschüß" =~ /TSCHÜSS/i # notice ß => SS 336 "Σίσυφος" =~ /ΣΊΣΥΦΟΣ/i # notice Σ,σ,ς sameness 337 338=head2 ℞ 21: Unicode case-insensitive comparisons 339 340Also available in the CPAN L<Unicode::CaseFold> module, 341the new C<fc> “foldcase” function from v5.16 grants 342access to the same Unicode casefolding as the C</i> 343pattern modifier has always used: 344 345 use feature "fc"; # fc() function is from v5.16 346 347 # sort case-insensitively 348 my @sorted = sort { fc($a) cmp fc($b) } @list; 349 350 # both are true: 351 fc("tschüß") eq fc("TSCHÜSS") 352 fc("Σίσυφος") eq fc("ΣΊΣΥΦΟΣ") 353 354=head2 ℞ 22: Match Unicode linebreak sequence in regex 355 356A Unicode linebreak matches the two-character CRLF 357grapheme or any of seven vertical whitespace characters. 358Good for dealing with textfiles coming from different 359operating systems. 360 361 \R 362 363 s/\R/\n/g; # normalize all linebreaks to \n 364 365=head2 ℞ 23: Get character category 366 367Find the general category of a numeric codepoint. 368 369 use Unicode::UCD qw(charinfo); 370 my $cat = charinfo(0x3A3)->{category}; # "Lu" 371 372=head2 ℞ 24: Disabling Unicode-awareness in builtin charclasses 373 374Disable C<\w>, C<\b>, C<\s>, C<\d>, and the POSIX 375classes from working correctly on Unicode either in this 376scope, or in just one regex. 377 378 use v5.14; 379 use re "/a"; 380 381 # OR 382 383 my($num) = $str =~ /(\d+)/a; 384 385Or use specific un-Unicode properties, like C<\p{ahex}> 386and C<\p{POSIX_Digit>}. Properties still work normally 387no matter what charset modifiers (C</d /u /l /a /aa>) 388should be effect. 389 390=head2 ℞ 25: Match Unicode properties in regex with \p, \P 391 392These all match a single codepoint with the given 393property. Use C<\P> in place of C<\p> to match 394one codepoint lacking that property. 395 396 \pL, \pN, \pS, \pP, \pM, \pZ, \pC 397 \p{Sk}, \p{Ps}, \p{Lt} 398 \p{alpha}, \p{upper}, \p{lower} 399 \p{Latin}, \p{Greek} 400 \p{script_extensions=Latin}, \p{scx=Greek} 401 \p{East_Asian_Width=Wide}, \p{EA=W} 402 \p{Line_Break=Hyphen}, \p{LB=HY} 403 \p{Numeric_Value=4}, \p{NV=4} 404 405=head2 ℞ 26: Custom character properties 406 407Define at compile-time your own custom character 408properties for use in regexes. 409 410 # using private-use characters 411 sub In_Tengwar { "E000\tE07F\n" } 412 413 if (/\p{In_Tengwar}/) { ... } 414 415 # blending existing properties 416 sub Is_GraecoRoman_Title {<<'END_OF_SET'} 417 +utf8::IsLatin 418 +utf8::IsGreek 419 &utf8::IsTitle 420 END_OF_SET 421 422 if (/\p{Is_GraecoRoman_Title}/ { ... } 423 424=head2 ℞ 27: Unicode normalization 425 426Typically render into NFD on input and NFC on output. Using NFKC or NFKD 427functions improves recall on searches, assuming you've already done to the 428same text to be searched. Note that this is about much more than just pre- 429combined compatibility glyphs; it also reorders marks according to their 430canonical combining classes and weeds out singletons. 431 432 use Unicode::Normalize; 433 my $nfd = NFD($orig); 434 my $nfc = NFC($orig); 435 my $nfkd = NFKD($orig); 436 my $nfkc = NFKC($orig); 437 438=head2 ℞ 28: Convert non-ASCII Unicode numerics 439 440Unless you’ve used C</a> or C</aa>, C<\d> matches more than 441ASCII digits only, but Perl’s implicit string-to-number 442conversion does not current recognize these. Here’s how to 443convert such strings manually. 444 445 use v5.14; # needed for num() function 446 use Unicode::UCD qw(num); 447 my $str = "got Ⅻ and ४५६७ and ⅞ and here"; 448 my @nums = (); 449 while ($str =~ /(\d+|\N)/g) { # not just ASCII! 450 push @nums, num($1); 451 } 452 say "@nums"; # 12 4567 0.875 453 454 use charnames qw(:full); 455 my $nv = num("\N{RUMI DIGIT ONE}\N{RUMI DIGIT TWO}"); 456 457=head2 ℞ 29: Match Unicode grapheme cluster in regex 458 459Programmer-visible “characters” are codepoints matched by C</./s>, 460but user-visible “characters” are graphemes matched by C</\X/>. 461 462 # Find vowel *plus* any combining diacritics,underlining,etc. 463 my $nfd = NFD($orig); 464 $nfd =~ / (?=[aeiou]) \X /xi 465 466=head2 ℞ 30: Extract by grapheme instead of by codepoint (regex) 467 468 # match and grab five first graphemes 469 my($first_five) = $str =~ /^ ( \X{5} ) /x; 470 471=head2 ℞ 31: Extract by grapheme instead of by codepoint (substr) 472 473 # cpan -i Unicode::GCString 474 use Unicode::GCString; 475 my $gcs = Unicode::GCString->new($str); 476 my $first_five = $gcs->substr(0, 5); 477 478=head2 ℞ 32: Reverse string by grapheme 479 480Reversing by codepoint messes up diacritics, mistakenly converting 481C<crème brûlée> into C<éel̂urb em̀erc> instead of into C<eélûrb emèrc>; 482so reverse by grapheme instead. Both these approaches work 483right no matter what normalization the string is in: 484 485 $str = join("", reverse $str =~ /\X/g); 486 487 # OR: cpan -i Unicode::GCString 488 use Unicode::GCString; 489 $str = reverse Unicode::GCString->new($str); 490 491=head2 ℞ 33: String length in graphemes 492 493The string C<brûlée> has six graphemes but up to eight codepoints. 494This counts by grapheme, not by codepoint: 495 496 my $str = "brûlée"; 497 my $count = 0; 498 while ($str =~ /\X/g) { $count++ } 499 500 # OR: cpan -i Unicode::GCString 501 use Unicode::GCString; 502 my $gcs = Unicode::GCString->new($str); 503 my $count = $gcs->length; 504 505=head2 ℞ 34: Unicode column-width for printing 506 507Perl’s C<printf>, C<sprintf>, and C<format> think all 508codepoints take up 1 print column, but many take 0 or 2. 509Here to show that normalization makes no difference, 510we print out both forms: 511 512 use Unicode::GCString; 513 use Unicode::Normalize; 514 515 my @words = qw/crème brûlée/; 516 @words = map { NFC($_), NFD($_) } @words; 517 518 for my $str (@words) { 519 my $gcs = Unicode::GCString->new($str); 520 my $cols = $gcs->columns; 521 my $pad = " " x (10 - $cols); 522 say str, $pad, " |"; 523 } 524 525generates this to show that it pads correctly no matter 526the normalization: 527 528 crème | 529 crème | 530 brûlée | 531 brûlée | 532 533=head2 ℞ 35: Unicode collation 534 535Text sorted by numeric codepoint follows no reasonable alphabetic order; 536use the UCA for sorting text. 537 538 use Unicode::Collate; 539 my $col = Unicode::Collate->new(); 540 my @list = $col->sort(@old_list); 541 542See the I<ucsort> program from the L<Unicode::Tussle> CPAN module 543for a convenient command-line interface to this module. 544 545=head2 ℞ 36: Case- I<and> accent-insensitive Unicode sort 546 547Specify a collation strength of level 1 to ignore case and 548diacritics, only looking at the basic character. 549 550 use Unicode::Collate; 551 my $col = Unicode::Collate->new(level => 1); 552 my @list = $col->sort(@old_list); 553 554=head2 ℞ 37: Unicode locale collation 555 556Some locales have special sorting rules. 557 558 # either use v5.12, OR: cpan -i Unicode::Collate::Locale 559 use Unicode::Collate::Locale; 560 my $col = Unicode::Collate::Locale->new(locale => "de__phonebook"); 561 my @list = $col->sort(@old_list); 562 563The I<ucsort> program mentioned above accepts a C<--locale> parameter. 564 565=head2 ℞ 38: Making C<cmp> work on text instead of codepoints 566 567Instead of this: 568 569 @srecs = sort { 570 $b->{AGE} <=> $a->{AGE} 571 || 572 $a->{NAME} cmp $b->{NAME} 573 } @recs; 574 575Use this: 576 577 my $coll = Unicode::Collate->new(); 578 for my $rec (@recs) { 579 $rec->{NAME_key} = $coll->getSortKey( $rec->{NAME} ); 580 } 581 @srecs = sort { 582 $b->{AGE} <=> $a->{AGE} 583 || 584 $a->{NAME_key} cmp $b->{NAME_key} 585 } @recs; 586 587=head2 ℞ 39: Case- I<and> accent-insensitive comparisons 588 589Use a collator object to compare Unicode text by character 590instead of by codepoint. 591 592 use Unicode::Collate; 593 my $es = Unicode::Collate->new( 594 level => 1, 595 normalization => undef 596 ); 597 598 # now both are true: 599 $es->eq("García", "GARCIA" ); 600 $es->eq("Márquez", "MARQUEZ"); 601 602=head2 ℞ 40: Case- I<and> accent-insensitive locale comparisons 603 604Same, but in a specific locale. 605 606 my $de = Unicode::Collate::Locale->new( 607 locale => "de__phonebook", 608 ); 609 610 # now this is true: 611 $de->eq("tschüß", "TSCHUESS"); # notice ü => UE, ß => SS 612 613=head2 ℞ 41: Unicode linebreaking 614 615Break up text into lines according to Unicode rules. 616 617 # cpan -i Unicode::LineBreak 618 use Unicode::LineBreak; 619 use charnames qw(:full); 620 621 my $para = "This is a super\N{HYPHEN}long string. " x 20; 622 my $fmt = Unicode::LineBreak->new; 623 print $fmt->break($para), "\n"; 624 625=head2 ℞ 42: Unicode text in DBM hashes, the tedious way 626 627Using a regular Perl string as a key or value for a DBM 628hash will trigger a wide character exception if any codepoints 629won’t fit into a byte. Here’s how to manually manage the translation: 630 631 use DB_File; 632 use Encode qw(encode decode); 633 tie %dbhash, "DB_File", "pathname"; 634 635 # STORE 636 637 # assume $uni_key and $uni_value are abstract Unicode strings 638 my $enc_key = encode("UTF-8", $uni_key, 1); 639 my $enc_value = encode("UTF-8", $uni_value, 1); 640 $dbhash{$enc_key} = $enc_value; 641 642 # FETCH 643 644 # assume $uni_key holds a normal Perl string (abstract Unicode) 645 my $enc_key = encode("UTF-8", $uni_key, 1); 646 my $enc_value = $dbhash{$enc_key}; 647 my $uni_value = decode("UTF-8", $enc_value, 1); 648 649=head2 ℞ 43: Unicode text in DBM hashes, the easy way 650 651Here’s how to implicitly manage the translation; all encoding 652and decoding is done automatically, just as with streams that 653have a particular encoding attached to them: 654 655 use DB_File; 656 use DBM_Filter; 657 658 my $dbobj = tie %dbhash, "DB_File", "pathname"; 659 $dbobj->Filter_Value("utf8"); # this is the magic bit 660 661 # STORE 662 663 # assume $uni_key and $uni_value are abstract Unicode strings 664 $dbhash{$uni_key} = $uni_value; 665 666 # FETCH 667 668 # $uni_key holds a normal Perl string (abstract Unicode) 669 my $uni_value = $dbhash{$uni_key}; 670 671=head2 ℞ 44: PROGRAM: Demo of Unicode collation and printing 672 673Here’s a full program showing how to make use of locale-sensitive 674sorting, Unicode casing, and managing print widths when some of the 675characters take up zero or two columns, not just one column each time. 676When run, the following program produces this nicely aligned output: 677 678 Crème Brûlée....... €2.00 679 Éclair............. €1.60 680 Fideuà............. €4.20 681 Hamburger.......... €6.00 682 Jamón Serrano...... €4.45 683 Linguiça........... €7.00 684 Pâté............... €4.15 685 Pears.............. €2.00 686 Pêches............. €2.25 687 Smørbrød........... €5.75 688 Spätzle............ €5.50 689 Xoriço............. €3.00 690 Γύρος.............. €6.50 691 막걸리............. €4.00 692 おもち............. €2.65 693 お好み焼き......... €8.00 694 シュークリーム..... €1.85 695 寿司............... €9.99 696 包子............... €7.50 697 698Here's that program. 699 700 #!/usr/bin/env perl 701 # umenu - demo sorting and printing of Unicode food 702 # 703 # (obligatory and increasingly long preamble) 704 # 705 use v5.36; 706 use utf8; 707 use warnings qw(FATAL utf8); # fatalize encoding faults 708 use open qw(:std :encoding(UTF-8)); # undeclared streams in UTF-8 709 use charnames qw(:full :short); # unneeded in v5.16 710 711 # std modules 712 use Unicode::Normalize; # std perl distro as of v5.8 713 use List::Util qw(max); # std perl distro as of v5.10 714 use Unicode::Collate::Locale; # std perl distro as of v5.14 715 716 # cpan modules 717 use Unicode::GCString; # from CPAN 718 719 my %price = ( 720 "γύρος" => 6.50, # gyros 721 "pears" => 2.00, # like um, pears 722 "linguiça" => 7.00, # spicy sausage, Portuguese 723 "xoriço" => 3.00, # chorizo sausage, Catalan 724 "hamburger" => 6.00, # burgermeister meisterburger 725 "éclair" => 1.60, # dessert, French 726 "smørbrød" => 5.75, # sandwiches, Norwegian 727 "spätzle" => 5.50, # Bayerisch noodles, little sparrows 728 "包子" => 7.50, # bao1 zi5, steamed pork buns, Mandarin 729 "jamón serrano" => 4.45, # country ham, Spanish 730 "pêches" => 2.25, # peaches, French 731 "シュークリーム" => 1.85, # cream-filled pastry like eclair 732 "막걸리" => 4.00, # makgeolli, Korean rice wine 733 "寿司" => 9.99, # sushi, Japanese 734 "おもち" => 2.65, # omochi, rice cakes, Japanese 735 "crème brûlée" => 2.00, # crema catalana 736 "fideuà" => 4.20, # more noodles, Valencian 737 # (Catalan=fideuada) 738 "pâté" => 4.15, # gooseliver paste, French 739 "お好み焼き" => 8.00, # okonomiyaki, Japanese 740 ); 741 742 my $width = 5 + max map { colwidth($_) } keys %price; 743 744 # So the Asian stuff comes out in an order that someone 745 # who reads those scripts won't freak out over; the 746 # CJK stuff will be in JIS X 0208 order that way. 747 my $coll = Unicode::Collate::Locale->new(locale => "ja"); 748 749 for my $item ($coll->sort(keys %price)) { 750 print pad(entitle($item), $width, "."); 751 printf " €%.2f\n", $price{$item}; 752 } 753 754 sub pad ($str, $width, $padchar) { 755 return $str . ($padchar x ($width - colwidth($str))); 756 } 757 758 sub colwidth ($str) { 759 return Unicode::GCString->new($str)->columns; 760 } 761 762 sub entitle ($str) { 763 $str =~ s{ (?=\pL)(\S) (\S*) } 764 { ucfirst($1) . lc($2) }xge; 765 return $str; 766 } 767 768=head1 SEE ALSO 769 770See these manpages, some of which are CPAN modules: 771L<perlunicode>, L<perluniprops>, 772L<perlre>, L<perlrecharclass>, 773L<perluniintro>, L<perlunitut>, L<perlunifaq>, 774L<PerlIO>, L<DB_File>, L<DBM_Filter>, L<DBM_Filter::utf8>, 775L<Encode>, L<Encode::Locale>, 776L<Unicode::UCD>, 777L<Unicode::Normalize>, 778L<Unicode::GCString>, L<Unicode::LineBreak>, 779L<Unicode::Collate>, L<Unicode::Collate::Locale>, 780L<Unicode::Unihan>, 781L<Unicode::CaseFold>, 782L<Unicode::Tussle>, 783L<Lingua::JA::Romanize::Japanese>, 784L<Lingua::ZH::Romanize::Pinyin>, 785L<Lingua::KO::Romanize::Hangul>. 786 787The L<Unicode::Tussle> CPAN module includes many programs 788to help with working with Unicode, including 789these programs to fully or partly replace standard utilities: 790I<tcgrep> instead of I<egrep>, 791I<uniquote> instead of I<cat -v> or I<hexdump>, 792I<uniwc> instead of I<wc>, 793I<unilook> instead of I<look>, 794I<unifmt> instead of I<fmt>, 795and 796I<ucsort> instead of I<sort>. 797For exploring Unicode character names and character properties, 798see its I<uniprops>, I<unichars>, and I<uninames> programs. 799It also supplies these programs, all of which are general filters that do Unicode-y things: 800I<unititle> and I<unicaps>; 801I<uniwide> and I<uninarrow>; 802I<unisupers> and I<unisubs>; 803I<nfd>, I<nfc>, I<nfkd>, and I<nfkc>; 804and I<uc>, I<lc>, and I<tc>. 805 806Finally, see the published Unicode Standard (page numbers are from version 8076.0.0), including these specific annexes and technical reports: 808 809=over 810 811=item §3.13 Default Case Algorithms, page 113; 812§4.2 Case, pages 120–122; 813Case Mappings, page 166–172, especially Caseless Matching starting on page 170. 814 815=item UAX #44: Unicode Character Database 816 817=item UTS #18: Unicode Regular Expressions 818 819=item UAX #15: Unicode Normalization Forms 820 821=item UTS #10: Unicode Collation Algorithm 822 823=item UAX #29: Unicode Text Segmentation 824 825=item UAX #14: Unicode Line Breaking Algorithm 826 827=item UAX #11: East Asian Width 828 829=back 830 831=head1 AUTHOR 832 833Tom Christiansen E<lt>tchrist@perl.comE<gt> wrote this, with occasional 834kibbitzing from Larry Wall and Jeffrey Friedl in the background. 835 836=head1 COPYRIGHT AND LICENCE 837 838Copyright © 2012 Tom Christiansen. 839 840This program is free software; you may redistribute it and/or modify it 841under the same terms as Perl itself. 842 843Most of these examples taken from the current edition of the “Camel Book”; 844that is, from the 4ᵗʰ Edition of I<Programming Perl>, Copyright © 2012 Tom 845Christiansen <et al.>, 2012-02-13 by O’Reilly Media. The code itself is 846freely redistributable, and you are encouraged to transplant, fold, 847spindle, and mutilate any of the examples in this manpage however you please 848for inclusion into your own programs without any encumbrance whatsoever. 849Acknowledgement via code comment is polite but not required. 850 851=head1 REVISION HISTORY 852 853v1.0.0 – first public release, 2012-02-27 854