1package encoding::warnings; 2$encoding::warnings::VERSION = '0.12'; 3 4use strict; 5use 5.007; 6 7=head1 NAME 8 9encoding::warnings - Warn on implicit encoding conversions 10 11=head1 VERSION 12 13This document describes version 0.11 of encoding::warnings, released 14June 5, 2007. 15 16=head1 SYNOPSIS 17 18 use encoding::warnings; # or 'FATAL' to raise fatal exceptions 19 20 utf8::encode($a = chr(20000)); # a byte-string (raw bytes) 21 $b = chr(20000); # a unicode-string (wide characters) 22 23 # "Bytes implicitly upgraded into wide characters as iso-8859-1" 24 $c = $a . $b; 25 26=head1 DESCRIPTION 27 28=head2 Overview of the problem 29 30By default, there is a fundamental asymmetry in Perl's unicode model: 31implicit upgrading from byte-strings to unicode-strings assumes that 32they were encoded in I<ISO 8859-1 (Latin-1)>, but unicode-strings are 33downgraded with UTF-8 encoding. This happens because the first 256 34codepoints in Unicode happens to agree with Latin-1. 35 36However, this silent upgrading can easily cause problems, if you happen 37to mix unicode strings with non-Latin1 data -- i.e. byte-strings encoded 38in UTF-8 or other encodings. The error will not manifest until the 39combined string is written to output, at which time it would be impossible 40to see where did the silent upgrading occur. 41 42=head2 Detecting the problem 43 44This module simplifies the process of diagnosing such problems. Just put 45this line on top of your main program: 46 47 use encoding::warnings; 48 49Afterwards, implicit upgrading of high-bit bytes will raise a warning. 50Ex.: C<Bytes implicitly upgraded into wide characters as iso-8859-1 at 51- line 7>. 52 53However, strings composed purely of ASCII code points (C<0x00>..C<0x7F>) 54will I<not> trigger this warning. 55 56You can also make the warnings fatal by importing this module as: 57 58 use encoding::warnings 'FATAL'; 59 60=head2 Solving the problem 61 62Most of the time, this warning occurs when a byte-string is concatenated 63with a unicode-string. There are a number of ways to solve it: 64 65=over 4 66 67=item * Upgrade both sides to unicode-strings 68 69If your program does not need compatibility for Perl 5.6 and earlier, 70the recommended approach is to apply appropriate IO disciplines, so all 71data in your program become unicode-strings. See L<encoding>, L<open> and 72L<perlfunc/binmode> for how. 73 74=item * Downgrade both sides to byte-strings 75 76The other way works too, especially if you are sure that all your data 77are under the same encoding, or if compatibility with older versions 78of Perl is desired. 79 80You may downgrade strings with C<Encode::encode> and C<utf8::encode>. 81See L<Encode> and L<utf8> for details. 82 83=item * Specify the encoding for implicit byte-string upgrading 84 85If you are confident that all byte-strings will be in a specific 86encoding like UTF-8, I<and> need not support older versions of Perl, 87use the C<encoding> pragma: 88 89 use encoding 'utf8'; 90 91Similarly, this will silence warnings from this module, and preserve the 92default behaviour: 93 94 use encoding 'iso-8859-1'; 95 96However, note that C<use encoding> actually had three distinct effects: 97 98=over 4 99 100=item * PerlIO layers for B<STDIN> and B<STDOUT> 101 102This is similar to what L<open> pragma does. 103 104=item * Literal conversions 105 106This turns I<all> literal string in your program into unicode-strings 107(equivalent to a C<use utf8>), by decoding them using the specified 108encoding. 109 110=item * Implicit upgrading for byte-strings 111 112This will silence warnings from this module, as shown above. 113 114=back 115 116Because literal conversions also work on empty strings, it may surprise 117some people: 118 119 use encoding 'big5'; 120 121 my $byte_string = pack("C*", 0xA4, 0x40); 122 print length $a; # 2 here. 123 $a .= ""; # concatenating with a unicode string... 124 print length $a; # 1 here! 125 126In other words, do not C<use encoding> unless you are certain that the 127program will not deal with any raw, 8-bit binary data at all. 128 129However, the C<Filter =E<gt> 1> flavor of C<use encoding> will I<not> 130affect implicit upgrading for byte-strings, and is thus incapable of 131silencing warnings from this module. See L<encoding> for more details. 132 133=back 134 135=head1 CAVEATS 136 137For Perl 5.9.4 or later, this module's effect is lexical. 138 139For Perl versions prior to 5.9.4, this module affects the whole script, 140instead of inside its lexical block. 141 142=cut 143 144# Constants. 145sub ASCII () { 0 } 146sub LATIN1 () { 1 } 147sub FATAL () { 2 } 148 149# Install a ${^ENCODING} handler if no other one are already in place. 150sub import { 151 my $class = shift; 152 my $fatal = shift || ''; 153 154 local $@; 155 return if ${^ENCODING} and ref(${^ENCODING}) ne $class; 156 return unless eval { require Encode; 1 }; 157 158 my $ascii = Encode::find_encoding('us-ascii') or return; 159 my $latin1 = Encode::find_encoding('iso-8859-1') or return; 160 161 # Have to undef explicitly here 162 undef ${^ENCODING}; 163 164 # Install a warning handler for decode() 165 my $decoder = bless( 166 [ 167 $ascii, 168 $latin1, 169 (($fatal eq 'FATAL') ? 'Carp::croak' : 'Carp::carp'), 170 ], $class, 171 ); 172 173 no warnings 'deprecated'; 174 ${^ENCODING} = $decoder; 175 use warnings 'deprecated'; 176 $^H{$class} = 1; 177} 178 179sub unimport { 180 my $class = shift; 181 $^H{$class} = undef; 182 undef ${^ENCODING}; 183} 184 185# Don't worry about source code literals. 186sub cat_decode { 187 my $self = shift; 188 return $self->[LATIN1]->cat_decode(@_); 189} 190 191# Warn if the data is not purely US-ASCII. 192sub decode { 193 my $self = shift; 194 195 DO_WARN: { 196 if ($] >= 5.009004) { 197 my $hints = (caller(0))[10]; 198 $hints->{ref($self)} or last DO_WARN; 199 } 200 201 local $@; 202 my $rv = eval { $self->[ASCII]->decode($_[0], Encode::FB_CROAK()) }; 203 return $rv unless $@; 204 205 require Carp; 206 no strict 'refs'; 207 $self->[FATAL]->( 208 "Bytes implicitly upgraded into wide characters as iso-8859-1" 209 ); 210 211 } 212 213 return $self->[LATIN1]->decode(@_); 214} 215 216sub name { 'iso-8859-1' } 217 2181; 219 220__END__ 221 222=head1 SEE ALSO 223 224L<perlunicode>, L<perluniintro> 225 226L<open>, L<utf8>, L<encoding>, L<Encode> 227 228=head1 AUTHORS 229 230Audrey Tang 231 232=head1 COPYRIGHT 233 234Copyright 2004, 2005, 2006, 2007 by Audrey Tang E<lt>cpan@audreyt.orgE<gt>. 235 236This program is free software; you can redistribute it and/or modify it 237under the same terms as Perl itself. 238 239See L<http://www.perl.com/perl/misc/Artistic.html> 240 241=cut 242