xref: /openbsd-src/gnu/usr.bin/perl/lib/unicore/README.perl (revision 50b7afb2c2c0993b0894d4e34bf857cb13ed9c80)
1# Perl should compile and reasonably run any version of Unicode.  That doesn't
2# mean that the test suite will run without showing errors.  A few of the
3# very-Unicode specific test files have been modified to account for different
4# versions, but most have not.  For example, some tests use characters that
5# aren't encoded in all Unicode versions; others have hard-coded the General
6# Categories that were correct at the time the test was written.  Perl itself
7# will not compile under Unicode releases prior to 3.0 without a simple change to
8# Unicode::Normalize.  mktables contains instructions for this, as well as other
9# hints for using older Unicode versions.
10
11# The *.txt files were copied from
12
13# 	ftp://www.unicode.org/Public/UNIDATA
14
15# (which always points to the latest version) with subdirectories 'extracted' and
16# 'auxiliary'.  Older versions are located under Public with an appropriate name.
17
18# The Unihan files were not included due to space considerations.  Also NOT
19# included were any *.html files.  It is possible to add the Unihan files, and
20# edit mktables (see instructions near its beginning) to look at them.
21
22# The file named 'version' should exist and be a single line with the Unicode
23# version, like:
24# 5.2.0
25
26# To be 8.3 filesystem friendly, the names of some of the input files have been
27# changed from the values that are in the Unicode DB.  Not all of the Test
28# files are currently used, so may not be present, so some of the mv's can
29# fail.  The .html Test files are not touched.
30
31mv PropertyValueAliases.txt PropValueAliases.txt
32mv NamedSequencesProv.txt NamedSqProv.txt
33mv NormalizationTest.txt NormTest.txt
34mv DerivedAge.txt DAge.txt
35mv DerivedCoreProperties.txt DCoreProperties.txt
36mv DerivedNormalizationProps.txt DNormalizationProps.txt
37
38# Some early releases don't have the extracted directory, and hence these files
39# should be moved to it.
40mkdir extracted 2>/dev/null
41mv DerivedBidiClass.txt DerivedBinaryProperties.txt extracted 2>/dev/null
42mv DerivedCombiningClass.txt DerivedDecompositionType.txt extracted 2>/dev/null
43mv DerivedEastAsianWidth.txt DerivedGeneralCategory.txt extracted 2>/dev/null
44mv DerivedJoiningGroup.txt DerivedJoiningType.txt extracted 2>/dev/null
45mv DerivedLineBreak.txt DerivedNumericType.txt DerivedNumericValues.txt extracted 2>/dev/null
46
47mv extracted/DerivedBidiClass.txt extracted/DBidiClass.txt
48mv extracted/DerivedBinaryProperties.txt extracted/DBinaryProperties.txt
49mv extracted/DerivedCombiningClass.txt extracted/DCombiningClass.txt
50mv extracted/DerivedDecompositionType.txt extracted/DDecompositionType.txt
51mv extracted/DerivedEastAsianWidth.txt extracted/DEastAsianWidth.txt
52mv extracted/DerivedGeneralCategory.txt extracted/DGeneralCategory.txt
53mv extracted/DerivedJoiningGroup.txt extracted/DJoinGroup.txt
54mv extracted/DerivedJoiningType.txt extracted/DJoinType.txt
55mv extracted/DerivedLineBreak.txt extracted/DLineBreak.txt
56mv extracted/DerivedNumericType.txt extracted/DNumType.txt
57mv extracted/DerivedNumericValues.txt extracted/DNumValues.txt
58
59mv auxiliary/GraphemeBreakTest.txt auxiliary/GCBTest.txt
60mv auxiliary/LineBreakTest.txt auxiliary/LBTest.txt
61mv auxiliary/SentenceBreakTest.txt auxiliary/SBTest.txt
62mv auxiliary/WordBreakTest.txt auxiliary/WBTest.txt
63
64# If you have the Unihan database (5.2 and above), you should also do the
65# following:
66
67mv Unihan_DictionaryIndices.txt UnihanIndicesDictionary.txt
68mv Unihan_DictionaryLikeData.txt UnihanDataDictionaryLike.txt
69mv Unihan_IRGSources.txt UnihanIRGSources.txt
70mv Unihan_NumericValues.txt UnihanNumericValues.txt
71mv Unihan_OtherMappings.txt UnihanOtherMappings.txt
72mv Unihan_RadicalStrokeCounts.txt UnihanRadicalStrokeCounts.txt
73mv Unihan_Readings.txt UnihanReadings.txt
74mv Unihan_Variants.txt UnihanVariants.txt
75
76# If you download everything, the names of files that are not used by mktables
77# are not changed by the above, and hence may not work correctly as-is on 8.3
78# filesystems.
79
80# mktables is used to generate the tables used by the rest of Perl.  It will
81# warn you about any *.txt files in the directory substructure that it doesn't
82# know about.  You should remove any so-identified, or edit mktables to add
83# them to its lists to process.  You can run
84#
85#    mktables -globlist
86#
87#to have it try to process these tables generically.
88#
89# FOR PUMPKINS
90#
91# The files are inter-related.  If you take the latest UnicodeData.txt, for
92# example, but leave the older versions of other files, there can be subtle
93# problems.  So get everything available from Unicode, and delete those which
94# aren't needed.
95#
96# When moving to a new version of Unicode, you need to update 'version' by hand
97#
98#	p4 edit version
99# 	...
100#
101# You should look in the Unicode release notes (which are probably towards the
102# bottom of http://www.unicode.org/reports/tr44/) to see if any properties have
103# newly been moved to be Obsolete, Deprecated, or Stabilized.  The full names
104# for these should be added to the respective lists near the beginning of
105# mktables, using an 'if' to add them for just this Unicode version going
106# forward, so that mktables can continue to be used for earlier Unicode
107# versions.
108#
109# When putting out a new Perl release, think about if any of the Deprecated
110# properties should be moved to Suppressed.
111#
112# perlrecharclass.pod has a list of all the characters that are white space,
113# which needs to be updated if there are changes.  A quick way to check if
114# there have been changes would be to see if the number of such characters
115# listed in perluniprops.pod (generated by running mktables) for the property
116# \p{White_Space} is no longer 26.  Further investigation would then be
117# necessary to classify the new characters as horizontal and vertical.
118#
119# The code in regexec.c for the \X match construct is intimately tied to the
120# regular expression in UAX #29 (http://www.unicode.org/reports/tr29/).  You
121# should see if it has changed, and if so regexec.c should be modified.  The
122# current one is
123# ( CRLF
124# | Prepend* ( Hangul-syllable | !Control )
125#   ( Grapheme_Extend | Spacing_Mark)*
126# | . )
127#
128# mktables has many checks to warn you if there are unexpected or novel things
129# that it doesn't know how to handle.
130#
131# Module::CoreList should be changed to include the new release
132#
133# Also, you should regen l1_char_class_tab.h, by
134#
135# perl regen/mk_L_charclass.pl
136#
137# and, regen charclass_invlists.h by
138#
139# perl regen/mk_invlists.pl
140#
141# Finally:
142#
143# 	p4 submit
144#
145# --
146# jhi@iki.fi; updated by nick@ccl4.org, public@khwilliamson.com
147