original commit log:
=====
I had originally suspected the parsing of ctype definition files as being
the source of the ctype flag mis-definitions, but it wasn't. In the
process, I simplified the cc_list parsing so I'm committing the no-impact
improvement separately. It removes some parsing redundancies and
won't parse partial range definitions anymore.
====
Submitted by: marino
Obtained from: Dragonfly
MFC after: 1 month
This commit is from John Marino in dragonfly with the following commit log:
====
This was a CTYPE encoding error involving consecutive points of the same
ctype. It was reported by myself to Illumos over a year ago but I was
unsure if it was only happening on BSD. Given the cause, the bug is also
present on Illumos.
Basically, if consecutive points were of the exact same ctype, they would
be defined as a range regardless. For example, all of these would be
considered equivalent:
<A> ... <C>, <H> (converts to <A> .. <H>)
<A>, <B>, <H> (converts to <A> .. <H>)
<A>, <J> ... <H> (converts to <A> .. <H>)
So all the points that shouldn't have been defined got "bridged" by the
extreme points.
The effects were recently reported to FreeBSD on PR 213013. There are
countless places were the ctype flags are misdefined, so this is a major
fix that has to be MFC'd.
====
This reveals a bad change I did on the testsuite: while 0x07FF is a valid
unicode it is not used yet (reserved for future use)
PR: 213013
Submitted by: marino@
Reported by: Kurtis Rader <krader@skepticism.us>
Obtained from: Dragonfly
MFC after: 1 month
The first argument of calloc(3) should be an ordinal type, and the
second a size: split a multiplication to make better use of calloc(3)
and detect overflows.
Do some other re-ordering and style fixes while here.
MFC after: 3 weeks
Illumos recently included space in 'print' class. We already had
this but the code had slight sorting differences. Move it some
lines up to reduce diffs with Illumos.
No functional change.
Reference:
https://illumos.org/issues/5227
These are no longer needed after the recent 'beforebuild: depend' changes
and hooking DIRDEPS_BUILD into a subset of FAST_DEPEND which supports
skipping 'make depend'.
Sponsored by: EMC / Isilon Storage Division
The output of "locale charmap" is identical to the result of
nl_langinfo (CODESET) for any given locale. The logic for returning the
codeset was very simplistic. It just returned portion of the locale name
after the period (e.g. en_FR.ISO8859-1 returned "ISO8859-1").
When softlinks were added to locales, this broke. e.g.:
en_US returned ""
en_FR.UTF8 returned "UTF8"
en_FR.UTF-8 returned "UTF-8"
zh_Hant_HK.Big5HKSCS returned "Big5HKSCS"
zh_Hant_TW.Big5 returned "Big5"
es_ES@euro returned ""
In order to fix this properly, the named locale cannot be used to
determine the encoding. This information was almost available in the
rune data. Unfortunately, all the single byte encodings were listed
as "NONE" encoding.
So I adjusted localedef tool to provide more information about the
encoding. For example, instead of "NONE", the LC_CTYPE used by
fr_FR.ISO8859-15 is now encoded as "NONE:ISO8859-15". The locale
handlers now check if the first four characters of the encoding is
"NONE" and if so, treats it as a single-byte encoding.
The nl_langinfo handling of CODESET was adjusting accordingly. Now the
following is returned:
en_US returns "ISO8859-1"
fr_FR.UTF8 returns "UTF-8"
fr_FR.UTF-8 returns "UTF-8"
zh_Hant_HK.Big5HKSCS returns "Big5"
zh_Hant_TW.Big5 returns "Big5"
es_ES@euro returns "ISO8859-15"
as before, "C" and "POSIX" locales return "US-ASCII". This is a big
improvement. The result of nl_langinfo can never be a zero-length
string and it will always exclusively one of the values of the
character maps of /usr/src/tools/tools/locale/etc/final-maps.
Submitted by: marino
Obtained from: DragonflyBSD
of hexidecimal numbers) are all considered "numbers". (Note that while
all digits are numbers, not all numbers are digits).
Enhance localedef to automatically set the "number" characteristic when
it encounters a digit or xdigit definition. This fixes malfunctionning
isalnum(3)
Obtained from: DragonflyBSD
By having space automatically classified as "print" type, we can
eliminate the print section from ctype src files completely (they
are just "graph" plus "<space>".
Obtained from: Dragonfly
FreeBSD extended ctypes to include numbers (e.g. isnumber()) but never
actually implemented it. The isnumber() function was equivalent to the
isdigit() function in every case.
Now that DragonFly's ctype source files have number definitions, the
number ctype can finally be implemented. It's given a new flag _CTYPE_N.
The isalnum() and iswalnum() functions have been changed to use this
flag rather than the _CTYPE_D digit flag.
While isalnum(), isnumber(), and their wide equivalents now return
different values in locale cases, the ishexnumber() and iswhexnumber()
functions are unchanged. They are still aliases for isxdigit() and
iswxdigit().
Also change ctype.h for isdigit and isxdigit to use sbistype like the
other functions.
Obtained from: dragonfly
The localedef tool can read entire (and unmodified) CLDR posix definition
files, and generate all 6 LC categories: LC_COLLATE, LC_CTYPE, LC_TIME,
LC_NUMERIC, LC_MONETARY and LC_MESSAGES.
This tool has a long history with Solaris. The Nexenta developers
modified it to read CLDR files and created the much richer collation
formats. The libc collation functions have to be modified to read the
new format (called "BSD-1.0") and to handle the new data structures.
The result will be that locale-sensitive tools and functions will now
properly sort multibyte and unicode strings.
Obtained from: Dragonfly