Commit Graph

592 Commits

Author SHA1 Message Date
Andrey A. Chernov
e08c3b7c11 EUC-type encodings don't have single byte characters >= 128
This change should not be MFCed until new collate will be
MFCed first, because our old EUC tables have some hacks for
missing codesets.
2016-04-04 02:43:35 +00:00
Pedro F. Giffuni
45256214eb mbtowc(3): set errno to EILSEQ if an incomplete character is passed.
According to POSIX, The mbtowc() function shall fail if:
[EILSEQ] An invalid character sequence is detected.

Reviewed by:		bapt
Differential Revision:	https://reviews.freebsd.org/D5496

Obtained from:	OpenBSD (Ingo Schwarze)
MFC after:	1 month
2016-03-01 19:15:34 +00:00
Enji Cooper
0b5cc81d3b Link localeconv(3) to localeconv_l(3)
MFC after: 3 days
2015-11-25 09:12:30 +00:00
Baptiste Daroussin
87101cb572 return "US-ASCII" instead of "POSIX" for "C" and "POSIX" locales
as it used to be in previous version of the locales. Returning
"POSIX" has too many fallouts.
2015-11-10 08:11:27 +00:00
Baptiste Daroussin
403105944d nl_langinfo: Simplify case ladder
The NONE:US-ASCII case isn't necessary.  The "NONE:" case will handle
US-ASCII, so let's remove the redundant handling.

Submitted by:	marino
Obtained from:	DragonflyBSD
2015-11-09 22:29:47 +00:00
Baptiste Daroussin
22b87a3555 Readd ascii.c forgotten in r290618 2015-11-09 22:11:37 +00:00
Baptiste Daroussin
473aa0b7ee locales: Enforce US-ASCII encoding (limited to 7-bit)
The US-ASCII format was getting treated identically to POSIX.  It is
supposed to throw an ILSEQ errno if a value of 0x80 or greater is
encountered, so let's bring back the "ASCII" handling.

While here, change nl_codeset to return US-ASCII only when the encoding
really is "US-ASCII".  Before "C" and "POSIX" encoding returned this
string, so now they return "POSIX".

Discussed with:	ache
Submitted by:	marino
Obtained from:	DragonflyBSD
2015-11-09 22:06:22 +00:00
Baptiste Daroussin
e58504783b Fix mbtowc not setting EILSEQ on an Incomplete multibyte sequence for eucJP encoding 2015-11-02 22:56:24 +00:00
Baptiste Daroussin
d8ed03efe5 locales: Fix eucJP sorting (broken upstream?)
Sorting eucJP text with "sort" resulted in an illegal sequence while
"gsort" worked.  This was traced back to mbrtowc handling which was
broken for eucJP (probably eucCN, eucKR, and eucTW as well).  This
small fix took hours to figure out.  The OR operation to build the
wide character requires an unsigned character to work correctly.  The
euc wcrtowc conversion is probably broken upstream in Illumos as well.

Triggered by: misc/freebsd-doc-ja in ports (encoded in eucJP)

Submitted by:	marino
Obtained from:	DragonflyBSD
2015-11-01 21:02:30 +00:00
Baptiste Daroussin
d79cdd21de libc: Fix (and improve) nl_langinfo (CODESET)
The output of "locale charmap" is identical to the result of
nl_langinfo (CODESET) for any given locale.  The logic for returning the
codeset was very simplistic.  It just returned portion of the locale name
after the period (e.g. en_FR.ISO8859-1 returned "ISO8859-1").

When softlinks were added to locales, this broke.  e.g.:
   en_US returned ""
   en_FR.UTF8 returned "UTF8"
   en_FR.UTF-8 returned "UTF-8"
   zh_Hant_HK.Big5HKSCS returned "Big5HKSCS"
   zh_Hant_TW.Big5 returned "Big5"
   es_ES@euro returned ""

In order to fix this properly, the named locale cannot be used to
determine the encoding.  This information was almost available in the
rune data.  Unfortunately, all the single byte encodings were listed
as "NONE" encoding.

So I adjusted localedef tool to provide more information about the
encoding.  For example, instead of "NONE", the LC_CTYPE used by
fr_FR.ISO8859-15 is now encoded as "NONE:ISO8859-15".  The locale
handlers now check if the first four characters of the encoding is
"NONE" and if so, treats it as a single-byte encoding.

The nl_langinfo handling of CODESET was adjusting accordingly.  Now the
following is returned:
   en_US returns "ISO8859-1"
   fr_FR.UTF8 returns "UTF-8"
   fr_FR.UTF-8 returns "UTF-8"
   zh_Hant_HK.Big5HKSCS returns "Big5"
   zh_Hant_TW.Big5 returns "Big5"
   es_ES@euro returns "ISO8859-15"

as before, "C" and "POSIX" locales return "US-ASCII".  This is a big
improvement.  The result of nl_langinfo can never be a zero-length
string and it will always exclusively one of the values of the
character maps of /usr/src/tools/tools/locale/etc/final-maps.

Submitted by:	marino
Obtained from:	DragonflyBSD
2015-11-01 12:00:55 +00:00
Baptiste Daroussin
76e6db686e collate: Fix expansion substitions (broken upstream too)
Through testing, the user noted that some Cyrillic characters were not
sorting correctly, and this was confirmed.

After extensive testing and review, the localedef tool was eliminated
as the culprit.  The sustitutions were encoded correctly in LC_COLLATE.

The error was mainly in wcscoll where character expansions were
mishandled.  The main directive pass routines had to be written to
go back for a new collation value when the "state" variable was set.
Before pointers were being advanced, the second lookup was gettting
applied to the wrong character, etc.

The "eat expansion codes" section on collate.c also had a bug.  Later
own, the "state" variable logic was changed to only set if next
code was greater than zero (rather than >= 0).

Some additional cleanups got captured from previous work:
1) The previous commit moved the binary search comment from the
   correct location to a wrong location because it's wrong upstream
   in Illumos.  The comment has little value so I just removed it.
2) Don't check if pointers are null before freeing, this is
   redundant as free() handles null pointers.
3) The two binary search trees were standardized wrt initialization
4) On the binary search trees, a negative "high" exits rather than
   checking the table count again.

Submitted by:	marino
Obtained from:	DragonflyBSD
2015-10-23 23:24:03 +00:00
Baptiste Daroussin
332fe83717 libc/collate: minor tweaks / fix
The main "fix" here is properly setting a collate loading error for each
early return.  Tweaks include removing unnecessary null checks, adding
assertions (from Illumos) and a couple of variables to reduces code
differences and improve readability.  For normal use, there are no
functional changes here.

Obtained from:	DragonflyBSD, Illumos
2015-10-22 14:29:19 +00:00
Baptiste Daroussin
c25f5140e9 Include sys/*.h earlier
Reported by:	kib
2015-10-14 12:46:05 +00:00
Baptiste Daroussin
f5dde0166d Commit log from Dragonfly:
FreeBSD extended ctypes to include numbers (e.g. isnumber()) but never
actually implemented it.  The isnumber() function was equivalent to the
isdigit() function in every case.

Now that DragonFly's ctype source files have number definitions, the
number ctype can finally be implemented.  It's given a new flag _CTYPE_N.
The isalnum() and iswalnum() functions have been changed to use this
flag rather than the _CTYPE_D digit flag.

While isalnum(), isnumber(), and their wide equivalents now return
different values in locale cases, the ishexnumber() and iswhexnumber()
functions are unchanged.  They are still aliases for isxdigit() and
iswxdigit().

Also change ctype.h for isdigit and isxdigit to use sbistype like the
other functions.

Obtained from:	dragonfly
2015-10-13 20:43:49 +00:00
Baptiste Daroussin
becbad1f6e Merge from head 2015-10-13 19:44:36 +00:00
Craig Rodrigues
c83f3fc4b4 Use ANSI C prototypes. Eliminates -Wold-style-definition warnings. 2015-09-20 20:50:18 +00:00
Baptiste Daroussin
23a32822d2 Merge from HEAD 2015-08-25 20:14:50 +00:00
Ed Schouten
57c69b1478 Make UTF-8 parsing and generation more strict.
- in mbrtowc() we need to disallow codepoints above 0x10ffff.
- In wcrtomb() we need to disallow codepoints between 0xd800 and 0xdfff.

Reviewed by:	bapt
Differential Revision:	https://reviews.freebsd.org/D3399
2015-08-25 09:16:09 +00:00
Baptiste Daroussin
eaa94ab419 Fix typo 2015-08-09 12:20:22 +00:00
Baptiste Daroussin
28a20bb3f5 Use more asprintf
Plug memory leak introduced in previous asprintf addition
2015-08-09 12:13:30 +00:00
Baptiste Daroussin
b89704cee7 Use asprintf/free instead of snprintf 2015-08-09 11:50:50 +00:00
Baptiste Daroussin
5e4bbc69de Remove useless variable 2015-08-09 11:47:01 +00:00
Baptiste Daroussin
81eb7d7e4b Readd checking utf16 surrogates that are invalid in utf8 2015-08-09 10:36:25 +00:00
Baptiste Daroussin
a6d2922cbb Mark __collate_load_tables_l as static
Remove useless addition to Symbols.map
2015-08-09 10:24:24 +00:00
Baptiste Daroussin
536451f914 Fix typo
Fix bad location for include

Reported by:	jilles
2015-08-09 00:21:59 +00:00
Baptiste Daroussin
764a768e16 Merge from HEAD 2015-08-09 00:15:17 +00:00
Baptiste Daroussin
8bb93485fb Remove 5 and 6 bytes sequences which are illegal in UTF-8 space. (part2)
Per rfc3629 value greater than 0x10ffff should be rejected

Suggested by:	jilles
2015-08-09 00:06:56 +00:00
Baptiste Daroussin
c9d24bcfd5 Remove 5 and 6 bytes sequences which are illegal in UTF-8 space.
Per rfc3629 value greater than 0x10ffff should be rejected

Suggested by:	jilles
2015-08-08 23:59:15 +00:00
Baptiste Daroussin
d0a68f8d38 Fix typo 2015-08-08 23:17:10 +00:00
Baptiste Daroussin
7b2473410f Revamp CTYPE support (from Illumos & Dragonfly)
Obtained from:	Dragonfly
2015-08-08 18:22:14 +00:00
Baptiste Daroussin
2a6abeebef The collate functions within libc have been using version 1 and 1.2 of the
packed LC_COLLATE binary formats. These were generated with the colldef
tool, but the new LC_COLLATE files are going to be generated by the new
localedef tool using CLDR POSIX files as input.  The BSD-flavored
version of localedef identifies the format as "BSD 1.0".  Any
LC_COLLATE file with a different version will simply not be loaded, and
all LC* categories will get set to "C" (aka "POSIX") locale.

This work is based off of Nexenta's contribution to Illumos.
The integration with xlocale is John Marino's work for Dragonfly.

The following commits will enable localedef tool, disable the colldef
tool, add generated colldef directory, and finally remove colldef from
base.

The only difference with Dragonfly are:
- a few fixes to build with clang
- And identification of the flavor as "BSD 1.0" instead of "Dragonfly 4.4"

Obtained from:	Dragonfly
2015-08-07 23:41:26 +00:00
David Chisnall
710ec77e3a __xlocale_C_ctype should not be const. It contains a reference count that is modified by newlocale / duplocale / freelocale.
MFC after:	1 week
2015-04-24 10:21:20 +00:00
David Chisnall
58912ae7c3 Small changes to locale-related man pages.
Fix a missing .h and change the recommended include for the POSIX2008 functions from xlocale.h to locale.h.  Including xlocale.h is for legacy / Darwin compatibility so should not be encouraged.
2015-04-24 10:17:55 +00:00
Tijl Coosemans
1243a98e38 Remove the const qualifier from iconv(3) to comply with POSIX:
http://pubs.opengroup.org/onlinepubs/9699919799/functions/iconv.html

Adjust all code that calls iconv.

PR:		199099
Exp-run by:	antoine
MFC after:	2 weeks
2015-04-15 09:09:20 +00:00
Joel Dahl
4990a1c050 mdoc: improvements to SEE ALSO. 2014-12-27 08:31:52 +00:00
Pedro F. Giffuni
9908eab82e libc/locale: Remove a wrong comma.
This only had some effect when debugging.

Obtained from:	DragonflyBSD
MFC after:	3 days
2014-09-04 17:36:21 +00:00
Pedro F. Giffuni
0716c0ff7a minor perf enhancement for UTF-8
Reduce some duplicate code.

Reference:
https://www.illumos.org/issues/628

Obtained from:	Illumos
MFC after:	1 week
2014-07-04 22:39:39 +00:00
Pedro F. Giffuni
0f5132cd25 citrus: Avoid invalid code points.
From the OpenBSD log:
The UTF-8 decoder should not accept byte sequences which decode to unicode
code positions U+D800 to U+DFFF (UTF-16 surrogates), U+FFFE, and U+FFFF.

http://www.cl.cam.ac.uk/~mgk25/unicode.html#utf-8
http://unicode.org/faq/utf_bom.html#utf8-4

Reported by:	Stefan Sperling
Obtained from:	OpenBSD
MFC after:	5 days
2014-05-01 01:42:48 +00:00
Pedro F. Giffuni
97ecaa8907 citrus: Avoid invalid code points.
From the OpenBSD log:
The UTF-8 decoder should not accept byte sequences which decode to unicode
code positions U+D800 to U+DFFF (UTF-16 surrogates), U+FFFE, and U+FFFF.

http://www.cl.cam.ac.uk/~mgk25/unicode.html#utf-8
http://unicode.org/faq/utf_bom.html#utf8-4

Reported by:	Stefan Sperling
Obtained from:	OpenBSD
MFC after:	5 days
2014-04-29 15:25:57 +00:00
David Chisnall
8d07b7deff Fix an issue where the locale and rune locale could become out of sync,
causing mb* functions (and similar) to be called with the wrong data
(possibly a null pointer, causing a crash).

PR:		standards/188036
MFC after:	1 week
2014-04-02 11:10:46 +00:00
Marcel Moolenaar
8876613dc5 Replace use of ${.CURDIR} by ${LIBC_SRCTOP} and define ${LIBC_SRCTOP}
if not already defined. This allows building libc from outside of
lib/libc using a reach-over makefile.

A typical use-case is to build a standard ILP32 version and a COMPAT32
version in a single iteration by building the COMPAT32 version using a
reach-over makefile.

Obtained from:	Juniper Networks, Inc.
2014-03-04 02:19:39 +00:00
Peter Wemm
19dfd82d81 Replace the #define for "iconv" so it is for the function name instead of
a macro with parameters.  Remove a __DECONST hack and add consts instead
for gnu libiconv API compatability.  This makes it work with things like
devel/boost-libs that expects to use "iconv" as though it were a pointer.
2013-07-03 07:03:19 +00:00
Ed Schouten
49111f0092 Add libiconv based versions of *c16*() and *c32*().
I initially thought wchar_t was locale independent, but this seems to be
only the case on Linux. This means that we cannot depend on the *wc*()
routines to implement *c16*() and *c32*(). Instead, use the Citrus
libiconv that is part of libc.

I'll see if there is anything I can do to make the existing functions
somewhat useful in case the system is built without libiconv in the
nearby future. If not, I'll simply remove the broken implementations.

Reviewed by:	jilles, gabor
2013-06-03 17:17:56 +00:00
Ed Schouten
50c77c6e8b Add <uchar.h>.
The <uchar.h> header, part of C11, adds a small number of utility
functions for 16/32-bit "universal" characters, which may or may not be
UTF-16/32. As our wchar_t is already ISO 10646, simply add light-weight
wrappers around wcrtomb() and mbrtowc().

While there, also add (non-yet-standard) _l functions, similar to the
ones we already have for the other locale-dependent functions.

Reviewed by:	theraven
2013-05-21 19:59:37 +00:00
Sergey Kandaurov
424b842a5d Document that the return type is different from 1003.1-2008.
MFC after:	1 week
2013-05-04 17:21:44 +00:00
Sergey Kandaurov
4f79ce7b4c mdoc: missing comma in .Dd macro. 2013-05-04 17:06:47 +00:00
Sergey Kandaurov
67ff590b9e Also, add a missing period. 2013-05-03 13:27:13 +00:00
Sergey Kandaurov
cc7a693fa7 Remove an extra comma. 2013-05-03 12:45:45 +00:00
Sergey Kandaurov
88bbbe88d7 Remove the STANDARDS section.
querylocale is not part of IEEE Std 1003.1-2008.

MFC after:	3 days
2013-05-03 12:42:43 +00:00
Jilles Tjoelker
86bca8fb51 btowc(3), isblank(3): Correct prototypes for _l variants.
MFC after:	1 week
2013-03-27 21:31:40 +00:00