Sorting eucJP text with "sort" resulted in an illegal sequence while
"gsort" worked. This was traced back to mbrtowc handling which was
broken for eucJP (probably eucCN, eucKR, and eucTW as well). This
small fix took hours to figure out. The OR operation to build the
wide character requires an unsigned character to work correctly. The
euc wcrtowc conversion is probably broken upstream in Illumos as well.
Triggered by: misc/freebsd-doc-ja in ports (encoded in eucJP)
Submitted by: marino
Obtained from: DragonflyBSD
load of _l suffixed versions of various standard library functions that use
the global locale, making them take an explicit locale parameter. Also
adds support for per-thread locales. This work was funded by the FreeBSD
Foundation.
Please test any code you have that uses the C standard locale functions!
Reviewed by: das (gdtoa changes)
Approved by: dim (mentor)
for wide characters locales in the argument range >= 0x80 - they may
return false positives.
Example 1: for UTF-8 locale we currently have:
iswspace(0xA0)==1 and isspace(0xA0)==1
(because iswspace() and isspace() are the same code)
but must have
iswspace(0xA0)==1 and isspace(0xA0)==0
(because there is no such character and all others in the range
0x80..0xff for the UTF-8 locale, it keeps ASCII only in the single byte
range because our internal wchar_t representation for UTF-8 is UCS-4).
Example 2: for all wide character locales isalpha(arg) when arg > 0xFF may
return false positives (must be 0).
(because iswalpha() and isalpha() are the same code)
This change address this issue separating single byte and wide ctype
and also fix iswascii() (currently iswascii() is broken for
arguments > 0xFF).
This change is 100% binary compatible with old binaries.
Reviewied by: i18n@
. Replace inclusion of sys/param.h to sys/cdefs.h and sys/types.h where
appropriate.
. move _*_init() prototypes to mblocal.h, and remove these prototypes
from .c files
. use _none_init() in __setrunelocale() instead of duplicating code
. move __mb* variables from table.c to none.c allowing us to not to
export _none_*() externs, and appropriately remove them from mblocal.h
Ok'ed by: tjr
with ``__'' to avoid polluting the namespace. This doesn't change the
documented rune interface at all, but breaks applications that accessed
_RuneLocale directly.
multibyte representation in conversion state objects, store the
accumulated wide character, set number and number of bytes remaining
to avoid having to derive them every time mbrtowc() is called.
mbrtowc() and wcrtomb() directly. GB18030, GBK and UTF2 are left
unconverted; GB18030 will be done eventually, but GBK and UTF2 may just
be removed, as they are subsets of GB18030 and UTF-8 respectively.
currently cached data. It allows a number of nice things, like: removing
fallback code from single locale loading, remove memory leak when LC_CTYPE
data loaded again and again, efficient cache use, not only for
setlocale(locale1); setlocale(locale1), but for setlocale(locale1);
setlocale("C"); setlocale(locale1) too (i.e. data file loaded only once).
Satoshi NIIMI-san kindly explained that EUC does not limit the byte length to
any arbitrary number.
We now set the limit to the maximum octet length of the codeset and it is
locale-specific.
Submitted by: Yong-Jhen Hong <winard@ms11.url.com.tw>