Commit Graph

18 Commits

Author SHA1 Message Date
pfg
aa4f79bd1b citrus: Avoid invalid code points.
From the OpenBSD log:
The UTF-8 decoder should not accept byte sequences which decode to unicode
code positions U+D800 to U+DFFF (UTF-16 surrogates), U+FFFE, and U+FFFF.

http://www.cl.cam.ac.uk/~mgk25/unicode.html#utf-8
http://unicode.org/faq/utf_bom.html#utf8-4

Reported by:	Stefan Sperling
Obtained from:	OpenBSD
MFC after:	5 days
2014-04-29 15:25:57 +00:00
theraven
0f6ef690b3 Implement xlocale APIs from Darwin, mainly for use by libc++. This adds a
load of _l suffixed versions of various standard library functions that use
the global locale, making them take an explicit locale parameter.  Also
adds support for per-thread locales.  This work was funded by the FreeBSD
Foundation.

Please test any code you have that uses the C standard locale functions!

Reviewed by:    das (gdtoa changes)
Approved by:    dim (mentor)
2011-11-20 14:45:42 +00:00
ache
c44dc0d639 Add comment explaining __mb_sb_limit trick here. 2007-10-15 09:51:30 +00:00
ache
a5038f060d The problem is: currently our single byte ctype(3) functions are broken
for wide characters locales in the argument range >= 0x80 - they may
return false positives.

Example 1: for UTF-8 locale we currently have:
iswspace(0xA0)==1 and isspace(0xA0)==1
(because iswspace() and isspace() are the same code)
but must have
iswspace(0xA0)==1 and isspace(0xA0)==0
(because there is no such character and all others in the range
0x80..0xff for the UTF-8 locale, it keeps ASCII only in the single byte
range because our internal wchar_t representation for UTF-8 is UCS-4).

Example 2: for all wide character locales isalpha(arg) when arg > 0xFF may
return false positives (must be 0).
(because iswalpha() and isalpha() are the same code)

This change address this issue separating single byte and wide ctype
and also fix iswascii() (currently iswascii() is broken for
arguments > 0xFF).
This change is 100% binary compatible with old binaries.

Reviewied by: i18n@
2007-10-13 16:28:22 +00:00
trhodes
dba9e095d4 Fix a bug where, for 6-byte sequences, the top 6 bits get compared to
111111 rather than the top 7 bits being compared against 1111110 causing
illegal bytes fe and ff being treated the same as legal bytes fc and fd.
2006-03-30 09:04:12 +00:00
phantom
23d961a13f . Static'ize functions exported via function reference variables only.
. Replace inclusion of sys/param.h to sys/cdefs.h and sys/types.h where
  appropriate.
. move _*_init() prototypes to mblocal.h, and remove these prototypes
  from .c files
. use _none_init() in __setrunelocale() instead of duplicating code
. move __mb* variables from table.c to none.c allowing us to not to
  export _none_*() externs, and appropriately remove them from mblocal.h

Ok'ed by:	tjr
2005-02-27 15:11:09 +00:00
stefanf
d5719e89ef Fix comparisons that test if an unsigned value is < 0.
Reviewed by:	tjr
2005-02-12 08:45:12 +00:00
tjr
b9fa8ef024 Add UTF-8-specific implementations of mbsnrtowcs() and wcsnrtombs().
These convert plain ASCII characters in-line, making them only slightly
slower than the single-byte ("NONE" encoding) version when processing
ASCII strings.
2004-07-27 06:29:48 +00:00
tjr
0bea5c0108 Add fast paths for conversion of plain ASCII characters. 2004-07-09 15:46:06 +00:00
tjr
aee8349a0c Use conversion state objects to store the accumulated wide character,
low bound, and the number of bytes remaining instead of storing the
raw byte sequence and deriving them every time mbrtowc() is called.
This is much faster -- about twice as fast in some crude benchmarks.
2004-05-17 12:32:40 +00:00
tjr
e3f042f4af Move prototypes of various encoding-related functions into a new header
file to avoid extern'ing them all over the place.
2004-05-12 14:09:04 +00:00
tjr
8f8a2ad179 Perform some basic validation of multibyte conversion state objects. 2004-04-12 13:09:18 +00:00
tjr
17077e5ae6 Don't cast away const qualifiers.
Spotted by:	bde
2004-04-10 00:27:52 +00:00
tjr
54a18fa1d6 Allow partial multibyte characters to accumulate in conversion state
objects passed to mbrtowc(), mbsrtowcs(), and mbrlen(), as required
by C99.
2004-04-07 10:48:19 +00:00
tjr
042225384f Fix a typo that caused mbrtowc() to always return 0. 2003-11-11 07:25:05 +00:00
tjr
1c3a3f7e26 Convert the Big5, EUC, MSKanji and UTF-8 encoding methods to implement
mbrtowc() and wcrtomb() directly. GB18030, GBK and UTF2 are left
unconverted; GB18030 will be done eventually, but GBK and UTF2 may just
be removed, as they are subsets of GB18030 and UTF-8 respectively.
2003-11-02 10:09:33 +00:00
nectar
e369901c4d Whack 28 unused variables. 2003-02-18 13:39:52 +00:00
tjr
d62abf19a3 Add a UTF-8 encoding method, which will eventually replace the antique
"UTF2" method. Although UTF-8 and the old UTF2 encoding are compatible
for 16-bit characters, the new UTF-8 implementation is much more strict
about rejecting malformed input and also handles the full 31 bit range
of characters.
2002-10-10 22:56:18 +00:00