The regex code does not work with multibyte codesets like UTF-8.

In fact, it doesn't even work with single-byte codesets like ISO-8859-1. The comparison blows up at index 128 (the range is 0 to UCHAR_MAX (255). As a temporary workaround, all comparisons will be done in C locale regardless of the environment setting. The regex library needs to be updated to handle all codesets. Obtained from: Dragonfly
2015-08-08 19:29:59 +00:00 · 2015-08-08 19:29:59 +00:00 · 1cd79ad208
commit 1cd79ad208
parent 1fbe2addc6
1 changed files with 9 additions and 1 deletions
--- a/lib/libc/regex/regcomp.c
+++ b/lib/libc/regex/regcomp.c
@ -38,6 +38,13 @@
 *	@(#)regcomp.c	8.5 (Berkeley) 3/20/94
 */

+/*
+ * This implementation currently only works with C locale
+ * It's definitely limited by UCHAR_MAX, but not even ISO-8859 charsets
+ * are working.  The forced changing of locale to C for the comparison
+ * is considered a workaround until a better solution is found.
+ */
+
 #if defined(LIBC_SCCS) && !defined(lint)
 static char sccsid[] = "@(#)regcomp.c	8.5 (Berkeley) 3/20/94";
 #endif /* LIBC_SCCS and not lint */
@ -768,8 +775,9 @@ p_b_term(struct parse *p, cset *cs)
 	char c;
 	wint_t start, finish;
 	wint_t i;
+	locale_t loc = &__xlocale_C_locale;  /* see note under license */
 	struct xlocale_collate *table =
-		(struct xlocale_collate*)__get_locale()->components[XLC_COLLATE];
+		(struct xlocale_collate*)loc->components[XLC_COLLATE];

 	/* classify what we've got */
 	switch ((MORE()) ? PEEK() : '\0') {