Previously the code to read from a local file or stdin was sperarated
After the change to remove the home made line reader used for stdin
(replaced by getdelim) it apprears that the rest of the code which is
used to read from any FILE * but stdin can benefit from the exact same
change.
The previous code had bug when reading lines with an unexpected
encoding, returning without the full line being captured.
This result in sort complaining with "sort: Illegal byte sequence"
Using getdelim(3) instead of the home made code, fixes the situation.
PR: 241679
Reported by: Ronald F. Guilmette <rfg-freebsd@tristatelogic.com>
MFC After: 1 week
Reviewed by: markj, imp
Differential Revision: https://reviews.freebsd.org/D36948
- Check that catopen() succeeded before calling catclose(). musl will
crash in the latter if the catalogue descriptor is -1.
- Keep the message catalogue open for most of sort(1)'s actual
operation.
- Don't use catgets(3) to print error messages if catopen(3) had failed.
Reviewed by: arichardson, emaste
MFC after: 1 week
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D34081
bwsrawdata() is supposed to return the string buffer.
PR: 259451
Reported by: sigsys@gmail.com
Fixes: d053fb22f6 ("usr.bin/sort: Avoid UBSan errors")
MFC after: 3 days
Sponsored by: The FreeBSD Foundation
UBSan complains about out-of-bounds accesses for zero-length arrays. To
avoid this we can use flexible array members. However, the C standard does
not allow for structures that only contain flexible array members, so we
move the length parameters into that structure too.
Split out from D28233.
Reviewed By: markj
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D31009
This results in a significant improvement in the runtime of sort(1) when
radix sort cannot be used. This comes at the expense of increased
memory usage, but this is small relative to sort's overall memory usage.
PR: 255551
Reviewed by: markj
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D30319
Every usage of MB_CUR_MAX results in a call to __mb_cur_max. This is
inefficient and redundant. Caching the value of MB_CUR_MAX in a global
variable removes these calls and speeds up the runtime of sort. For
numeric sorting, runtime is almost halved in some tests.
PR: 255551
PR: 255840
Reviewed by: markj
MFC after: 1 week
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D30170
Sort(1)'s radixsort implementation was broken for multibyte LC_CTYPEs in at
least two ways:
* In actual radix sort, it would only bucket the least significant
byte from each wchar, ignoring the 24 most-significant bits of each
unicode character.
* In degenerate cases / "fast paths," it would fall back to another
sorting algorithm (default: mergesort) with a bogus comparator
offset. The string comparison functions in sort(1) take an offset
in units of the operating character size. However, radixsort was
passing an offset in units of bytes. The byte offset must be
divided by sizeof(wchar_t).
This revision addresses both discovered issues.
Some example testcases:
$ (echo 耳 ; echo 脳 ; echo 耳) | \
LC_CTYPE=ja_JP.UTF-8 LC_COLLATE=C LANG=C sort --radixsort --debug
$ (echo 耳 ; echo 脳 ; echo 耳) | \
LC_CTYPE=C LC_COLLATE=C LANG=C sort --radixsort --debug
$ (for i in $(jot 34); do echo 耳耳耳耳耳; echo 耳耳耳耳脳; echo 耳耳耳耳脴; done) | \
LC_CTYPE=ja_JP.UTF-8 LC_COLLATE=C LANG=C sort --radixsort --debug
PR: 247494
Reported by: knu
MFC after: I do not intend to, but parties interested in stable might want to
Update a bunch of Makefile.depend files as
a result of adding Makefile.depend.options files
Reviewed by: bdrewery
MFC after: 1 week
Sponsored by: Juniper Networks
Differential Revision: https://reviews.freebsd.org/D22494
Leaf directories that have dependencies impacted
by options need a Makefile.depend.options file
to avoid churn in Makefile.depend
DIRDEPS for cases such as OPENSSL, TCP_WRAPPERS etc
can be set in local.dirdeps-options.mk
which can add to those set in Makefile.depend.options
See share/mk/dirdeps-options.mk
Reviewed by: bdrewery
MFC after: 1 week
Sponsored by: Juniper Networks
Differential Revision: https://reviews.freebsd.org/D22469
Experimentally, reduces sort -R time of a 148160 line corpus from about
3.15s to about 0.93s on this particular system.
There's probably room for improvement using some digest other than md5, but
I don't want to look at sort(1) anymore. Some discussion of other possible
improvements in the Test Plan section of the Differential.
PR: 230792
Reviewed by: jhb (earlier version)
Differential Revision: https://reviews.freebsd.org/D19885
Bound input file processing length to avoid the issue reported in [1]. For
simplicity, only allow regular file and character device inputs. For
character devices, only allow /dev/random (and /dev/urandom symblink).
32 bytes of random is perfectly sufficient to seed MD5; we don't need any
more. Users that want to use large files as seeds are encouraged to truncate
those files down to an appropriate input file via tools like sha256(1).
(This does not change the sort algorithm of sort -R.)
[1]: https://lists.freebsd.org/pipermail/freebsd-hackers/2018-August/053152.html
PR: 230792
Reported by: Ali Abdallah <aliovx AT gmail.com>
Relnotes: yes
There's no reason to order based on strcmp of ASCII digests instead of
memcmp of the raw digests.
While here, remove collision fallback. If you collide two MD5s, they're
probably the same string anyway. If robustness against MD5 collisions is
desired, maybe we shouldn't use MD5.
None of the behavior of sort -R is specified by POSIX, so we're free to
implement this however we like. E.g., using a 128-bit counter and block cipher
to generate unique indices for each line of input.
PR: 230792 (2/many)
Relnotes: This will change the sort order for a given dataset with a
given seed. Other similarly breaking changes are planned.
Sponsored by: Dell EMC Isilon
Observe:
printf "a\nb\nc\n" > /tmp/foo
# Next command results in no output
cat /tmp/foo | sort -m
# Next command results in proper output
cat /tmp/foo | sort -m -
# Also works:
sort -m /tmp/foo
Some const'ification was done to simplify the actual solution of adding "-"
explicitly to the file list if we didn't have any file arguments left over.
PR: 190099
MFC after: 1 week
Mainly focus on files that use BSD 2-Clause license, however the tool I
was using misidentified many licenses so this was mostly a manual - error
prone - task.
The Software Package Data Exchange (SPDX) group provides a specification
to make it easier for automated tools to detect and summarize well known
opensource licenses. We are gradually adopting the specification, noting
that the tags are considered only advisory and do not, in any way,
superceed or replace the license texts.
No functional change intended.
Worker threads now use a pthread_cond_t to wait for work instead of
burning the cpu up.
Obtained from: DragonflyBSD (07774aea0ccf64a48fcfad8899e3bf7c8f18277a)
MFC after: 2 weeks
zero-length array are dynamically sized at run-time based on the use
of hints, compilers can't be expected to figure out these offsets on
their own. [1]
- Fix incorrect comparison in cmp_nans(). [2]
PR: 204571 [1], 202301 [2]
Submitted by: David Binderman [2]
MFC after: 3 days