arena_dalloc_lazy_hard() was split out of arena_dalloc_lazy() in revision
1.162.
Reduce thundering herd problems in lazy deallocation by randomly varying
how many probes a thread does before taking the slow path.
assumptions about whether bits are set at various times. This makes
adding other flags safe.
Reorganize functions in order to inline i{m,c,p,s,re}alloc(). This
allows the entire fast-path call chains for malloc() and free() to be
inlined. [1]
Suggested by: [1] Stuart Parmenter <stuart@mozilla.com>
exponent bits of the reduced result, construct 2**k (hopefully in
parallel with the construction of the reduced result) and multiply by
it. This tends to be much faster if the construction of 2**k is
actually in parallel, and might be faster even with no parallelism
since adjustment of the exponent requires a read-modify-wrtite at an
unfortunate time for pipelines.
In some cases involving exp2* on amd64 (A64), this change saves about
40 cycles or 30%. I think it is inherently only about 12 cycles faster
in these cases and the rest of the speedup is from partly-accidentally
avoiding compiler pessimizations (the construction of 2**k is now
manually scheduled for good results, and -O2 doesn't always mess this
up). In most cases on amd64 (A64) and i386 (A64) the speedup is about
20 cycles. The worst case that I found is expf on ia64 where this
change is a pessimization of about 10 cycles or 5%. The manual
scheduling for plain exp[f] is harder and not as tuned.
Details specific to expm1*:
- the saving is closer to 12 cycles than to 40 for expm1* on i386 (A64).
For some reason it is much larger for negative args.
- also convert to __FBSDID().
exponent bits of the reduced result, construct 2**k (hopefully in
parallel with the construction of the reduced result) and multiply by
it. This tends to be much faster if the construction of 2**k is
actually in parallel, and might be faster even with no parallelism
since adjustment of the exponent requires a read-modify-wrtite at an
unfortunate time for pipelines.
In some cases involving exp2* on amd64 (A64), this change saves about
40 cycles or 30%. I think it is inherently only about 12 cycles faster
in these cases and the rest of the speedup is from partly-accidentally
avoiding compiler pessimizations (the construction of 2**k is now
manually scheduled for good results, and -O2 doesn't always mess this
up). In most cases on amd64 (A64) and i386 (A64) the speedup is about
20 cycles. The worst case that I found is expf on ia64 where this
change is a pessimization of about 10 cycles or 5%. The manual
scheduling for plain exp[f] is harder and not as tuned.
This change ld128/s_exp2l.c has not been tested.
that is specialized for float precision. The new polynomial has degree
5 instead of 11, and a maximum error of 2**-27.74 ulps instead
of 2**-30.64. This doesn't affect the final error significantly; the
maximum error was and is about 0.9101 ulps on amd64 -01 and the number
of cases with an error of > 0.5 ulps is actually reduced by epsilon
despite the larger error in the polynomial.
This is about 15% faster on amd64 (A64), i386 (A64) and ia64. The asm
version is still used instead of this on i386 since it is faster and
more accurate.
threshold, according to the 'F' MALLOC_OPTIONS flag. This obsoletes the
'H' flag.
Try to realloc() large objects in place. This substantially speeds up
incremental large reallocations in the common case.
Fix a bug in arena_ralloc() that caused relocation of sub-page objects
even if the old and new sizes were in the same size class.
Maintain trees of runs and simplify the per-chunk page map. This allows
logarithmic-time searching for sufficiently large runs in
arena_run_alloc(), whereas the previous algorithm required linear time
in the worst case.
Break various large functions into smaller sub-functions, and inline
only the functions that are in the fast path for small object
allocation/deallocation.
Remove an unnecessary check in base_pages_alloc_mmap().
Avoid integer division in choose_arena() for the NO_TLS case on
single-CPU systems.
the semantics of pthread_mutex_islocked_np() to return true if and only if
the mutex is held by the current thread.
Obviously, change the regression test to match.
MFC after: 2 weeks
locked. This is intended primarily to support the userland equivalent
of the various *_ASSERT_LOCKED() macros we have in the kernel.
MFC after: 2 weeks
referencing the files VM pages are returned from the network stack,
making changes to the file safe.
This flag does not guarantee that the data has been transmitted to the
other end.
use. If it is in use, use the watched request, otherwise use the
lockuser's own request. Only allocate a lockuser request if both
requests are null.
PR: 119920
Tested by (6.x): Landon Fuller <landonf -at- bikemonkey -dot- org>
prerequisite for using this interface. However, the 'statinfo' struct
actually references CPUSTATES from <sys/resource.h>, so in fact it
requires <sys/resource.h> to compile. Use a nested include of
<sys/resource.h> to make the code match the docs.
Reported by: Pietro Cerutti gahr | gahr.ch
global header if nothing else has been written before the closing of
the archive. This will change the behaviour when creating archives
without members, i.e., instead of generating a 0-size archive file, an
archive with just the global header (8 bytes in total) will be created
and it is indeed a valid archive by the definition of libarchive, thus
subsequent operation on this archive will be accepted. This especially
solves the failure caused by following sequence: (several ports do)
% ar cru libfoo.a # without specifying obj files
% ranlib libfoo.a
Reviewed by: kientzle, jkoshy
Approved by: kientzle
Approved by: jkoshy (mentor)
Reported by: erwin
MFC after: 1 month
obey or ignore the size field on a hardlink entry. In particular,
if we're reading a non-POSIX archive, we should always ignore
the size field.
This should fix both the audio/xmcd port and the math/unixstat port.
Thanks to: Pav Lucistnik for pointing these two ports out to me.
MFC after: 7 days
fields in FTS and FTSENT structs being too narrow. In addition,
the narrow types creep from there into fts.c. As a result, fts(3)
consumers, e.g., find(1) or rm(1), can't handle file trees an ordinary
user can create, which can have security implications.
To fix the historic implementation of fts(3), OpenBSD and NetBSD
have already changed <fts.h> in somewhat incompatible ways, so we
are free to do so, too. This change is a superset of changes from
the other BSDs with a few more improvements. It doesn't touch
fts(3) functionality; it just extends integer types used by it to
match modern reality and the C standard.
Here are its points:
o For C object sizes, use size_t unless it's 100% certain that
the object will be really small. (Note that fts(3) can construct
pathnames _much_ longer than PATH_MAX for its consumers.)
o Avoid the short types because on modern platforms using them
results in larger and slower code. Change shorts to ints as
follows:
- For variables than count simple, limited things like states,
use plain vanilla `int' as it's the type of choice in C.
- For a limited number of bit flags use `unsigned' because signed
bit-wise operations are implementation-defined, i.e., unportable,
in C.
o For things that should be at least 64 bits wide, use long long
and not int64_t, as the latter is an optional type. See
FTSENT.fts_number aka FTS.fts_bignum. Extending fts_number `to
satisfy future needs' is pointless because there is fts_pointer,
which can be used to link to arbitrary data from an FTSENT.
However, there already are fts(3) consumers that require fts_number,
or fts_bignum, have at least 64 bits in it, so we must allow for them.
o For the tree depth, use `long'. This is a trade-off between making
this field too wide and allowing for 64-bit inode numbers and/or
chain-mounted filesystems. On the one hand, `long' is almost
enough for 32-bit filesystems on a 32-bit platform (our ino_t is
uint32_t now). On the other hand, platforms with a 64-bit (or
wider) `long' will be ready for 64-bit inode numbers, as well as
for several 32-bit filesystems mounted one under another. Note
that fts_level has to be signed because -1 is a magic value for it,
FTS_ROOTPARENTLEVEL.
o For the `nlinks' local var in fts_build(), use `long'. The logic
in fts_build() requires that `nlinks' be signed, but our nlink_t
currently is uint16_t. Therefore let's make the signed var wide
enough to be able to represent 2^16-1 in pure C99, and even 2^32-1
on a 64-bit platform. Perhaps the logic should be changed just
to use nlink_t, but it can be done later w/o breaking fts(3) ABI
any more because `nlinks' is just a local var.
This commit also inludes supporting stuff for the fts change:
o Preserve the old versions of fts(3) functions through libc symbol
versioning because the old versions appeared in all our former releases.
o Bump __FreeBSD_version just in case. There is a small chance that
some ill-written 3-rd party apps may fail to build or work correctly
if compiled after this change.
o Update the fts(3) manpage accordingly. In particular, remove
references to fts_bignum, which was a FreeBSD-specific hack to work
around the too narrow types of FTSENT members. Now fts_number is
at least 64 bits wide (long long) and fts_bignum is an undocumented
alias for fts_number kept around for compatibility reasons. According
to Google Code Search, the only big consumers of fts_bignum are in
our own source tree, so they can be fixed easily to use fts_number.
o Mention the change in src/UPDATING.
PR: bin/104458
Approved by: re (quite a while ago)
Discussed with: deischen (the symbol versioning part)
Reviewed by: -arch (mostly silence); das (generally OK, but we didn't
agree on some types used; assuming that no objections on
-arch let me to stick to my opinion)
Even though I believe this is a good change, it does
have the potential to break certain clients, so it's
good to document the reasoning behind the change.
cases which are used mainly by regression tests.
As usual, the cutoff for tiny args was not correctly translated to
float precision. It was 2**-54 but 2**-24 works. It must be about
2**-precision, since the error from approximating log(1+x) by x is
about the same as |x|. Exhaustive testing shows that 2**-24 gives
perfect rounding in round-to-nearest mode.
Similarly for the cutoff for being small, except this is not used by
so many other functions. It was 2**-29 but 2**-15 works. It must be
a bit smaller than sqrt(2**-precision), since the error from
approximating log(1+x) by x-x*x/2 is about the same as x*x. Exhaustive
testing shows that 2**-15 gives a maximum error of 0.5052 ulps in
round-to-nearest-mode. The algorithm for the general case is only good
for 0.8388 ulps, so this is sufficient (but it loses slightly on i386 --
then extra precision gives 0.5032 ulps for the general case).
While investigating this, I noticed that optimizing the usual case by
falling into a middle case involving a simple polynomial evaluation
(return x-x*x/2 instead of x here) is not such a good idea since it
gives an enormous pessimization of tinier args on machines for which
denormals are slow. Float x*x/2 is denormal when |x| ~< 2**-64 and
x*x/2 is evaluated in float precision, so it can easily be denormal
for normal x. This is even more interesting for general polynomial
evaluations. Multiplying out large powers of x is normally a good
optimization since it reduces dependencies, but it creates denormals
starting with quite large x.
forget to translate "float" to "double".
ucbtest didn't detect the bug, but exhaustive testing of the float
case relative to the double case eventually did. The bug only affects
args x with |x| ~> 2**19*(pi/2) on non-i386 (i386 is broken in a
different way for large args).
it should never have existed and it has not been used for many years
(floats are reduced faster using doubles). All relevant changes (just
the workaround for broken assignment) have been merged to the double
version.