instead of 32+32+15+1) on all arches that have such long doubles (amd64,
ia64 and i386). Large objects should be be accessed in large units,
and the 32+32+15+1[+padding] decomposition asks for almost the opposite
of that, sometimes resulting in very slow accesses depending on how
well the compiler ignores what we ask for and converts to the best
units for the given machine. E.g., on Athlons, there is a 10-20 cycle
penalty for accessing the middle 32-bit word immediately after an
80-bit store.
Whether actually using the alternative view is better is very machine-
dependent. A 32+32+16 view is probably best with old 32-bit systems
and gcc through 4.2.1. The compiler should mostly avoid the view and
generate best accesses, but gcc-4.2.1 is far from doing that. I think
64+16 is best for now. Similarly for doubles -- they should be using
64+0 especially on 64-bit machines, but fdlibm uses 32+32 extensively
for them. Fortunately, in 64-bit mode for doubles, gcc already ignores
the 32+32-bit view and generates best accesses in many cases.
into slowsort for some sequences because different parts of the
code used 'r' to store two different things, one of which was
signed. Clean things up by splitting 'r' into two variables, and
use a more meaningful name.
implement shm_open(2) and shm_unlink(2) in the kernel:
- Each shared memory file descriptor is associated with a swap-backed vm
object which provides the backing store. Each descriptor starts off with
a size of zero, but the size can be altered via ftruncate(2). The shared
memory file descriptors also support fstat(2). read(2), write(2),
ioctl(2), select(2), poll(2), and kevent(2) are not supported on shared
memory file descriptors.
- shm_open(2) and shm_unlink(2) are now implemented as system calls that
manage shared memory file descriptors. The virtual namespace that maps
pathnames to shared memory file descriptors is implemented as a hash
table where the hash key is generated via the 32-bit Fowler/Noll/Vo hash
of the pathname.
- As an extension, the constant 'SHM_ANON' may be specified in place of the
path argument to shm_open(2). In this case, an unnamed shared memory
file descriptor will be created similar to the IPC_PRIVATE key for
shmget(2). Note that the shared memory object can still be shared among
processes by sharing the file descriptor via fork(2) or sendmsg(2), but
it is unnamed. This effectively serves to implement the getmemfd() idea
bandied about the lists several times over the years.
- The backing store for shared memory file descriptors are garbage
collected when they are not referenced by any open file descriptors or
the shm_open(2) virtual namespace.
Submitted by: dillon, peter (previous versions)
Submitted by: rwatson (I based this on his version)
Reviewed by: alc (suggested converting getmemfd() to shm_open())
default. This has the disadvantage of rendering the datasize resource
limit irrelevant, but without this change, legitimate uses of more
memory than will fit in the data segment are thwarted by default.
Fix chunk_alloc_mmap() to work correctly if initial mapping is not
chunk-aligned and mapping extension fails.
Clean up DSS-related locking and protect all pertinent variables with
dss_mtx (remove dss_chunks_mtx). This fixes race conditions that could
cause chunk leaks.
Reported by: [1] kris
This is a long-standing bug, but until recent changes it was difficult
to trigger, and even then its impact was non-catastrophic, with the
exception of revision 1.157.
Optimize chunk_alloc_mmap() to avoid the need for unmapping pages in the
common case. Thanks go to Kris Kennaway for a patch that inspired this
change.
Do not maintain a record of previously mmap'ed chunk address ranges.
The original intent was to avoid the extra system call overhead in
chunk_alloc_mmap(), which is no longer a concern. This also allows some
simplifications for the tree of unused DSS chunks.
Introduce huge_mtx and dss_chunks_mtx to replace chunks_mtx. There was
no compelling reason to use the same mutex for these disjoint purposes.
Avoid memset() for huge allocations when possible.
Maintain two trees instead of one for tracking unused DSS address
ranges. This allows scalable allocation of multi-chunk huge objects in
the DSS. Previously, multi-chunk huge allocation requests failed if the
DSS could not be extended.
order to support re-use of multi-chunk unused regions within the DSS for
huge allocations. This generalization is important to correct function
when mmap-based allocation is disabled.
Avoid zeroing re-used memory in the DSS unless it really needs to be
zeroed.
memory is acquired from the system via sbrk(2) and/or mmap(2). By default,
use sbrk(2) only, in order to support traditional use of resource limits.
Additionally, when both options are enabled, prefer the data segment to
anonymous mappings, in order to coexist better with large file mappings
in applications on 32-bit platforms. This change has the potential to
increase memory fragmentation due to the linear nature of the data
segment, but from a performance perspective this is mitigated by the use
of madvise(2). [1]
Add the ability to interpret integer prefixes in MALLOC_OPTIONS
processing. For example, MALLOC_OPTIONS=lllllllll can now be specified as
MALLOC_OPTIONS=9l.
Reported by: [1] rwatson
Design review: [1] alc, peter, rwatson
- Use PTY* for all pty(4) related constants.
- Use PTMX* for all pts(4) related constants.
- Consistently use _PATH_DEV PTMX rather than "/dev/ptmx".
- Revert 1.7 and properly fix it by using the correct prefix string for
pts(4) masters.
MFC after: 3 days
my original implementation made both use the same code. Unfortunately,
this meant libm depended on a vendor header at compile time and previously-
unexposed vendor bits in libc at runtime.
Hence, I just wrote my own version of the relevant vendor routine. As it
turns out, mine has a factor of 8 fewer of lines of code, and is a bit more
readable anyway. The strtod() and *scanf() routines still use vendor code.
Reviewed by: bde
calculating run sizes. Use of the floating point unit was a potential
pessimization to context switching for applications that do not otherwise
use floating point math. [1]
Reformat cpp macro-related comments to improve consistency.
Submitted by: das
deallocation and dynamic load balancing via the MALLOC_LAZY_FREE and
MALLOC_BALANCE knobs. This is a non-functional change, since these
features are still enabled when possible.
Clean up a few things that more pedantic compiler settings would cause
complaints over.
adds two new directories in msun: ld80 and ld128. These are for
long double functions specific to the 80-bit long double format
used on x86-derived architectures, and the 128-bit format used on
sparc64, respectively.
when particular function can't be found in nsswitch-module. For
example, getgrouplist(3) will use module-supplied 'getgroupmembership'
function (which can work in an optimal way for such source as LDAP) and
will fall back to the stanard iterate-through-all-groups implementation
otherwise.
PR: ports/114655
Submitted by: Michael Hanselmann <freebsd AT hansmi DOT ch>
Reviewed by: brooks (mentor)
is seems to be a problem for SUID applications, which we like to
prevent as much as possible.
PR: docs/39530
Submitted by: Soren Spies <sspies at apple dot com>
MFC After: 3 days
contention. The intent is to dynamically adjust to load imbalances, which
can cause severe contention.
Use pthread mutexes where possible instead of libc "spinlocks" (they aren't
actually spin locks). Conceptually, this change is meant only to support
the dynamic load balancing code by enabling the use of spin locks, but it
has the added apparent benefit of substantially improving performance due to
reduced context switches when there is moderate arena lock contention.
Proper tuning parameter configuration for this change is a finicky business,
and it is very much machine-dependent. One seemingly promising solution
would be to run a tuning program during operating system installation that
computes appropriate settings for load balancing. (The pthreads adaptive
spin locks should probably be similarly tuned.)
vector of slots for lazily freed objects. For each deallocation, before
doing the hard work of locking the arena and deallocating, try several times
to randomly insert the object into the vector using atomic operations.
This approach is particularly effective at reducing contention for
multi-threaded applications that use the producer-consumer model, wherein
one producer thread allocates objects, then multiple consumer threads
deallocate those objects.