The resolver in libc creates a kqueue for watching a single file descriptor.
This can be done using poll() which should be lighter on the kernel and
reduce possible problems with rlimits (file descriptors, kqueues).
Reviewed by: jhb
in the future in a backward compatible (API and ABI) way.
The cap_rights_t represents capability rights. We used to use one bit to
represent one right, but we are running out of spare bits. Currently the new
structure provides place for 114 rights (so 50 more than the previous
cap_rights_t), but it is possible to grow the structure to hold at least 285
rights, although we can make it even larger if 285 rights won't be enough.
The structure definition looks like this:
struct cap_rights {
uint64_t cr_rights[CAP_RIGHTS_VERSION + 2];
};
The initial CAP_RIGHTS_VERSION is 0.
The top two bits in the first element of the cr_rights[] array contain total
number of elements in the array - 2. This means if those two bits are equal to
0, we have 2 array elements.
The top two bits in all remaining array elements should be 0.
The next five bits in all array elements contain array index. Only one bit is
used and bit position in this five-bits range defines array index. This means
there can be at most five array elements in the future.
To define new right the CAPRIGHT() macro must be used. The macro takes two
arguments - an array index and a bit to set, eg.
#define CAP_PDKILL CAPRIGHT(1, 0x0000000000000800ULL)
We still support aliases that combine few rights, but the rights have to belong
to the same array element, eg:
#define CAP_LOOKUP CAPRIGHT(0, 0x0000000000000400ULL)
#define CAP_FCHMOD CAPRIGHT(0, 0x0000000000002000ULL)
#define CAP_FCHMODAT (CAP_FCHMOD | CAP_LOOKUP)
There is new API to manage the new cap_rights_t structure:
cap_rights_t *cap_rights_init(cap_rights_t *rights, ...);
void cap_rights_set(cap_rights_t *rights, ...);
void cap_rights_clear(cap_rights_t *rights, ...);
bool cap_rights_is_set(const cap_rights_t *rights, ...);
bool cap_rights_is_valid(const cap_rights_t *rights);
void cap_rights_merge(cap_rights_t *dst, const cap_rights_t *src);
void cap_rights_remove(cap_rights_t *dst, const cap_rights_t *src);
bool cap_rights_contains(const cap_rights_t *big, const cap_rights_t *little);
Capability rights to the cap_rights_init(), cap_rights_set(),
cap_rights_clear() and cap_rights_is_set() functions are provided by
separating them with commas, eg:
cap_rights_t rights;
cap_rights_init(&rights, CAP_READ, CAP_WRITE, CAP_FSTAT);
There is no need to terminate the list of rights, as those functions are
actually macros that take care of the termination, eg:
#define cap_rights_set(rights, ...) \
__cap_rights_set((rights), __VA_ARGS__, 0ULL)
void __cap_rights_set(cap_rights_t *rights, ...);
Thanks to using one bit as an array index we can assert in those functions that
there are no two rights belonging to different array elements provided
together. For example this is illegal and will be detected, because CAP_LOOKUP
belongs to element 0 and CAP_PDKILL to element 1:
cap_rights_init(&rights, CAP_LOOKUP | CAP_PDKILL);
Providing several rights that belongs to the same array's element this way is
correct, but is not advised. It should only be used for aliases definition.
This commit also breaks compatibility with some existing Capsicum system calls,
but I see no other way to do that. This should be fine as Capsicum is still
experimental and this change is not going to 9.x.
Sponsored by: The FreeBSD Foundation
This ensures strerror() and friends continue to work correctly even if a
(non-PIE) executable linked against an older libc imports sys_errlist (which
causes sys_errlist to refer to the executable's copy with a size fixed when
that executable was linked).
The executable's use of sys_errlist remains broken because it uses the
current value of sys_nerr and may access past the bounds of the array.
Different from the message "Using sys_errlist from executables is not
ABI-stable" on freebsd-arch, this change does not affect the static library.
There seems no reason to prevent overriding the error messages in the static
library.
- Capability is no longer separate descriptor type. Now every descriptor
has set of its own capability rights.
- The cap_new(2) system call is left, but it is no longer documented and
should not be used in new code.
- The new syscall cap_rights_limit(2) should be used instead of
cap_new(2), which limits capability rights of the given descriptor
without creating a new one.
- The cap_getrights(2) syscall is renamed to cap_rights_get(2).
- If CAP_IOCTL capability right is present we can further reduce allowed
ioctls list with the new cap_ioctls_limit(2) syscall. List of allowed
ioctls can be retrived with cap_ioctls_get(2) syscall.
- If CAP_FCNTL capability right is present we can further reduce fcntls
that can be used with the new cap_fcntls_limit(2) syscall and retrive
them with cap_fcntls_get(2).
- To support ioctl and fcntl white-listing the filedesc structure was
heavly modified.
- The audit subsystem, kdump and procstat tools were updated to
recognize new syscalls.
- Capability rights were revised and eventhough I tried hard to provide
backward API and ABI compatibility there are some incompatible changes
that are described in detail below:
CAP_CREATE old behaviour:
- Allow for openat(2)+O_CREAT.
- Allow for linkat(2).
- Allow for symlinkat(2).
CAP_CREATE new behaviour:
- Allow for openat(2)+O_CREAT.
Added CAP_LINKAT:
- Allow for linkat(2). ABI: Reuses CAP_RMDIR bit.
- Allow to be target for renameat(2).
Added CAP_SYMLINKAT:
- Allow for symlinkat(2).
Removed CAP_DELETE. Old behaviour:
- Allow for unlinkat(2) when removing non-directory object.
- Allow to be source for renameat(2).
Removed CAP_RMDIR. Old behaviour:
- Allow for unlinkat(2) when removing directory.
Added CAP_RENAMEAT:
- Required for source directory for the renameat(2) syscall.
Added CAP_UNLINKAT (effectively it replaces CAP_DELETE and CAP_RMDIR):
- Allow for unlinkat(2) on any object.
- Required if target of renameat(2) exists and will be removed by this
call.
Removed CAP_MAPEXEC.
CAP_MMAP old behaviour:
- Allow for mmap(2) with any combination of PROT_NONE, PROT_READ and
PROT_WRITE.
CAP_MMAP new behaviour:
- Allow for mmap(2)+PROT_NONE.
Added CAP_MMAP_R:
- Allow for mmap(PROT_READ).
Added CAP_MMAP_W:
- Allow for mmap(PROT_WRITE).
Added CAP_MMAP_X:
- Allow for mmap(PROT_EXEC).
Added CAP_MMAP_RW:
- Allow for mmap(PROT_READ | PROT_WRITE).
Added CAP_MMAP_RX:
- Allow for mmap(PROT_READ | PROT_EXEC).
Added CAP_MMAP_WX:
- Allow for mmap(PROT_WRITE | PROT_EXEC).
Added CAP_MMAP_RWX:
- Allow for mmap(PROT_READ | PROT_WRITE | PROT_EXEC).
Renamed CAP_MKDIR to CAP_MKDIRAT.
Renamed CAP_MKFIFO to CAP_MKFIFOAT.
Renamed CAP_MKNODE to CAP_MKNODEAT.
CAP_READ old behaviour:
- Allow pread(2).
- Disallow read(2), readv(2) (if there is no CAP_SEEK).
CAP_READ new behaviour:
- Allow read(2), readv(2).
- Disallow pread(2) (CAP_SEEK was also required).
CAP_WRITE old behaviour:
- Allow pwrite(2).
- Disallow write(2), writev(2) (if there is no CAP_SEEK).
CAP_WRITE new behaviour:
- Allow write(2), writev(2).
- Disallow pwrite(2) (CAP_SEEK was also required).
Added convinient defines:
#define CAP_PREAD (CAP_SEEK | CAP_READ)
#define CAP_PWRITE (CAP_SEEK | CAP_WRITE)
#define CAP_MMAP_R (CAP_MMAP | CAP_SEEK | CAP_READ)
#define CAP_MMAP_W (CAP_MMAP | CAP_SEEK | CAP_WRITE)
#define CAP_MMAP_X (CAP_MMAP | CAP_SEEK | 0x0000000000000008ULL)
#define CAP_MMAP_RW (CAP_MMAP_R | CAP_MMAP_W)
#define CAP_MMAP_RX (CAP_MMAP_R | CAP_MMAP_X)
#define CAP_MMAP_WX (CAP_MMAP_W | CAP_MMAP_X)
#define CAP_MMAP_RWX (CAP_MMAP_R | CAP_MMAP_W | CAP_MMAP_X)
#define CAP_RECV CAP_READ
#define CAP_SEND CAP_WRITE
#define CAP_SOCK_CLIENT \
(CAP_CONNECT | CAP_GETPEERNAME | CAP_GETSOCKNAME | CAP_GETSOCKOPT | \
CAP_PEELOFF | CAP_RECV | CAP_SEND | CAP_SETSOCKOPT | CAP_SHUTDOWN)
#define CAP_SOCK_SERVER \
(CAP_ACCEPT | CAP_BIND | CAP_GETPEERNAME | CAP_GETSOCKNAME | \
CAP_GETSOCKOPT | CAP_LISTEN | CAP_PEELOFF | CAP_RECV | CAP_SEND | \
CAP_SETSOCKOPT | CAP_SHUTDOWN)
Added defines for backward API compatibility:
#define CAP_MAPEXEC CAP_MMAP_X
#define CAP_DELETE CAP_UNLINKAT
#define CAP_MKDIR CAP_MKDIRAT
#define CAP_RMDIR CAP_UNLINKAT
#define CAP_MKFIFO CAP_MKFIFOAT
#define CAP_MKNOD CAP_MKNODAT
#define CAP_SOCK_ALL (CAP_SOCK_CLIENT | CAP_SOCK_SERVER)
Sponsored by: The FreeBSD Foundation
Reviewed by: Christoph Mallon <christoph.mallon@gmx.de>
Many aspects discussed with: rwatson, benl, jonathan
ABI compatibility discussed with: kib
clock_gettime(2) functions if supported. The speedup seen in
microbenchmarks is in range 4x-7x depending on the hardware.
Only amd64 and i386 architectures are supported. Libc uses rdtsc and
kernel data to calculate current time, if enabled by kernel.
Hopefully, this code is going to migrate into vdso in some future.
Discussed with: bde
Reviewed by: jhb
Tested by: flo
MFC after: 1 month
indicates the avaliability of FILE, to prevent possible reordering of
the writes as seen by other CPUs.
Reported by: Fengwei yin <yfw.bsd gmail com>
Reviewed by: jhb
MFC after: 1 week
initialize the cache of the system information as it was done for the
dynamic libc. This removes several sysctls from the static binary
startup.
Use the aux vector to fill the single struct dl_phdr_info describing
the static binary itself, to implement dl_iterate_phdr(3) for the
static binaries. [1]
Based on the submission by: John Marino <draco marino st> [1]
Tested by: flo (sparc64)
MFC after: 2 weeks
Add an API for alerting internal libc routines to the presence of
"unsafe" paths post-chroot, and use it in ftpd. [11:07]
Fix a buffer overflow in telnetd. [11:08]
Make pam_ssh ignore unpassphrased keys unless the "nullok" option is
specified. [11:09]
Add sanity checking of service names in pam_start. [11:10]
Approved by: so (cperciva)
Approved by: re (bz)
Security: FreeBSD-SA-11:06.bind
Security: FreeBSD-SA-11:07.chroot
Security: FreeBSD-SA-11:08.telnetd
Security: FreeBSD-SA-11:09.pam_ssh
Security: FreeBSD-SA-11:10.pam
calling thread's unique integral ID, which is similar to AIX function of
the same name. Bump __FreeBSD_version to note its introduction.
Reviewed by: kib
a silly rwlock deadlock problem, the deadlock is caused by writer
waiters, if a thread has already locked a reader lock, and wants to
acquire another reader lock, it will be blocked by writer waiters,
but we had already fixed it years ago.
for them, two functions _pthread_cancel_enter and _pthread_cancel_leave
are added to let thread enter and leave a cancellation point, it also
makes it possible that other functions can be cancellation points in
libraries without having to be rewritten in libthr.
atexit and __cxa_atexit handlers that are either installed by unloaded
dso, or points to the functions provided by the dso.
Use _rtld_addr_phdr to locate segment information from the address of
private variable belonging to the dso, supplied by crtstuff.c. Provide
utility function __elf_phdr_match_addr to do the match of address against
dso executable segment.
Call back into libthr from __cxa_finalize using weak
__pthread_cxa_finalize symbol to remove any atfork handler which
function points into unloaded object.
The rtld needs private __pthread_cxa_finalize symbol to not require
resolution of the weak undefined symbol at initialization time. This
cannot work, since rtld is relocated before sym_zero is set up.
Idea by: kan
Reviewed by: kan (previous version)
MFC after: 3 weeks
number of host CPUs and osreldate.
This eliminates the last sysctl(2) calls from the dynamically linked image
startup.
No objections from: kan
Tested by: marius (sparc64)
MFC after: 1 month
now type sema_t is a structure which can be put in a shared memory area,
and multiple processes can operate it concurrently.
User can either use mmap(MAP_SHARED) + sem_init(pshared=1) or use sem_open()
to initialize a shared semaphore.
Named semaphore uses file system and is located in /tmp directory, and its
file name is prefixed with 'SEMD', so now it is chroot or jail friendly.
In simplist cases, both for named and un-named semaphore, userland code
does not have to enter kernel to reduce/increase semaphore's count.
The semaphore is designed to be crash-safe, it means even if an application
is crashed in the middle of operating semaphore, the semaphore state is
still safely recovered by later use, there is no waiter counter maintained
by userland code.
The main semaphore code is in libc and libthr only has some necessary stubs,
this makes it possible that a non-threaded application can use semaphore
without linking to thread library.
Old semaphore implementation is kept libc to maintain binary compatibility.
The kernel ksem API is no longer used in the new implemenation.
Discussed on: threads@
a feature that libstdc++ depends on to simulate the behavior of libc's
internal '__isthreaded' variable. One benefit of this is that _libc_once()
is now private to _once_stub.c.
Requested by: kan
with the additional property that it is safe for routines in libc to use
in both single-threaded and multi-threaded processes. Multi-threaded
processes use the pthread_once() implementation from the threading library
while single-threaded processes use a simplified "stub" version internal
to libc. The libc stub-version of pthread_once() now also uses the
simplified "stub" version as well instead of being a nop.
Reviewed by: deischen, Matthew Fleming @ Isilon
Suggested by: alc
MFC after: 1 week
- The uid/cuid members of struct ipc_perm are now uid_t instead of unsigned
short.
- The gid/cgid members of struct ipc_perm are now gid_t instead of unsigned
short.
- The mode member of struct ipc_perm is now mode_t instead of unsigned short
(this is merely a style bug).
- The rather dubious padding fields for ABI compat with SV/I386 have been
removed from struct msqid_ds and struct semid_ds.
- The shm_segsz member of struct shmid_ds is now a size_t instead of an
int. This removes the need for the shm_bsegsz member in struct
shmid_kernel and should allow for complete support of SYSV SHM regions
>= 2GB.
- The shm_nattch member of struct shmid_ds is now an int instead of a
short.
- The shm_internal member of struct shmid_ds is now gone. The internal
VM object pointer for SHM regions has been moved into struct
shmid_kernel.
- The existing __semctl(), msgctl(), and shmctl() system call entries are
now marked COMPAT7 and new versions of those system calls which support
the new ABI are now present.
- The new system calls are assigned to the FBSD-1.1 version in libc. The
FBSD-1.0 symbols in libc now refer to the old COMPAT7 system calls.
- A simplistic framework for tagging system calls with compatibility
symbol versions has been added to libc. Version tags are added to
system calls by adding an appropriate __sym_compat() entry to
src/lib/libc/incldue/compat.h. [1]
PR: kern/16195 kern/113218 bin/129855
Reviewed by: arch@, rwatson
Discussed with: kan, kib [1]
FPA floating-point format is identical to the VFP format,
but is always stored in big-endian.
Introduce _IEEE_WORD_ORDER to describe the byte-order of
the FP representation.
Obtained from: Juniper Networks, Inc
It includes the following fix:
2426. [bug] libbind: inet_net_pton() can sometimes return the
wrong value if excessively large netmasks are
supplied. [RT #18512]
Reported by: Maksymilian Arciemowicz <cxib__at__securityreason.com>
This caching allows for completely lock-free allocation/deallocation in the
steady state, at the expense of likely increased memory use and
fragmentation.
Reduce the default number of arenas to 2*ncpus, since thread-specific
caching typically reduces arena contention.
Modify size class spacing to include ranges of 2^n-spaced, quantum-spaced,
cacheline-spaced, and subpage-spaced size classes. The advantages are:
fewer size classes, reduced false cacheline sharing, and reduced internal
fragmentation for allocations that are slightly over 512, 1024, etc.
Increase RUN_MAX_SMALL, in order to limit fragmentation for the
subpage-spaced size classes.
Add a size-->bin lookup table for small sizes to simplify translating sizes
to size classes. Include a hard-coded constant table that is used unless
custom size class spacing is specified at run time.
Add the ability to disable tiny size classes at compile time via
MALLOC_TINY.
Adding exevpe() has caused some ports to break. Even though execvpe() is
a useful routine, it does not conform to any standards.
This patch is a little bit different from the patch sent to the mailing
list. I forgot to remove execvpe from the Symbol.map (which does not
seem to miscompile libc, though).
Reviewed by: davidxu
Approved by: philip
call the pad-less versions of the corresponding syscalls if the running
kernel supports it. Check kern.osreldate once per program and cache the
result to select the appropriate syscall. This maintains userland
compatability with kernel.old's from quite a while back.
Approved by: re (kensmith)