freebsd-skq

Author	SHA1	Message	Date
oshogbo	20d273b44b	Introduce funlinkat syscall that always us to check if we are removing the file associated with the given file descriptor. Reviewed by: kib, asomers Reviewed by: cem, jilles, brooks (they reviewed previous version) Discussed with: pjd, and many others Differential Revision: https://reviews.freebsd.org/D14567	2019-04-06 09:34:26 +00:00
kib	48d91fd889	Add new file handle system calls. Namely, getfhat(2), fhlink(2), fhlinkat(2), fhreadlink(2). The syscalls are provided for a NFS userspace server (nfs-ganesha). Submitted by: Jack Halford <jack@gandi.net> Sponsored by: Gandi.net Tested by: pho Feedback from: brooks, markj MFC after: 1 week Differential revision: https://reviews.freebsd.org/D18359	2018-12-07 15:17:29 +00:00
kib	a77cb36882	Remove bits of the old NUMA. Remove numactl(1), edit numa(4) to bring it some closer to reality, provide libc ABI shims for old NUMA syscalls. Noted and reviewed by: brooks (previous version) Sponsored by: The FreeBSD Foundation Differential revision: https://reviews.freebsd.org/D16142	2018-07-10 22:00:20 +00:00
brooks	228301019d	Get rid of netbsd_lchown and netbsd_msync syscall entries. No valid FreeBSD binary very called them (they would call lchown and msync directly) and we haven't supported NetBSD binaries in ages. This is a respin of r335983 with a workaround for the ancient BFD linker in the libc stubs. Reviewed by: kib Sponsored by: DARPA, AFRL Differential Revision: https://reviews.freebsd.org/D16193	2018-07-10 13:32:04 +00:00
brooks	b11bbbe3ee	Revert r335983. The bfd linker in tree doesn't support multiple names for the same symbol (at least with current flags).	2018-07-05 16:03:03 +00:00
brooks	3585a5abe3	Get rid of netbsd_lchown and netbsd_msync syscall entries. No valid FreeBSD binary ever called them (they would call lchown and msync directly) and we haven't supported NetBSD binaries in ages. Reviewed by: kib Sponsored by: DARPA, AFRL Differential Revision: https://reviews.freebsd.org/D15814	2018-07-05 14:12:56 +00:00
markj	9d9fd255d6	Reimplement brk() and sbrk() to avoid the use of _end. Previously, libc.so would initialize its notion of the break address using _end, a special symbol emitted by the static linker following the bss section. Compatibility issues between lld and ld.bfd could cause the wrong definition of _end (libc.so's definition rather than that of the executable) to be used, breaking the brk()/sbrk() interface. Avoid this problem and future interoperability issues by simply not relying on _end. Instead, modify the break() system call to return the kernel's view of the current break address, and have libc initialize its state using an extra syscall upon the first use of the interface. As a side effect, this appears to fix brk()/sbrk() usage in executables run with rtld direct exec, since the kernel and libc.so no longer maintain separate views of the process' break address. PR: 228574 Reviewed by: kib (previous version) MFC after: 2 months Differential Revision: https://reviews.freebsd.org/D15663	2018-06-04 19:35:15 +00:00
brooks	7477344224	Make vadvise compat freebsd11. The vadvise syscall (aka ovadvise) is undocumented and has always been implmented as returning EINVAL. Put the syscall under COMPAT11 and provide a userspace implementation. Reviewed by: kib Sponsored by: DARPA, AFRL Differential Revision: https://reviews.freebsd.org/D15557	2018-05-25 20:40:23 +00:00
jeff	3e6c614462	Document new NUMA related syscalls and utility options. Sponsored by: Netflix, Dell/EMC Isilon	2018-03-24 23:58:44 +00:00
cem	82710b55b6	Implement getrandom(2) and getentropy(3) The general idea here is to provide userspace programs with well-defined sources of entropy, in a fashion that doesn't require opening a new file descriptor (ulimits) or accessing paths (/dev/urandom may be restricted by chroot or capsicum). getrandom(2) is the more general API, and comes from the Linux world. Since our urandom and random devices are identical, the GRND_RANDOM flag is ignored. getentropy(3) is added as a compatibility shim for the OpenBSD API. truss(1) support is included. Tests for both system calls are provided. Coverage is believed to be at least as comprehensive as LTP getrandom(2) test coverage. Additionally, instructions for running the LTP tests directly against FreeBSD are provided in the "Test Plan" section of the Differential revision linked below. (They pass, of course.) PR: 194204 Reported by: David CARLIER <david.carlier AT hardenedbsd.org> Discussed with: cperciva, delphij, jhb, markj Relnotes: maybe Differential Revision: https://reviews.freebsd.org/D14500	2018-03-21 01:15:45 +00:00
imp	c5ddd11381	Support armv7 builds for userland Make armv7 as a new MACHINE_ARCH. Copy all the places we do armv6 and add armv7 as basically an alias. clang appears to generate code for armv7 by default. armv7 hard float isn't supported by the the in-tree gcc, so it hasn't been updated to have a new default. Support armv7 as a new valid MACHINE_ARCH (and by extension TARGET_ARCH). Add armv7 to the universe build. Differential Revision: https://reviews.freebsd.org/D12010	2017-10-05 23:01:33 +00:00
imp	6563dfe551	Forward compatibility for ino64. Add forward compatibility so that new binaries can run on old kernels. If the new system call from ino64 isn't available on your system, then the old one will be used and the results translated. The stat and statfs families of functions are fully emulated. While not required by policy, in this case it is helpful to our users to provide this compatibility. In this case, it allows rollback of the kernel after installing a new userland should a problem be discovered. It also prevents foot-shooting if a user does an install before rebooting with the new kernel. Finally, it allows the use case where one needs to run new binaries on an old kernel as part of an upgrade process. The getdirentries family uses tricks that may not work on remote filesystems. Specifically, it uses a buffer 1/4 the size requested to get the data from he old syscall. The code carefully uses direct syscalls for old system calls to avoid referencing freebsd11_* symbols, which contaminate ld-elf.so.1's export table due to its use of stat functions, which causes errno to be incorrect in client programs due to the wrong stat function being resolved in some cases. This code should removed sometime after 12 is branched. Tested on: 12-current binaries on a 10.3-beta kernel run and return consistent results. 12-current kernel and userland with packages from before ino64 was committed also work. Differential Revision: https://reviews.freebsd.org/D11185 Reviewed by: kib@, emaste@	2017-06-23 18:06:20 +00:00
cem	ebff38a197	pdwait4(2): Remove documentation of vaporware This syscall has never existed and is not at risk of existing any time soon. Remove documentation referencing it, which has been wrong since FreeBSD 9. Reported by: allanjude@	2017-06-17 17:32:40 +00:00
jilles	05d289b7eb	libc: Remove futimens() and utimensat() compat stubs. The futimens() and utimensat() compat stubs allowed using these functions on kernels that did not have the system calls yet (10.2, old 11-current). Also remove the documentation of the [ENOTSUP] error that could occur with an old kernel. A -DNO_CLEAN build may fail because the depend files refer to the deleted files.	2017-06-07 21:21:14 +00:00
jhb	c58377f2b2	Remove stale cap_rights_get(2) manpage. The documentation moved to section 3 several years ago, but 'man cap_rights_get' pulls up cap_rights_limit(2) (which is MLINKed to cap_rights_get.2) instead of cap_rights_get(3). MFC after: 1 week	2017-06-02 03:53:34 +00:00
kib	e75ba1d5c4	Commit the 64-bit inode project. Extend the ino_t, dev_t, nlink_t types to 64-bit ints. Modify struct dirent layout to add d_off, increase the size of d_fileno to 64-bits, increase the size of d_namlen to 16-bits, and change the required alignment. Increase struct statfs f_mntfromname[] and f_mntonname[] array length MNAMELEN to 1024. ABI breakage is mitigated by providing compatibility using versioned symbols, ingenious use of the existing padding in structures, and by employing other tricks. Unfortunately, not everything can be fixed, especially outside the base system. For instance, third-party APIs which pass struct stat around are broken in backward and forward incompatible ways. Kinfo sysctl MIBs ABI is changed in backward-compatible way, but there is no general mechanism to handle other sysctl MIBS which return structures where the layout has changed. It was considered that the breakage is either in the management interfaces, where we usually allow ABI slip, or is not important. Struct xvnode changed layout, no compat shims are provided. For struct xtty, dev_t tty device member was reduced to uint32_t. It was decided that keeping ABI compat in this case is more useful than reporting 64-bit dev_t, for the sake of pstat. Update note: strictly follow the instructions in UPDATING. Build and install the new kernel with COMPAT_FREEBSD11 option enabled, then reboot, and only then install new world. Credits: The 64-bit inode project, also known as ino64, started life many years ago as a project by Gleb Kurtsou (gleb). Kirk McKusick (mckusick) then picked up and updated the patch, and acted as a flag-waver. Feedback, suggestions, and discussions were carried by Ed Maste (emaste), John Baldwin (jhb), Jilles Tjoelker (jilles), and Rick Macklem (rmacklem). Kris Moore (kris) performed an initial ports investigation followed by an exp-run by Antoine Brodin (antoine). Essential and all-embracing testing was done by Peter Holm (pho). The heavy lifting of coordinating all these efforts and bringing the project to completion were done by Konstantin Belousov (kib). Sponsored by: The FreeBSD Foundation (emaste, kib) Differential revision: https://reviews.freebsd.org/D10439	2017-05-23 09:29:05 +00:00
vangyzen	d6de25428d	Add clock_nanosleep() Add a clock_nanosleep() syscall, as specified by POSIX. Make nanosleep() a wrapper around it. Attach the clock_nanosleep test from NetBSD. Adjust it for the FreeBSD behavior of updating rmtp only when interrupted by a signal. I believe this to be POSIX-compliant, since POSIX mentions the rmtp parameter only in the paragraph about EINTR. This is also what Linux does. (NetBSD updates rmtp unconditionally.) Copy the whole nanosleep.2 man page from NetBSD because it is complete and closely resembles the POSIX description. Edit, polish, and reword it a bit, being sure to keep any relevant text from the FreeBSD page. Reviewed by: kib, ngie, jilles MFC after: 3 weeks Relnotes: yes Sponsored by: Dell EMC Differential Revision: https://reviews.freebsd.org/D10020	2017-03-19 00:51:12 +00:00
ngie	32a5b43489	Replace dot-dot relative pathing with SRCTOP-relative paths where possible This reduces build output, need for recalculating paths, and makes it clearer which paths are relative to what areas in the source tree. The change in performance over a locally mounted UFS filesystem was negligible in my testing, but this may more positively impact other filesystems like NFS. LIBC_SRCTOP was left alone so Juniper (and other users) can continue to manipulate lib/libc/Makefile (and other Makefile.inc's under lib/libc) as include Makefiles with custom options. Discussed with: marcel, sjg MFC after: 1 week Reviewed by: emaste Sponsored by: Dell EMC Isilon Differential Revision: https://reviews.freebsd.org/D9207	2017-01-20 03:23:24 +00:00
kib	e4920be6f2	Document thr_suspend(2) and thr_wake(2). Reviewed by: bjk, jilles Discussed with: emaste, wblock Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D8016	2016-09-26 08:18:34 +00:00
badger	ec84c9826b	Add manpage for rctl_* system calls Reviewed by: trasz, wblock Approved by: kib (mentor) MFC after: 3 days Sponsored by: Dell Technologies Differential Revision: https://reviews.freebsd.org/D7877	2016-09-19 02:25:30 +00:00
brooks	5072762802	Fix spelling in comment. Submitted by: brueffer	2016-09-09 16:18:44 +00:00
brooks	dc591d96b2	Reduce duplicate NOASM and PSEUDO definitions The initial value of NOASM is nearly the same in all cases and the initial value of PSEUDO is the same in all cases so reduce duplication (and hopefully, future merge conflicts) by machine independent defaults. Also document the PSEUDO variable. Reviewed by: jhb, kib Obtained from: CheriBSD Sponsored by: DARPA, AFRL Differential Revision: https://reviews.freebsd.org/D7820	2016-09-08 22:38:20 +00:00
kib	fffac29068	Rewrite ptrace(2) wrappers in C. Besides removing hand-translation to assembler, this also adds missing wrappers for arm64 and risc-v. Reviewed by: emaste, jhb Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D7694	2016-08-29 18:47:51 +00:00
kib	21a721373f	Add fdatasync(2) man page, combined with fsync(2). Reviewed by: emaste, rpokala, wblock Sponsored by: The FreeBSD Foundation MFC after: 2 weeks Differential revision: https://reviews.freebsd.org/D7522	2016-08-17 10:16:42 +00:00
kib	e04600d300	The fdatasync(2) call must be cancellation point. Sponsored by: The FreeBSD Foundation MFC after: 13 days	2016-08-16 08:27:03 +00:00
brooks	ecead64b41	Replace use of the pipe(2) system call with pipe2(2) with a zero flags value. This eliminates the need for machine dependant assembly wrappers for pipe(2). It also make passing an invalid address to pipe(2) return EFAULT rather than triggering a segfault. Document this behavior (which was already true for pipe2(2), but undocumented). Reviewed by: andrew Approved by: re (gjb) Sponsored by: DARPA, AFRL Differential Revision: https://reviews.freebsd.org/D6815	2016-06-22 21:11:27 +00:00
kib	b8e76aa55c	Add thr*.2 and _umtx_op.2 manpages to the build. Sponsored by: The FreeBSD Foundation	2016-05-14 09:43:28 +00:00
kib	2b6ac44d5d	Annotate arm userspace assembler sources stating their tolerance to the non-executable stack. Reviewed by: andrew Sponsored by: The FreeBSD Foundation	2015-09-29 16:09:58 +00:00
bdrewery	3faab6f866	Avoid adding duplicates into OBJS. bsd.lib.mk already handles adding entries to OBJS based on SRCS. MFC after: 1 week Sponsored by: EMC / Isilon Storage Division	2015-09-22 04:55:28 +00:00
pfg	9b366d35ee	Move the stack protector to a new "secure" directory As part of the code refactoring to support FORTIFY_SOURCE we want a new subdirectory "secure" to keep the files related to security. Move the stack protector functions to this new directory. No functional change. Differential Review: https://reviews.freebsd.org/D3333	2015-08-14 03:03:13 +00:00
adrian	41db4b88e0	Add an initial NUMA affinity/policy configuration for threads and processes. This is based on work done by jeff@ and jhb@, as well as the numa.diff patch that has been circulating when someone asks for first-touch NUMA on -10 or -11. * Introduce a simple set of VM policy and iterator types. * tie the policy types into the vm_phys path for now, mirroring how the initial first-touch allocation work was enabled. * add syscalls to control changing thread and process defaults. * add a global NUMA VM domain policy. * implement a simple cascade policy order - if a thread policy exists, use it; if a process policy exists, use it; use the default policy. * processes inherit policies from their parent processes, threads inherit policies from their parent threads. * add a simple tool (numactl) to query and modify default thread/process policities. * add documentation for the new syscalls, for numa and for numactl. * re-enable first touch NUMA again by default, as now policies can be set in a variety of methods. This is only relevant for very specific workloads. This doesn't pretend to be a final NUMA solution. The previous defaults in -HEAD (with MAXMEMDOM set) can be achieved by 'sysctl vm.default_policy=rr'. This is only relevant if MAXMEMDOM is set to something other than 1. Ie, if you're using GENERIC or a modified kernel with non-NUMA, then this is a glorified no-op for you. Thank you to Norse Corp for giving me access to rather large (for FreeBSD!) NUMA machines in order to develop and verify this. Thank you to Dell for providing me with dual socket sandybridge and westmere v3 hardware to do NUMA development with. Thank you to Scott Long at Netflix for providing me with access to the two-socket, four-domain haswell v3 hardware. Thank you to Peter Holm for running the stress testing suite against the NUMA branch during various stages of development! Tested: * MIPS (regression testing; non-NUMA) * i386 (regression testing; non-NUMA GENERIC) * amd64 (regression testing; non-NUMA GENERIC) * westmere, 2 socket (thankyou norse!) * sandy bridge, 2 socket (thankyou dell!) * ivy bridge, 2 socket (thankyou norse!) * westmere-EX, 4 socket / 1TB RAM (thankyou norse!) * haswell, 2 socket (thankyou norse!) * haswell v3, 2 socket (thankyou dell) * haswell v3, 2x18 core (thankyou scott long / netflix!) * Peter Holm ran a stress test suite on this work and found one issue, but has not been able to verify it (it doesn't look NUMA related, and he only saw it once over many testing runs.) * I've tested bhyve instances running in fixed NUMA domains and cpusets; all seems to work correctly. Verified: * intel-pcm - pcm-numa.x and pcm-memory.x, whilst selecting different NUMA policies for processes under test. Review: This was reviewed through phabricator (https://reviews.freebsd.org/D2559) as well as privately and via emails to freebsd-arch@. The git history with specific attributes is available at https://github.com/erikarn/freebsd/ in the NUMA branch (https://github.com/erikarn/freebsd/compare/local/adrian_numa_policy). This has been reviewed by a number of people (stas, rpaulo, kib, ngie, wblock) but not achieved a clear consensus. My hope is that with further exposure and testing more functionality can be implemented and evaluated. Notes: * The VM doesn't handle unbalanced domains very well, and if you have an overly unbalanced memory setup whilst under high memory pressure, VM page allocation may fail leading to a kernel panic. This was a problem in the past, but it's much more easily triggered now with these tools. * This work only controls the path through vm_phys; it doesn't yet strongly/predictably affect contigmalloc, KVA placement, UMA, etc. So, driver placement of memory isn't really guaranteed in any way. That's next on my plate. Sponsored by: Norse Corp, Inc.; Dell	2015-07-11 15:21:37 +00:00
kib	2254748ed0	The lseek(2), mmap(2), truncate(2), ftruncate(2), pread(2), and pwrite(2) syscalls are wrapped to provide compatibility with pre-7.x kernels which required padding before the off_t parameter. The fcntl(2) contains compatibility code to handle kernels before the struct flock was changed during the 8.x CURRENT development. The shims were reasonable to allow easier revert to the older kernel at that time. Now, two or three major releases later, shims do not serve any purpose. Such old kernels cannot handle current libc, so revert the compatibility code. Make padded syscalls support conditional under the COMPAT6 config option. For COMPAT32, the syscalls were under COMPAT6 already. Remove WITHOUT_SYSCALL_COMPAT build option, which only purpose was to (partially) disable the removed shims. Reviewed by: jhb, imp (previous versions) Discussed with: peter Sponsored by: The FreeBSD Foundation MFC after: 1 week	2015-04-18 21:50:13 +00:00
kib	9a774084c8	Make wait6(2), waitid(3) and ppoll(2) cancellation points. The waitid() function is required to be cancellable by the standard. The wait6() and ppoll() follow the other syscalls in their groups. Reviewed by: jhb, jilles (previous versions) Sponsored by: The FreeBSD Foundation MFC after: 1 week	2015-04-18 21:35:41 +00:00
kib	6531ee3ae5	Make kevent(2) a cancellation point. Note that to cancel blocked kevent(2) call, changelist must be empty, since we cannot cancel a call which already made changes to the process state. And in reverse, call which only makes changes to the kqueue state, without waiting for an event, is not cancellable. This makes a natural usage model to migrate kqueue loop to support cancellation, where existing single kevent(2) call must be split into two: first uncancellable update of kqueue, then cancellable wait for events. Note that this is ABI-incompatible change, but it is believed that there is no cancel-safe code that relies on kevent(2) not being a cancellation point. Option to preserve the ABI would be to keep kevent(2) as is, but add new call with flags to specify cancellation behaviour, which only value seems to add complications. Suggested and reviewed by: jilles Sponsored by: The FreeBSD Foundation MFC after: 2 weeks	2015-03-29 19:14:41 +00:00
marius	c837ced420	Unbreak sparc64 after r276630 by calling __sparc_sigtramp_setup signal trampoline as part of the MD __sys_sigaction again. Submitted by: kib (initial versions) MFC after: 3 days	2015-02-16 22:13:03 +00:00
jilles	67db24d0f2	Add futimens and utimensat system calls. The core kernel part is patch file utimes.2008.4.diff from pluknet@FreeBSD.org. I updated the code for API changes, added the manual page and added compatibility code for old kernels. There is also audit and Capsicum support. A new UTIME_* constant might allow setting birthtimes in future. Differential Revision: https://reviews.freebsd.org/D1426 Submitted by: pluknet (partially) Reviewed by: delphij, pluknet, rwatson Relnotes: yes	2015-01-23 21:07:08 +00:00
kib	8d1dfb4106	Fix known issues which blow up the process after dlopen("libthr.so") (or loading a dso linked to libthr.so into process which was not linked against threading library). - Remove libthr interposers of the libc functions, including __error(). Instead, functions calls are indirected through the interposing table, similar to how pthread stubs in libc are already done. Libc by default points either to syscall trampolines or to existing libc implementations. On libthr load, libthr rewrites the pointers to the cancellable implementations already in libthr. The interposition table is separate from pthreads stubs indirection table to not pull pthreads stubs into static binaries. - Postpone the malloc(3) internal mutexes initialization until libthr is loaded. This avoids recursion between calloc(3) and static pthread_mutex_t initialization. - Reinstall signal handlers with wrapper on libthr load. The _rtld_is_dlopened(3) is used to avoid useless calls to sigaction(2) when libthr is statically referenced from the main binary. In the process, fix openat(2), swapcontext(2) and setcontext(2) interposing. The libc symbols were exported at different versions than libthr interposers. Export both libc and libthr versions from libc now, with default set to the higher version from libthr. Remove unused and disconnected swapcontext(3) userspace implementation from libc/gen. No objections from: deischen Tested by: pho, antoine (exp-run) (previous versions) Sponsored by: The FreeBSD Foundation MFC after: 1 week	2015-01-03 18:38:46 +00:00
dchagin	162012051b	Add the ppoll() system call. Export kern_poll() needed by an upcoming Linuxulator change. Differential Revision: https://reviews.freebsd.org/D1133 Reviewed by: kib, wblock MFC after: 1 month	2014-11-13 05:26:14 +00:00
imp	9878392e1a	Convert from WITHOUT_SYSCALL_COMPAT to MK_SYSCALL_COMPAT.	2014-04-05 17:54:43 +00:00
marcel	99c9726a00	Replace use of ${.CURDIR} by ${LIBC_SRCTOP} and define ${LIBC_SRCTOP} if not already defined. This allows building libc from outside of lib/libc using a reach-over makefile. A typical use-case is to build a standard ILP32 version and a COMPAT32 version in a single iteration by building the COMPAT32 version using a reach-over makefile. Obtained from: Juniper Networks, Inc.	2014-03-04 02:19:39 +00:00
pluknet	903839e958	Provide the manual page for aio_fsync(2). Reviewed by: davidxu MFC after: 1 week	2013-12-26 19:16:30 +00:00
pluknet	71c1c69583	Fix extattr(2) MLINKS. MFC after: 1 week	2013-11-09 00:36:09 +00:00
jhb	d3ef75b6c7	Extend the support for exempting processes from being killed when swap is exhausted. - Add a new protect(1) command that can be used to set or revoke protection from arbitrary processes. Similar to ktrace it can apply a change to all existing descendants of a process as well as future descendants. - Add a new procctl(2) system call that provides a generic interface for control operations on processes (as opposed to the debugger-specific operations provided by ptrace(2)). procctl(2) uses a combination of idtype_t and an id to identify the set of processes on which to operate similar to wait6(). - Add a PROC_SPROTECT control operation to manage the protection status of a set of processes. MADV_PROTECT still works for backwards compatability. - Add a p_flag2 to struct proc (and a corresponding ki_flag2 to kinfo_proc) the first bit of which is used to track if P_PROTECT should be inherited by new child processes. Reviewed by: kib, jilles (earlier version) Approved by: re (delphij) MFC after: 1 month	2013-09-19 18:53:42 +00:00
glebius	9a02f3097d	Add new system call - aio_mlock(). The name speaks for itself. It allows to perform the mlock(2) operation, which can consume a lot of time, under control of aio(4). Reviewed by: kib, jilles Sponsored by: Nginx, Inc.	2013-06-08 13:27:57 +00:00
jilles	16772c421d	Add pipe2() system call. The pipe2() function is similar to pipe() but allows setting FD_CLOEXEC and O_NONBLOCK (on both sides) as part of the function. If p points to two writable ints, pipe2(p, 0) is equivalent to pipe(p). If the pointer is not valid, behaviour differs: pipe2() writes into the array from the kernel like socketpair() does, while pipe() writes into the array from an architecture-specific assembler wrapper. Reviewed by: kan, kib	2013-05-01 22:42:42 +00:00
jilles	299afd25fd	Add accept4() system call. The accept4() function, compared to accept(), allows setting the new file descriptor atomically close-on-exec and explicitly controlling the non-blocking status on the new socket. (Note that the latter point means that accept() is not equivalent to any form of accept4().) The linuxulator's accept4 implementation leaves a race window where the new file descriptor is not close-on-exec because it calls sys_accept(). This implementation leaves no such race window (by using falloc() flags). The linuxulator could be fixed and simplified by using the new code. Like accept(), accept4() is async-signal-safe, a cancellation point and permitted in capability mode.	2013-05-01 20:10:21 +00:00
pjd	01401cc9bc	Document chflagsat(2). Obtained from: jilles	2013-03-21 23:05:44 +00:00
pjd	702516e70b	- Implement two new system calls: int bindat(int fd, int s, const struct sockaddr addr, socklen_t addrlen); int connectat(int fd, int s, const struct sockaddr name, socklen_t namelen); which allow to bind and connect respectively to a UNIX domain socket with a path relative to the directory associated with the given file descriptor 'fd'. - Add manual pages for the new syscalls. - Make the new syscalls available for processes in capability mode sandbox. - Add capability rights CAP_BINDAT and CAP_CONNECTAT that has to be present on the directory descriptor for the syscalls to work. - Update audit(4) to support those two new syscalls and to handle path in sockaddr_un structure relative to the given directory descriptor. - Update procstat(1) to recognize the new capability rights. - Document the new capability rights in cap_rights_limit(2). Sponsored by: The FreeBSD Foundation Discussed with: rwatson, jilles, kib, des	2013-03-02 21:11:30 +00:00
pjd	f07ebb8888	Merge Capsicum overhaul: - Capability is no longer separate descriptor type. Now every descriptor has set of its own capability rights. - The cap_new(2) system call is left, but it is no longer documented and should not be used in new code. - The new syscall cap_rights_limit(2) should be used instead of cap_new(2), which limits capability rights of the given descriptor without creating a new one. - The cap_getrights(2) syscall is renamed to cap_rights_get(2). - If CAP_IOCTL capability right is present we can further reduce allowed ioctls list with the new cap_ioctls_limit(2) syscall. List of allowed ioctls can be retrived with cap_ioctls_get(2) syscall. - If CAP_FCNTL capability right is present we can further reduce fcntls that can be used with the new cap_fcntls_limit(2) syscall and retrive them with cap_fcntls_get(2). - To support ioctl and fcntl white-listing the filedesc structure was heavly modified. - The audit subsystem, kdump and procstat tools were updated to recognize new syscalls. - Capability rights were revised and eventhough I tried hard to provide backward API and ABI compatibility there are some incompatible changes that are described in detail below: CAP_CREATE old behaviour: - Allow for openat(2)+O_CREAT. - Allow for linkat(2). - Allow for symlinkat(2). CAP_CREATE new behaviour: - Allow for openat(2)+O_CREAT. Added CAP_LINKAT: - Allow for linkat(2). ABI: Reuses CAP_RMDIR bit. - Allow to be target for renameat(2). Added CAP_SYMLINKAT: - Allow for symlinkat(2). Removed CAP_DELETE. Old behaviour: - Allow for unlinkat(2) when removing non-directory object. - Allow to be source for renameat(2). Removed CAP_RMDIR. Old behaviour: - Allow for unlinkat(2) when removing directory. Added CAP_RENAMEAT: - Required for source directory for the renameat(2) syscall. Added CAP_UNLINKAT (effectively it replaces CAP_DELETE and CAP_RMDIR): - Allow for unlinkat(2) on any object. - Required if target of renameat(2) exists and will be removed by this call. Removed CAP_MAPEXEC. CAP_MMAP old behaviour: - Allow for mmap(2) with any combination of PROT_NONE, PROT_READ and PROT_WRITE. CAP_MMAP new behaviour: - Allow for mmap(2)+PROT_NONE. Added CAP_MMAP_R: - Allow for mmap(PROT_READ). Added CAP_MMAP_W: - Allow for mmap(PROT_WRITE). Added CAP_MMAP_X: - Allow for mmap(PROT_EXEC). Added CAP_MMAP_RW: - Allow for mmap(PROT_READ \| PROT_WRITE). Added CAP_MMAP_RX: - Allow for mmap(PROT_READ \| PROT_EXEC). Added CAP_MMAP_WX: - Allow for mmap(PROT_WRITE \| PROT_EXEC). Added CAP_MMAP_RWX: - Allow for mmap(PROT_READ \| PROT_WRITE \| PROT_EXEC). Renamed CAP_MKDIR to CAP_MKDIRAT. Renamed CAP_MKFIFO to CAP_MKFIFOAT. Renamed CAP_MKNODE to CAP_MKNODEAT. CAP_READ old behaviour: - Allow pread(2). - Disallow read(2), readv(2) (if there is no CAP_SEEK). CAP_READ new behaviour: - Allow read(2), readv(2). - Disallow pread(2) (CAP_SEEK was also required). CAP_WRITE old behaviour: - Allow pwrite(2). - Disallow write(2), writev(2) (if there is no CAP_SEEK). CAP_WRITE new behaviour: - Allow write(2), writev(2). - Disallow pwrite(2) (CAP_SEEK was also required). Added convinient defines: #define CAP_PREAD (CAP_SEEK \| CAP_READ) #define CAP_PWRITE (CAP_SEEK \| CAP_WRITE) #define CAP_MMAP_R (CAP_MMAP \| CAP_SEEK \| CAP_READ) #define CAP_MMAP_W (CAP_MMAP \| CAP_SEEK \| CAP_WRITE) #define CAP_MMAP_X (CAP_MMAP \| CAP_SEEK \| 0x0000000000000008ULL) #define CAP_MMAP_RW (CAP_MMAP_R \| CAP_MMAP_W) #define CAP_MMAP_RX (CAP_MMAP_R \| CAP_MMAP_X) #define CAP_MMAP_WX (CAP_MMAP_W \| CAP_MMAP_X) #define CAP_MMAP_RWX (CAP_MMAP_R \| CAP_MMAP_W \| CAP_MMAP_X) #define CAP_RECV CAP_READ #define CAP_SEND CAP_WRITE #define CAP_SOCK_CLIENT \ (CAP_CONNECT \| CAP_GETPEERNAME \| CAP_GETSOCKNAME \| CAP_GETSOCKOPT \| \ CAP_PEELOFF \| CAP_RECV \| CAP_SEND \| CAP_SETSOCKOPT \| CAP_SHUTDOWN) #define CAP_SOCK_SERVER \ (CAP_ACCEPT \| CAP_BIND \| CAP_GETPEERNAME \| CAP_GETSOCKNAME \| \ CAP_GETSOCKOPT \| CAP_LISTEN \| CAP_PEELOFF \| CAP_RECV \| CAP_SEND \| \ CAP_SETSOCKOPT \| CAP_SHUTDOWN) Added defines for backward API compatibility: #define CAP_MAPEXEC CAP_MMAP_X #define CAP_DELETE CAP_UNLINKAT #define CAP_MKDIR CAP_MKDIRAT #define CAP_RMDIR CAP_UNLINKAT #define CAP_MKFIFO CAP_MKFIFOAT #define CAP_MKNOD CAP_MKNODAT #define CAP_SOCK_ALL (CAP_SOCK_CLIENT \| CAP_SOCK_SERVER) Sponsored by: The FreeBSD Foundation Reviewed by: Christoph Mallon <christoph.mallon@gmx.de> Many aspects discussed with: rwatson, benl, jonathan ABI compatibility discussed with: kib	2013-03-02 00:53:12 +00:00
pjd	cc57b32cb6	Put one file per line so it is easier to read diffs against those files.	2013-02-16 22:21:46 +00:00

1 2 3 4 5

211 Commits