freebsd-dev

Author	SHA1	Message	Date
Brooks Davis	3a325dec32	Remove a needlessly clever hack to start init with sys_exec(). Construct a struct image_args with the help of new exec_args_*() helper functions and call kern_execve(). The previous code mapped a page in userspace, copied arguments out to it one at a time, and then constructed a struct execve_args all so that sys_execve() can call exec_copyin_args() to copy the data back in to a struct image_args. Opencode the part of pre_execve()/post_execve() that releases a reference to the initial vmspace. We don't need to stop threads like they do. Reviewed by: kib, jhb (prior version) Obtained from: CheriBSD Sponsored by: DARPA, AFRL Differential Revision: https://reviews.freebsd.org/D15469	2018-12-04 00:15:47 +00:00
Mark Johnston	02164d3603	Add a missing definition for the !COMPAT_FREEBSD32 case. Reported by: jenkins MFC with: r341442 Sponsored by: The FreeBSD Foundation	2018-12-03 21:07:10 +00:00
Mark Johnston	352aaa5122	Plug memory disclosures via ptrace(2). On some architectures, the structures returned by PT_GET*REGS were not fully populated and could contain uninitialized stack memory. The same issue existed with the register files in procfs. Reported by: Thomas Barabosch, Fraunhofer FKIE Reviewed by: kib MFC after: 3 days Security: kernel stack memory disclosure Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D18421	2018-12-03 20:54:17 +00:00
Konstantin Belousov	200bf72793	Correct accuracy of the barrier writes accounting. Discussed with: mckusick MFC after: 1 week Sponsored by: The FreeBSD Foundation	2018-12-02 12:53:39 +00:00
Eric van Gyzen	5e38e3f5eb	Include path for tmpfs objects in vm.objects sysctl This applies the fix in r283924 to the vm.objects sysctl added by r283624 so the output will include the vnode information (i.e. path) for tmpfs objects. Reviewed by: kib, dab MFC after: 2 weeks Sponsored by: Dell EMC Isilon Differential Revision: https://reviews.freebsd.org/D2724	2018-11-30 04:59:43 +00:00
Brooks Davis	f373437a01	Add helper functions to copy strings into struct image_args. Given a zeroed struct image_args with an allocated buf member, exec_args_add_fname() must be called to install a file name (or NULL). Then zero or more calls to exec_args_add_env() followed by zero or more calls to exec_args_add_env(). exec_args_adjust_args() may be called after args and/or env to allow an interpreter to be prepended to the argument list. To allow code reuse when adding arg and env variables, begin_envv should be accessed with the accessor exec_args_get_begin_envv() which handles the case when no environment entries have been added. Use these functions to simplify exec_copyin_args() and freebsd32_exec_copyin_args(). Reviewed by: kib Obtained from: CheriBSD Sponsored by: DARPA, AFRL Differential Revision: https://reviews.freebsd.org/D15468	2018-11-29 21:00:56 +00:00
Konstantin Belousov	7d2b0bd7d7	If BENEATH is specified, always latch the topping directory vnode. It is possible that we started with a relative path but during the lookup, found an absolute symlink. In this case, BENEATH handling code needs the latch, but it is too late to calculate it. While there, somewhat improve the assertions. Clear the NI_LCF_LATCH flag when the latch vnode is released, so that asserts know the state. Assert that there is a latch if we entered beneath+abs path mode, after the starting point is processed. Reported by: wulf With more input from: pho Sponsored by: The FreeBSD Foundation	2018-11-29 19:13:10 +00:00
Mateusz Guzik	1f6ad48c76	vfs: fix i386 build after r341220	2018-11-29 09:54:27 +00:00
Mateusz Guzik	22443809ff	cache: retire cache_enter compat schim It was added over 6 years ago for binary compat. cache_enter macro remains as it expands to cache_enter_time. Sponsored by: The FreeBSD Foundation	2018-11-29 09:32:59 +00:00
Mateusz Guzik	712775843f	vfs: drop spurious memcpy in stat Sponsored by: The FreeBSD Foundation	2018-11-29 09:04:10 +00:00
Mateusz Guzik	d47f3fdb0a	fd: unify fd range check across the routines While here annotate out of range as unlikely. Sponsored by: The FreeBSD Foundation	2018-11-29 08:53:39 +00:00
Mateusz Guzik	eec8d0a378	Convert racct_enable to bool and annotate as __read_frequently Sponsored by: The FreeBSD Foundation	2018-11-29 05:17:16 +00:00
Mateusz Guzik	64cf6a62d4	Deinline racct throttling out of syscall exit path. racct is not enabled by default and even when it is enabled processes are typically not throttled. The order of checks is left unchanged since racct_enable will be annotated as __read_frequently, while checking for the flag in the processes would probably require an extra fetch. Sponsored by: The FreeBSD Foundation	2018-11-29 05:08:46 +00:00
Mateusz Guzik	e272bf479b	Annotate td_cowgen check as unlikely. Sponsored by: The FreeBSD Foundation	2018-11-29 04:48:22 +00:00
Mateusz Guzik	3277792bde	Tidy up hardclock. - use fcmpset for updating ticks - move (rarely used) itimer handling to a dedicated function Sponsored by: The FreeBSD Foundation	2018-11-29 03:44:02 +00:00
Mateusz Guzik	1e9a1bf589	proc: create a dedicated lock for zombproc to ligthen the load on allproc_lock waitpid always takes proctree to evaluate the list, but only takes allproc if it can reap. With this patch allproc is no longer taken, which helps during poudriere -j 128. Discussed with: kib Sponsored by: The FreeBSD Foundation	2018-11-29 02:52:08 +00:00
Konstantin Belousov	affd918514	Improve sigonstack(). Avoid relying on unsigned overflow for the test. Simplify expressions to avoid duplicate check for the range. Style. Add herald comment. Reviewed by: jhb Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D18361	2018-11-27 19:50:58 +00:00
Jamie Gritton	b307954481	In hardened systems, where the security.bsd.unprivileged_proc_debug sysctl node is set, allow setting security.bsd.unprivileged_proc_debug per-jail. In part, this is needed to create jails in which the Address Sanitizer (ASAN) fully works as ASAN utilizes libkvm to inspect the virtual address space. Instead of having to allow unprivileged process debugging for the entire system, allow setting it on a per-jail basis. The sysctl node is still security.bsd.unprivileged_proc_debug and the jail(8) param is allow.unprivileged_proc_debug. The sysctl code is now a sysctl proc rather than a sysctl int. This allows us to determine setting the flag for the corresponding jail (or prison0). As part of the change, the dynamic allow.* API needed to be modified to take into account pr_allow flags which may now be disabled in prison0. This prevents conflicts with new pr_allow flags (like that of vmm(4)) that are added (and removed) dynamically. Also teach the jail creation KPI to allow differences for certain pr_allow flags between the parent and child jail. This can happen when unprivileged process debugging is disabled in the parent prison, but enabled in the child. Submitted by: Shawn Webb <lattera at gmail.com> Obtained from: HardenedBSD (45b3625edba0f73b3e3890b1ec3d0d1e95fd47e1, deba0b5078cef0faae43cbdafed3035b16587afc, ab21eeb3b4c72f2500987c96ff603ccf3b6e7de8) Relnotes: yes Sponsored by: HardenedBSD and G2, Inc Differential Revision: https://reviews.freebsd.org/D18319	2018-11-27 17:51:50 +00:00
Eric van Gyzen	607a0eb2f1	Remove superfluous bzero in getcontext/swapcontext/sendsig We zero the whole structure; we don't need to zero the __spare__ field again. Remove trailing whitespace. MFC after: 2 weeks Sponsored by: Dell EMC Isilon	2018-11-26 20:56:05 +00:00
Alan Somers	72bce9fff6	vfs_aio.c: rename "physio" symbols to "bio". aio has two paths: an asynchronous "physio" path and a synchronous path. Confusingly, physio(9) isn't actually used by the "physio" path, and never has been. In fact, it may even be called by the synchronous path! Rename the "physio" path to the "bio" path to reflect what it actually does: directly compose BIOs and send them to character devices. MFC after: 2 weeks	2018-11-26 18:31:00 +00:00
Alan Cox	ee73fef96e	blist_meta_alloc assumes that mask=scan->bm_bitmap is nonzero. But if the cursor lies in the middle of the space that the meta node represents, then blanking the low bits of mask may make it zero, and break later code that expects a nonzero value. Add a test that returns failure if the mask has been cleared. Submitted by: Doug Moore <dougm@rice.edu> Reported by: pho Tested by: pho X-MFC with: r340402 Differential Revision: https://reviews.freebsd.org/D18058	2018-11-24 21:52:10 +00:00
Mark Johnston	792843c38f	Pass malloc flags directly through kevent(2) subroutines. Some kevent functions have a boolean "waitok" parameter for use when calling malloc(9). Replace them with the corresponding malloc() flags: the desired behaviour is known at compile-time, so this eliminates a couple of conditional branches, and makes the code easier to read. No functional change intended. Reviewed by: kib MFC after: 1 week Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D18318	2018-11-24 17:06:01 +00:00
Mark Johnston	36c4960ef8	Plug some kernel memory disclosures via kevent(2). The kernel may register for events on behalf of a userspace process, in which case it must be careful to zero the kevent struct that will be copied out to userspace. Reviewed by: kib MFC after: 3 days Security: kernel stack memory disclosure Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D18317	2018-11-24 17:02:31 +00:00
Mark Johnston	a2afae524a	Ensure that knotes do not get registered when KQ_CLOSING is set. KQ_CLOSING is set before draining the knotes associated with a kqueue, so we must ensure that new knotes are not added after that point. In particular, some kernel facilities may register for events on behalf of a userspace process and race with a close of the kqueue. PR: 228858 Reviewed by: kib Tested by: pho MFC after: 3 days Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D18316	2018-11-24 16:58:34 +00:00
Mark Johnston	1eeab857a3	Lock the knlist before releasing the in-flux state in knote_fork(). Otherwise there is a window, before iteration is resumed, during which the knote may be freed. The in-flux state ensures that the knote will not be removed from the knlist while locks are dropped. PR: 228858 Reviewed by: kib Tested by: pho MFC after: 3 days Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D18316	2018-11-24 16:41:29 +00:00
Konstantin Belousov	cefb93f253	Parse FreeBSD Feature Control note on the ELF image activation. Sponsored by: The FreeBSD Foundation MFC after: 2 weeks	2018-11-23 23:33:55 +00:00
Konstantin Belousov	92328a3251	Generalize ELF parse_notes(). Remove the knowledge of the ABI note type and brandnote from it, instead provide it with a callback to do note-specific matching and data fetching. Implement callback to match against ELF brand. Sponsored by: The FreeBSD Foundation MFC after: 2 weeks	2018-11-23 23:29:14 +00:00
Konstantin Belousov	eda8fe63c9	Trivial reduction of the code duplication, reuse the return FALSE code. Sponsored by: The FreeBSD Foundation MFC after: 2 weeks	2018-11-23 23:16:01 +00:00
Mark Johnston	96fdfb3649	Honour the waitok parameter in kevent_expand(). Reviewed by: kib MFC after: 1 week Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D18316	2018-11-23 23:10:03 +00:00
Konstantin Belousov	f5cf758998	Provide storage for the process feature control flags in struct proc. The flags are cleared on exec, it is up to the image activator to set them. Sponsored by: The FreeBSD Foundation MFC after: 2 weeks	2018-11-23 23:07:57 +00:00
Mark Johnston	6d2e2df764	Ensure that directory entry padding bytes are zeroed. Directory entries must be padded to maintain alignment; in many filesystems the padding was not initialized, resulting in stack memory being copied out to userspace. With the ino64 work there are also some explicit pad fields in struct dirent. Add a subroutine to clear these bytes and use it in the in-tree filesystems. The NFS client is omitted for now as it was fixed separately in r340787. Reported by: Thomas Barabosch, Fraunhofer FKIE Reviewed by: kib MFC after: 3 days Sponsored by: The FreeBSD Foundation	2018-11-23 22:24:59 +00:00
Mateusz Guzik	e3d3e8289b	Revert "fork: fix use-after-free with vfork" This unreliably breaks libc handling of vfork where forking succeded, but execve did not. vfork code in libc performs waitpid with WNOHANG in case of failed exec. With the fix exit codepath was waking up the parent before the child fully transitioned to a zombie. Woken up parent would waitpid, which could find a not-yet-zombie child and fail to reap it due to the WNOHANG flag. While removing the flag fixes the problem, it is not an option due to older releases which would still suffer from the kernel change. Revert the fix until a solution can be worked out. Note that while use-after-free which gets back due to the revert is a real bug, it's side-effects are limited due to the fact that struct proc memory is never released by UMA.	2018-11-23 04:38:50 +00:00
Mateusz Guzik	adce241981	Annotate TDP_RFPPWAIT as unlikely. The flag is only set on vfork, but is tested for all syscalls. On amd64 this shortens common-case (not vfork) code.	2018-11-22 21:38:24 +00:00
Mateusz Guzik	a5ac8272c0	fork: remove avoidable proc lock/unlock pair We don't have to access the process after making it runnable, so there is no need to hold it either. Sponsored by: The FreeBSD Foundation	2018-11-22 21:29:36 +00:00
Mateusz Guzik	b00b27e925	fork: fix use-after-free with vfork The pointer to the child is stored without any reference held. Then it is blindly used to wait until P_PPWAIT is cleared. However, if the child is autoreaped it could have exited and get freed before the parent started waiting. Use the existing hold mechanism to mitigate the problem. Most common case of doing exec remains unchanged. The corner case of doing exit performs wake up before waiting for holds to clear. Reviewed by: kib Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D18295	2018-11-22 21:08:37 +00:00
Mark Johnston	79db6fe7aa	Plug some networking sysctl leaks. Various network protocol sysctl handlers were not zero-filling their output buffers and thus would export uninitialized stack memory to userland. Fix a number of such handlers. Reported by: Thomas Barabosch, Fraunhofer FKIE Reviewed by: tuexen MFC after: 3 days Security: kernel memory disclosure Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D18301	2018-11-22 20:49:41 +00:00
Mateusz Guzik	f218ac5087	uipc_usrreq: fix inode number assignment The code was incrementing a global variable in an unsafe manner. Two different threads stating two different sockets could have resulted in the same inode numbers assigned to both. Creation is protected with a global lock, move the assigment there. Since inode numbers are 64-bit now drop the check for overflows. Sponsored by: The FreeBSD Foundation	2018-11-21 22:25:05 +00:00
Mateusz Guzik	a627b4629d	proc: update list manipulation comment on process exit Processes stay in the hash until they get reaped. This code does not unlink the child from the parent, so remove the claim that it does. Sponsored by: The FreeBSD Foundation	2018-11-21 22:16:10 +00:00
Mateusz Guzik	7883ce1f26	uipc_shm: use unr64 for inode numbers Sponsored by: The FreeBSD Foundation	2018-11-21 22:01:06 +00:00
Mateusz Guzik	53011553fa	proc: convert pfind & friends to use pidhash locks and other cleanup pfind_locked is retired as it relied on allproc which unnecessarily restricts locking of the hash. Sponsored by: The FreeBSD Foundation	2018-11-21 20:15:56 +00:00
Mateusz Guzik	3d3e6793f6	proc: implement pid hash locks and an iterator forks, exits and waits are frequently stalled during poudriere -j 128 runs due to killpg and process list exports performed for each package. Both uses take the allproc lock. The latter case can be modified to iterate over the hash with finer grained locking instead. Reviewed by: kib Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D17817	2018-11-21 18:56:15 +00:00
Mark Johnston	d5e494fee4	Avoid unsynchronized updates to kn_status. kn_status is protected by the kqueue's lock, but we were updating it without the kqueue lock held. For EVFILT_TIMER knotes, there is no knlist lock, so the knote activation could occur during the kn_status update and result in KN_QUEUED being lost, in which case we'd enqueue an already-enqueued knote, corrupting the queue. Fix the problem by setting or clearing KN_DISABLED before dropping the kqueue lock to call into the filter. KN_DISABLED is used only by the core kevent code, so there is no side effect from setting it earlier. Reported and tested by: Sylvain GALLIANO <sg@efficientip.com> Reviewed by: kib MFC after: 3 days Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D18060	2018-11-21 17:32:09 +00:00
Mark Johnston	45aecd0422	Remove KN_HASKQLOCK. It is a write-only flag whose last use was removed in r302235. No functional change intended. Reviewed by: kib MFC after: 1 week Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D18059	2018-11-21 17:28:10 +00:00
Mark Johnston	bb58b5d670	Add a taskqueue_quiesce(9) KPI. This is similar to taskqueue_drain_all(9) but will wait for the queue to become idle before returning instead of only waiting for already-enqueued tasks to finish. This will be used in the opensolaris compat layer. PR: 227784 Reviewed by: cem MFC after: 1 week Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D17975	2018-11-21 17:18:27 +00:00
Mark Johnston	c7dc361d6f	Clear pad bytes in the struct exported by kern.ntp_pll.gettime. Reported by: Thomas Barabosch, Fraunhofer FKIE MFC after: 3 days Sponsored by: The FreeBSD Foundation	2018-11-20 20:32:10 +00:00
Mateusz Guzik	737037f6c0	pipe: use unr64 Reviewed by: kib Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D18054	2018-11-20 14:59:27 +00:00
Mateusz Guzik	435bef7a2f	Implement unr64 Important users of unr like tmpfs or pipes can get away with just ever-increasing counters, making the overhead of managing the state for 32 bit counters a pessimization. Change it to an atomic variable. This can be further sped up by making the counts variable "allocate" ranges and store them per-cpu. Reviewed by: kib Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D18054	2018-11-20 14:58:41 +00:00
Ben Widawsky	1a305bda15	acpi: fix acpi_ec_probe to only check EC devices This patch utilizes the fixed_devclass attribute in order to make sure other acpi devices with params don't get confused for an EC device. The existing code assumes that acpi_ec_probe is only ever called with a dereferencable acpi param. Aside from being incorrect because other devices of ACPI_TYPE_DEVICE may be probed here which aren't ec devices, (and they may have set acpi private data), it is even more nefarious if another ACPI driver uses private data which is not dereferancable. This will result in a pointer deref during boot and therefore boot failure. On X86, as it stands today, no other devices actually do this (acpi_cpu checks for PROCESSOR type devices) and so there is no issue. I ran into this because I am adding such a device which gets probed before acpi_ec_probe and sets private data. If ARM ever has an EC, I think they'd run into this issue as well. There have been several iterations of this patch. Earlier iterations had ECDT enumerated ECs not call into the probe/attach functions of this driver. This change was Suggested by: jhb@. Reviewed by: jhb Approved by: emaste (mentor) Differential Revision: https://reviews.freebsd.org/D16635	2018-11-19 18:29:03 +00:00
Hans Petter Selasky	90acd1d139	Minor code factoring. No functional change. MFC after: 1 week Sponsored by: Mellanox Technologies	2018-11-19 09:36:09 +00:00
Hans Petter Selasky	2205f61a31	Be more verbose when a sysctl fails to unregister. Print name of sysctl in question. MFC after: 1 week Sponsored by: Mellanox Technologies	2018-11-19 09:35:16 +00:00
Kevin Bowling	2a24f4d911	Retire sbsndptr() KPI As of r340465 all consumers use sbsndptr_adv and sbsndptr_noadv Reviewed by: gallatin Approved by: krion (mentor) Differential Revision: https://reviews.freebsd.org/D17998	2018-11-19 00:54:31 +00:00
Mateusz Guzik	2c054ce924	proc: always store parent pid in p_oppid Doing so removes the dependency on proctree lock from sysctl process list export which further reduces contention during poudriere -j 128 runs. Reviewed by: kib (previous version) Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D17825	2018-11-16 17:07:54 +00:00
Mark Johnston	aeb7a84ee1	Remove mostly-useless proc provider probes. For some reason the proc UMA zone's ctor, dtor and init functions are instrumented, but these functions are always available through FBT. Moreover, the probes are not part of the original Solaris proc provider, aren't documented, have no uses (e.g., in dwatch(8)) and have no clear use to begin with. Therefore, remove them. Reviewed by: rpaulo Differential Revision: https://reviews.freebsd.org/D2169	2018-11-15 23:02:59 +00:00
Warner Losh	36173f6976	Do proper conversion to/from sbt. Doh! sbttoX and Xtosbt were backwards. While they ran, they produced bogus results. Pointy hat to: imp@	2018-11-15 16:02:24 +00:00
Gleb Smirnoff	905837ebe7	Initialize compatibility epoch tracker for thread0. Fixes panics for drivers that call if_maddr_lock() during startup. Reported by: cy	2018-11-14 19:10:35 +00:00
Brooks Davis	5b1df30051	Use the main capabilities.conf for freebsd32. Allow the location of capabilities.conf to be configured. Also allow a per-abi syscall prefix to be configured with the abi_func_prefix syscalls.conf variable and check syscalls against entries in capabilities.conf with and without the prefix amended. Take advantage of these two features to allow use shared capabilities.conf between the default syscall vector and the freebsd32 compatability layer. We've been inconsistent about keeping the two in sync as evidenced by the bugs fixed in r340294. This eliminates that problem going forward. Reviewed by: kib Obtained from: CheriBSD Sponsored by: DARPA, AFRL Differential Revision: https://reviews.freebsd.org/D17932	2018-11-14 00:46:02 +00:00
Gleb Smirnoff	6febf18036	Fix build on some architectures after r340413. On amd64 epoch.h appeared to be included implicitly.	2018-11-14 00:33:03 +00:00
Matt Macy	91cf497515	epoch(9) revert r340097 - no longer a need for multiple sections per cpu I spoke with Samy Bahra and recent changes to CK to make ck_epoch_call and ck_epoch_poll not modify the record have eliminated the need for this.	2018-11-14 00:12:04 +00:00
Gleb Smirnoff	635c18840a	style(9), mostly adjusting overly long lines.	2018-11-13 23:57:34 +00:00
Gleb Smirnoff	a760c50c9e	With epoch not inlined, there is no point in using _lite KPI. While here, remove some unnecessary casts.	2018-11-13 23:45:38 +00:00
Gleb Smirnoff	9f360eecf9	The dualism between epoch_tracker and epoch_thread is fragile and unnecessary. So, expose CK types to kernel and use a single normal structure for epoch_tracker. Reviewed by: jtl, gallatin	2018-11-13 23:20:55 +00:00
Gleb Smirnoff	b79aa45e0e	For compatibility KPI functions like if_addr_rlock() that used to have mutexes but now are converted to epoch(9) use thread-private epoch_tracker. Embedding tracker into ifnet(9) or ifnet derived structures creates a non reentrable function, that will fail miserably if called simultaneously from two different contexts. A thread private tracker will provide a single tracker that would allow to call these functions safely. It doesn't allow nested call, but this is not expected from compatibility KPIs. Reviewed by: markj	2018-11-13 22:58:38 +00:00
Mateusz Guzik	f183fb162c	locks: plug warnings about unitialized variables They only showed up after I redefined LOCKSTAT_ENABLED to 0. doing_lockprof in mutex.c is a real (but harmless) bug. Should the value be non-zero it will do checks for lock profiling which would otherwise be skipped. state in rwlock.c is a wart from the compiler, the value can't be used if lock profiling is not enabled. Sponsored by: The FreeBSD Foundation	2018-11-13 21:29:56 +00:00
Eric van Gyzen	d54474e63b	Make no assertions about lock state when the scheduler is stopped. Change the assert paths in rm, rw, and sx locks to match the lock and unlock paths. I did this for mutexes in r306346. Reported by: Travis Lane <tlane@isilon.com> MFC after: 2 weeks Sponsored by: Dell EMC Isilon	2018-11-13 20:48:05 +00:00
Gleb Smirnoff	a82296c2df	Uninline epoch(9) entrance and exit. There is no proof that modern processors would benefit from avoiding a function call, but bloating code. In fact, clang created an uninlined real function for many object files in the network stack. - Move epoch_private.h into subr_epoch.c. Code copied exactly, avoiding any changes, including style(9). - Remove private copies of critical_enter/exit. Reviewed by: kib, jtl Differential Revision: https://reviews.freebsd.org/D17879	2018-11-13 19:02:11 +00:00
Mark Johnston	bb4a27f927	Allow allocations across meta boundaries. Remove restrictions that prevent allocation requests to cross the boundary between two meta nodes. Replace the bmu_avail field in meta nodes with a bitmap that identifies which subtrees have some free memory, and iterate over the nonempty subtrees only in blst_meta_alloc. If free memory is scarce, this should make searching for it faster. Put the code for handling the next-leaf allocation in a separate function. When taking blocks from the next leaf empties the leaf, be sure to clear the appropriate bit in its parent, and so on, up to the least-common ancestor of this leaf and the next. Eliminate special terminator nodes, and rely instead on the fact that there is a 0-bit at the end of the bitmask at the root of the tree that will stop a meta_alloc search, or a next-leaf search, before the search falls off the end of the tree. Make sure that the tree is big enough to have space for that 0-bit. Eliminate special all-free indicators. Lazy initialization of subtrees stands in the way of having an allocation span a meta-node boundary, so a subtree of all free blocks is not treated specially. Subtrees of all-allocated blocks are still recognized by looking at the bitmask at the root and finding 0. Don't print all-allocated subtrees. Do print the bitmasks for meta nodes, when tree-printing. Submitted by: Doug Moore <dougm@rice.edu> Reviewed by: alc MFC after: 1 month Differential Revision: https://reviews.freebsd.org/D12635	2018-11-13 18:40:01 +00:00
Kyle Evans	75beb4d46a	Add dynamic_kenv assertion to init_static_kenv Both to formally document the requirement that this not be called after the dynamic kenv is setup, and to perhaps help static analyzers figure out what's going on. While calling init_static_kenv this late isn't fatal, there are some caveats that the caller should be aware of: - Late calls are effectively a no-op, as far as default FreeBSD is concerned, as everything will switch to searching the dynamic kenv once it's available. - Each of the kern_getenv calls will leak memory, as it's assumed that these are searching static environment and allocations will not be made. As such, this usage is not sensible and should be detected.	2018-11-13 04:34:30 +00:00
Konstantin Belousov	389474c122	Allow set ether/vlan PCP operation from the VNET jails. The vlan interfaces can be created from vnet jails, it seems, so it sounds logical to allow pcp configuration as well. Reviewed by: bz, hselasky (previous version) Sponsored by: Mellanox Technologies MFC after: 1 week Differential revision: https://reviews.freebsd.org/D17777	2018-11-12 15:59:32 +00:00
Conrad Meyer	0d1467b199	netdump: Fix netdumping with INVARIANTS kernels Correct boneheaded assertion I added in r339501. Mea culpa. The intent is to notice when an M_WAITOK zone allocation would fail during netdump, not to prevent all use of mbufs during netdump. Reviewed by: markj X-MFC-With: r339501 Differential Revision: https://reviews.freebsd.org/D17957	2018-11-12 05:24:20 +00:00
Konstantin Belousov	8782eef46f	Remove one-use variable. This also removes a lot of #ifdefs and cleans up a warning when the AUDIT kernel option is defined, but neither KDTRACE_HOOKS nor MAC are. Reported and tested by: danger Sponsored by: The FreeBSD Foundation MFC after: 1 week	2018-11-11 00:21:28 +00:00
Konstantin Belousov	ade85c5eec	Allow absolute paths for O_BENEATH. The path must have a tail which does not escape starting/topping directory. The documentation will come shortly, see the man pages commit message for the reason of separate commit. Reviewed by: jilles (previous version) Discussed with: emaste Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D17714	2018-11-11 00:04:36 +00:00
Brooks Davis	9a38df59e9	Fix freebsd32 mknod(at). As dev_t is now a 64-bit integer, it requires special handling as a system call argument. 64-bit arguments are split between two 64-bit integers due to the way arguments are promoted to allow reuse of most system call implementations. They must be reassembled before use. Further, 64-bit arguments at an odd offset (counting from zero) are padded and slid to the next slot on powerpc and mips. Fix the non-COMPAT11 system call by adding a freebsd32_mknodat() and appropriately padded declerations. The COMPAT11 system calls are fully compatible with the 64-bit implementations so remove the freebsd32_ versions. Use uint32_t consistently as the type of the old dev_t. This matches the old definition. Reviewed by: kib MFC after: 3 days Sponsored by: DARPA, AFRL Differential Revision: https://reviews.freebsd.org/D17928	2018-11-09 21:01:16 +00:00
Brooks Davis	b34f4419fb	Make freebsd32_umtx_op follow the freebsd32_foo convention. Sponsored by: DARPA, AFRL	2018-11-09 00:46:10 +00:00
John Baldwin	4bf4b0f139	Enable non-executable stacks by default on RISC-V. Reviewed by: markj Sponsored by: DARPA Differential Revision: https://reviews.freebsd.org/D17878	2018-11-07 18:32:02 +00:00
Brooks Davis	5577e44bf4	Regen after r340221: allow pointer return types. Sponsored by: DARPA, AFRL Differential Revision: https://reviews.freebsd.org/D17873	2018-11-07 16:56:07 +00:00
Brooks Davis	e56ec0e519	makesyscalls.sh: allow pointer return types. The previous code required that the return type be a single word. This allows it to be a pointer without using a typedef. Update the return types of break, mmap, and shmat to be void * as declared. This only effects systrace output in-tree, but can aid in generating system call wrappers from syscalls.master. Reviewed by: kib Obtained from: CheriBSD Sponsored by: DARPA, AFRL Differential Revision: https://reviews.freebsd.org/D17873	2018-11-07 16:55:04 +00:00
Mark Johnston	f8a222010f	Avoid fixing the tty_info() buffer size in tty.h. Different compilation units may otherwise get a different view of the layout of struct tty depending on whether they include opt_printf.h. This caused a blowup in the number of types defined in the kernel's CTF file after r339468; thanks to dim@ for bisecting down to that revision. PR: 232675 Reported by: dim Reviewed by: cem (previous version) Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D17877	2018-11-06 23:41:44 +00:00
Mark Johnston	07702f72e5	Avoid specifying VM_PROT_EXECUTE in mappings from pipe_map and exec_map. These submaps are used for mapping pipe buffers and execv() argument strings respectively, so there's no need for such mappings to have execute permissions. Reported by: jhb Reviewed by: alc, jhb, kib MFC after: 2 weeks Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D17827	2018-11-06 21:57:03 +00:00
Mark Johnston	6741ea083f	We need opt_stack.h after r339605. Reviewed by: cem Sponsored by: The FreeBSD Foundation	2018-11-06 21:47:22 +00:00
Brooks Davis	dd4d2f216f	Update some comments made obsolete by recent commits.	2018-11-06 20:45:15 +00:00
Brooks Davis	938e8dcf60	Regen after r340199: Use declared types for caddr_t arguments. Sponsored by: DARPA, AFRL Differential Revision: https://reviews.freebsd.org/D17852	2018-11-06 18:47:29 +00:00
Brooks Davis	318f0d7720	Use declared types for caddr_t arguments. Leave ptrace(2) alone for the moment as it's defined to take a caddr_t. Reviewed by: kib Obtained from: CheriBSD Sponsored by: DARPA, AFRL Differential Revision: https://reviews.freebsd.org/D17852	2018-11-06 18:46:38 +00:00
Mariusz Zaborski	f4a035b8df	Regenerate after r340129. Pointed out by: brooks	2018-11-06 18:03:04 +00:00
Mark Johnston	f71ef9b686	Use plain atomic_{add,subtract} when that's sufficient. CID: 1386920 MFC after: 2 weeks	2018-11-06 17:32:25 +00:00
Andrew Turner	4ea56599e8	Port the NetBSD ubsan runtime to the FreeBSD kernel. This allows us to build the ubsan code added in r340189 into the kernel with the KUBSAN option. This will report when undefined behaviour is detected in the currently running kernel. As it can be large, the kernel is 65MB on arm64, loader may not be able to load the kernel on all architectures so is disabled by default for now. Sponsored by: DARPA, AFRL	2018-11-06 17:32:07 +00:00
Andrew Turner	0645126fae	Import the NetBSD micro ubsan code for the kernel. This imports revision 1.3 of common/lib/libc/misc/ubsan.c from NetBSD, the micro-ubsan code. It is an implementation of the Undefined Behavior Sanitizer runtime for use with recent clang and gcc. The uubsan code will be used in a later commit to implement kubsan to help find undefined behavior in the kernel. Sponsored by: DARPA, AFRL	2018-11-06 16:56:49 +00:00
Brooks Davis	44cbc1c2b7	Fix a couple indentation errors in r339958.	2018-11-06 00:09:43 +00:00
John Baldwin	4cbbb74888	Add a KPI for the delay while spinning on a spin lock. Replace a call to DELAY(1) with a new cpu_lock_delay() KPI. Currently cpu_lock_delay() is defined to DELAY(1) on all platforms. However, platforms with a DELAY() implementation that uses spin locks should implement a custom cpu_lock_delay() doesn't use locks. Reviewed by: kib MFC after: 3 days	2018-11-05 21:34:17 +00:00
Mariusz Zaborski	82560231d3	capsicum: allow ppoll(2) in capability mode We already allow to use poll(2). There is no reason to disallow ppoll(2). PR: 232495 Submitted by: Stefan Grundmann <sg2342@googlemail.com> Reviewed by: cem, oshogbo MFC after: 2 weeks	2018-11-04 17:12:53 +00:00
Matt Macy	10f42d244b	Convert epoch to read / write records per cpu In discussing D17503 "Run epoch calls sooner and more reliably" with sbahra@ we came to the conclusion that epoch is currently misusing the ck_epoch API. It isn't safe to do a "write side" operation (ck_epoch_call or ck_epoch_poll) in the middle of a "read side" section. Since, by definition, it's possible to be preempted during the middle of an EPOCH_PREEMPT epoch the GC task might call ck_epoch_poll or another thread might call ck_epoch_call on the same section. The right solution is ultimately to change the way that ck_epoch works for this use case. However, as a stopgap for 12 we agreed to simply have separate records for each use case. Tested by: pho@ MFC after: 3 days	2018-11-03 03:43:32 +00:00
Brooks Davis	4e8c73eb20	Regen after r340080: Add const to input-only char * arguments. Sponsored by: DARPA, AFRL Differential Revision: https://reviews.freebsd.org/D17812	2018-11-02 20:56:19 +00:00
Brooks Davis	12e69f96a2	Add const to input-only char * arguments. These arguments are mostly paths handled by NAMEI() macros which already take const char arguments. This change improves the match between syscalls.master and the public declerations of system calls. Reviewed by: kib (prior version) Obtained from: CheriBSD Sponsored by: DARPA, AFRL Differential Revision: https://reviews.freebsd.org/D17812	2018-11-02 20:50:22 +00:00
Warner Losh	003ffd57fe	Add sysctl_usec_to_sbintime and sysctl_msec_to_sbintime. These functions are used to present a sbintime_t as either a number of microseconds or a number of milliseconds respectively. Sponsored by: Netflix	2018-11-02 17:50:57 +00:00
Mark Johnston	2203c46d87	Initialize the eflags field of vm_map headers. Initializing the eflags field of the map->header entry to a value with a unique new bit set makes a few comparisons to &map->header unnecessary. Submitted by: Doug Moore <dougm@rice.edu> Reviewed by: alc, kib Tested by: pho MFC after: 2 weeks Differential Revision: https://reviews.freebsd.org/D14005	2018-11-02 16:26:44 +00:00
Brooks Davis	1493c2ee62	Make vop_symlink take a const target path. This will enable callers to take const paths as part of syscall decleration improvements. Where doing so is easy and non-distruptive carry the const through implementations. In UFS the value is passed to an interface that must take non-const values. In ZFS, const poisoning would touch code shared with upstream and it's not worth adding diffs. Bump __FreeBSD_version for external API consumers. Reviewed by: kib (prior version) Obtained from: CheriBSD Sponsored by: DARPA, AFRL Differential Revision: https://reviews.freebsd.org/D17805	2018-11-02 14:42:36 +00:00
Conrad Meyer	78c2a9806e	kern_poll: Restore explanatory comment removed in r177374 The comment isn't stale. The check is bogus in the sense that poll(2) does not require pollfd entries to be unique in fd space, so there is no reason there cannot be more pollfd entries than open or even allowed fds. The check is mostly a seatbelt against accidental misuse or abuse. FD_SETSIZE, while usually unrelated to poll, is used as an arbitrary floor for systems with very low kern.maxfilesperproc. Additionally, document this possible EINVAL condition in the poll.2 manual. No functional change. Reviewed by: markj Sponsored by: Dell EMC Isilon Differential Revision: https://reviews.freebsd.org/D17671	2018-11-01 23:46:23 +00:00
Brooks Davis	f7e5ce325f	Regent after r340034: Use mode_t when the documented signature does. Sponsored by: DARPA, AFRL Differential Revision: https://reviews.freebsd.org/D17784	2018-11-01 23:10:53 +00:00
Brooks Davis	2105ac07d7	Use mode_t when the documented signature does. This is more clear and produces better results when generating function stubs from syscalls.master. Reviewed by: kib, emaste Obtained from: CheribSD Sponsored by: DARPA, AFRL Differential Revision: https://reviews.freebsd.org/D17784	2018-11-01 23:06:50 +00:00
John Baldwin	b317cfd4c0	Don't enter DDB for fatal traps before panic by default. Add a new 'debugger_on_trap' knob separate from 'debugger_on_panic' and make the calls to kdb_trap() in MD fatal trap handlers prior to calling panic() conditional on this new knob instead of 'debugger_on_panic'. Disable the new knob by default. Developers who wish to recover from a fatal fault by adjusting saved register state and retrying the faulting instruction can still do so by enabling the new knob. However, for the more common case this makes the user experience for panics due to a fatal fault match the user experience for other panics, e.g. 'c' in DDB will generate a crash dump and reboot the system rather than being stuck in an infinite loop of fatal fault messages and DDB prompts. Reviewed by: kib, avg MFC after: 2 months Sponsored by: Chelsio Communications Differential Revision: https://reviews.freebsd.org/D17768	2018-11-01 21:34:17 +00:00
Brooks Davis	e3e5481326	Reformat syscalls.master for better readability. This takes advantage of two recents changes to makesyscalls.sh: r328598: Permit a range of syscall numbers for UNIMPL r339624: Remove the need for backslashes in syscalls.master Syscall declerations are now split across multiple lines with the syscall name and variables each on seperate lines (with an exception for syscalls taking no arguments.) Reviewed by: imp Obtained from: CheriBSD Sponsored by: DARPA, AFRL Differential Revision: https://reviews.freebsd.org/D17706	2018-10-31 16:17:45 +00:00
Bjoern A. Zeeb	9afc56849a	Fix mips build after r339931. I erroneously thought that it was two 64bit platforms which use link_elf_obj.c. PR: 228854 Reported by: ci.f.o. MFC after: 3 days X-MFC with: r339931 Pointyhat to: bz	2018-10-30 21:35:56 +00:00
Bjoern A. Zeeb	0f823b6497	As a follow-up to r339930 and various reports implement logging in case we fail during module load because the pcpu or vnet module sections are full. We did return a proper error but not leaving any indication to the user as to what the actual problem was. Even worse, on 12/13 currently we are seeing an unrelated error (ENOSYS instead of ENOSPC, which gets skipped over in kern_linker.c) to be printed which made problem diagnostics even harder. PR: 228854 MFC after: 3 days	2018-10-30 20:51:03 +00:00
Mark Johnston	9978bd996b	Add malloc_domainset(9) and _domainset variants to other allocator KPIs. Remove malloc_domain(9) and most other _domain KPIs added in r327900. The new functions allow the caller to specify a general NUMA domain selection policy, rather than specifically requesting an allocation from a specific domain. The latter policy tends to interact poorly with M_WAITOK, resulting in situations where a caller is blocked indefinitely because the specified domain is depleted. Most existing consumers of the _domain KPIs are converted to instead use a DOMAINSET_PREF() policy, in which we fall back to other domains to satisfy the allocation request. This change also defines a set of DOMAINSET_FIXED() policies, which only permit allocations from the specified domain. Discussed with: gallatin, jeff Reported and tested by: pho (previous version) MFC after: 2 weeks Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D17418	2018-10-30 18:26:34 +00:00
Mark Johnston	920239efde	Fix some problems that manifest when NUMA domain 0 is empty. - In uma_prealloc(), we need to check for an empty domain before the first allocation attempt, not after. Fix this by switching uma_prealloc() to use a vm_domainset iterator, which addresses the secondary issue of using a signed domain identifier in round-robin iteration. - Don't automatically create a page daemon for domain 0. - In domainset_empty_vm(), recompute ds_cnt and ds_order after excluding empty domains; otherwise we may frequently specify an empty domain when calling in to the page allocator, wasting CPU time. Convert DOMAINSET_PREF() policies for empty domains to round-robin. - When freeing bootstrap pages, don't count them towards the per-domain total page counts for now: some vm_phys segments are created before the SRAT is parsed and are thus always identified as being in domain 0 even when they are not. Then, when bootstrap pages are freed, they are added to a domain that we had previously thought was empty. Until this is corrected, we simply exclude them from the per-domain page count. Reported and tested by: Rajesh Kumar <rajfbsd@gmail.com> Reviewed by: gallatin MFC after: 2 weeks Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D17704	2018-10-30 17:57:40 +00:00
Eric van Gyzen	fcbb889fdb	Always stop the scheduler when entering kdb Set curthread->td_stopsched when entering kdb via any vector. Previously, it was only set when entering via panic, so when entering kdb another way, mutexes and such were still "live", and an attempt to lock an already locked mutex would panic. Reviewed by: kib, cem Discussed with: jhb Tested by: pho MFC after: 2 months Sponsored by: Dell EMC Isilon Differential Revision: https://reviews.freebsd.org/D17687	2018-10-30 14:54:15 +00:00
Stephen Hurd	5201e0f110	Drain grouptaskqueue of the gtask before detaching it. taskqgroup_detach() would remove the task even if it was running or enqueued, which could lead to panics (see D17404). With this change, taskqgroup_detach() drains the task and sets a new flag which prevents the task from being scheduled again. I've added grouptask_block() and grouptask_unblock() to allow control over the flag from other locations as well. Reviewed by: Jeffrey Pieper <jeffrey.e.pieper@intel.com> MFC after: 1 week Sponsored by: Limelight Networks Differential Revision: https://reviews.freebsd.org/D17674	2018-10-29 14:36:03 +00:00
Mark Johnston	4aed5937db	Use M_WAITOK in init_hwpmc(). No functional change intended. MFC after: 2 weeks	2018-10-27 18:48:49 +00:00
Conrad Meyer	2384981b61	poll: Unify userspace pollfd pointer name Some of the poll code used 'fds' and some used 'ufds' to refer to the uap->fds userspace pointer that was passed around to subroutines. Some of the poll code used 'fds' to refer to the kernel memory pollfd arrays, which seemed unnecessarily confusing. Unify on 'ufds' to refer to the userspace pollfd array. Additionally, 'bits' is not an accurate description of the kernel pollfd array in kern_poll, so rename that to 'kfds'. Finally, clean up some logic with mallocarray() and nitems(). No functional change. Reviewed by: markj Differential Revision: https://reviews.freebsd.org/D17670	2018-10-26 20:07:46 +00:00
Brooks Davis	ed34a7fcf2	Move 32-bit compat support for FIODGNAME to the right place. ioctl(2) commands only have meaning in the context of a file descriptor so translating them in the syscall layer is incorrect. The new handler users an accessor to retrieve/construct a pointer from the last member of the passed structure and relies on type punning to access the other member which requires no translation. Unlike r339174 this change supports both places FIODGNAME is handled. Reviewed by: kib Obtained from: CheriBSD Sponsored by: DARPA, AFRL Differential Revision: https://reviews.freebsd.org/D17475	2018-10-26 17:59:25 +00:00
Konstantin Belousov	4f77f48884	Implement O_BENEATH and AT_BENEATH. Flags prevent open(2) and *at(2) vfs syscalls name lookup from escaping the starting directory. Supposedly the interface is similar to the same proposed Linux flags. Reviewed by: jilles (code, previous version of manpages), 0mp (manpages) Discussed with: allanjude, emaste, jonathan Sponsored by: The FreeBSD Foundation Differential revision: https://reviews.freebsd.org/D17547	2018-10-25 22:16:34 +00:00
Mark Johnston	ad054101eb	Remove a dead store. CID: 1304878 MFC after: 1 week	2018-10-25 17:36:28 +00:00
Mark Johnston	970a174f3b	Add FALLTHROUGH comments to appease Coverity. CID: 1017862-1017864, 1017866-1017868 MFC after: 2 weeks	2018-10-25 15:43:21 +00:00
Mark Johnston	a0a18fd46b	Remove a redundant check. CID: 1042100 MFC after: 2 weeks	2018-10-25 15:40:59 +00:00
Konstantin Belousov	4fceda6206	Correct condition to detect mount(2) support by a filesystem. Reported and tested by: cy Sponsored by: The FreeBSD Foundation Approved by: re (rgrimes)	2018-10-24 19:40:09 +00:00
Konstantin Belousov	8ff7fad1d7	Only call sigdeferstop() for NFS. Use bypass to catch any NFS VOP dispatch and route it through the wrapper which does sigdeferstop() and then dispatches original VOP. NFS does not need a bypass below it, which is not supported. The vop offset in the vop_vector is added since otherwise it is impossible to get vop_op_t from the internal table, and I did not wanted to create the layered fs only to wrap NFS VOPs. VFS_OP()s wrap is straightforward. Requested and reviewed by: mjg (previous version) Tested by: pho Sponsored by: The FreeBSD Foundation Differential revision: https://reviews.freebsd.org/D17658	2018-10-23 21:43:41 +00:00
Eric Joyner	46fa0c2552	Revert r339634. That commit is causing kernel panics in em(4), so this will be reverted until those are fixed. Reported by: ae@, pho@, et al Sponsored by: Intel Corporation	2018-10-23 17:06:36 +00:00
Eric Joyner	940f62d616	iflib: drain enqueued tasks before detaching from taskqgroup The taskqgroup_detach function does not check if task is already enqueued when detaching it. This may lead to kernel panic if enqueued task starts after context state lock is destroyed. Ensure that the already enqueued admin tasks are executed before detaching them. The issue was discovered during validation of D16429. Unloading of if_ixlv followed by immediate removal of VFs with iovctl -D may lead to panic on NODEBUG kernel. As well, check if iflib is in detach before enqueueing new admin or iov tasks, to prevent new tasks from executing while the taskqgroup tasks are being drained. Submitted by: Krzysztof Galazka <krzysztof.galazka@intel.com> Reviewed by: shurd@, erj@ Sponsored by: Intel Corporation Differential Revision: https://reviews.freebsd.org/D17404	2018-10-23 04:37:29 +00:00
Brooks Davis	9d7051d920	Remove the need for backslashes in syscalls.master. Join non-special lines together until we hit a line containing a '}' character. This allows the function declaration body to be split across multiple lines without backslash continuation characters. Continue to join lines ending with backslashes to allow gradual migration and to support out-of-tree syscall vectors Reviewed by: emaste, kib Obtained from: CheriBSD Sponsored by: DARPA, AFRL Differential Revision: https://reviews.freebsd.org/D17488	2018-10-22 22:13:00 +00:00
Brooks Davis	22c0c9a481	Remove __restrict qualifiers from syscalls.master. The restruct qualifier is intended to aid code generation in the compiler, but the only access to storage through these pointers is via structs using copyin/copyout and the like which can not be written in C or C++ and thus the compiler gains nothing from the qualifiers. As such, the qualifiers add no value in current usage. Reviewed by: kib Obtained from: CheriBSD Sponsored by: DARPA, AFRL Differential Revision: https://reviews.freebsd.org/D17574	2018-10-22 21:50:43 +00:00
Mark Johnston	b61f314290	Make it possible to disable NUMA support with a tunable. This provides a chicken switch for anyone negatively impacted by enabling NUMA in the amd64 GENERIC kernel configuration. With NUMA disabled at boot-time, information about the NUMA topology is not exposed to the rest of the kernel, and all of physical memory is viewed as coming from a single domain. This method still has some performance overhead relative to disabling NUMA support at compile time. PR: 231460 Reviewed by: alc, gallatin, kib MFC after: 1 week Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D17439	2018-10-22 20:13:51 +00:00
Conrad Meyer	93940996fd	Conditionalize kern.tty_info_kstacks feature on STACKS option Fix tinderbox (mips XLPN32) after r339471. Reported by: tinderbox X-MFC-With: r339471 Sponsored by: Dell EMC Isilon	2018-10-22 17:42:57 +00:00
Mark Johnston	21744c825f	Don't import 0 into vmem quantum caches. vmem uses UMA cache zones to implement the quantum cache. Since uma_zalloc() returns 0 (NULL) to signal an allocation failure, UMA should not be used to cache resource 0. Fix this by ensuring that 0 is never cached in UMA in the first place, and by modifying vmem_alloc() to fall back to a search of the free lists if the cache is depleted, rather than blocking in qc_import(). Reported by and discussed with: Brett Gutstein <bgutstein@rice.edu> Reviewed by: alc MFC after: 2 weeks Differential Revision: https://reviews.freebsd.org/D17483	2018-10-22 16:16:42 +00:00
Warner Losh	e9b5375b04	Retire dpt(4) Marked as gone in 12 and not relevant since the early 90s. No sightings in nycbug's dmesg database. Relnotes: yes	2018-10-22 02:35:12 +00:00
Warner Losh	a1db7455b7	Remove bt(4) driver The buslogic scsi driver has been tagged as gone in 12 for some time now. Remove it. The nycbug dmesg database shows only one sighting in 6 for this driver. It was very popular in the early days of the project, but that popularity seems to have died by 2004 when the nycbug database started up. Relnotes: yes	2018-10-22 02:34:59 +00:00
Warner Losh	43b16da804	Remove adv(4) and adw(4) Remove the advanssy drivers (both adv and adw). They were tagged as gone in 12 a while qgo. The nycbug dmesg database shows this was last seen in 6 and there were only a few adv sightings then (none for adw). Relnotes: yes	2018-10-22 02:34:47 +00:00
Warner Losh	39c362e0b0	Remove aha(4) from the tree. We tagged aha as gone in 12 a while ago. Proceed with its removal. Data from nycbug's database shows the last sighting of this driver in 6, with the prior one in 4.x show its popularity had died prior to 4.x. Relnotes: yes	2018-10-22 02:34:25 +00:00
Warner Losh	c4b23051af	Remove stray refernce to pdq. Like the infamous twenty first of Johan Sebastian Bach's twenty children, it hasn't been seen in many years.	2018-10-21 16:49:49 +00:00
Conrad Meyer	3937ee7557	netdump: Zone mbufs should be allocated before dump Reviewed by: markj Sponsored by: Dell EMC Isilon Differential Revision: https://reviews.freebsd.org/D17306	2018-10-20 22:24:58 +00:00
Conrad Meyer	767bc248de	ZSTDIO: Correctly initialize zstd context with provided 'level' Prior to this revision, we allocated sufficient context space for 'level' but never actually set the compress level parameter, so we would always get the default '3'. Reviewed by: markj, vangyzen MFC after: 12 hours Sponsored by: Dell EMC Isilon Differential Revision: https://reviews.freebsd.org/D17144	2018-10-20 21:49:44 +00:00
Conrad Meyer	ec86f8b28b	dev_refthread: Do not initialize ref when reference was not acquired Like the companion API devvn_refthread, leave ref uninitialized when a reference was not acquired. Initializing to 1 provides a vaguely correct-looking but bogus value for broken callers to (mistakenly) pass to dev_relthread() when refthread fails. Make it even more clear to consumers that dev_relthread is only valid when dev_refthread succeeds. Reviewed by: kib, markj Sponsored by: Dell EMC Isilon Differential Revision: https://reviews.freebsd.org/D16885	2018-10-20 19:42:38 +00:00
Conrad Meyer	d7aa89c363	tty info (^T): Add optional kernel stack(9) traces It is often useful for developers and administrators to determine a running thread's stack for debugging purposes. With this feature, using ^T will print that information For now, the feature is disabled by default. Enable with sysctl kern.tty_info_kstacks=1. Discussed with: markj Reviewed by: oshogbo Relnotes: yes Sponsored by: Dell EMC Isilon Differential Revision: https://reviews.freebsd.org/D17621	2018-10-20 18:42:28 +00:00
Conrad Meyer	6858c2cc8f	Replace ttyprintf with sbuf_printf and tty drain routine Add string variants of cnputc and tty_putchar, and use them from the tty sbuf drain routine. Suggested by: ed@ Sponsored by: Dell EMC Isilon	2018-10-20 18:31:36 +00:00
Conrad Meyer	d158fa4ade	Add flags variants to linker_files / stack(9) symbol resolution Some best-effort consumers may find trylock behavior for stack(9) symbol resolution acceptable. Expose that behavior to such consumers. This API is ugly. If in the future the modules and linker file list locking is cleaned up such that the linker_files list can be iterated safely without acquiring a sleepable lock, this API should be removed. However, most of the time nothing will be holding the linker files lock exclusive and the acquisition can proceed. Reviewed by: markj Sponsored by: Dell EMC Isilon Differential Revision: https://reviews.freebsd.org/D17620	2018-10-20 18:08:43 +00:00
Mark Johnston	662e7fa8d9	Create some global domainsets and refactor NUMA registration. Pre-defined policies are useful when integrating the domainset(9) policy machinery into various kernel memory allocators. The refactoring will make it easier to add NUMA support for other architectures. No functional change intended. Reviewed by: alc, gallatin, jeff, kib Tested by: pho (part of a larger patch) MFC after: 3 days Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D17416	2018-10-20 17:36:00 +00:00
Bjoern A. Zeeb	9ae7bc39c2	In r78161 the lookup_set linker method was introduced which optionally returns the section start and stop locations as well as a count if the caller asks for them. There was only one out-of-file consumer of count which did not actually use it and hence was eliminated in r339407. In r194784 parse_dpcpu(), and in r195699 parse_vnet() (a copy of the former) started to use the link_elf_lookup_set() interface internally also asking for the count. count is computed as the difference of the void stop - void start locations and as such, if the absoulte numbers (stop - start) % sizeof(void ) != 0 a round-down happens, e.g., stop 0x1003 - start 0x1000 => count 0. To get the section size instead of "count is the number of pointer elements in the section", the parse_() functions do a count = sizeof(void ). They use the result to allocate memory and copy the section data into the "master" and per-instance memory regions with a size of count. As a result of count possibly round-down this can miss the last bytes of the section. The good news is that we do not touch out of bounds memory during these operations (we may at a later stage if the last bytes would overflow the master sections). Given relocation in elf_relocaddr() works based on the absolute numbers of start and stop, this means that we can possibly try to access relocated data which was never copied and hence we get random garbage or at best zeroed memory. Stop the two (last) consumers of count (the parse_*() functions) from using count as well, and calculate the section size based on the absolute numbers of stop and start and use the proper size for the memory allocation and data copies. This will make the symbols in the last bytes of the pcpu or vnet sections be presented as expected. PR: 232289 Approved by: re (gjb) MFC after: 2 weeks	2018-10-18 20:20:41 +00:00
Jamie Gritton	4520f617c9	Fix typos from r339409. Reported by: maxim Approved by: re (gjb)	2018-10-18 15:02:57 +00:00
Jonathan T. Looney	e77f0bdcb5	r334853 added a "socket destructor" callback. However, as implemented, it was really a "socket close" callback. Update the socket destructor functionality to run when a socket is destroyed (rather than when it is closed). The original submitter has confirmed that this change satisfies the intended use case. Suggested by: rwatson Submitted by: Michio Honda <micchie at sfc.wide.ad.jp> Tested by: Michio Honda <micchie at sfc.wide.ad.jp> Approved by: re (kib) Differential Revision: https://reviews.freebsd.org/D17590	2018-10-18 14:20:15 +00:00
Jamie Gritton	b19d66fd5a	Add a new jail permission, allow.read_msgbuf. When true, jailed processes can see the dmesg buffer (this is the current behavior). When false (the new default), dmesg will be unavailable to jailed users, whether root or not. The security.bsd.unprivileged_read_msgbuf sysctl still works as before, controlling system-wide whether non-root users can see the buffer. PR: 211580 Submitted by: bz Approved by: re@ (kib@) MFC after: 3 days	2018-10-17 16:11:43 +00:00
Bjoern A. Zeeb	0455a92bcb	The countp argument passed to linker_file_lookup_set() in linker_load_dependencies() is unused, so no need to ask for the value in first place. Remove the unused "count" variable. Approved by: re (kib)	2018-10-17 10:31:08 +00:00
Mark Johnston	ddab8c351a	Reparent a child of pdfork(2) to its reaper when the procdesc is closed. Unconditionally reparenting to PID 1 breaks the procctl(2) reaper functionality. Add a regression test for this case. Reviewed by: kib Approved by: re (gjb) MFC after: 2 weeks Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D17589	2018-10-16 20:06:56 +00:00
Gleb Smirnoff	47f1ea5109	Plug sendfile(2) on a listening socket with proper error code. Reported by: ngie Reviewed by: ngie Approved by: re (delphij)	2018-10-16 15:57:16 +00:00
Kyle Evans	29bf3a7ba8	Correct COMPAT* macro names in syscalls.master Both ^/sys/compat/freebsd32/syscalls.master and ^/sys/kern/syscalls.master cited "COMPAT[n] #ifdef" instead of "COMPAT_FREEBSD[n] #ifdef" in places. Approved by: re (glebius)	2018-10-15 21:35:57 +00:00
Mateusz Guzik	98fca94d22	capsicum: provide cap_rights_fde_inline Reading caps is in the hot path (on each successful fd lookup), but completely unnecessarily requires a function call. Approved by: re (gjb) Sponsored by: The FreeBSD Foundation	2018-10-12 23:48:10 +00:00
Mateusz Guzik	c9964045a0	Add a file missed in r339321 Approved by: re (implicit) Sad face: mjg	2018-10-12 00:32:45 +00:00
Mateusz Guzik	3f102f5881	Provide string functions for use before ifuncs get resolved. The change is a no-op for architectures which don't ifunc memset, memcpy nor memmove. Convert places which need them. Xen bits by royger. Reviewed by: kib Approved by: re (gjb) Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D17487	2018-10-11 23:28:04 +00:00
Jamie Gritton	08b4333399	Fix the test prohibiting jails from sharing IP addresses. It's not supposed to be legal for two jails to contain the same IP address, unless both jails contain only that one address. This is the behavior documented in jail(8), and is there to prevent confusion when multiple jails are listening on IADDR_ANY. VIMAGE jails (now the default for GENERIC kernels) test this correctly, but non-VIMAGE jails have been performing an incomplete test when nested jails are used. Approved by: re@ (kib@) MFC after: 5 days	2018-10-06 02:10:32 +00:00
Matt Macy	e8bb589d56	eliminate locking surrounding ui_vmsize and swap reserve by using atomics Change swap_reserve and swap_total to be in units of pages so that swap reservations can be done using only atomics instead of using a single global mutex for swap_reserve and a single mutex for all processes running under the same uid for uid accounting. Results in mmap speed up and a 70% increase in brk calls / second. Reviewed by: alc@, markj@, kib@ Approved by: re (delphij@) Differential Revision: https://reviews.freebsd.org/D16273	2018-10-05 05:50:56 +00:00
Gleb Smirnoff	ad7eb8cad5	In PR 227259, a user is reporting that they have code which is using shutdown() to wakeup another thread blocked on a stream listen socket. This code is failing, while it used to work on FreeBSD 10 and still works on Linux. It seems reasonable to add another exception to support something users are actually doing, which used to work on FreeBSD 10, and still works on Linux. And, it seems like it should be acceptable to POSIX, as we still return ENOTCONN. This patch is different to what had been committed to stable/11, since code around listening sockets is different. Patch in D15019 is written by jtl@, slightly modified by me. PR: 227259 Obtained from: jtl Approved by: re (kib) Differential Revision: D15019	2018-10-03 17:40:04 +00:00
Andrew Turner	8696dcdacf	Add kernel ifunc support on arm64. Tested with ifunc resolvers in the kernel and module with calls from kernel to kernel, module to kernel, and module to module. Reviewed by: kib (previous version) Approved by: re (gjb) Differential Revision: https://reviews.freebsd.org/D17370	2018-10-01 18:51:08 +00:00
Andrew Gallatin	30c5525b3c	Allow empty NUMA memory domains to support Threadripper2 The AMD Threadripper 2990WX is basically a slightly crippled Epyc. Rather than having 4 memory controllers, one per NUMA domain, it has only 2 memory controllers enabled. This means that only 2 of the 4 NUMA domains can be populated with physical memory, and the others are empty. Add support to FreeBSD for empty NUMA domains by: - creating empty memory domains when parsing the SRAT table, rather than failing to parse the table - not running the pageout deamon threads in empty domains - adding defensive code to UMA to avoid allocating from empty domains - adding defensive code to cpuset to avoid binding to an empty domain Thanks to Jeff for suggesting this strategy. Reviewed by: alc, markj Approved by: re (gjb@) Differential Revision: https://reviews.freebsd.org/D1683	2018-10-01 14:14:21 +00:00
John Baldwin	ff13c0a24f	Regenerate after UNIMPL -> OBSOL changes in r339001. Approved by: re (gjb)	2018-09-28 17:25:28 +00:00
John Baldwin	46e2054905	Mark various removed system calls as OBSOL instead of UNIMPL. This is mostly a cosmetic change except that obsolete system calls are assigned meaningful names in the names arrays which means that using tools like kdump or truss against binaries invoking these system calls will print out the name instead of the number. The script I use to generate the XML list of syscalls for GDB also ignores UNIMPL but not OBSOL entries. In general UNIMPL should only be used to reserve placeholders for system calls that have never been implemented while system calls that existed at one time in FreeBSD but were removed should be marked OBSOL instead. Reviewed by: brooks, kib, imp Approved by: re (gjb) Differential Revision: https://reviews.freebsd.org/D17344	2018-09-28 17:23:54 +00:00
Gordon Tetlow	89f652d149	Clear stack allocated data structure to prevent kernel memory leak. Reported by: Thomas Barabosch, Fraunhofer FKIE Reviewed by: wes@ Approved by: re (implicit) Approved by: so Security: FreeBSD-EN-18:12.mem Security: CVE-2018-17155	2018-09-27 18:39:54 +00:00
Mateusz Guzik	9afff6b1c0	Eliminate false sharing in malloc due to statistic collection Currently stats are collected in a MAXCPU-sized array which is not aligned and suffers enormous false-sharing. Fix the problem by utilizing per-cpu allocation. The counter(9) API is not used here as it is too incomplete and does not provide a win over per-cpu zone sized for malloc stats struct. In particular stats are being reported for each cpu separately by just copying what is supposed to be an array element for given cpu. This eliminates significant false-sharing during malloc-heavy tests e.g. on Skylake. See the review for details. Reviewed by: markj Approved by: re (kib) Differential Revision: https://reviews.freebsd.org/D17289	2018-09-23 19:00:06 +00:00
Mateusz Guzik	d6fda03a64	select: stop doing zero-sized memsets Approved by: re (kib)	2018-09-21 13:20:41 +00:00
Mark Johnston	969e147aff	Ensure that imports into per-domain kmem arenas are KVA_QUANTUM-aligned. The old code appears to assume that vmem_alloc() would import size-aligned KVA chunks from the parent kernel_arena, but vmem doesn't provide this guarantee. Also remove the unused global RWX arena and add comments explaining why we have per-domain arenas. Reported by: alc Reviewed by: alc, kib (previous version) Approved by: re (gjb) Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D17249	2018-09-20 18:29:55 +00:00
Mateusz Guzik	c396945b74	vfs: remove lookup_shared tunable Reviewed by: kib, jhb Approved by: re (gjb) Differential Revision: https://reviews.freebsd.org/D17253	2018-09-20 18:25:26 +00:00
Mateusz Guzik	51e13c93b6	fd: prevent inlining of _fdrop thorough kern_descrip.c fdrop is used in several places in the file and almost never has to call _fdrop. Thus inlining it is a pure waste of space. Approved by: re (kib)	2018-09-20 13:32:40 +00:00
Konstantin Belousov	bf94d6c78b	Fix state of dquot-less vnodes after failed quotaoff. UFS quotaoff iterates over all mp vnodes, and derefences and clears the pointers to corresponding dquots. If SU work items transiently reference some of dquots,quotaoff() would eventually fail, but all processed vnodes are already stripped from dquots. The state is problematic, since quotas are left enabled, but there is no dquots where blocks and inodes can be accounted. The result is assertion failures and NULL pointer dereferences. Fix it by suspending writes around quotaoff() call. Since the filesystem is synced, no dandling references to dquots from SU workitems can left behind, which means that quotaoff succeeds. The complication there is that quotaoff VFS op is performed with the mount point busied, while to suspend, we need to start write on the mp. If vn_start_write() is called on busied mp, system might deadlock against parallel unmount request. Handle this by unbusy-ing mp before starting write, which in turn requires changing the quotaoff() interface to return with the mount point not busied, same as was done for quotaon(). Reviewed by: mckusick Reported and tested by: pho Sponsored by: The FreeBSD Foundation Approved by: re (gjb) MFC after: 1 week Differential revision: https://reviews.freebsd.org/D17208	2018-09-19 14:36:57 +00:00
Gordon Tetlow	c9e562b188	Correct ELF header parsing code to prevent invalid ELF sections from disclosing memory. Submitted by: markj Reported by: Thomas Barabosch, Fraunhofer FKIE Approved by: re (implicit) Approved by: so Security: FreeBSD-SA-18:12.elf Security: CVE-2018-6924 Sponsored by: The FreeBSD Foundation	2018-09-12 04:57:34 +00:00
Mark Johnston	cc4f3d0ae2	Rename hardclock_cnt() to hardclock() and remove the old implementation. Also remove some related and unused subroutines. They have long been replaced by variants that handle multiple coalesced events with a single call. No functional change intended. Reviewed by: cem, kib Approved by: re (gjb) Differential Revision: https://reviews.freebsd.org/D17029	2018-09-06 02:10:59 +00:00
Mark Johnston	ce9eea6425	Correct the condition under which we allocate a terminator node. We will have last_block < blocks if the block count is divisible by BLIST_BMAP_RADIX, but a terminator node is still needed if the tree isn't balanced. In this case we were overruning the blist array by 16 bytes during initialization. While here, add a check for the invalid blocks == 0 case. PR: 231116 Reviewed by: alc, kib (previous version), Doug Moore <dougm@rice.edu> Approved by: re (gjb) MFC after: 1 week Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D17020	2018-09-05 19:05:30 +00:00
Konstantin Belousov	1565fb29a7	Add amd64 mdthread fields needed for the upcoming EFI RT exception handling. This is split into a separate commit from the main change to make it easier to handle possible revert after upcoming KBI freeze. Reviewed by: kevans Sponsored by: The FreeBSD Foundation MFC after: 1 week Approved by: re (rgrimes) Differential revision: https://reviews.freebsd.org/D16972	2018-09-02 21:16:43 +00:00
Konstantin Belousov	420382368a	Improve error messages from clock_if.m method failures. Print error message in verbose mode when CLOCK_SETTIME() clock_if.m method failed. For EFIRT RTC clock, add error code for the failure of CLOCK_GETTIME() report. Reviewed by: kevans Sponsored by: The FreeBSD Foundation MFC after: 1 week Approved by: re (rgrimes) Differential revision: https://reviews.freebsd.org/D16972	2018-09-02 20:17:51 +00:00
Mark Murray	19fa89e938	Remove the Yarrow PRNG algorithm option in accordance with due notice given in random(4). This includes updating of the relevant man pages, and no-longer-used harvesting parameters. Ensure that the pseudo-unit-test still does something useful, now also with the "other" algorithm instead of Yarrow. PR: 230870 Reviewed by: cem Approved by: so(delphij,gtetlow) Approved by: re(marius) Differential Revision: https://reviews.freebsd.org/D16898	2018-08-26 12:51:46 +00:00
Alan Cox	49bfa624ac	Eliminate the arena parameter to kmem_free(). Implicitly this corrects an error in the function hypercall_memfree(), where the wrong arena was being passed to kmem_free(). Introduce a per-page flag, VPO_KMEM_EXEC, to mark physical pages that are mapped in kmem with execute permissions. Use this flag to determine which arena the kmem virtual addresses are returned to. Eliminate UMA_SLAB_KRWX. The introduction of VPO_KMEM_EXEC makes it redundant. Update the nearby comment for UMA_SLAB_KERNEL. Reviewed by: kib, markj Discussed with: jeff Approved by: re (marius) Differential Revision: https://reviews.freebsd.org/D16845	2018-08-25 19:38:08 +00:00
Warner Losh	d36967bd2b	Add a new device flag: DF_ATTACHED_ONCE This flag is set once the device has been successfully attached. When set, it inhibits devmatch from trying to match the device. This in turn allows kldunload to work as expected. Prior to the change, the driver would immediately reload because devmatch had no notion that the driver had once been attached, and therefore shouldn't participate in further matching. Differential Revision: https://reviews.freebsd.org/D16735	2018-08-23 05:06:16 +00:00
Warner Losh	5fa2979791	Create devctl freeze/thaw. This adds it to devctl, libdevctl, defines the two IOCTLs and implements the kernel bits. causes any new drivers that are added via kldload to be deferred until a 'thaw' comes in. These do not stack: it is an error to freeze while frozen, or thaw while thawed. Differential Revision: https://reviews.freebsd.org/D16735	2018-08-23 05:05:47 +00:00
Conrad Meyer	b6c7d9c345	devstat(9): Constify function parameters that can be const No functional change. When attempting to document the changed argument types in devstat.9, I discovered the 20 year old manual page severely mismatched reality even prior to my simple change. So I took a first cut pass cleaning that up to match reality. I'm sure I've missed some things; the goal was just to leave it better than when I started. Sponsored by: Dell EMC Isilon	2018-08-23 01:42:45 +00:00
Conrad Meyer	4ca8c1efe4	KASSERT: Make runtime optionality optional Add an option, KASSERT_PANIC_OPTIONAL, that allows runtime KASSERT() behavior changes. When this option is not enabled, code that allows KASSERTs to become optional is not enabled, and all violated assertions cause termination. The runtime KASSERT behavior was added in r243980. One important distinction here is that panic has __dead2 ("attribute((noreturn))"), while kassert_panic does not. Static analyzers like Coverity understand __dead2. Without it, KASSERTs go misunderstood, resulting in many false positives that result from violation of program invariants. Reviewed by: jhb, jtl, np, vangyzen Relnotes: yes Sponsored by: Dell EMC Isilon Differential Revision: https://reviews.freebsd.org/D16835	2018-08-22 22:19:42 +00:00
Mark Johnston	36716fe2e6	Prepare the kernel linker to handle PC-relative ifunc relocations. The boot-time ifunc resolver assumes that it only needs to apply IRELATIVE relocations to PLT entries. With an upcoming optimization, this assumption no longer holds, so add the support required to handle PC-relative relocations targeting GNU_IFUNC symbols. - Provide a custom symbol lookup routine that can be used in early boot. The default lookup routine uses kobj, which is not functional at that point. - Apply all existing relocations during boot rather than filtering IRELATIVE relocations. - Ensure that we continue to apply ifunc relocations in a second pass when loading a kernel module. Reviewed by: kib MFC after: 1 month Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D16749	2018-08-22 20:44:30 +00:00
Michael Tuexen	6b01d4d433	Add SOL_SOCKET level socket option with name SO_DOMAIN to get the domain of a socket. This is helpful when testing and Solaris and Linux have the same socket option using the same name. Reviewed by: bcr@, rrs@ Sponsored by: Netflix, Inc. Differential Revision: https://reviews.freebsd.org/D16791	2018-08-21 14:04:30 +00:00
Alan Cox	44d0efb215	Eliminate kmem_alloc_contig()'s unused arena parameter. Reviewed by: hselasky, kib, markj Discussed with: jeff Differential Revision: https://reviews.freebsd.org/D16799	2018-08-20 15:57:27 +00:00
Kyle Evans	d529de874b	res_find: Fix fallback logic The fallback logic was broken if hints were found in multiple environments. If we found a hint in either the loader environment or the static environment, fallback would be incremented excessively when we returned to the environment-selection bits. These checks should have also been guarded by the fbacklvl checks. As a result, fbacklvl could quickly get to a point where we skip either the static environment and/or the static hints depending on which environments contained valid hints. The impact of this bug is minimal, mostly affecting mips boards that use static hints and may have hints in either the loader environment or the static environment. There may be better ways to express the searchable environments and describing their characteristics (immutable, already searched, etc.) but this may be revisited after 12 branches. Reported by: Dan Nelson <dnelson_1901@yahoo.com> Triaged by: Dan Nelson <dnelson_1901@yahoo.com> MFC after: 3 days	2018-08-18 19:45:56 +00:00
Xin LI	ed1fa01ac4	Regen after r337998.	2018-08-18 06:33:51 +00:00
Xin LI	0362ec1e8e	getrandom(2) should not be restricted in capability mode.	2018-08-18 06:31:49 +00:00
Mark Johnston	1436ff1ebb	Typo. X-MFC with: r337974	2018-08-17 16:07:06 +00:00
Mark Johnston	3ccbdc8254	Add INVARIANTS-only fences around lockless vnode refcount updates. Some internal KASSERTs access the v_iflag field without the vnode interlock held after such a refcount update. The fences are needed for the assertions to be correct in the face of store reordering. Reported and tested by: jhibbits Reviewed by: kib, mjg MFC after: 2 weeks Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D16756	2018-08-17 15:41:01 +00:00
Mariusz Zaborski	2fe6aefff8	capsicum: allow the setproctitle(3) function in capability mode Capsicum in past allowed to change the process title. This was broken with r335939. PR: 230584 Submitted by: Yuichiro NAITO <naito.yuichiro@gmail.com> Reported by: ian@niw.com.au MFC after: 1 week	2018-08-17 14:35:10 +00:00
Kyle Evans	45625675e7	subr_prf: Don't write kern.boot_tag if it's empty This change allows one to set kern.boot_tag="" and not get a blank line preceding other boot messages. While this isn't super critical- blank lines are easy to filter out both mentally and in processing dmesg later- it allows for a mode of operation that matches previous behavior. I intend to MFC this whole series to stable/11 by the end of the month with boot_tag empty by default to make this effectively a nop in the stable branch.	2018-08-17 03:42:57 +00:00
Jamie Gritton	c542c43ef1	Revert r337922, except for some documention-only bits. This needs to wait until user is changed to stop using jail(2). Differential Revision: D14791	2018-08-16 19:09:43 +00:00
Jamie Gritton	284001a222	Put jail(2) under COMPAT_FREEBSD11. It has been the "old" way of creating jails since FreeBSD 7. Along with the system call, put the various security.jail.allow_foo and security.jail.foo_allowed sysctls partly under COMPAT_FREEBSD11 (or BURN_BRIDGES). These sysctls had two disparate uses: on the system side, they were global permissions for jails created via jail(2) which lacked fine-grained permission controls; inside a jail, they're read-only descriptions of what the current jail is allowed to do. The first use is obsolete along with jail(2), but keep them for the second-read-only use. Differential Revision: D14791	2018-08-16 18:40:16 +00:00
Edward Tomasz Napierala	e77b6cfe34	In the help message at the mountroot prompt, suggest something that actually works and matches the bsdinstall(8) default. MFC after: 2 weeks Sponsored by: DARPA, AFRL	2018-08-15 12:12:21 +00:00
Alan Cox	c65ed2ff53	Eliminate a redundant assignment. MFC after: 1 week	2018-08-11 19:21:53 +00:00
Kyle Evans	0915d9d070	subr_prf: remove think-o that had returned to local patch Reported by: cognet	2018-08-10 15:35:02 +00:00
Kyle Evans	170bc29131	boot tagging: minor fixes msgbufinit may be called multiple times as we initialize the msgbuf into a progressively larger buffer. This doesn't happen as of now on head, but it may happen in the future and we generally support this. As such, only print the boot tag if we've just initialized the buffer for the first time. The boot tag also now has a newline appended to it for better visibility, and has been switched to a normal printf, by requesto f bde, after we've denoted that the msgbuf is mapped.	2018-08-10 15:29:06 +00:00
Kyle Evans	240fcda1e8	subr_prf: style(9) the sizeof Reported by: jkim, ian	2018-08-09 19:09:06 +00:00
Kyle Evans	4c793b68da	subr_prf: Use "sizeof current_boot_tag" instead	2018-08-09 17:53:18 +00:00
Kyle Evans	2a4650cc11	BOOT_TAG: Make a config(5) option, expose as sysctl and loader tunable BOOT_TAG lived shortly in sys/msgbuf.h, but this wasn't necessarily great for changing it or removing it. Move it into subr_prf.c and add options for it to opt_printf.h. One can specify both the BOOT_TAG and BOOT_TAG_SZ (really, size of the buffer that holds the BOOT_TAG). We expose it as kern.boot_tag and also add a loader tunable by the same name that we'll fetch upon initialization of the msgbuf. This allows for flexibility and also ensures that there's a consistent way to figure out the boot tag of the running kernel, rather than relying on headers to be in-sync. Prodded super-super-lightly by: imp	2018-08-09 17:47:47 +00:00
Kyle Evans	21aa6e8345	msgbuf: Light detailing (const'ify and bool'itize)	2018-08-09 17:42:27 +00:00
Leandro Lupori	c8e2123b6a	[ppc] Fix kernel panic when using BOOTP_NFSROOT On PowerPC (and possibly other architectures), that doesn't use EARLY_AP_STARTUP, the config task queue may be used initialized. This was observed while trying to mount the root fs from NFS, as reported here: https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=230168. This patch has 2 main changes: 1- Perform a basic initialization of qgroup_config, similar to what is done in taskqgroup_adjust, but simpler. This makes qgroup_config ready to be used during NFS root mount. 2- When EARLY_AP_STARTUP is not used, call inm_init() and in6m_init() right before SI_SUB_ROOT_CONF, because bootp needs to send multicast packages to request an IP. PR: Bug 230168 Reported by: sbruno Reviewed by: jhibbits, mmacy, sbruno Approved by: jhibbits Differential Revision: D16633	2018-08-09 14:04:51 +00:00
Matt Macy	9fec45d8e5	epoch_block_wait: don't check TD_RUNNING struct epoch_thread is not type safe (stack allocated) and thus cannot be dereferenced from another CPU Reported by: novel@	2018-08-09 05:18:27 +00:00
Kyle Evans	2834d61202	kern: Add a BOOT_TAG marker at the beginning of boot dmesg From the "newly licensed to drive" PR department, add a BOOT_TAG marker (by default, --<<BOOT>>--, to the beginning of each boot's dmesg. This makes it easier to do textproc magic to locate the start of each boot and, of particular interest to some, the dmesg of the current boot. The PR has a dmesg(8) component as well that I've opted not to include for the moment- it was the more contentious part of this PR. bde@ also made the statement that this boot tag should be written with an ordinary printf, which I've- for the moment- declined to change about this patch to keep it more transparent to observer of the boot process. PR: 43434 Submitted by: dak <aurelien.nephtali@wanadoo.fr> (basically rewritten) MFC after: maybe never	2018-08-09 01:32:09 +00:00
Konstantin Belousov	8f94195022	Followup to r337430: only call elf_reloc_ifunc on x86. Sponsored by: The FreeBSD Foundation MFC after: 1 week	2018-08-07 20:43:50 +00:00
Konstantin Belousov	289ead7cb0	Add missed handling of local relocs against ifunc target in the obj modules. Reported and tested by: wulf Sponsored by: The FreeBSD Foundation MFC after: 1 week	2018-08-07 18:26:46 +00:00
Mark Johnston	c7902fbeae	Improve handling of control message truncation. If a recvmsg(2) or recvmmsg(2) caller doesn't provide sufficient space for all control messages, the kernel sets MSG_CTRUNC in the message flags to indicate truncation of the control messages. In the case of SCM_RIGHTS messages, however, we were failing to dispose of the rights that had already been externalized into the recipient's file descriptor table. Add a new function and mbuf type to handle this cleanup task, and use it any time we fail to copy control messages out to the recipient. To simplify cleanup, control message truncation is now only performed at control message boundaries. The change also fixes a few related bugs: - Rights could be leaked to the recipient process if an error occurred while copying out a message's contents. - We failed to set MSG_CTRUNC if the truncation occurred on a control message boundary, e.g., if the caller received two control messages and provided only the exact amount of buffer space needed for the first. PR: 131876 Reviewed by: ed (previous version) MFC after: 1 month Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D16561	2018-08-07 16:36:48 +00:00
Konstantin Belousov	a70e9a1388	Swap in WKILLED processes. Swapped-out process that is WKILLED must be swapped in as soon as possible. The reason is that such process can be killed by OOM and its pages can be only freed if the process exits. To exit, the kernel stack of the process must be mapped. When allocating pages for the stack of the WKILLED process on swap in, use VM_ALLOC_SYSTEM requests to increase the chance of the allocation to succeed. Add counter of the swapped out processes to avoid unneeded iteration over the allprocs list when there is no work to do, reducing the allproc_lock ownership. Reviewed by: alc, markj (previous version) Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 2 weeks Differential revision: https://reviews.freebsd.org/D16489	2018-08-04 20:45:43 +00:00
Mark Johnston	5b0480f2cc	Don't check rcv sockbuf limits when sending on a unix stream socket. sosend_generic() performs an initial comparison of the amount of data (including control messages) to be transmitted with the send buffer size. When transmitting on a unix socket, we then compare the amount of data being sent with the amount of space in the receive buffer size; if insufficient space is available, sbappendcontrol() returns an error and the data is lost. This is easily triggered by sending control messages together with an amount of data roughly equal to the send buffer size, since the control message size may change in uipc_send() as file descriptors are internalized. Fix the problem by removing the space check in sbappendcontrol(), whose only consumer is the unix sockets code. The stream sockets code uses the SB_STOP mechanism to ensure that senders will block if the receive buffer fills up. PR: 181741 MFC after: 1 month Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D16515	2018-08-04 20:26:54 +00:00
Mark Johnston	e62ca80bde	Style.	2018-08-04 20:16:36 +00:00
Andriy Gapon	e0fa977ea5	safer wait-free iteration of shared interrupt handlers The code that iterates a list of interrupt handlers for a (shared) interrupt, whether in the ISR context or in the context of an interrupt thread, does so in a lock-free fashion. Thus, the routines that modify the list need to take special steps to ensure that the iterating code has a consistent view of the list. Previously, those routines tried to play nice only with the code running in the ithread context. The iteration in the ISR context was left to a chance. After commit r336635 atomic operations and memory fences are used to ensure that ie_handlers list is always safe to navigate with respect to inserting and removal of list elements. There is still a question of when it is safe to actually free a removed element. The idea of this change is somewhat similar to the idea of the epoch based reclamation. There are some simplifications comparing to the general epoch based reclamation. All writers are serialized using a mutex, so we do not need to worry about concurrent modifications. Also, all read accesses from the open context are serialized too. So, we can get away just two epochs / phases. When a thread removes an element it switches the global phase from the current phase to the other and then drains the previous phase. Only after the draining the removed element gets actually freed. The code that iterates the list in the ISR context takes a snapshot of the global phase and then increments the use count of that phase before iterating the list. The use count (in the same phase) is decremented after the iteration. This should ensure that there should be no iteration over the removed element when its gets freed. This commit also simplifies the coordination with the interrupt thread context. Now we always schedule the interrupt thread when removing one of handlers for its interrupt. This makes the code both simpler and safer as the interrupt thread masks the interrupt thus ensuring that there is no interaction with the ISR context. P.S. This change matters only for shared interrupts and I realize that those are becoming a thing of the past (and quickly). I also understand that the problem that I am trying to solve is extremely rare. PR: 229106 Reviewed by: cem Discussed with: Samy Al Bahra MFC after: 5 weeks Differential Revision: https://reviews.freebsd.org/D15905	2018-08-03 14:27:28 +00:00
Alan Somers	da4465506d	Fix LOCAL_PEERCRED with socketpair(2) Enable the LOCAL_PEERCRED socket option for unix domain stream sockets created with socketpair(2). Previously, it only worked with unix domain stream sockets created with socket(2)/listen(2)/connect(2)/accept(2). PR: 176419 Reported by: Nicholas Wilson <nicholas@nicholaswilson.me.uk> MFC after: 2 weeks Differential Revision: https://reviews.freebsd.org/D16350	2018-08-03 01:37:00 +00:00
Andriy Gapon	a260971458	fix a typo resulting in a wrong variable in kern_syscall_deregister The difference is between sysent, a global, and sysents, a function parameter.	2018-08-02 09:41:55 +00:00
Mark Johnston	4765717321	Remove a redundant check. MFC after: 3 days Sponsored by: The FreeBSD Foundation	2018-07-30 17:58:41 +00:00
Alan Somers	6040822c4e	Make timespecadd(3) and friends public The timespecadd(3) family of macros were imported from NetBSD back in r35029. However, they were initially guarded by #ifdef _KERNEL. In the meantime, we have grown at least 28 syscalls that use timespecs in some way, leading many programs both inside and outside of the base system to redefine those macros. It's better just to make the definitions public. Our kernel currently defines two-argument versions of timespecadd and timespecsub. NetBSD, OpenBSD, and FreeDesktop.org's libbsd, however, define three-argument versions. Solaris also defines a three-argument version, but only in its kernel. This revision changes our definition to match the common three-argument version. Bump _FreeBSD_version due to the breaking KPI change. Discussed with: cem, jilles, ian, bde Differential Revision: https://reviews.freebsd.org/D14725	2018-07-30 15:46:40 +00:00
Andrew Turner	cd2106eaea	Ensure the DPCPU and VNET module spaces are aligned to hold a pointer. Previously they may have been aligned to a char, leading to misaligned DPCPU and VNET variables. Sponsored by: DARPA, AFRL	2018-07-30 14:25:17 +00:00
David E. O'Brien	455d358977	Correct copyright dates.	2018-07-30 07:01:00 +00:00
Antoine Brodin	ccd6ac9f6e	Add allow.mlock to jail parameters It allows locking or unlocking physical pages in memory within a jail This allows running elasticsearch with "bootstrap.memory_lock" inside a jail Reviewed by: jamie@ Differential Revision: https://reviews.freebsd.org/D16342	2018-07-29 12:41:56 +00:00
Don Lewis	290d906084	Fix the long term ULE load balancer so that it actually works. The initial call to sched_balance() during startup is meant to initialize balance_ticks, but does not actually do that since smp_started is still zero at that time. Since balance_ticks does not get set, there are no further calls to sched_balance(). Fix this by setting balance_ticks in sched_initticks() since we know the value of balance_interval at that time, and eliminate the useless startup call to sched_balance(). We don't need to randomize the intial value of balance_ticks. Since there is now only one call to sched_balance(), we can hoist the tests at the top of this function out to the caller and avoid the overhead of the function call when running a SMP kernel on UP hardware. PR: 223914 Reviewed by: kib MFC after: 2 weeks	2018-07-29 00:30:06 +00:00
David Bright	95c05062ec	Allow a EVFILT_TIMER kevent to be updated. If a timer is updated (re-added) with a different time period (specified in the .data field of the kevent), the new time period has no effect; the timer will not expire until the original time has elapsed. This violates the documented behavior as the kqueue(2) man page says (in part) "Re-adding an existing event will modify the parameters of the original event, and not result in a duplicate entry." This modification, adapted from a patch submitted by cem@ to PR214987, fixes the kqueue system to allow updating a timer entry. The kevent timer behavior is changed to: * When a timer is re-added, update the timer parameters to and re-start the timer using the new parameters. * Allow updating both active and already expired timers. * When the timer has already expired, dequeue any undelivered events and clear the count of expirations. All of these changes address the original PR and also bring the FreeBSD and macOS kevent timer behaviors into agreement. A few other changes were made along the way: * Update the kqueue(2) man page to reflect the new timer behavior. * Fix man page style issues in kqueue(2) diagnosed by igor. * Update the timer libkqueue system test to test for the updated timer behavior. * Fix the (test) libkqueue common.h file so that it includes config.h which defines various HAVE_* feature defines, before the #if tests for such variables in common.h. This enables the use of the actual err(3) family of functions. * Fix the usages of the err(3) functions in the tests for incorrect type of variables. Those were formerly undiagnosed due to the disablement of the err(3) functions (see previous bullet point). PR: 214987 Reported by: Brian Wellington <bwelling@xbill.org> Reviewed by: kib MFC after: 1 week Relnotes: yes Sponsored by: Dell EMC Differential Revision: https://reviews.freebsd.org/D15778	2018-07-27 13:49:17 +00:00
Andriy Gapon	111b043cdf	change interrupt event's list of handlers from TAILQ to CK_SLIST The primary reason for this commit is to separate mechanical and nearly mechanical code changes from an upcoming fix for unsafe teardown of shared interrupt handlers that have only filters (see D15905). The technical rationale is that SLIST is sufficient. The only operation that gets worse performance -- O(n) instead of O(1) is a removal of a handler, but it is not a critical operation and the list is expected to be rather short. Additionally, it is easier to reason about SLIST when considering the concurrent lock-free access to the list from the interrupt context and the interrupt thread. CK_SLIST is used because the upcoming change depends on the memory order provided by CK_SLIST insert and the fact that CL_SLIST remove does not trash the linkage in a removed element. While here, I also fixed a couple of whitespace issues, made code under ifdef notyet compilable, added a lock assertion to ithread_update() and made intr_event_execute_handlers() static as it had no external callers. Reviewed by: cem (earlier version) MFC after: 4 weeks Differential Revision: https://reviews.freebsd.org/D16016	2018-07-23 12:51:23 +00:00
Emmanuel Vadot	c54fe25dcb	Raise the size of L3 table for early devmap on arm64 Some driver (like efifb) needs to map more than the current L2_SIZE Raise the size so we can map the framebuffer setup by the bootloader. Reviewed by: cognet	2018-07-19 21:58:06 +00:00
Mark Johnston	bf923a556d	Delete an XXX comment addressed by r336505. X-MFC with: r336505 Sponsored by: The FreeBSD Foundation	2018-07-19 20:11:08 +00:00
Mark Johnston	483f692ea6	Have preload_delete_name() free pages backing preloaded data. On i386 and amd64, add a vm_phys segment for physical memory used to store the kernel binary and other preloaded data. This makes it possible to free such memory back to the system once it is no longer needed, e.g., when a preloaded kernel module is unloaded. Previously, it would have remained unused. Reviewed by: kib, royger MFC after: 2 weeks Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D16330	2018-07-19 20:00:28 +00:00
Mark Johnston	73624a804a	Provide the full module path to preload_delete_name(). The basename will never match against the preload metadata, so these calls previously had no effect. Reviewed by: kib, royger MFC after: 1 week Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D16330	2018-07-19 19:50:42 +00:00
Konstantin Belousov	53e20b2702	When reporting an error, print the errno value. Sponsored by: The FreeBSD Foundation MFC after: 3 days	2018-07-19 19:03:18 +00:00
Emmanuel Vadot	326867616f	kern_cpu: When adding abs frequency allow for unordered insertion Keep the list ordered as some code assume that it is but allow for unordered cf_settings sets.	2018-07-19 11:28:14 +00:00
Mark Johnston	9295517ac9	Add a FALLTHROUGH comment to kvprintf(). Submitted by: Sebastian Huber <sebastian.huber@embedded-brains.de> MFC after: 3 days	2018-07-17 14:56:54 +00:00
Mariusz Zaborski	f1fe1e020f	Extend amount of possible coredumps from 10 to 100000 when using index format. The amount of digits in the name of corefile is assigned dynamically. Reviewed by: kib Differential Revision: https://reviews.freebsd.org/D16118	2018-07-15 17:10:12 +00:00
Mateusz Guzik	95ab076d6e	lockmgr: tidy up slock/sunlock similar to other locks	2018-07-13 22:40:14 +00:00
Warner Losh	25bc561e68	There's two files in the sys tree named inflate.c, in addition to it being a common name elsewhere. Rename the old kzip one to subr_inflate.c. This actually fixes the build issues on sparc64 that my inclusion of .PATH ${SYSDIR}/kern created in r336244, so also revert the broken workaround I committed in r336249. This slipped passed me because apparently, I never did a clean build.	2018-07-13 17:41:28 +00:00
Warner Losh	52379d36a9	Create helper functions for parsing boot args. boot_parse_arg to parse a single arg boot_parse_cmdline to parse a command line string boot_parse_args to parse all the args in a vector boot_howto_to_env Convert howto bits to env vars boot_env_to_howto Return howto mask mased on what's set in the environment. All these routines return an int that's the bitmask of the args translated to RB_* flags. As a special case, the 'S' flag sets the comconsole_speed env var. Any arg that looks like a=b will set the env key 'a' to value 'b'. If =b is omitted, 'a' is set to '1'. This should help us reduce the number of redundant copies of these routines in the tree. It should also give a more uniform experience between platforms. Also, invent a new flag RB_PROBE that's set when 'P' is parsed. On x86 + BIOS, this means 'probe for the keyboard, and if it's not there set both RB_MULTIPLE and RB_SERIAL (which means show the output on both video and serial consoles, but make serial primary). Others it may be some similar concept of probing, but it's loader dependent what, exactly, it means. These routines are suitable for /boot/loader and/or the kernel, though they may not be suitable for the tightly hand-rolled-for-space environments like boot2. Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D16205	2018-07-13 16:43:05 +00:00
Brooks Davis	d92da75941	Round down the location of execpathp to slightly improve copyout speed. In practice, this moves the padding from below the canary to above execpathp has no impact on stack consumption. Submitted by: Wuyang-Chung (via github pull request #159) MFC after: 1 week	2018-07-13 11:32:27 +00:00
Mateusz Guzik	bcbc8d35eb	fd: stop passing M_ZERO to uma_zalloc The optimisation seen with malloc cannot be used here as zone sizes are now known at compilation. Thus bzero by hand to get the optimisation instead.	2018-07-12 22:48:18 +00:00
Kyle Evans	44314c3509	kern_environment: Give the static environment a chance to disable MD env This variable has been given the name "loader_env.disabled" as it's the primary way most people will have an MD environment. This restores the previously-default behavior of ignoring the loader(8) environment, which may be useful for vendor distributions or other scenarios where inheriting the loader environment may be considered a security issue or potentially breaking of a more locked-down environment. As the change to config(5) indicates, disabling the loader environment should not be a choice made lightly since it may provide ACPI hints and other useful things that the system can rely on to boot. An UPDATING entry has been added to mention an upgrade path for those that may have relied on the previous behavior. Discussed with: bde Relnotes: yes (maybe)	2018-07-12 02:51:50 +00:00
Alan Somers	8a894c1aa1	Don't acquire evclass_lock with a spinlock held When the "pc" audit class is enabled and auditd is running, witness will panic during thread exit because au_event_class tries to lock an rwlock while holding a spinlock acquired upstack by thread_exit. To fix this, move AUDIT_SYSCALL_EXIT futher upstack, before the spinlock is acquired. Of thread_exit's 16 callers, it's only necessary to call AUDIT_SYSCALL_EXIT from two, exit1 (for exiting processes) and kern_thr_exit (for exiting threads). The other callers are all kernel threads, which needen't call AUDIT_SYSCALL_EXIT because since they can't make syscalls there will be nothing to audit. And exit1 already does call AUDIT_SYSCALL_EXIT, making the second call in thread_exit redundant for that case. PR: 228444 Reported by: aniketp Reviewed by: aniketp, kib MFC after: 2 weeks Differential Revision: https://reviews.freebsd.org/D16210	2018-07-11 19:38:42 +00:00
Brooks Davis	942ae5c8b8	Regen after r336171.	2018-07-10 14:04:52 +00:00
Brooks Davis	7cc923f8a8	Get rid of netbsd_lchown and netbsd_msync syscall entries. No valid FreeBSD binary very called them (they would call lchown and msync directly) and we haven't supported NetBSD binaries in ages. This is a respin of r335983 with a workaround for the ancient BFD linker in the libc stubs. Reviewed by: kib Sponsored by: DARPA, AFRL Differential Revision: https://reviews.freebsd.org/D16193	2018-07-10 13:32:04 +00:00
Brooks Davis	3a20f06a1c	Use uintptr_t alone when assigning to kvaddr_t variables. Suggested by: jhb	2018-07-10 13:03:06 +00:00
Kyle Evans	c7a82b9c6c	kern_environment: bool'itize dynamic_kenv; fix small style(9) nit	2018-07-10 02:43:22 +00:00
Kyle Evans	0f46005e4b	subr_hints: Skip static_env and static_hints if they don't contain hints This is possible because, well, they're static. Both the dynamic environment and the MD-environment (generally loader(8) environment) can potentially have room for new variables to be set, and thus do not receive this treatment.	2018-07-10 00:36:37 +00:00
Kyle Evans	dc4446df5f	subr_hints: Convert some bool-like ints to bools	2018-07-10 00:34:19 +00:00
Kyle Evans	5768da6c21	subr_hints: Use goto/label instead of series of conditionals	2018-07-10 00:33:31 +00:00
Mark Johnston	013072f04c	Fix pre-SI_SUB_CPU initialization of per-CPU counters. r336020 introduced pcpu_page_alloc(), replacing page_alloc() as the backend allocator for PCPU UMA zones. Unlike page_alloc(), it does not honour malloc(9) flags such as M_ZERO or M_NODUMP, so fix that. r336020 also changed counter(9) to initialize each counter using a CPU_FOREACH() loop instead of an SMP rendezvous. Before SI_SUB_CPU, smp_rendezvous() will only execute the callback on the current CPU (i.e., CPU 0), so only one counter gets zeroed. The rest are zeroed by virtue of the fact that UMA gratuitously zeroes slabs when importing them into a zone. Prior to SI_SUB_CPU, all_cpus is clear, so with r336020 we weren't zeroing vm_cnt counters during boot: the CPU_FOREACH() loop had no effect, and pcpu_page_alloc() didn't honour M_ZERO. Fix this by iterating over the full range of CPU IDs when zeroing counters, ignoring whether the corresponding bits in all_cpus are set. Reported and tested by: pho (previous version) Reviewed by: kib (previous version) Differential Revision: https://reviews.freebsd.org/D16190	2018-07-10 00:18:12 +00:00
Jamie Gritton	0a1724045e	Change prison_add_vfs() to the more generic prison_add_allow(), which can add any dynamic allow.* or allow.. parameter. Also keep prison_add_vfs() as a wrapper. Differential Revision: D16146	2018-07-06 18:50:22 +00:00
Kyle Evans	cae22dd904	kern_environment: Fix SYSINIT ordering The dynamic environment was being initialized at SI_SUB_KMEM, SI_ORDER_ANY. I added the hint-merging at SI_SUB_KMEM, SI_ORDER_ANY as well in r335998 - this can only work by coincidence. Re-do both to operate at SI_SUB_KMEM + 1, SI_ORDER_FIRST and SI_ORDER_SECOND respectively to be safe. It's sufficiently obfuscated away as to when in SU_SUB_KMEM malloc will be available, and the dynamic environment cannot be relied upon there anyways since it's initialized at SI_ORDER_ANY. Reported by: bde Discussed with: bde X-MFC-With: r335998	2018-07-06 16:51:35 +00:00
Brooks Davis	7524b4c14b	Correct breakage on 32-bit platforms from r335979.	2018-07-06 10:03:33 +00:00
Matt Macy	822e50e3f6	epoch(9): simplify initialization replace manual NUMA aware allocation with a pcpu zone	2018-07-06 06:20:03 +00:00
Matt Macy	ab3059a8e7	Back pcpu zone with domain correct pages - Change pcpu zone consumers to use a stride size of PAGE_SIZE. (defined as UMA_PCPU_ALLOC_SIZE to make future identification easier) - Allocate page from the correct domain for a given cpu. - Don't initialize pc_domain to non-zero value if NUMA is not defined There are some misconceptions surrounding this field. It is the _VM_ NUMA domain and should only ever correspond to valid domain values as understood by the VM. The former slab size of sizeof(struct pcpu) was somewhat arbitrary. The new value is PAGE_SIZE because that's the smallest granularity which the VM can allocate a slab for a given domain. If you have fewer than PAGE_SIZE/8 counters on your system there will be some memory wasted, but this is obviously something where you want the cache line to be coming from the correct domain. Reviewed by: jeff Sponsored by: Limelight Networks Differential Revision: https://reviews.freebsd.org/D15933	2018-07-06 02:06:03 +00:00
Andrew Turner	2bf9501287	Create a new macro for static DPCPU data. On arm64 (and possible other architectures) we are unable to use static DPCPU data in kernel modules. This is because the compiler will generate PC-relative accesses, however the runtime-linker expects to be able to relocate these. In preparation to fix this create two macros depending on if the data is global or static. Reviewed by: bz, emaste, markj Sponsored by: ABT Systems Ltd Differential Revision: https://reviews.freebsd.org/D16140	2018-07-05 17:13:37 +00:00
Bjoern A. Zeeb	1534cd19b5	Split up deadlkres() to make it more readable in anticipation of further changes adding another level of indentation. Some of the logic got simplified with the break out functions. There should be no functional changes. Reviewed by: kib Sponsored by: iXsystems, Inc. Differential Revision: https://reviews.freebsd.org/D15914	2018-07-05 17:06:54 +00:00
Kyle Evans	39d44f7f15	kern_environment: use any provided environments, evict hintmode/envmode At the moment, hintmode and envmode are used to indicate whether static hints or static env have been provided in the kernel config(5) and the static versions are mutually exclusive with loader(8)-provided environment. hintmode can be reconfigured later to pull from the dynamic environment, thus taking advantage of the loader(8) or post-kmem environment setting. This changeset fixes both problems at once to move us from a semi-confusing state to a consistent state: if an environment file, hints file, or loader(8) environment are provided, we use them in a well-known order of precedence: - loader(8) environment - static environment - static hints file Once the dynamic environment is setup this becomes a moot point. The loader(8) and static environments are merged (respecting the above order of precedence), and the static hints are merged in on an as-needed basis after the dynamic environment has been setup. Hints lookup are changed to respect all of the above. Before the dynamic environment is setup, lookups use the above-mentioned order and fallback to the next environment if a matching hint is not found. Once the dynamic environment is setup, that is used on its own since it captures all of the above information plus any dynamic kenv settings that came up later in boot. The following tangentially related changes were made to res_find: - A hintp cookie is now passed in so that related searches continue using the chain of environments (or dynamic environment) without relying on global state - All three environments will be searched if they actually have valid hints to use, rather than just choosing the first environment that actually had a hint and rolling with that only The hintmode sysctl has been ripped out. static_{env,hints}.disabled are still honored and will disable their respective environments from being used for hint lookups and from being merged into the dynamic environment, as expected. MFC after: 1 month (maybe) Differential Revision: https://reviews.freebsd.org/D15953	2018-07-05 16:30:32 +00:00
Kyle Evans	e28687347f	Revert r335995 due to accidental changes snuck in	2018-07-05 16:28:43 +00:00
Kyle Evans	8ef5886303	kern_environment: use any provided environments, evict hintmode/envmode At the moment, hintmode and envmode are used to indicate whether static hints or static env have been provided in the kernel config(5) and the static versions are mutually exclusive with loader(8)-provided environment. hintmode can be reconfigured later to pull from the dynamic environment, thus taking advantage of the loader(8) or post-kmem environment setting. This changeset fixes both problems at once to move us from a semi-confusing state to a consistent state: if an environment file, hints file, or loader(8) environment are provided, we use them in a well-known order of precedence: - loader(8) environment - static environment - static hints file Once the dynamic environment is setup this becomes a moot point. The loader(8) and static environments are merged (respecting the above order of precedence), and the static hints are merged in on an as-needed basis after the dynamic environment has been setup. Hints lookup are changed to respect all of the above. Before the dynamic environment is setup, lookups use the above-mentioned order and fallback to the next environment if a matching hint is not found. Once the dynamic environment is setup, that is used on its own since it captures all of the above information plus any dynamic kenv settings that came up later in boot. The following tangentially related changes were made to res_find: - A hintp cookie is now passed in so that related searches continue using the chain of environments (or dynamic environment) without relying on global state - All three environments will be searched if they actually have valid hints to use, rather than just choosing the first environment that actually had a hint and rolling with that only The hintmode sysctl has been ripped out. static_{env,hints}.disabled are still honored and will disable their respective environments from being used for hint lookups and from being merged into the dynamic environment, as expected. MFC after: 1 month (maybe) Differential Revision: https://reviews.freebsd.org/D15953	2018-07-05 16:25:48 +00:00
Bjoern A. Zeeb	0fb9f29bae	With the introduction of reapers and reaplists in r275800, proc0 and init are setup as a circular dependency. create_init() calls fork1() which calls do_fork(). There the newproc (initproc) is setup with a reaper of proc0 who's reaper points to itself. The newproc (initproc) is then put on its reaper's (proc0) p_reaplist (initproc is a descendants of proc0 for proc0 to reap). Upon return to create_init(), proc0 is added to initproc's p_reaplist (which would mean proc0 is a descendant of init, for init to reap). This creates a circular dependency which eventually leads to LIST corruptions when trying to kill init and a proc0. For the base system we never really hit this case during reboot. The problem only became visible after adding more virtual process spaces which could go away cleanly (work existing in an experimental branch). Reviewed by: kib Sponsored by: iXsystems, Inc. Differential Revision: https://reviews.freebsd.org/D15924	2018-07-05 16:16:28 +00:00
Brooks Davis	714c03c81e	Revert r335983. The bfd linker in tree doesn't support multiple names for the same symbol (at least with current flags).	2018-07-05 16:03:03 +00:00
Brooks Davis	5b04a71dae	Get rid of netbsd_lchown and netbsd_msync syscall entries. No valid FreeBSD binary ever called them (they would call lchown and msync directly) and we haven't supported NetBSD binaries in ages. Reviewed by: kib Sponsored by: DARPA, AFRL Differential Revision: https://reviews.freebsd.org/D15814	2018-07-05 14:12:56 +00:00
Konstantin Belousov	dbadb01591	Silence warnings about unused variables when RACCT is defined but RCTL is not. Reported by: Dries Michiels <driesm.michiels@gmail.com> Sponsored by: The FreeBSD Foundation MFC after: 3 days	2018-07-05 13:37:31 +00:00
Brooks Davis	f38b68ae8a	Make struct xinpcb and friends word-size independent. Replace size_t members with ksize_t (uint64_t) and pointer members (never used as pointers in userspace, but instead as unique idenitifiers) with kvaddr_t (uint64_t). This makes the structs identical between 32-bit and 64-bit ABIs. On 64-bit bit systems, the ABI is maintained. On 32-bit systems, this is an ABI breaking change. The ABI of most of these structs was previously broken in r315662. This also imposes a small API change on userspace consumers who must handle kernel pointers becoming virtual addresses. PR: 228301 (exp-run by antoine) Reviewed by: jtl, kib, rwatson (various versions) Sponsored by: DARPA, AFRL Differential Revision: https://reviews.freebsd.org/D15386	2018-07-05 13:13:48 +00:00
Matt Macy	10b8cd7f55	epoch(9): make nesting assert in epoch_wait_preempt more specific Reported by: markj	2018-07-04 21:34:08 +00:00
Mariusz Zaborski	6cad1a5d14	Add description to debug.ncores sysctl. Reviewed by: bcr Differential Revision: https://reviews.freebsd.org/D16123	2018-07-04 17:06:51 +00:00

... 3 4 5 6 7 ...

16605 Commits