freebsd-dev

Author	SHA1	Message	Date
Gordon Tetlow	89f652d149	Clear stack allocated data structure to prevent kernel memory leak. Reported by: Thomas Barabosch, Fraunhofer FKIE Reviewed by: wes@ Approved by: re (implicit) Approved by: so Security: FreeBSD-EN-18:12.mem Security: CVE-2018-17155	2018-09-27 18:39:54 +00:00
Mateusz Guzik	9afff6b1c0	Eliminate false sharing in malloc due to statistic collection Currently stats are collected in a MAXCPU-sized array which is not aligned and suffers enormous false-sharing. Fix the problem by utilizing per-cpu allocation. The counter(9) API is not used here as it is too incomplete and does not provide a win over per-cpu zone sized for malloc stats struct. In particular stats are being reported for each cpu separately by just copying what is supposed to be an array element for given cpu. This eliminates significant false-sharing during malloc-heavy tests e.g. on Skylake. See the review for details. Reviewed by: markj Approved by: re (kib) Differential Revision: https://reviews.freebsd.org/D17289	2018-09-23 19:00:06 +00:00
Mateusz Guzik	d6fda03a64	select: stop doing zero-sized memsets Approved by: re (kib)	2018-09-21 13:20:41 +00:00
Mark Johnston	969e147aff	Ensure that imports into per-domain kmem arenas are KVA_QUANTUM-aligned. The old code appears to assume that vmem_alloc() would import size-aligned KVA chunks from the parent kernel_arena, but vmem doesn't provide this guarantee. Also remove the unused global RWX arena and add comments explaining why we have per-domain arenas. Reported by: alc Reviewed by: alc, kib (previous version) Approved by: re (gjb) Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D17249	2018-09-20 18:29:55 +00:00
Mateusz Guzik	c396945b74	vfs: remove lookup_shared tunable Reviewed by: kib, jhb Approved by: re (gjb) Differential Revision: https://reviews.freebsd.org/D17253	2018-09-20 18:25:26 +00:00
Mateusz Guzik	51e13c93b6	fd: prevent inlining of _fdrop thorough kern_descrip.c fdrop is used in several places in the file and almost never has to call _fdrop. Thus inlining it is a pure waste of space. Approved by: re (kib)	2018-09-20 13:32:40 +00:00
Konstantin Belousov	bf94d6c78b	Fix state of dquot-less vnodes after failed quotaoff. UFS quotaoff iterates over all mp vnodes, and derefences and clears the pointers to corresponding dquots. If SU work items transiently reference some of dquots,quotaoff() would eventually fail, but all processed vnodes are already stripped from dquots. The state is problematic, since quotas are left enabled, but there is no dquots where blocks and inodes can be accounted. The result is assertion failures and NULL pointer dereferences. Fix it by suspending writes around quotaoff() call. Since the filesystem is synced, no dandling references to dquots from SU workitems can left behind, which means that quotaoff succeeds. The complication there is that quotaoff VFS op is performed with the mount point busied, while to suspend, we need to start write on the mp. If vn_start_write() is called on busied mp, system might deadlock against parallel unmount request. Handle this by unbusy-ing mp before starting write, which in turn requires changing the quotaoff() interface to return with the mount point not busied, same as was done for quotaon(). Reviewed by: mckusick Reported and tested by: pho Sponsored by: The FreeBSD Foundation Approved by: re (gjb) MFC after: 1 week Differential revision: https://reviews.freebsd.org/D17208	2018-09-19 14:36:57 +00:00
Gordon Tetlow	c9e562b188	Correct ELF header parsing code to prevent invalid ELF sections from disclosing memory. Submitted by: markj Reported by: Thomas Barabosch, Fraunhofer FKIE Approved by: re (implicit) Approved by: so Security: FreeBSD-SA-18:12.elf Security: CVE-2018-6924 Sponsored by: The FreeBSD Foundation	2018-09-12 04:57:34 +00:00
Mark Johnston	cc4f3d0ae2	Rename hardclock_cnt() to hardclock() and remove the old implementation. Also remove some related and unused subroutines. They have long been replaced by variants that handle multiple coalesced events with a single call. No functional change intended. Reviewed by: cem, kib Approved by: re (gjb) Differential Revision: https://reviews.freebsd.org/D17029	2018-09-06 02:10:59 +00:00
Mark Johnston	ce9eea6425	Correct the condition under which we allocate a terminator node. We will have last_block < blocks if the block count is divisible by BLIST_BMAP_RADIX, but a terminator node is still needed if the tree isn't balanced. In this case we were overruning the blist array by 16 bytes during initialization. While here, add a check for the invalid blocks == 0 case. PR: 231116 Reviewed by: alc, kib (previous version), Doug Moore <dougm@rice.edu> Approved by: re (gjb) MFC after: 1 week Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D17020	2018-09-05 19:05:30 +00:00
Konstantin Belousov	1565fb29a7	Add amd64 mdthread fields needed for the upcoming EFI RT exception handling. This is split into a separate commit from the main change to make it easier to handle possible revert after upcoming KBI freeze. Reviewed by: kevans Sponsored by: The FreeBSD Foundation MFC after: 1 week Approved by: re (rgrimes) Differential revision: https://reviews.freebsd.org/D16972	2018-09-02 21:16:43 +00:00
Konstantin Belousov	420382368a	Improve error messages from clock_if.m method failures. Print error message in verbose mode when CLOCK_SETTIME() clock_if.m method failed. For EFIRT RTC clock, add error code for the failure of CLOCK_GETTIME() report. Reviewed by: kevans Sponsored by: The FreeBSD Foundation MFC after: 1 week Approved by: re (rgrimes) Differential revision: https://reviews.freebsd.org/D16972	2018-09-02 20:17:51 +00:00
Mark Murray	19fa89e938	Remove the Yarrow PRNG algorithm option in accordance with due notice given in random(4). This includes updating of the relevant man pages, and no-longer-used harvesting parameters. Ensure that the pseudo-unit-test still does something useful, now also with the "other" algorithm instead of Yarrow. PR: 230870 Reviewed by: cem Approved by: so(delphij,gtetlow) Approved by: re(marius) Differential Revision: https://reviews.freebsd.org/D16898	2018-08-26 12:51:46 +00:00
Alan Cox	49bfa624ac	Eliminate the arena parameter to kmem_free(). Implicitly this corrects an error in the function hypercall_memfree(), where the wrong arena was being passed to kmem_free(). Introduce a per-page flag, VPO_KMEM_EXEC, to mark physical pages that are mapped in kmem with execute permissions. Use this flag to determine which arena the kmem virtual addresses are returned to. Eliminate UMA_SLAB_KRWX. The introduction of VPO_KMEM_EXEC makes it redundant. Update the nearby comment for UMA_SLAB_KERNEL. Reviewed by: kib, markj Discussed with: jeff Approved by: re (marius) Differential Revision: https://reviews.freebsd.org/D16845	2018-08-25 19:38:08 +00:00
Warner Losh	d36967bd2b	Add a new device flag: DF_ATTACHED_ONCE This flag is set once the device has been successfully attached. When set, it inhibits devmatch from trying to match the device. This in turn allows kldunload to work as expected. Prior to the change, the driver would immediately reload because devmatch had no notion that the driver had once been attached, and therefore shouldn't participate in further matching. Differential Revision: https://reviews.freebsd.org/D16735	2018-08-23 05:06:16 +00:00
Warner Losh	5fa2979791	Create devctl freeze/thaw. This adds it to devctl, libdevctl, defines the two IOCTLs and implements the kernel bits. causes any new drivers that are added via kldload to be deferred until a 'thaw' comes in. These do not stack: it is an error to freeze while frozen, or thaw while thawed. Differential Revision: https://reviews.freebsd.org/D16735	2018-08-23 05:05:47 +00:00
Conrad Meyer	b6c7d9c345	devstat(9): Constify function parameters that can be const No functional change. When attempting to document the changed argument types in devstat.9, I discovered the 20 year old manual page severely mismatched reality even prior to my simple change. So I took a first cut pass cleaning that up to match reality. I'm sure I've missed some things; the goal was just to leave it better than when I started. Sponsored by: Dell EMC Isilon	2018-08-23 01:42:45 +00:00
Conrad Meyer	4ca8c1efe4	KASSERT: Make runtime optionality optional Add an option, KASSERT_PANIC_OPTIONAL, that allows runtime KASSERT() behavior changes. When this option is not enabled, code that allows KASSERTs to become optional is not enabled, and all violated assertions cause termination. The runtime KASSERT behavior was added in r243980. One important distinction here is that panic has __dead2 ("attribute((noreturn))"), while kassert_panic does not. Static analyzers like Coverity understand __dead2. Without it, KASSERTs go misunderstood, resulting in many false positives that result from violation of program invariants. Reviewed by: jhb, jtl, np, vangyzen Relnotes: yes Sponsored by: Dell EMC Isilon Differential Revision: https://reviews.freebsd.org/D16835	2018-08-22 22:19:42 +00:00
Mark Johnston	36716fe2e6	Prepare the kernel linker to handle PC-relative ifunc relocations. The boot-time ifunc resolver assumes that it only needs to apply IRELATIVE relocations to PLT entries. With an upcoming optimization, this assumption no longer holds, so add the support required to handle PC-relative relocations targeting GNU_IFUNC symbols. - Provide a custom symbol lookup routine that can be used in early boot. The default lookup routine uses kobj, which is not functional at that point. - Apply all existing relocations during boot rather than filtering IRELATIVE relocations. - Ensure that we continue to apply ifunc relocations in a second pass when loading a kernel module. Reviewed by: kib MFC after: 1 month Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D16749	2018-08-22 20:44:30 +00:00
Michael Tuexen	6b01d4d433	Add SOL_SOCKET level socket option with name SO_DOMAIN to get the domain of a socket. This is helpful when testing and Solaris and Linux have the same socket option using the same name. Reviewed by: bcr@, rrs@ Sponsored by: Netflix, Inc. Differential Revision: https://reviews.freebsd.org/D16791	2018-08-21 14:04:30 +00:00
Alan Cox	44d0efb215	Eliminate kmem_alloc_contig()'s unused arena parameter. Reviewed by: hselasky, kib, markj Discussed with: jeff Differential Revision: https://reviews.freebsd.org/D16799	2018-08-20 15:57:27 +00:00
Kyle Evans	d529de874b	res_find: Fix fallback logic The fallback logic was broken if hints were found in multiple environments. If we found a hint in either the loader environment or the static environment, fallback would be incremented excessively when we returned to the environment-selection bits. These checks should have also been guarded by the fbacklvl checks. As a result, fbacklvl could quickly get to a point where we skip either the static environment and/or the static hints depending on which environments contained valid hints. The impact of this bug is minimal, mostly affecting mips boards that use static hints and may have hints in either the loader environment or the static environment. There may be better ways to express the searchable environments and describing their characteristics (immutable, already searched, etc.) but this may be revisited after 12 branches. Reported by: Dan Nelson <dnelson_1901@yahoo.com> Triaged by: Dan Nelson <dnelson_1901@yahoo.com> MFC after: 3 days	2018-08-18 19:45:56 +00:00
Xin LI	ed1fa01ac4	Regen after r337998.	2018-08-18 06:33:51 +00:00
Xin LI	0362ec1e8e	getrandom(2) should not be restricted in capability mode.	2018-08-18 06:31:49 +00:00
Mark Johnston	1436ff1ebb	Typo. X-MFC with: r337974	2018-08-17 16:07:06 +00:00
Mark Johnston	3ccbdc8254	Add INVARIANTS-only fences around lockless vnode refcount updates. Some internal KASSERTs access the v_iflag field without the vnode interlock held after such a refcount update. The fences are needed for the assertions to be correct in the face of store reordering. Reported and tested by: jhibbits Reviewed by: kib, mjg MFC after: 2 weeks Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D16756	2018-08-17 15:41:01 +00:00
Mariusz Zaborski	2fe6aefff8	capsicum: allow the setproctitle(3) function in capability mode Capsicum in past allowed to change the process title. This was broken with r335939. PR: 230584 Submitted by: Yuichiro NAITO <naito.yuichiro@gmail.com> Reported by: ian@niw.com.au MFC after: 1 week	2018-08-17 14:35:10 +00:00
Kyle Evans	45625675e7	subr_prf: Don't write kern.boot_tag if it's empty This change allows one to set kern.boot_tag="" and not get a blank line preceding other boot messages. While this isn't super critical- blank lines are easy to filter out both mentally and in processing dmesg later- it allows for a mode of operation that matches previous behavior. I intend to MFC this whole series to stable/11 by the end of the month with boot_tag empty by default to make this effectively a nop in the stable branch.	2018-08-17 03:42:57 +00:00
Jamie Gritton	c542c43ef1	Revert r337922, except for some documention-only bits. This needs to wait until user is changed to stop using jail(2). Differential Revision: D14791	2018-08-16 19:09:43 +00:00
Jamie Gritton	284001a222	Put jail(2) under COMPAT_FREEBSD11. It has been the "old" way of creating jails since FreeBSD 7. Along with the system call, put the various security.jail.allow_foo and security.jail.foo_allowed sysctls partly under COMPAT_FREEBSD11 (or BURN_BRIDGES). These sysctls had two disparate uses: on the system side, they were global permissions for jails created via jail(2) which lacked fine-grained permission controls; inside a jail, they're read-only descriptions of what the current jail is allowed to do. The first use is obsolete along with jail(2), but keep them for the second-read-only use. Differential Revision: D14791	2018-08-16 18:40:16 +00:00
Edward Tomasz Napierala	e77b6cfe34	In the help message at the mountroot prompt, suggest something that actually works and matches the bsdinstall(8) default. MFC after: 2 weeks Sponsored by: DARPA, AFRL	2018-08-15 12:12:21 +00:00
Alan Cox	c65ed2ff53	Eliminate a redundant assignment. MFC after: 1 week	2018-08-11 19:21:53 +00:00
Kyle Evans	0915d9d070	subr_prf: remove think-o that had returned to local patch Reported by: cognet	2018-08-10 15:35:02 +00:00
Kyle Evans	170bc29131	boot tagging: minor fixes msgbufinit may be called multiple times as we initialize the msgbuf into a progressively larger buffer. This doesn't happen as of now on head, but it may happen in the future and we generally support this. As such, only print the boot tag if we've just initialized the buffer for the first time. The boot tag also now has a newline appended to it for better visibility, and has been switched to a normal printf, by requesto f bde, after we've denoted that the msgbuf is mapped.	2018-08-10 15:29:06 +00:00
Kyle Evans	240fcda1e8	subr_prf: style(9) the sizeof Reported by: jkim, ian	2018-08-09 19:09:06 +00:00
Kyle Evans	4c793b68da	subr_prf: Use "sizeof current_boot_tag" instead	2018-08-09 17:53:18 +00:00
Kyle Evans	2a4650cc11	BOOT_TAG: Make a config(5) option, expose as sysctl and loader tunable BOOT_TAG lived shortly in sys/msgbuf.h, but this wasn't necessarily great for changing it or removing it. Move it into subr_prf.c and add options for it to opt_printf.h. One can specify both the BOOT_TAG and BOOT_TAG_SZ (really, size of the buffer that holds the BOOT_TAG). We expose it as kern.boot_tag and also add a loader tunable by the same name that we'll fetch upon initialization of the msgbuf. This allows for flexibility and also ensures that there's a consistent way to figure out the boot tag of the running kernel, rather than relying on headers to be in-sync. Prodded super-super-lightly by: imp	2018-08-09 17:47:47 +00:00
Kyle Evans	21aa6e8345	msgbuf: Light detailing (const'ify and bool'itize)	2018-08-09 17:42:27 +00:00
Leandro Lupori	c8e2123b6a	[ppc] Fix kernel panic when using BOOTP_NFSROOT On PowerPC (and possibly other architectures), that doesn't use EARLY_AP_STARTUP, the config task queue may be used initialized. This was observed while trying to mount the root fs from NFS, as reported here: https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=230168. This patch has 2 main changes: 1- Perform a basic initialization of qgroup_config, similar to what is done in taskqgroup_adjust, but simpler. This makes qgroup_config ready to be used during NFS root mount. 2- When EARLY_AP_STARTUP is not used, call inm_init() and in6m_init() right before SI_SUB_ROOT_CONF, because bootp needs to send multicast packages to request an IP. PR: Bug 230168 Reported by: sbruno Reviewed by: jhibbits, mmacy, sbruno Approved by: jhibbits Differential Revision: D16633	2018-08-09 14:04:51 +00:00
Matt Macy	9fec45d8e5	epoch_block_wait: don't check TD_RUNNING struct epoch_thread is not type safe (stack allocated) and thus cannot be dereferenced from another CPU Reported by: novel@	2018-08-09 05:18:27 +00:00
Kyle Evans	2834d61202	kern: Add a BOOT_TAG marker at the beginning of boot dmesg From the "newly licensed to drive" PR department, add a BOOT_TAG marker (by default, --<<BOOT>>--, to the beginning of each boot's dmesg. This makes it easier to do textproc magic to locate the start of each boot and, of particular interest to some, the dmesg of the current boot. The PR has a dmesg(8) component as well that I've opted not to include for the moment- it was the more contentious part of this PR. bde@ also made the statement that this boot tag should be written with an ordinary printf, which I've- for the moment- declined to change about this patch to keep it more transparent to observer of the boot process. PR: 43434 Submitted by: dak <aurelien.nephtali@wanadoo.fr> (basically rewritten) MFC after: maybe never	2018-08-09 01:32:09 +00:00
Konstantin Belousov	8f94195022	Followup to r337430: only call elf_reloc_ifunc on x86. Sponsored by: The FreeBSD Foundation MFC after: 1 week	2018-08-07 20:43:50 +00:00
Konstantin Belousov	289ead7cb0	Add missed handling of local relocs against ifunc target in the obj modules. Reported and tested by: wulf Sponsored by: The FreeBSD Foundation MFC after: 1 week	2018-08-07 18:26:46 +00:00
Mark Johnston	c7902fbeae	Improve handling of control message truncation. If a recvmsg(2) or recvmmsg(2) caller doesn't provide sufficient space for all control messages, the kernel sets MSG_CTRUNC in the message flags to indicate truncation of the control messages. In the case of SCM_RIGHTS messages, however, we were failing to dispose of the rights that had already been externalized into the recipient's file descriptor table. Add a new function and mbuf type to handle this cleanup task, and use it any time we fail to copy control messages out to the recipient. To simplify cleanup, control message truncation is now only performed at control message boundaries. The change also fixes a few related bugs: - Rights could be leaked to the recipient process if an error occurred while copying out a message's contents. - We failed to set MSG_CTRUNC if the truncation occurred on a control message boundary, e.g., if the caller received two control messages and provided only the exact amount of buffer space needed for the first. PR: 131876 Reviewed by: ed (previous version) MFC after: 1 month Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D16561	2018-08-07 16:36:48 +00:00
Konstantin Belousov	a70e9a1388	Swap in WKILLED processes. Swapped-out process that is WKILLED must be swapped in as soon as possible. The reason is that such process can be killed by OOM and its pages can be only freed if the process exits. To exit, the kernel stack of the process must be mapped. When allocating pages for the stack of the WKILLED process on swap in, use VM_ALLOC_SYSTEM requests to increase the chance of the allocation to succeed. Add counter of the swapped out processes to avoid unneeded iteration over the allprocs list when there is no work to do, reducing the allproc_lock ownership. Reviewed by: alc, markj (previous version) Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 2 weeks Differential revision: https://reviews.freebsd.org/D16489	2018-08-04 20:45:43 +00:00
Mark Johnston	5b0480f2cc	Don't check rcv sockbuf limits when sending on a unix stream socket. sosend_generic() performs an initial comparison of the amount of data (including control messages) to be transmitted with the send buffer size. When transmitting on a unix socket, we then compare the amount of data being sent with the amount of space in the receive buffer size; if insufficient space is available, sbappendcontrol() returns an error and the data is lost. This is easily triggered by sending control messages together with an amount of data roughly equal to the send buffer size, since the control message size may change in uipc_send() as file descriptors are internalized. Fix the problem by removing the space check in sbappendcontrol(), whose only consumer is the unix sockets code. The stream sockets code uses the SB_STOP mechanism to ensure that senders will block if the receive buffer fills up. PR: 181741 MFC after: 1 month Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D16515	2018-08-04 20:26:54 +00:00
Mark Johnston	e62ca80bde	Style.	2018-08-04 20:16:36 +00:00
Andriy Gapon	e0fa977ea5	safer wait-free iteration of shared interrupt handlers The code that iterates a list of interrupt handlers for a (shared) interrupt, whether in the ISR context or in the context of an interrupt thread, does so in a lock-free fashion. Thus, the routines that modify the list need to take special steps to ensure that the iterating code has a consistent view of the list. Previously, those routines tried to play nice only with the code running in the ithread context. The iteration in the ISR context was left to a chance. After commit r336635 atomic operations and memory fences are used to ensure that ie_handlers list is always safe to navigate with respect to inserting and removal of list elements. There is still a question of when it is safe to actually free a removed element. The idea of this change is somewhat similar to the idea of the epoch based reclamation. There are some simplifications comparing to the general epoch based reclamation. All writers are serialized using a mutex, so we do not need to worry about concurrent modifications. Also, all read accesses from the open context are serialized too. So, we can get away just two epochs / phases. When a thread removes an element it switches the global phase from the current phase to the other and then drains the previous phase. Only after the draining the removed element gets actually freed. The code that iterates the list in the ISR context takes a snapshot of the global phase and then increments the use count of that phase before iterating the list. The use count (in the same phase) is decremented after the iteration. This should ensure that there should be no iteration over the removed element when its gets freed. This commit also simplifies the coordination with the interrupt thread context. Now we always schedule the interrupt thread when removing one of handlers for its interrupt. This makes the code both simpler and safer as the interrupt thread masks the interrupt thus ensuring that there is no interaction with the ISR context. P.S. This change matters only for shared interrupts and I realize that those are becoming a thing of the past (and quickly). I also understand that the problem that I am trying to solve is extremely rare. PR: 229106 Reviewed by: cem Discussed with: Samy Al Bahra MFC after: 5 weeks Differential Revision: https://reviews.freebsd.org/D15905	2018-08-03 14:27:28 +00:00
Alan Somers	da4465506d	Fix LOCAL_PEERCRED with socketpair(2) Enable the LOCAL_PEERCRED socket option for unix domain stream sockets created with socketpair(2). Previously, it only worked with unix domain stream sockets created with socket(2)/listen(2)/connect(2)/accept(2). PR: 176419 Reported by: Nicholas Wilson <nicholas@nicholaswilson.me.uk> MFC after: 2 weeks Differential Revision: https://reviews.freebsd.org/D16350	2018-08-03 01:37:00 +00:00
Andriy Gapon	a260971458	fix a typo resulting in a wrong variable in kern_syscall_deregister The difference is between sysent, a global, and sysents, a function parameter.	2018-08-02 09:41:55 +00:00
Mark Johnston	4765717321	Remove a redundant check. MFC after: 3 days Sponsored by: The FreeBSD Foundation	2018-07-30 17:58:41 +00:00
Alan Somers	6040822c4e	Make timespecadd(3) and friends public The timespecadd(3) family of macros were imported from NetBSD back in r35029. However, they were initially guarded by #ifdef _KERNEL. In the meantime, we have grown at least 28 syscalls that use timespecs in some way, leading many programs both inside and outside of the base system to redefine those macros. It's better just to make the definitions public. Our kernel currently defines two-argument versions of timespecadd and timespecsub. NetBSD, OpenBSD, and FreeDesktop.org's libbsd, however, define three-argument versions. Solaris also defines a three-argument version, but only in its kernel. This revision changes our definition to match the common three-argument version. Bump _FreeBSD_version due to the breaking KPI change. Discussed with: cem, jilles, ian, bde Differential Revision: https://reviews.freebsd.org/D14725	2018-07-30 15:46:40 +00:00
Andrew Turner	cd2106eaea	Ensure the DPCPU and VNET module spaces are aligned to hold a pointer. Previously they may have been aligned to a char, leading to misaligned DPCPU and VNET variables. Sponsored by: DARPA, AFRL	2018-07-30 14:25:17 +00:00
David E. O'Brien	455d358977	Correct copyright dates.	2018-07-30 07:01:00 +00:00
Antoine Brodin	ccd6ac9f6e	Add allow.mlock to jail parameters It allows locking or unlocking physical pages in memory within a jail This allows running elasticsearch with "bootstrap.memory_lock" inside a jail Reviewed by: jamie@ Differential Revision: https://reviews.freebsd.org/D16342	2018-07-29 12:41:56 +00:00
Don Lewis	290d906084	Fix the long term ULE load balancer so that it actually works. The initial call to sched_balance() during startup is meant to initialize balance_ticks, but does not actually do that since smp_started is still zero at that time. Since balance_ticks does not get set, there are no further calls to sched_balance(). Fix this by setting balance_ticks in sched_initticks() since we know the value of balance_interval at that time, and eliminate the useless startup call to sched_balance(). We don't need to randomize the intial value of balance_ticks. Since there is now only one call to sched_balance(), we can hoist the tests at the top of this function out to the caller and avoid the overhead of the function call when running a SMP kernel on UP hardware. PR: 223914 Reviewed by: kib MFC after: 2 weeks	2018-07-29 00:30:06 +00:00
David Bright	95c05062ec	Allow a EVFILT_TIMER kevent to be updated. If a timer is updated (re-added) with a different time period (specified in the .data field of the kevent), the new time period has no effect; the timer will not expire until the original time has elapsed. This violates the documented behavior as the kqueue(2) man page says (in part) "Re-adding an existing event will modify the parameters of the original event, and not result in a duplicate entry." This modification, adapted from a patch submitted by cem@ to PR214987, fixes the kqueue system to allow updating a timer entry. The kevent timer behavior is changed to: * When a timer is re-added, update the timer parameters to and re-start the timer using the new parameters. * Allow updating both active and already expired timers. * When the timer has already expired, dequeue any undelivered events and clear the count of expirations. All of these changes address the original PR and also bring the FreeBSD and macOS kevent timer behaviors into agreement. A few other changes were made along the way: * Update the kqueue(2) man page to reflect the new timer behavior. * Fix man page style issues in kqueue(2) diagnosed by igor. * Update the timer libkqueue system test to test for the updated timer behavior. * Fix the (test) libkqueue common.h file so that it includes config.h which defines various HAVE_* feature defines, before the #if tests for such variables in common.h. This enables the use of the actual err(3) family of functions. * Fix the usages of the err(3) functions in the tests for incorrect type of variables. Those were formerly undiagnosed due to the disablement of the err(3) functions (see previous bullet point). PR: 214987 Reported by: Brian Wellington <bwelling@xbill.org> Reviewed by: kib MFC after: 1 week Relnotes: yes Sponsored by: Dell EMC Differential Revision: https://reviews.freebsd.org/D15778	2018-07-27 13:49:17 +00:00
Andriy Gapon	111b043cdf	change interrupt event's list of handlers from TAILQ to CK_SLIST The primary reason for this commit is to separate mechanical and nearly mechanical code changes from an upcoming fix for unsafe teardown of shared interrupt handlers that have only filters (see D15905). The technical rationale is that SLIST is sufficient. The only operation that gets worse performance -- O(n) instead of O(1) is a removal of a handler, but it is not a critical operation and the list is expected to be rather short. Additionally, it is easier to reason about SLIST when considering the concurrent lock-free access to the list from the interrupt context and the interrupt thread. CK_SLIST is used because the upcoming change depends on the memory order provided by CK_SLIST insert and the fact that CL_SLIST remove does not trash the linkage in a removed element. While here, I also fixed a couple of whitespace issues, made code under ifdef notyet compilable, added a lock assertion to ithread_update() and made intr_event_execute_handlers() static as it had no external callers. Reviewed by: cem (earlier version) MFC after: 4 weeks Differential Revision: https://reviews.freebsd.org/D16016	2018-07-23 12:51:23 +00:00
Emmanuel Vadot	c54fe25dcb	Raise the size of L3 table for early devmap on arm64 Some driver (like efifb) needs to map more than the current L2_SIZE Raise the size so we can map the framebuffer setup by the bootloader. Reviewed by: cognet	2018-07-19 21:58:06 +00:00
Mark Johnston	bf923a556d	Delete an XXX comment addressed by r336505. X-MFC with: r336505 Sponsored by: The FreeBSD Foundation	2018-07-19 20:11:08 +00:00
Mark Johnston	483f692ea6	Have preload_delete_name() free pages backing preloaded data. On i386 and amd64, add a vm_phys segment for physical memory used to store the kernel binary and other preloaded data. This makes it possible to free such memory back to the system once it is no longer needed, e.g., when a preloaded kernel module is unloaded. Previously, it would have remained unused. Reviewed by: kib, royger MFC after: 2 weeks Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D16330	2018-07-19 20:00:28 +00:00
Mark Johnston	73624a804a	Provide the full module path to preload_delete_name(). The basename will never match against the preload metadata, so these calls previously had no effect. Reviewed by: kib, royger MFC after: 1 week Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D16330	2018-07-19 19:50:42 +00:00
Konstantin Belousov	53e20b2702	When reporting an error, print the errno value. Sponsored by: The FreeBSD Foundation MFC after: 3 days	2018-07-19 19:03:18 +00:00
Emmanuel Vadot	326867616f	kern_cpu: When adding abs frequency allow for unordered insertion Keep the list ordered as some code assume that it is but allow for unordered cf_settings sets.	2018-07-19 11:28:14 +00:00
Mark Johnston	9295517ac9	Add a FALLTHROUGH comment to kvprintf(). Submitted by: Sebastian Huber <sebastian.huber@embedded-brains.de> MFC after: 3 days	2018-07-17 14:56:54 +00:00
Mariusz Zaborski	f1fe1e020f	Extend amount of possible coredumps from 10 to 100000 when using index format. The amount of digits in the name of corefile is assigned dynamically. Reviewed by: kib Differential Revision: https://reviews.freebsd.org/D16118	2018-07-15 17:10:12 +00:00
Mateusz Guzik	95ab076d6e	lockmgr: tidy up slock/sunlock similar to other locks	2018-07-13 22:40:14 +00:00
Warner Losh	25bc561e68	There's two files in the sys tree named inflate.c, in addition to it being a common name elsewhere. Rename the old kzip one to subr_inflate.c. This actually fixes the build issues on sparc64 that my inclusion of .PATH ${SYSDIR}/kern created in r336244, so also revert the broken workaround I committed in r336249. This slipped passed me because apparently, I never did a clean build.	2018-07-13 17:41:28 +00:00
Warner Losh	52379d36a9	Create helper functions for parsing boot args. boot_parse_arg to parse a single arg boot_parse_cmdline to parse a command line string boot_parse_args to parse all the args in a vector boot_howto_to_env Convert howto bits to env vars boot_env_to_howto Return howto mask mased on what's set in the environment. All these routines return an int that's the bitmask of the args translated to RB_* flags. As a special case, the 'S' flag sets the comconsole_speed env var. Any arg that looks like a=b will set the env key 'a' to value 'b'. If =b is omitted, 'a' is set to '1'. This should help us reduce the number of redundant copies of these routines in the tree. It should also give a more uniform experience between platforms. Also, invent a new flag RB_PROBE that's set when 'P' is parsed. On x86 + BIOS, this means 'probe for the keyboard, and if it's not there set both RB_MULTIPLE and RB_SERIAL (which means show the output on both video and serial consoles, but make serial primary). Others it may be some similar concept of probing, but it's loader dependent what, exactly, it means. These routines are suitable for /boot/loader and/or the kernel, though they may not be suitable for the tightly hand-rolled-for-space environments like boot2. Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D16205	2018-07-13 16:43:05 +00:00
Brooks Davis	d92da75941	Round down the location of execpathp to slightly improve copyout speed. In practice, this moves the padding from below the canary to above execpathp has no impact on stack consumption. Submitted by: Wuyang-Chung (via github pull request #159) MFC after: 1 week	2018-07-13 11:32:27 +00:00
Mateusz Guzik	bcbc8d35eb	fd: stop passing M_ZERO to uma_zalloc The optimisation seen with malloc cannot be used here as zone sizes are now known at compilation. Thus bzero by hand to get the optimisation instead.	2018-07-12 22:48:18 +00:00
Kyle Evans	44314c3509	kern_environment: Give the static environment a chance to disable MD env This variable has been given the name "loader_env.disabled" as it's the primary way most people will have an MD environment. This restores the previously-default behavior of ignoring the loader(8) environment, which may be useful for vendor distributions or other scenarios where inheriting the loader environment may be considered a security issue or potentially breaking of a more locked-down environment. As the change to config(5) indicates, disabling the loader environment should not be a choice made lightly since it may provide ACPI hints and other useful things that the system can rely on to boot. An UPDATING entry has been added to mention an upgrade path for those that may have relied on the previous behavior. Discussed with: bde Relnotes: yes (maybe)	2018-07-12 02:51:50 +00:00
Alan Somers	8a894c1aa1	Don't acquire evclass_lock with a spinlock held When the "pc" audit class is enabled and auditd is running, witness will panic during thread exit because au_event_class tries to lock an rwlock while holding a spinlock acquired upstack by thread_exit. To fix this, move AUDIT_SYSCALL_EXIT futher upstack, before the spinlock is acquired. Of thread_exit's 16 callers, it's only necessary to call AUDIT_SYSCALL_EXIT from two, exit1 (for exiting processes) and kern_thr_exit (for exiting threads). The other callers are all kernel threads, which needen't call AUDIT_SYSCALL_EXIT because since they can't make syscalls there will be nothing to audit. And exit1 already does call AUDIT_SYSCALL_EXIT, making the second call in thread_exit redundant for that case. PR: 228444 Reported by: aniketp Reviewed by: aniketp, kib MFC after: 2 weeks Differential Revision: https://reviews.freebsd.org/D16210	2018-07-11 19:38:42 +00:00
Brooks Davis	942ae5c8b8	Regen after r336171.	2018-07-10 14:04:52 +00:00
Brooks Davis	7cc923f8a8	Get rid of netbsd_lchown and netbsd_msync syscall entries. No valid FreeBSD binary very called them (they would call lchown and msync directly) and we haven't supported NetBSD binaries in ages. This is a respin of r335983 with a workaround for the ancient BFD linker in the libc stubs. Reviewed by: kib Sponsored by: DARPA, AFRL Differential Revision: https://reviews.freebsd.org/D16193	2018-07-10 13:32:04 +00:00
Brooks Davis	3a20f06a1c	Use uintptr_t alone when assigning to kvaddr_t variables. Suggested by: jhb	2018-07-10 13:03:06 +00:00
Kyle Evans	c7a82b9c6c	kern_environment: bool'itize dynamic_kenv; fix small style(9) nit	2018-07-10 02:43:22 +00:00
Kyle Evans	0f46005e4b	subr_hints: Skip static_env and static_hints if they don't contain hints This is possible because, well, they're static. Both the dynamic environment and the MD-environment (generally loader(8) environment) can potentially have room for new variables to be set, and thus do not receive this treatment.	2018-07-10 00:36:37 +00:00
Kyle Evans	dc4446df5f	subr_hints: Convert some bool-like ints to bools	2018-07-10 00:34:19 +00:00
Kyle Evans	5768da6c21	subr_hints: Use goto/label instead of series of conditionals	2018-07-10 00:33:31 +00:00
Mark Johnston	013072f04c	Fix pre-SI_SUB_CPU initialization of per-CPU counters. r336020 introduced pcpu_page_alloc(), replacing page_alloc() as the backend allocator for PCPU UMA zones. Unlike page_alloc(), it does not honour malloc(9) flags such as M_ZERO or M_NODUMP, so fix that. r336020 also changed counter(9) to initialize each counter using a CPU_FOREACH() loop instead of an SMP rendezvous. Before SI_SUB_CPU, smp_rendezvous() will only execute the callback on the current CPU (i.e., CPU 0), so only one counter gets zeroed. The rest are zeroed by virtue of the fact that UMA gratuitously zeroes slabs when importing them into a zone. Prior to SI_SUB_CPU, all_cpus is clear, so with r336020 we weren't zeroing vm_cnt counters during boot: the CPU_FOREACH() loop had no effect, and pcpu_page_alloc() didn't honour M_ZERO. Fix this by iterating over the full range of CPU IDs when zeroing counters, ignoring whether the corresponding bits in all_cpus are set. Reported and tested by: pho (previous version) Reviewed by: kib (previous version) Differential Revision: https://reviews.freebsd.org/D16190	2018-07-10 00:18:12 +00:00
Jamie Gritton	0a1724045e	Change prison_add_vfs() to the more generic prison_add_allow(), which can add any dynamic allow.* or allow.. parameter. Also keep prison_add_vfs() as a wrapper. Differential Revision: D16146	2018-07-06 18:50:22 +00:00
Kyle Evans	cae22dd904	kern_environment: Fix SYSINIT ordering The dynamic environment was being initialized at SI_SUB_KMEM, SI_ORDER_ANY. I added the hint-merging at SI_SUB_KMEM, SI_ORDER_ANY as well in r335998 - this can only work by coincidence. Re-do both to operate at SI_SUB_KMEM + 1, SI_ORDER_FIRST and SI_ORDER_SECOND respectively to be safe. It's sufficiently obfuscated away as to when in SU_SUB_KMEM malloc will be available, and the dynamic environment cannot be relied upon there anyways since it's initialized at SI_ORDER_ANY. Reported by: bde Discussed with: bde X-MFC-With: r335998	2018-07-06 16:51:35 +00:00
Brooks Davis	7524b4c14b	Correct breakage on 32-bit platforms from r335979.	2018-07-06 10:03:33 +00:00
Matt Macy	822e50e3f6	epoch(9): simplify initialization replace manual NUMA aware allocation with a pcpu zone	2018-07-06 06:20:03 +00:00
Matt Macy	ab3059a8e7	Back pcpu zone with domain correct pages - Change pcpu zone consumers to use a stride size of PAGE_SIZE. (defined as UMA_PCPU_ALLOC_SIZE to make future identification easier) - Allocate page from the correct domain for a given cpu. - Don't initialize pc_domain to non-zero value if NUMA is not defined There are some misconceptions surrounding this field. It is the _VM_ NUMA domain and should only ever correspond to valid domain values as understood by the VM. The former slab size of sizeof(struct pcpu) was somewhat arbitrary. The new value is PAGE_SIZE because that's the smallest granularity which the VM can allocate a slab for a given domain. If you have fewer than PAGE_SIZE/8 counters on your system there will be some memory wasted, but this is obviously something where you want the cache line to be coming from the correct domain. Reviewed by: jeff Sponsored by: Limelight Networks Differential Revision: https://reviews.freebsd.org/D15933	2018-07-06 02:06:03 +00:00
Andrew Turner	2bf9501287	Create a new macro for static DPCPU data. On arm64 (and possible other architectures) we are unable to use static DPCPU data in kernel modules. This is because the compiler will generate PC-relative accesses, however the runtime-linker expects to be able to relocate these. In preparation to fix this create two macros depending on if the data is global or static. Reviewed by: bz, emaste, markj Sponsored by: ABT Systems Ltd Differential Revision: https://reviews.freebsd.org/D16140	2018-07-05 17:13:37 +00:00
Bjoern A. Zeeb	1534cd19b5	Split up deadlkres() to make it more readable in anticipation of further changes adding another level of indentation. Some of the logic got simplified with the break out functions. There should be no functional changes. Reviewed by: kib Sponsored by: iXsystems, Inc. Differential Revision: https://reviews.freebsd.org/D15914	2018-07-05 17:06:54 +00:00
Kyle Evans	39d44f7f15	kern_environment: use any provided environments, evict hintmode/envmode At the moment, hintmode and envmode are used to indicate whether static hints or static env have been provided in the kernel config(5) and the static versions are mutually exclusive with loader(8)-provided environment. hintmode can be reconfigured later to pull from the dynamic environment, thus taking advantage of the loader(8) or post-kmem environment setting. This changeset fixes both problems at once to move us from a semi-confusing state to a consistent state: if an environment file, hints file, or loader(8) environment are provided, we use them in a well-known order of precedence: - loader(8) environment - static environment - static hints file Once the dynamic environment is setup this becomes a moot point. The loader(8) and static environments are merged (respecting the above order of precedence), and the static hints are merged in on an as-needed basis after the dynamic environment has been setup. Hints lookup are changed to respect all of the above. Before the dynamic environment is setup, lookups use the above-mentioned order and fallback to the next environment if a matching hint is not found. Once the dynamic environment is setup, that is used on its own since it captures all of the above information plus any dynamic kenv settings that came up later in boot. The following tangentially related changes were made to res_find: - A hintp cookie is now passed in so that related searches continue using the chain of environments (or dynamic environment) without relying on global state - All three environments will be searched if they actually have valid hints to use, rather than just choosing the first environment that actually had a hint and rolling with that only The hintmode sysctl has been ripped out. static_{env,hints}.disabled are still honored and will disable their respective environments from being used for hint lookups and from being merged into the dynamic environment, as expected. MFC after: 1 month (maybe) Differential Revision: https://reviews.freebsd.org/D15953	2018-07-05 16:30:32 +00:00
Kyle Evans	e28687347f	Revert r335995 due to accidental changes snuck in	2018-07-05 16:28:43 +00:00
Kyle Evans	8ef5886303	kern_environment: use any provided environments, evict hintmode/envmode At the moment, hintmode and envmode are used to indicate whether static hints or static env have been provided in the kernel config(5) and the static versions are mutually exclusive with loader(8)-provided environment. hintmode can be reconfigured later to pull from the dynamic environment, thus taking advantage of the loader(8) or post-kmem environment setting. This changeset fixes both problems at once to move us from a semi-confusing state to a consistent state: if an environment file, hints file, or loader(8) environment are provided, we use them in a well-known order of precedence: - loader(8) environment - static environment - static hints file Once the dynamic environment is setup this becomes a moot point. The loader(8) and static environments are merged (respecting the above order of precedence), and the static hints are merged in on an as-needed basis after the dynamic environment has been setup. Hints lookup are changed to respect all of the above. Before the dynamic environment is setup, lookups use the above-mentioned order and fallback to the next environment if a matching hint is not found. Once the dynamic environment is setup, that is used on its own since it captures all of the above information plus any dynamic kenv settings that came up later in boot. The following tangentially related changes were made to res_find: - A hintp cookie is now passed in so that related searches continue using the chain of environments (or dynamic environment) without relying on global state - All three environments will be searched if they actually have valid hints to use, rather than just choosing the first environment that actually had a hint and rolling with that only The hintmode sysctl has been ripped out. static_{env,hints}.disabled are still honored and will disable their respective environments from being used for hint lookups and from being merged into the dynamic environment, as expected. MFC after: 1 month (maybe) Differential Revision: https://reviews.freebsd.org/D15953	2018-07-05 16:25:48 +00:00
Bjoern A. Zeeb	0fb9f29bae	With the introduction of reapers and reaplists in r275800, proc0 and init are setup as a circular dependency. create_init() calls fork1() which calls do_fork(). There the newproc (initproc) is setup with a reaper of proc0 who's reaper points to itself. The newproc (initproc) is then put on its reaper's (proc0) p_reaplist (initproc is a descendants of proc0 for proc0 to reap). Upon return to create_init(), proc0 is added to initproc's p_reaplist (which would mean proc0 is a descendant of init, for init to reap). This creates a circular dependency which eventually leads to LIST corruptions when trying to kill init and a proc0. For the base system we never really hit this case during reboot. The problem only became visible after adding more virtual process spaces which could go away cleanly (work existing in an experimental branch). Reviewed by: kib Sponsored by: iXsystems, Inc. Differential Revision: https://reviews.freebsd.org/D15924	2018-07-05 16:16:28 +00:00
Brooks Davis	714c03c81e	Revert r335983. The bfd linker in tree doesn't support multiple names for the same symbol (at least with current flags).	2018-07-05 16:03:03 +00:00
Brooks Davis	5b04a71dae	Get rid of netbsd_lchown and netbsd_msync syscall entries. No valid FreeBSD binary ever called them (they would call lchown and msync directly) and we haven't supported NetBSD binaries in ages. Reviewed by: kib Sponsored by: DARPA, AFRL Differential Revision: https://reviews.freebsd.org/D15814	2018-07-05 14:12:56 +00:00
Konstantin Belousov	dbadb01591	Silence warnings about unused variables when RACCT is defined but RCTL is not. Reported by: Dries Michiels <driesm.michiels@gmail.com> Sponsored by: The FreeBSD Foundation MFC after: 3 days	2018-07-05 13:37:31 +00:00
Brooks Davis	f38b68ae8a	Make struct xinpcb and friends word-size independent. Replace size_t members with ksize_t (uint64_t) and pointer members (never used as pointers in userspace, but instead as unique idenitifiers) with kvaddr_t (uint64_t). This makes the structs identical between 32-bit and 64-bit ABIs. On 64-bit bit systems, the ABI is maintained. On 32-bit systems, this is an ABI breaking change. The ABI of most of these structs was previously broken in r315662. This also imposes a small API change on userspace consumers who must handle kernel pointers becoming virtual addresses. PR: 228301 (exp-run by antoine) Reviewed by: jtl, kib, rwatson (various versions) Sponsored by: DARPA, AFRL Differential Revision: https://reviews.freebsd.org/D15386	2018-07-05 13:13:48 +00:00
Matt Macy	10b8cd7f55	epoch(9): make nesting assert in epoch_wait_preempt more specific Reported by: markj	2018-07-04 21:34:08 +00:00
Mariusz Zaborski	6cad1a5d14	Add description to debug.ncores sysctl. Reviewed by: bcr Differential Revision: https://reviews.freebsd.org/D16123	2018-07-04 17:06:51 +00:00
Konstantin Belousov	d6eff0832c	Add a way for the process to request cleanup of the kernel cache of the process arguments. New arguments length zero causes the drop of the pargs instead of allocation of useless zero-length buffer. Submitted by: Thomas Munro MFC after: 1 week Differential revision: https://reviews.freebsd.org/D16111	2018-07-04 13:22:48 +00:00
Andriy Gapon	b0af06052c	remove unneeded inclusion of sys/interrupt.h from several files It's likely that the header was needed in the past for swi(9). But now that code does not use swi(9) or any other interfaces defined in sys/interrupt.h. MFC after: 1 week	2018-07-04 09:07:18 +00:00
Matt Macy	6573d7580b	epoch(9): allow preemptible epochs to compose - Add tracker argument to preemptible epochs - Inline epoch read path in kernel and tied modules - Change in_epoch to take an epoch as argument - Simplify tfb_tcp_do_segment to not take a ti_locked argument, there's no longer any benefit to dropping the pcbinfo lock and trying to do so just adds an error prone branchfest to these functions - Remove cases of same function recursion on the epoch as recursing is no longer free. - Remove the the TAILQ_ENTRY and epoch_section from struct thread as the tracker field is now stack or heap allocated as appropriate. Tested by: pho and Limelight Networks Reviewed by: kbowling at llnw dot com Sponsored by: Limelight Networks Differential Revision: https://reviews.freebsd.org/D16066	2018-07-04 02:47:16 +00:00
Matt Macy	8bedbb4d42	expose thread_lite definition to tied modules	2018-07-03 02:50:07 +00:00
Matt Macy	6443773dab	make critical_{enter, exit} inline Avoid pulling in all of the <sys/proc.h> dependencies by automatically generating a stripped down thread_lite exporting only the fields of interest. The field declarations are type checked against the original and the offsets of the generated result is automatically checked. kib has expressed disagreement and would have preferred to simply use genassym style offsets (which loses type check enforcement). jhb has expressed dislike of it due to header pollution and a duplicate structure. He would have preferred to just have defined thread in _thread.h. Nonetheless, he admits that this is the only viable solution at the moment. The impetus for this came from mjg's D15331: "Inline critical_enter/exit for amd64" Reviewed by: jeff Differential Revision: https://reviews.freebsd.org/D16078	2018-07-03 01:55:09 +00:00
Mariusz Zaborski	0dea6e3c98	core(5): overwrite the oldest core dump The '%I' format in the kern.corefile sysctl limits the number of core files that a process can generate to the number stored in the debug.ncores sysctl. The '%I' format is replaced by the single digit index. Previously, if all indexes were taken the kernel would overwrite only a core file with the highest index in a filename. Currently the system will create a new core file if there is a free index or if all slots are taken it will overwrite the oldest one. Reviewed by: kib(code), bcr (updating) Differential Revision: https://reviews.freebsd.org/D15991 Differential Revision: https://reviews.freebsd.org/D16084	2018-07-01 17:28:46 +00:00
Gleb Smirnoff	95dce07dea	Correct r335242. Use unsigned cast instead of abs(). Using abs() gives incorrect result when ticks has already wrapped, and are about to reach the cr_ticks value (cr_ticks - ticks < hz). Submitted by: bde	2018-06-27 22:00:50 +00:00
Warner Losh	bc6cb3f6b4	Remove devctl_safe_quote since it's now unused. Sponsored by: Netflix Differential Review: https://reviews.freebsd.org/D16026	2018-06-27 04:11:19 +00:00
Warner Losh	349fcda430	Fix devctl generation for core files. We have a problem with vn_fullpath_global when the file exists. Work around it by printing the full path if the core file name starts with /, or current working directory followed by the filename if not. Sponsored by: Netflix Differential Review: https://reviews.freebsd.org/D16026	2018-06-27 04:11:09 +00:00
Warner Losh	ab531b8825	Create new devctl_safe_quote_sb to copy a source string into a struct sbuf to make it safe. Callers are expected to add the " " around it, if needed. Sponsored by: Netflix Differential Review: https://reviews.freebsd.org/D16026	2018-06-27 04:10:48 +00:00
Matt Macy	74333b3dee	fix assert and conditionally allow mutexes to be held across epoch_wait_preempt	2018-06-24 18:57:06 +00:00
Matt Macy	0bcfb47363	epoch(9): Don't trigger taskq enqueue before the grouptaskqs are setup If EARLY_AP_STARTUP is not defined it is possible for an epoch to be allocated prior to it being possible to call epoch_call without issue. Based on patch by andrew@ PR: 229014 Reported by: andrew	2018-06-23 07:14:08 +00:00
Colin Percival	7e8db78116	Improve the accuracy of the POSIX "process CPU-time" clocks by adding the used portion of the current thread's time slice if the current thread belongs to the process being queried (i.e., if clock_gettime is invoked with a clock ID of CLOCK_PROCESS_CPUTIME_ID or the value provided by passing getpid(2) to clock_getcpuclockid(3)). The CLOCK_VIRTUAL and CLOCK_PROF timers already make this adjustment via long-standing code in calcru(), but since those timers are not specified by POSIX it seems useful to add it here so that the higher accuracy is available to code which aims to be portable. PR: 228669 Reported by: Graham Percival Reviewed by: kib MFC after: 1 week	2018-06-22 10:23:32 +00:00
Matt Macy	ae25f40b72	epoch(9): make non-preemptible variant work early boot	2018-06-22 00:47:18 +00:00
Kyle Evans	03d7aee8a7	subr_hints: Fix acpi unit hinting (at the very least) The refactoring in r335479 overlooked the fact that the dynamic kenv can also be switched to if hintmode == 0. This is problematic because the checkmethod bits are only ever ran once, but it worked previously because the use_kenv was a global state and the first lookup would enable it if occurring after the dynamic environment has been setup. Extending our local definition of use_kenv to include all non-STATIC hintmodes as long as the dynamic_kenv is setup fixes this. We still have potential issues if the dynamic kenv comes up while we're doing an anchored search through the environment, but this is not much of a concern right now because: 1.) The dynamic environment comes up super early in boot, just after kmem 2.) This is going to get rewritten to provide a safer mechanism for the anchored searches, ensuring that we continue using the same environment chain (dynamic env or static fallback) for all anchored search invocations Reported by: mmamcy X-MFC-With: r335479	2018-06-21 21:50:00 +00:00
Konstantin Belousov	6e22bbf66e	fork: avoid endless wait with PTRACE_FORK and RFSTOPPED. An RFSTOPPED thread can't clean TDB_STOPATFORK, which is done in the fork_return() in its context, so parent is stuck forever. Triggered when trying to ptrace linux process. Instead of waiting for the new thread to clear TDB_STOPATFORK, tag it as traced and reparent to the debugger in do_fork(), and let it only notify the debugger when run. Submitted by: Yanko Yankulov <yanko.yankulov@gmail.com> Reviewed by: jhb MFC after: 1 week X-MFC-Note: keep p_dbgwait placeholder intact Differential revision: https://reviews.freebsd.org/D15857	2018-06-21 21:12:49 +00:00
Konstantin Belousov	ac4bc0c171	Update proc->p_ptevents annotation to reflect the actual locking. Submitted by: Yanko Yankulov <yanko.yankulov@gmail.com> Reviewed by: jhb MFC after: 1 week Differential revision: https://reviews.freebsd.org/D15954	2018-06-21 21:07:25 +00:00
Justin Hibbits	22c1b4c0f1	Introduce PMCR-based cpufreq(4) driver, for IBM POWER8 and POWER9 systems Summary: POWER8 and POWER9 use a single CPU register, per core, to change clock speed. Everything else is handled by the on-chip controller. This change necessitates a change to the cpufreq global kernel driver to bump supported levels, as the device tree for these systems can have theoretically 256 different options. On my POWER9 Talos, the list consists of 100 items. At 16.67MHz intervals, that allows for a change of roughly 1.67GHz between lowest and highest. This has only been tested on the POWER9. However, since they're similar, this should work on POWER8 as well. Reviewed By: nwhitehorn Differential Revision: https://reviews.freebsd.org/D15932	2018-06-21 14:26:43 +00:00
Kyle Evans	770488d202	subr_hints: simplify a little bit Some complexity exists in these bits that isn't needed. The sysctl handler, upon change to '2', runs through the current set of hints and sets them in the kenv. However, this isn't at all necessary if we're pulling hints from the kenv, static or dynamic, as the former will get added to the latter in init_dynamic_kenv (see: kern_environment.c). We can reduce this configuration to just adding static_hints to the kenv if we were previously using them. The changes in res_find are minimal and based on the observation that once use_kenv gets set to '1' it will never be reset to '0', and it gets set to '1' as soon as we hit fallback mode. Later work will refactor res_find a little bit and eliminate this now-local, because it's become clear that there's some funkiness revolving around use_kenv=1 and it being used to imply that we're certainly looking at the dynamic_kenv. Reviewed by: ray MFC after: 2 weeks Differential Revision: https://reviews.freebsd.org/D15940	2018-06-21 14:04:02 +00:00
Hans Petter Selasky	ce70c57262	Permit the kernel environment to set an array of numeric values for a single sysctl(9) node. Reviewed by: kib@, imp@, jhb@ Differential Revision: https://reviews.freebsd.org/D15802 MFC after: 1 week Sponsored by: Mellanox Technologies	2018-06-20 20:04:20 +00:00
Kyle Evans	c7962400c9	Add debug.verbose_sysinit tunable for VERBOSE_SYSINIT VERBOSE_SYSINIT is currently an all-or-nothing option. debug.verbose_sysinit adds an option to have the code compiled in but quiet by default so that getting this information from a device in the field doesn't necessarily require distributing a recompiled kernel. Its default is VERBOSE_SYSINIT's value as defined in the kernconf. As such, the default behavior for simply omitting or including this option is unchanged. MFC after: 1 week	2018-06-20 19:23:56 +00:00
Emmanuel Vadot	78442297f5	Add pmap_mapdev_attr for arm64 This is needed for efifb. arm and ricv pmap (the two arch with arm64 that uses subr_devmap) have very different implementation so for now only add this for arm64. Tested with efifb on Pine64 with a few other patches. Reviewed by: cognet Differential Revision: https://reviews.freebsd.org/D15294	2018-06-20 16:07:35 +00:00
Bjoern A. Zeeb	7938a4425a	Instead of using hand-rolled loops where not needed switch them to FOREACH_PROC_IN_SYSTEM() to have a single pattern to look for. Reviewed by: kib MFC after: 2 weeks Sponsored by: iXsystems, Inc. Differential Revision: https://reviews.freebsd.org/D15916	2018-06-20 11:42:06 +00:00
Bjoern A. Zeeb	7ffbcfe281	Sometimes it is helpful to get the path for a vnode. Implement a ddb function walking the namecache to do this. Reviewed by: jhb, mjg Inspired by: gdb macro from jhb (old version) Sponsored by: iXsystems, Inc. Differential Revision: https://reviews.freebsd.org/D14898	2018-06-20 08:34:29 +00:00
Matt Macy	9e58ff6ff9	convert inpcbinfo hash and info rwlocks to epoch + mutex - Convert inpcbinfo info & hash locks to epoch for read and mutex for write - Garbage collect code that handled INP_INFO_TRY_RLOCK failures as INP_INFO_RLOCK which can no longer fail When running 64 netperfs sending minimal sized packets on a 2x8x2 reduces unhalted core cycles samples in rwlock rlock/runlock in udp_send from 51% to 3%. Overall packet throughput rate limited by CPU affinity and NIC driver design choices. On the receiver unhalted core cycles samples in in_pcblookup_hash went from 13% to to 1.6% Tested by LLNW and pho@ Reviewed by: jtl Sponsored by: Limelight Networks Differential Revision: https://reviews.freebsd.org/D15686	2018-06-19 01:54:00 +00:00
Andrey V. Elsukov	20efcfc602	Switch RIB and RADIX_NODE_HEAD lock from rwlock(9) to rmlock(9). Using of rwlock with multiqueue NICs for IP forwarding on high pps produces high lock contention and inefficient. Rmlock fits better for such workloads. Reviewed by: melifaro, olivier Obtained from: Yandex LLC Sponsored by: Yandex LLC Differential Revision: https://reviews.freebsd.org/D15789	2018-06-16 08:26:23 +00:00
Gleb Smirnoff	61f63f47b3	Since 'ticks' is an int, it may wrap around and cr_ticks at a certain counter_rate will be greater than ticks, resulting in counter_ratecheck() failure. To fix this take an absolute value of the difference between ticks and cr_ticks. Reported by: jtl Sponsored by: Netflix	2018-06-15 21:36:16 +00:00
Bryan Drewery	03bd1b693e	proc0_post: Fix some locking issues - Filter out PRS_NEW procs as rufetch() tries taking the thread lock which may not yet be initialized. - Hold PROC_LOCK to ensure stability of iterating the threads. - p_rux fields are protected by the process statlock as well. MFC after: 2 weeks Reviewed by: kib Sponsored by: Dell EMC Differential Revision: https://reviews.freebsd.org/D15809	2018-06-15 00:36:41 +00:00
Olivier Houchard	78bcf87e3e	Use M_EXEC when calling malloc() to allocate the memory to store the module, as it'll contain executable code.	2018-06-14 23:10:10 +00:00
Brooks Davis	7d87c005da	Regen after 335177 (rename sys_obreak to sys_break).	2018-06-14 21:29:31 +00:00
Brooks Davis	9da5364ed9	Name the implementation of brk and sbrk sys_break(). The break() system call was renamed (several times) starting in v3 AT&T UNIX when C was invented and break was a language keyword. The last vestage of a need for it to be called something else (eg obreak) was removed in r225617 which consistantly prefixed all syscall implementations. Reviewed by: emaste, kib (older version) Sponsored by: DARPA, AFRL Differential Revision: https://reviews.freebsd.org/D15638	2018-06-14 21:27:25 +00:00
Jonathan T. Looney	0766f278d8	Make UMA and malloc(9) return non-executable memory in most cases. Most kernel memory that is allocated after boot does not need to be executable. There are a few exceptions. For example, kernel modules do need executable memory, but they don't use UMA or malloc(9). The BPF JIT compiler also needs executable memory and did use malloc(9) until r317072. (Note that a side effect of r316767 was that the "small allocation" path in UMA on amd64 already returned non-executable memory. This meant that some calls to malloc(9) or the UMA zone(9) allocator could return executable memory, while others could return non-executable memory. This change makes the behavior consistent.) This change makes malloc(9) return non-executable memory unless the new M_EXEC flag is specified. After this change, the UMA zone(9) allocator will always return non-executable memory, and a KASSERT will catch attempts to use the M_EXEC flag to allocate executable memory using uma_zalloc() or its variants. Allocations that do need executable memory have various choices. They may use the M_EXEC flag to malloc(9), or they may use a different VM interfact to obtain executable pages. Now that malloc(9) again allows executable allocations, this change also reverts most of r317072. PR: 228927 Reviewed by: alc, kib, markj, jhb (previous version) Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D15691	2018-06-13 17:04:41 +00:00
Warner Losh	a971acbc25	Implement a 'car limit' for bioq. Allow one to implement a 'car limit' for bioq_disksort. debug.bioq_batchsize sets the size of car limit. Every time we queue that many requests, we start over so that we limit the latency for requests when the software queue depths are large. A value of '0', the default, means to revert to the old behavior. Sponsored by: Netflix	2018-06-13 16:48:07 +00:00
Bruce Evans	ab35e1c71b	Fix the encoding of major and minor numbers in 64-bit dev_t by restoring the old encodings for the lower 16 and 32 bits and only using the higher 32 bits for unusually large major and minor numbers. This change breaks compatibility with the previous encoding (which was only used in -current). Fix truncation to (essentially) 16-bit dev_t in newnfs v3. Any encoding of device numbers gives an ABI, so it can't be changed without translations for compatibility. Extra bits give the much larger complication that the translations need to compress into fewer bits. Fortunately, more than 32 bits are rarely needed, so compression is rarely needed except for 16-bit linux dev_t where it was always needed but never done. The previous encoding moved the major number into the top 32 bits. Almost no translation code handled this, so the major number was blindly truncated away in most 32-bit encodings. E.g., for ffs, mknod(8) with major = 1 and minor = 2 gave dev_t = 0x10000002; ffs cannot represent this and blindly truncated it to 2. But if this mknod was run on any released version of FreeBSD, it gives dev_t = 0x102. ffs can represent this, but in the previous encoding it was not decoded, giving major = 0, minor = 0x102. The presence of bugs was most obvious for exporting dev_t's from an old system to -current, since bugs in newnfs augment them. I fixed oldnfs to support 32-bit dev_t in 1996 (r16634), but this regressed to 16-bit dev_t in newnfs, first to the old 16-bit encoding and then further in -current. E.g., old ad0 with major = 234, minor = 0x10002 had the correct (major, minor) number on the wire, but newnfs truncated this to (234, 2) and then the previous encoding shifted the major number into oblivion as seen by ffs or old applications. I first tried to fix this by translating on every ABI/API boundary, but there are too many boundaries and too many sloppy translations by blind truncation. So use the old encoding for the low 32 bits so that sloppy translations work no worse than before provided the high 32 bits are not set. Add some error checking for when bits are lost. Keep not doing any error checking for translations for almost everything in compat/linux. compat/freebsd32/freebsd32_misc.c: Optionally check for losing bits after possibly-truncating assignments as before. compat/linux/linux_stats.c: Depend on the representation being compatible with Linux's (or just with itself for local use) and spell some of the translations as assignments in a macro that hides the details. fs/nfsclient/nfs_clcomsubs.c: Essentially the same fix as in 1996, except there is now no possible truncation in makedev() itself. Also fix nearby style bugs. kern/vfs_syscalls.c: As for freebsd32. Also update the sysctl description to include file numbers, and change it to describe device ids as device numbers. sys/types.h: Use inline functions (wrapped by macros) since the expressions are now a bit too complicated for plain macros. Describe the encoding and some of the reasons for it. 16-bit compatibility didn't leave many reasonable choices for the 32-bit encoding, and 32-bit compatibility doesn't leave many reasonable choices for the 64-bit encoding. My choice is to put the 8 new minor bits in the low 8 bits of the top 32 bits. This minimizes discontiguities. Reviewed by: kib (except for rewrite of the comment in linux_stats.c)	2018-06-13 12:22:00 +00:00
Bruce Evans	372639f944	Fix some bugs found while fixing the representation and translation of 64-bit dev_t's (but not ones involving dev_t's). st_size was supposed to be clamped in cvtstat() and linux's copy_stat(), but the clamping code wasn't aware that st_size is signed, and also had an obfuscated off-by-1 value for the unsigned limit, so its effect was to produce a bizarre negative size instead of clamping. Change freebsd32's copy_ostat() to be no worse than cvtstat(). It was missing clamping and bzero()ing of padding. Reviewed by: kib (except a final fix of the clamp to the signed maximum)	2018-06-13 08:50:43 +00:00
Ed Maste	00ce0c6258	makesyscalls: simplify capenabled pipeline Replace cat + 2x grep with one grep. Sponsored by: Turing Robotic Industries	2018-06-11 18:57:40 +00:00
Matt Macy	0ea9d9376e	limit change to fixing controlp handling pending review	2018-06-11 17:10:19 +00:00
Matt Macy	c34bf30069	soreceive_stream: correctly handle edge cases - non NULL controlp is not an error, returning EINVAL would cause X forwarding to fail - MSG_PEEK and MSG_WAITALL are fairly exceptional, but we still want to handle them - punt to soreceive_generic	2018-06-11 16:31:42 +00:00
Mateusz Guzik	0001edb823	counter: add a bit missed in r334858 It happens to be a noop.	2018-06-08 22:06:32 +00:00
Matt Macy	a62b4665f4	AF_UNIX: bring uipc_ready in compliance with new locking protocol PR: 228742 Submitted by: markj Reviewed by: markj	2018-06-08 20:31:59 +00:00
Jonathan T. Looney	1fbe13cf4b	Add a socket destructor callback. This allows kernel providers to set callbacks to perform additional cleanup actions at the time a socket is closed. Michio Honda presented a use for this at BSDCan 2018. (See https://www.bsdcan.org/2018/schedule/events/965.en.html .) Submitted by: Michio Honda <micchie at sfc.wide.ad.jp> (previous version) Reviewed by: lstewart (previous version) Differential Revision: https://reviews.freebsd.org/D15706	2018-06-08 19:35:24 +00:00
Mateusz Guzik	b8af2820f6	uma: fix up r334824 Turns out there is code which ends up passing M_ZERO to counters. Since counters zero unconditionally on their own, just ignore drop the flag in that place.	2018-06-08 05:40:36 +00:00
Matt Macy	eb7c901995	hwpmc: simplify calling convention for hwpmc interrupt handling pmc_process_interrupt takes 5 arguments when only 3 are needed. cpu is always available in curcpu and inuserspace can always be derived from the passed trapframe. While facially a reasonable cleanup this change was motivated by the need to workaround a compiler bug. core2_intr(cpu, tf) -> pmc_process_interrupt(cpu, ring, pmc, tf, inuserspace) -> pmc_add_sample(cpu, ring, pm, tf, inuserspace) In the process of optimizing the tail call the tf pointer was getting clobbered: (kgdb) up at /storage/mmacy/devel/freebsd/sys/dev/hwpmc/hwpmc_mod.c:4709 4709 pmc_save_kernel_callchain(ps->ps_pc, (kgdb) up 1205 error = pmc_process_interrupt(cpu, PMC_HR, pm, tf, resulting in a crash in pmc_save_kernel_callchain.	2018-06-08 04:58:03 +00:00
Randall Stewart	89e560f441	This commit brings in a new refactored TCP stack called Rack. Rack includes the following features: - A different SACK processing scheme (the old sack structures are not used). - RACK (Recent acknowledgment) where counting dup-acks is no longer done instead time is used to knwo when to retransmit. (see the I-D) - TLP (Tail Loss Probe) where we will probe for tail-losses to attempt to try not to take a retransmit time-out. (see the I-D) - Burst mitigation using TCPHTPS - PRR (partial rate reduction) see the RFC. Once built into your kernel, you can select this stack by either socket option with the name of the stack is "rack" or by setting the global sysctl so the default is rack. Note that any connection that does not support SACK will be kicked back to the "default" base FreeBSD stack (currently known as "default"). To build this into your kernel you will need to enable in your kernel: makeoptions WITH_EXTRA_TCP_STACKS=1 options TCPHPTS Sponsored by: Netflix Inc. Differential Revision: https://reviews.freebsd.org/D15525	2018-06-07 18:18:13 +00:00
Alan Cox	5b274055d1	When pidctrl_daemon() is called multiple times within an interval, it should use the cumulative error to calculate the output.	2018-06-07 07:48:50 +00:00
Matt Macy	fcabd54160	AF_UNIX: check for unp == unp2 on disconnect	2018-06-07 04:57:40 +00:00
Alan Cox	e768070ca9	pidctrl_daemon() implements a variation on the classical, discrete PID controller that tries to handle early invocations of the controller, in other words, invocations before the expected end of the interval. However, there were some calculation errors in this early invocation case. Notably, if an early invocation occurred while the error was negative, the derivative term was off by a large amount. One visible effect of this error was that processes were being killed by the virtual memory system's OOM killer when in fact there was plentiful free memory. Correct a couple minor errors in the sysctl descriptions, and apply some style fixes. Reviewed by: jeff, markj	2018-06-07 02:54:11 +00:00
Sean Bruno	1a43cff92a	Load balance sockets with new SO_REUSEPORT_LB option. This patch adds a new socket option, SO_REUSEPORT_LB, which allow multiple programs or threads to bind to the same port and incoming connections will be load balanced using a hash function. Most of the code was copied from a similar patch for DragonflyBSD. However, in DragonflyBSD, load balancing is a global on/off setting and can not be set per socket. This patch allows for simultaneous use of both the current SO_REUSEPORT and the new SO_REUSEPORT_LB options on the same system. Required changes to structures: Globally change so_options from 16 to 32 bit value to allow for more options. Add hashtable in pcbinfo to hold all SO_REUSEPORT_LB sockets. Limitations: As DragonflyBSD, a load balance group is limited to 256 pcbs (256 programs or threads sharing the same socket). This is a substantially different contribution as compared to its original incarnation at svn r332894 and reverted at svn r332967. Thanks to rwatson@ for the substantive feedback that is included in this commit. Submitted by: Johannes Lundberg <johalun0@gmail.com> Obtained from: DragonflyBSD Relnotes: Yes Sponsored by: Limelight Networks Differential Revision: https://reviews.freebsd.org/D11003	2018-06-06 15:45:57 +00:00
Justin Hibbits	3f9e1fc8ee	Revert r334708 This is the wrong place to put the barrier. Requested by: kib,mjg	2018-06-06 15:12:19 +00:00
Justin Hibbits	32c369f40c	Add a memory barrier after taking a reference on the vnode holdcnt in _vhold This is needed to avoid a race between the VNASSERT() below, and another thread updating the VI_FREE flag, on weakly-ordered architectures. On a 72-thread POWER9, without this barrier a 'make -j72 buildworld' would panic on the assert regularly. It may be possible to use a weaker barrier, and I'll investigate that once all stability issues are worked out on POWER9.	2018-06-06 12:57:11 +00:00
Matt Macy	ebfaf69cc0	hwpmc: log name->pid, name->tid mappings By logging all threads and processes 'pmc filter' can now filter on process or thread name, relieving the user of the burden of determining which tid or pid was which when the sample was taken. % pmc filter -T if_io_tqg -P nginx pmc.log pmc-iflib.log % pmc filter -x -T idle pmc.log pmc-noidle.log	2018-06-05 04:26:40 +00:00
Mark Johnston	97bc9a9384	Regen after r334626.	2018-06-04 19:36:47 +00:00

1 2 3 4 5 ...

16353 Commits