freebsd-dev

Author	SHA1	Message	Date
Matt Macy	6573d7580b	epoch(9): allow preemptible epochs to compose - Add tracker argument to preemptible epochs - Inline epoch read path in kernel and tied modules - Change in_epoch to take an epoch as argument - Simplify tfb_tcp_do_segment to not take a ti_locked argument, there's no longer any benefit to dropping the pcbinfo lock and trying to do so just adds an error prone branchfest to these functions - Remove cases of same function recursion on the epoch as recursing is no longer free. - Remove the the TAILQ_ENTRY and epoch_section from struct thread as the tracker field is now stack or heap allocated as appropriate. Tested by: pho and Limelight Networks Reviewed by: kbowling at llnw dot com Sponsored by: Limelight Networks Differential Revision: https://reviews.freebsd.org/D16066	2018-07-04 02:47:16 +00:00
Matt Macy	8bedbb4d42	expose thread_lite definition to tied modules	2018-07-03 02:50:07 +00:00
Matt Macy	6443773dab	make critical_{enter, exit} inline Avoid pulling in all of the <sys/proc.h> dependencies by automatically generating a stripped down thread_lite exporting only the fields of interest. The field declarations are type checked against the original and the offsets of the generated result is automatically checked. kib has expressed disagreement and would have preferred to simply use genassym style offsets (which loses type check enforcement). jhb has expressed dislike of it due to header pollution and a duplicate structure. He would have preferred to just have defined thread in _thread.h. Nonetheless, he admits that this is the only viable solution at the moment. The impetus for this came from mjg's D15331: "Inline critical_enter/exit for amd64" Reviewed by: jeff Differential Revision: https://reviews.freebsd.org/D16078	2018-07-03 01:55:09 +00:00
Mariusz Zaborski	0dea6e3c98	core(5): overwrite the oldest core dump The '%I' format in the kern.corefile sysctl limits the number of core files that a process can generate to the number stored in the debug.ncores sysctl. The '%I' format is replaced by the single digit index. Previously, if all indexes were taken the kernel would overwrite only a core file with the highest index in a filename. Currently the system will create a new core file if there is a free index or if all slots are taken it will overwrite the oldest one. Reviewed by: kib(code), bcr (updating) Differential Revision: https://reviews.freebsd.org/D15991 Differential Revision: https://reviews.freebsd.org/D16084	2018-07-01 17:28:46 +00:00
Gleb Smirnoff	95dce07dea	Correct r335242. Use unsigned cast instead of abs(). Using abs() gives incorrect result when ticks has already wrapped, and are about to reach the cr_ticks value (cr_ticks - ticks < hz). Submitted by: bde	2018-06-27 22:00:50 +00:00
Warner Losh	bc6cb3f6b4	Remove devctl_safe_quote since it's now unused. Sponsored by: Netflix Differential Review: https://reviews.freebsd.org/D16026	2018-06-27 04:11:19 +00:00
Warner Losh	349fcda430	Fix devctl generation for core files. We have a problem with vn_fullpath_global when the file exists. Work around it by printing the full path if the core file name starts with /, or current working directory followed by the filename if not. Sponsored by: Netflix Differential Review: https://reviews.freebsd.org/D16026	2018-06-27 04:11:09 +00:00
Warner Losh	ab531b8825	Create new devctl_safe_quote_sb to copy a source string into a struct sbuf to make it safe. Callers are expected to add the " " around it, if needed. Sponsored by: Netflix Differential Review: https://reviews.freebsd.org/D16026	2018-06-27 04:10:48 +00:00
Matt Macy	74333b3dee	fix assert and conditionally allow mutexes to be held across epoch_wait_preempt	2018-06-24 18:57:06 +00:00
Matt Macy	0bcfb47363	epoch(9): Don't trigger taskq enqueue before the grouptaskqs are setup If EARLY_AP_STARTUP is not defined it is possible for an epoch to be allocated prior to it being possible to call epoch_call without issue. Based on patch by andrew@ PR: 229014 Reported by: andrew	2018-06-23 07:14:08 +00:00
Colin Percival	7e8db78116	Improve the accuracy of the POSIX "process CPU-time" clocks by adding the used portion of the current thread's time slice if the current thread belongs to the process being queried (i.e., if clock_gettime is invoked with a clock ID of CLOCK_PROCESS_CPUTIME_ID or the value provided by passing getpid(2) to clock_getcpuclockid(3)). The CLOCK_VIRTUAL and CLOCK_PROF timers already make this adjustment via long-standing code in calcru(), but since those timers are not specified by POSIX it seems useful to add it here so that the higher accuracy is available to code which aims to be portable. PR: 228669 Reported by: Graham Percival Reviewed by: kib MFC after: 1 week	2018-06-22 10:23:32 +00:00
Matt Macy	ae25f40b72	epoch(9): make non-preemptible variant work early boot	2018-06-22 00:47:18 +00:00
Kyle Evans	03d7aee8a7	subr_hints: Fix acpi unit hinting (at the very least) The refactoring in r335479 overlooked the fact that the dynamic kenv can also be switched to if hintmode == 0. This is problematic because the checkmethod bits are only ever ran once, but it worked previously because the use_kenv was a global state and the first lookup would enable it if occurring after the dynamic environment has been setup. Extending our local definition of use_kenv to include all non-STATIC hintmodes as long as the dynamic_kenv is setup fixes this. We still have potential issues if the dynamic kenv comes up while we're doing an anchored search through the environment, but this is not much of a concern right now because: 1.) The dynamic environment comes up super early in boot, just after kmem 2.) This is going to get rewritten to provide a safer mechanism for the anchored searches, ensuring that we continue using the same environment chain (dynamic env or static fallback) for all anchored search invocations Reported by: mmamcy X-MFC-With: r335479	2018-06-21 21:50:00 +00:00
Konstantin Belousov	6e22bbf66e	fork: avoid endless wait with PTRACE_FORK and RFSTOPPED. An RFSTOPPED thread can't clean TDB_STOPATFORK, which is done in the fork_return() in its context, so parent is stuck forever. Triggered when trying to ptrace linux process. Instead of waiting for the new thread to clear TDB_STOPATFORK, tag it as traced and reparent to the debugger in do_fork(), and let it only notify the debugger when run. Submitted by: Yanko Yankulov <yanko.yankulov@gmail.com> Reviewed by: jhb MFC after: 1 week X-MFC-Note: keep p_dbgwait placeholder intact Differential revision: https://reviews.freebsd.org/D15857	2018-06-21 21:12:49 +00:00
Konstantin Belousov	ac4bc0c171	Update proc->p_ptevents annotation to reflect the actual locking. Submitted by: Yanko Yankulov <yanko.yankulov@gmail.com> Reviewed by: jhb MFC after: 1 week Differential revision: https://reviews.freebsd.org/D15954	2018-06-21 21:07:25 +00:00
Justin Hibbits	22c1b4c0f1	Introduce PMCR-based cpufreq(4) driver, for IBM POWER8 and POWER9 systems Summary: POWER8 and POWER9 use a single CPU register, per core, to change clock speed. Everything else is handled by the on-chip controller. This change necessitates a change to the cpufreq global kernel driver to bump supported levels, as the device tree for these systems can have theoretically 256 different options. On my POWER9 Talos, the list consists of 100 items. At 16.67MHz intervals, that allows for a change of roughly 1.67GHz between lowest and highest. This has only been tested on the POWER9. However, since they're similar, this should work on POWER8 as well. Reviewed By: nwhitehorn Differential Revision: https://reviews.freebsd.org/D15932	2018-06-21 14:26:43 +00:00
Kyle Evans	770488d202	subr_hints: simplify a little bit Some complexity exists in these bits that isn't needed. The sysctl handler, upon change to '2', runs through the current set of hints and sets them in the kenv. However, this isn't at all necessary if we're pulling hints from the kenv, static or dynamic, as the former will get added to the latter in init_dynamic_kenv (see: kern_environment.c). We can reduce this configuration to just adding static_hints to the kenv if we were previously using them. The changes in res_find are minimal and based on the observation that once use_kenv gets set to '1' it will never be reset to '0', and it gets set to '1' as soon as we hit fallback mode. Later work will refactor res_find a little bit and eliminate this now-local, because it's become clear that there's some funkiness revolving around use_kenv=1 and it being used to imply that we're certainly looking at the dynamic_kenv. Reviewed by: ray MFC after: 2 weeks Differential Revision: https://reviews.freebsd.org/D15940	2018-06-21 14:04:02 +00:00
Hans Petter Selasky	ce70c57262	Permit the kernel environment to set an array of numeric values for a single sysctl(9) node. Reviewed by: kib@, imp@, jhb@ Differential Revision: https://reviews.freebsd.org/D15802 MFC after: 1 week Sponsored by: Mellanox Technologies	2018-06-20 20:04:20 +00:00
Kyle Evans	c7962400c9	Add debug.verbose_sysinit tunable for VERBOSE_SYSINIT VERBOSE_SYSINIT is currently an all-or-nothing option. debug.verbose_sysinit adds an option to have the code compiled in but quiet by default so that getting this information from a device in the field doesn't necessarily require distributing a recompiled kernel. Its default is VERBOSE_SYSINIT's value as defined in the kernconf. As such, the default behavior for simply omitting or including this option is unchanged. MFC after: 1 week	2018-06-20 19:23:56 +00:00
Emmanuel Vadot	78442297f5	Add pmap_mapdev_attr for arm64 This is needed for efifb. arm and ricv pmap (the two arch with arm64 that uses subr_devmap) have very different implementation so for now only add this for arm64. Tested with efifb on Pine64 with a few other patches. Reviewed by: cognet Differential Revision: https://reviews.freebsd.org/D15294	2018-06-20 16:07:35 +00:00
Bjoern A. Zeeb	7938a4425a	Instead of using hand-rolled loops where not needed switch them to FOREACH_PROC_IN_SYSTEM() to have a single pattern to look for. Reviewed by: kib MFC after: 2 weeks Sponsored by: iXsystems, Inc. Differential Revision: https://reviews.freebsd.org/D15916	2018-06-20 11:42:06 +00:00
Bjoern A. Zeeb	7ffbcfe281	Sometimes it is helpful to get the path for a vnode. Implement a ddb function walking the namecache to do this. Reviewed by: jhb, mjg Inspired by: gdb macro from jhb (old version) Sponsored by: iXsystems, Inc. Differential Revision: https://reviews.freebsd.org/D14898	2018-06-20 08:34:29 +00:00
Matt Macy	9e58ff6ff9	convert inpcbinfo hash and info rwlocks to epoch + mutex - Convert inpcbinfo info & hash locks to epoch for read and mutex for write - Garbage collect code that handled INP_INFO_TRY_RLOCK failures as INP_INFO_RLOCK which can no longer fail When running 64 netperfs sending minimal sized packets on a 2x8x2 reduces unhalted core cycles samples in rwlock rlock/runlock in udp_send from 51% to 3%. Overall packet throughput rate limited by CPU affinity and NIC driver design choices. On the receiver unhalted core cycles samples in in_pcblookup_hash went from 13% to to 1.6% Tested by LLNW and pho@ Reviewed by: jtl Sponsored by: Limelight Networks Differential Revision: https://reviews.freebsd.org/D15686	2018-06-19 01:54:00 +00:00
Andrey V. Elsukov	20efcfc602	Switch RIB and RADIX_NODE_HEAD lock from rwlock(9) to rmlock(9). Using of rwlock with multiqueue NICs for IP forwarding on high pps produces high lock contention and inefficient. Rmlock fits better for such workloads. Reviewed by: melifaro, olivier Obtained from: Yandex LLC Sponsored by: Yandex LLC Differential Revision: https://reviews.freebsd.org/D15789	2018-06-16 08:26:23 +00:00
Gleb Smirnoff	61f63f47b3	Since 'ticks' is an int, it may wrap around and cr_ticks at a certain counter_rate will be greater than ticks, resulting in counter_ratecheck() failure. To fix this take an absolute value of the difference between ticks and cr_ticks. Reported by: jtl Sponsored by: Netflix	2018-06-15 21:36:16 +00:00
Bryan Drewery	03bd1b693e	proc0_post: Fix some locking issues - Filter out PRS_NEW procs as rufetch() tries taking the thread lock which may not yet be initialized. - Hold PROC_LOCK to ensure stability of iterating the threads. - p_rux fields are protected by the process statlock as well. MFC after: 2 weeks Reviewed by: kib Sponsored by: Dell EMC Differential Revision: https://reviews.freebsd.org/D15809	2018-06-15 00:36:41 +00:00
Olivier Houchard	78bcf87e3e	Use M_EXEC when calling malloc() to allocate the memory to store the module, as it'll contain executable code.	2018-06-14 23:10:10 +00:00
Brooks Davis	7d87c005da	Regen after 335177 (rename sys_obreak to sys_break).	2018-06-14 21:29:31 +00:00
Brooks Davis	9da5364ed9	Name the implementation of brk and sbrk sys_break(). The break() system call was renamed (several times) starting in v3 AT&T UNIX when C was invented and break was a language keyword. The last vestage of a need for it to be called something else (eg obreak) was removed in r225617 which consistantly prefixed all syscall implementations. Reviewed by: emaste, kib (older version) Sponsored by: DARPA, AFRL Differential Revision: https://reviews.freebsd.org/D15638	2018-06-14 21:27:25 +00:00
Jonathan T. Looney	0766f278d8	Make UMA and malloc(9) return non-executable memory in most cases. Most kernel memory that is allocated after boot does not need to be executable. There are a few exceptions. For example, kernel modules do need executable memory, but they don't use UMA or malloc(9). The BPF JIT compiler also needs executable memory and did use malloc(9) until r317072. (Note that a side effect of r316767 was that the "small allocation" path in UMA on amd64 already returned non-executable memory. This meant that some calls to malloc(9) or the UMA zone(9) allocator could return executable memory, while others could return non-executable memory. This change makes the behavior consistent.) This change makes malloc(9) return non-executable memory unless the new M_EXEC flag is specified. After this change, the UMA zone(9) allocator will always return non-executable memory, and a KASSERT will catch attempts to use the M_EXEC flag to allocate executable memory using uma_zalloc() or its variants. Allocations that do need executable memory have various choices. They may use the M_EXEC flag to malloc(9), or they may use a different VM interfact to obtain executable pages. Now that malloc(9) again allows executable allocations, this change also reverts most of r317072. PR: 228927 Reviewed by: alc, kib, markj, jhb (previous version) Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D15691	2018-06-13 17:04:41 +00:00
Warner Losh	a971acbc25	Implement a 'car limit' for bioq. Allow one to implement a 'car limit' for bioq_disksort. debug.bioq_batchsize sets the size of car limit. Every time we queue that many requests, we start over so that we limit the latency for requests when the software queue depths are large. A value of '0', the default, means to revert to the old behavior. Sponsored by: Netflix	2018-06-13 16:48:07 +00:00
Bruce Evans	ab35e1c71b	Fix the encoding of major and minor numbers in 64-bit dev_t by restoring the old encodings for the lower 16 and 32 bits and only using the higher 32 bits for unusually large major and minor numbers. This change breaks compatibility with the previous encoding (which was only used in -current). Fix truncation to (essentially) 16-bit dev_t in newnfs v3. Any encoding of device numbers gives an ABI, so it can't be changed without translations for compatibility. Extra bits give the much larger complication that the translations need to compress into fewer bits. Fortunately, more than 32 bits are rarely needed, so compression is rarely needed except for 16-bit linux dev_t where it was always needed but never done. The previous encoding moved the major number into the top 32 bits. Almost no translation code handled this, so the major number was blindly truncated away in most 32-bit encodings. E.g., for ffs, mknod(8) with major = 1 and minor = 2 gave dev_t = 0x10000002; ffs cannot represent this and blindly truncated it to 2. But if this mknod was run on any released version of FreeBSD, it gives dev_t = 0x102. ffs can represent this, but in the previous encoding it was not decoded, giving major = 0, minor = 0x102. The presence of bugs was most obvious for exporting dev_t's from an old system to -current, since bugs in newnfs augment them. I fixed oldnfs to support 32-bit dev_t in 1996 (r16634), but this regressed to 16-bit dev_t in newnfs, first to the old 16-bit encoding and then further in -current. E.g., old ad0 with major = 234, minor = 0x10002 had the correct (major, minor) number on the wire, but newnfs truncated this to (234, 2) and then the previous encoding shifted the major number into oblivion as seen by ffs or old applications. I first tried to fix this by translating on every ABI/API boundary, but there are too many boundaries and too many sloppy translations by blind truncation. So use the old encoding for the low 32 bits so that sloppy translations work no worse than before provided the high 32 bits are not set. Add some error checking for when bits are lost. Keep not doing any error checking for translations for almost everything in compat/linux. compat/freebsd32/freebsd32_misc.c: Optionally check for losing bits after possibly-truncating assignments as before. compat/linux/linux_stats.c: Depend on the representation being compatible with Linux's (or just with itself for local use) and spell some of the translations as assignments in a macro that hides the details. fs/nfsclient/nfs_clcomsubs.c: Essentially the same fix as in 1996, except there is now no possible truncation in makedev() itself. Also fix nearby style bugs. kern/vfs_syscalls.c: As for freebsd32. Also update the sysctl description to include file numbers, and change it to describe device ids as device numbers. sys/types.h: Use inline functions (wrapped by macros) since the expressions are now a bit too complicated for plain macros. Describe the encoding and some of the reasons for it. 16-bit compatibility didn't leave many reasonable choices for the 32-bit encoding, and 32-bit compatibility doesn't leave many reasonable choices for the 64-bit encoding. My choice is to put the 8 new minor bits in the low 8 bits of the top 32 bits. This minimizes discontiguities. Reviewed by: kib (except for rewrite of the comment in linux_stats.c)	2018-06-13 12:22:00 +00:00
Bruce Evans	372639f944	Fix some bugs found while fixing the representation and translation of 64-bit dev_t's (but not ones involving dev_t's). st_size was supposed to be clamped in cvtstat() and linux's copy_stat(), but the clamping code wasn't aware that st_size is signed, and also had an obfuscated off-by-1 value for the unsigned limit, so its effect was to produce a bizarre negative size instead of clamping. Change freebsd32's copy_ostat() to be no worse than cvtstat(). It was missing clamping and bzero()ing of padding. Reviewed by: kib (except a final fix of the clamp to the signed maximum)	2018-06-13 08:50:43 +00:00
Ed Maste	00ce0c6258	makesyscalls: simplify capenabled pipeline Replace cat + 2x grep with one grep. Sponsored by: Turing Robotic Industries	2018-06-11 18:57:40 +00:00
Matt Macy	0ea9d9376e	limit change to fixing controlp handling pending review	2018-06-11 17:10:19 +00:00
Matt Macy	c34bf30069	soreceive_stream: correctly handle edge cases - non NULL controlp is not an error, returning EINVAL would cause X forwarding to fail - MSG_PEEK and MSG_WAITALL are fairly exceptional, but we still want to handle them - punt to soreceive_generic	2018-06-11 16:31:42 +00:00
Mateusz Guzik	0001edb823	counter: add a bit missed in r334858 It happens to be a noop.	2018-06-08 22:06:32 +00:00
Matt Macy	a62b4665f4	AF_UNIX: bring uipc_ready in compliance with new locking protocol PR: 228742 Submitted by: markj Reviewed by: markj	2018-06-08 20:31:59 +00:00
Jonathan T. Looney	1fbe13cf4b	Add a socket destructor callback. This allows kernel providers to set callbacks to perform additional cleanup actions at the time a socket is closed. Michio Honda presented a use for this at BSDCan 2018. (See https://www.bsdcan.org/2018/schedule/events/965.en.html .) Submitted by: Michio Honda <micchie at sfc.wide.ad.jp> (previous version) Reviewed by: lstewart (previous version) Differential Revision: https://reviews.freebsd.org/D15706	2018-06-08 19:35:24 +00:00
Mateusz Guzik	b8af2820f6	uma: fix up r334824 Turns out there is code which ends up passing M_ZERO to counters. Since counters zero unconditionally on their own, just ignore drop the flag in that place.	2018-06-08 05:40:36 +00:00
Matt Macy	eb7c901995	hwpmc: simplify calling convention for hwpmc interrupt handling pmc_process_interrupt takes 5 arguments when only 3 are needed. cpu is always available in curcpu and inuserspace can always be derived from the passed trapframe. While facially a reasonable cleanup this change was motivated by the need to workaround a compiler bug. core2_intr(cpu, tf) -> pmc_process_interrupt(cpu, ring, pmc, tf, inuserspace) -> pmc_add_sample(cpu, ring, pm, tf, inuserspace) In the process of optimizing the tail call the tf pointer was getting clobbered: (kgdb) up at /storage/mmacy/devel/freebsd/sys/dev/hwpmc/hwpmc_mod.c:4709 4709 pmc_save_kernel_callchain(ps->ps_pc, (kgdb) up 1205 error = pmc_process_interrupt(cpu, PMC_HR, pm, tf, resulting in a crash in pmc_save_kernel_callchain.	2018-06-08 04:58:03 +00:00
Randall Stewart	89e560f441	This commit brings in a new refactored TCP stack called Rack. Rack includes the following features: - A different SACK processing scheme (the old sack structures are not used). - RACK (Recent acknowledgment) where counting dup-acks is no longer done instead time is used to knwo when to retransmit. (see the I-D) - TLP (Tail Loss Probe) where we will probe for tail-losses to attempt to try not to take a retransmit time-out. (see the I-D) - Burst mitigation using TCPHTPS - PRR (partial rate reduction) see the RFC. Once built into your kernel, you can select this stack by either socket option with the name of the stack is "rack" or by setting the global sysctl so the default is rack. Note that any connection that does not support SACK will be kicked back to the "default" base FreeBSD stack (currently known as "default"). To build this into your kernel you will need to enable in your kernel: makeoptions WITH_EXTRA_TCP_STACKS=1 options TCPHPTS Sponsored by: Netflix Inc. Differential Revision: https://reviews.freebsd.org/D15525	2018-06-07 18:18:13 +00:00
Alan Cox	5b274055d1	When pidctrl_daemon() is called multiple times within an interval, it should use the cumulative error to calculate the output.	2018-06-07 07:48:50 +00:00
Matt Macy	fcabd54160	AF_UNIX: check for unp == unp2 on disconnect	2018-06-07 04:57:40 +00:00
Alan Cox	e768070ca9	pidctrl_daemon() implements a variation on the classical, discrete PID controller that tries to handle early invocations of the controller, in other words, invocations before the expected end of the interval. However, there were some calculation errors in this early invocation case. Notably, if an early invocation occurred while the error was negative, the derivative term was off by a large amount. One visible effect of this error was that processes were being killed by the virtual memory system's OOM killer when in fact there was plentiful free memory. Correct a couple minor errors in the sysctl descriptions, and apply some style fixes. Reviewed by: jeff, markj	2018-06-07 02:54:11 +00:00
Sean Bruno	1a43cff92a	Load balance sockets with new SO_REUSEPORT_LB option. This patch adds a new socket option, SO_REUSEPORT_LB, which allow multiple programs or threads to bind to the same port and incoming connections will be load balanced using a hash function. Most of the code was copied from a similar patch for DragonflyBSD. However, in DragonflyBSD, load balancing is a global on/off setting and can not be set per socket. This patch allows for simultaneous use of both the current SO_REUSEPORT and the new SO_REUSEPORT_LB options on the same system. Required changes to structures: Globally change so_options from 16 to 32 bit value to allow for more options. Add hashtable in pcbinfo to hold all SO_REUSEPORT_LB sockets. Limitations: As DragonflyBSD, a load balance group is limited to 256 pcbs (256 programs or threads sharing the same socket). This is a substantially different contribution as compared to its original incarnation at svn r332894 and reverted at svn r332967. Thanks to rwatson@ for the substantive feedback that is included in this commit. Submitted by: Johannes Lundberg <johalun0@gmail.com> Obtained from: DragonflyBSD Relnotes: Yes Sponsored by: Limelight Networks Differential Revision: https://reviews.freebsd.org/D11003	2018-06-06 15:45:57 +00:00
Justin Hibbits	3f9e1fc8ee	Revert r334708 This is the wrong place to put the barrier. Requested by: kib,mjg	2018-06-06 15:12:19 +00:00
Justin Hibbits	32c369f40c	Add a memory barrier after taking a reference on the vnode holdcnt in _vhold This is needed to avoid a race between the VNASSERT() below, and another thread updating the VI_FREE flag, on weakly-ordered architectures. On a 72-thread POWER9, without this barrier a 'make -j72 buildworld' would panic on the assert regularly. It may be possible to use a weaker barrier, and I'll investigate that once all stability issues are worked out on POWER9.	2018-06-06 12:57:11 +00:00
Matt Macy	ebfaf69cc0	hwpmc: log name->pid, name->tid mappings By logging all threads and processes 'pmc filter' can now filter on process or thread name, relieving the user of the burden of determining which tid or pid was which when the sample was taken. % pmc filter -T if_io_tqg -P nginx pmc.log pmc-iflib.log % pmc filter -x -T idle pmc.log pmc-noidle.log	2018-06-05 04:26:40 +00:00
Mark Johnston	97bc9a9384	Regen after r334626.	2018-06-04 19:36:47 +00:00
Mark Johnston	9f9c9b22ec	Reimplement brk() and sbrk() to avoid the use of _end. Previously, libc.so would initialize its notion of the break address using _end, a special symbol emitted by the static linker following the bss section. Compatibility issues between lld and ld.bfd could cause the wrong definition of _end (libc.so's definition rather than that of the executable) to be used, breaking the brk()/sbrk() interface. Avoid this problem and future interoperability issues by simply not relying on _end. Instead, modify the break() system call to return the kernel's view of the current break address, and have libc initialize its state using an extra syscall upon the first use of the interface. As a side effect, this appears to fix brk()/sbrk() usage in executables run with rtld direct exec, since the kernel and libc.so no longer maintain separate views of the process' break address. PR: 228574 Reviewed by: kib (previous version) MFC after: 2 months Differential Revision: https://reviews.freebsd.org/D15663	2018-06-04 19:35:15 +00:00
Alan Cox	3e7cb27cdd	Use a single, consistent approach to returning success versus failure in vm_map_madvise(). Previously, vm_map_madvise() used a traditional Unix- style "return (0);" to indicate success in the common case, but Mach- style return values in the edge cases. Since KERN_SUCCESS equals zero, the only problem with this inconsistency was stylistic. vm_map_madvise() has exactly two callers in the entire source tree, and only one of them cares about the return value. That caller, kern_madvise(), can be simplified if vm_map_madvise() consistently uses Unix-style return values. Since vm_map_madvise() uses the variable modify_map as a Boolean, make it one. Eliminate a redundant error check from kern_madvise(). Add a comment explaining where the check is performed. Explicitly note that exec_release_args_kva() doesn't care about vm_map_madvise()'s return value. Since MADV_FREE is passed as the behavior, the return value will always be zero. Reviewed by: kib, markj MFC after: 7 days	2018-06-04 16:28:06 +00:00
Matt Macy	5de96e33d6	hwpmc: support sampling both kernel and user stacks when interrupted in kernel This adds the -U options to pmcstat which will attribute in-kernel samples back to the user stack that invoked the system call. It is not the default, because when looking at kernel profiles it is generally more desirable to merge all instances of a given system call together. Although heavily revised, this change is directly derived from D7350 by Jonathan T. Looney. Obtained from: jtl Sponsored by: Juniper Networks, Limelight Networks	2018-06-04 01:10:23 +00:00
Mateusz Guzik	d0a22279db	Remove an unused argument to turnstile_unpend. PR: 228694 Submitted by: Julian Pszczołowski <julian.pszczolowski@gmail.com>	2018-06-02 22:37:53 +00:00
Mateusz Guzik	34c538c356	malloc: try to use builtins for zeroing at the callsite Plenty of allocation sites pass M_ZERO and sizes which are small and known at compilation time. Handling them internally in malloc loses this information and results in avoidable calls to memset. Instead, let the compiler take the advantage of it whenever possible. Discussed with: jeff	2018-06-02 22:20:09 +00:00
Mark Johnston	3fb14f61e1	Avoid completing I/O when dumping core after a panic. Filesystem or pager completion callbacks are generally non-functional after a panic and may trigger deadlocks if invoked in this context (e.g., by attempting to destroying a buffer mapping). To avoid this situation, short-circuit I/O completion in biodone(). Reviewed by: imp Discussed with: mav MFC after: 2 weeks Differential Revision: https://reviews.freebsd.org/D15592	2018-06-01 23:49:32 +00:00
Ed Maste	b8d908b71e	ANSIfy sys/kern	2018-06-01 13:26:45 +00:00
Warner Losh	c580ca4cf4	Make the data returned by devinfo harder to overflow. Rather than using fixed-length strings, pack them into a string table to return. Also expand the buffer from ~300 charaters to 3k. This should be enough, even for USB. This fixes a problem where USB pnp info is truncated on return to userland. Differential Revision: https://reviews.freebsd.org/D15629	2018-05-31 02:57:58 +00:00
Brooks Davis	64b378f1e1	Remove alternative names that are identical to the default. Verified by make sysent producing no changes.	2018-05-30 22:22:58 +00:00
Ed Maste	f912a970e6	link_elf_obj: correct an error message Previously we'd report that a file has "no valid symbol table" if it in fact had two or more. Change the message to report that there must be exactly one.	2018-05-30 12:55:27 +00:00
Matt Macy	e445381f13	epoch(9): make epoch closer to style(9)	2018-05-30 03:39:57 +00:00
Stephen Hurd	3e0e6330b5	iflib: mark irq allocation name parameter as constant The name parameter passed to iflib_irq_alloc_generic and iflib_softirq_alloc_generic is never modified. Many places in code pass string literals and thus should not be modified. Mark the name parameter as a const char * instead, so that we enforce that the name is not modified before passing to bus_describe_intr() Submitted by: Jacob Keller <jacob.e.keller@intel.com> Reviewed by: kmacy Sponsored by: Intel Corporation Differential Revision: https://reviews.freebsd.org/D15343	2018-05-29 21:56:39 +00:00
Gleb Smirnoff	147cd40fe7	Revert second chunk of r333860. The warning from gcc is false positive. The npages won't be ever used in no space case.	2018-05-29 21:45:15 +00:00
Matt Macy	b99aa0fbb2	hwpmc: don't enter epoch section across mmap hook	2018-05-29 18:03:48 +00:00
Brooks Davis	d8b2f0790b	Correct pointer subtraction in KASSERT(). The assertion would never fire without truly spectacular future programming errors. Reported by: Coverity CID: 1391367, 1391368 Sponsored by: DARPA, AFRL	2018-05-29 17:49:03 +00:00
Andriy Gapon	ec6faf94c4	add support for console resuming, implement it for uart, use on x86 This change adds a new optional console method cn_resume and a kernel console interface cnresume. Consoles that may need to re-initialize their hardware after suspend (e.g., because firmware does not care to do it) will implement cn_resume. Note that it is called in rather early environment not unlike early boot, so the same restrictions apply. Platform specific code, for platforms that support hardware suspend, should call cnresume early after resume, before any console output is expected. This change fixes a problem with a system of mine failing to resume when a serial console is used. I found that the serial port was in a strange configuration and an attempt to write to it likely resulted in an infinite loop. To avoid adding cn_resume method to every console driver, CONSOLE_DRIVER macro has been extended to support optional methods. Reviewed by: imp, mav MFC after: 3 weeks Differential Revision: https://reviews.freebsd.org/D15552	2018-05-29 16:16:24 +00:00
Matt Macy	552b3e1798	witness/hwpmc: fix locking order for pmc locks	2018-05-28 23:14:38 +00:00
Eric van Gyzen	70d66bcf28	kern_cpuset: fix small leak on error path The "mask" was leaked on some error paths. Reported by: Coverity CID: 1384683 Sponsored by: Dell EMC	2018-05-26 14:23:11 +00:00
Eric van Gyzen	16b51429d2	kdb_trap: Fix use of uninitialized data In some cases, other_cpus was used without being initialized. Thankfully, it was harmless. Reported by: Coverity CID: 1385265 Sponsored by: Dell EMC	2018-05-26 14:01:44 +00:00
Brooks Davis	659a2e9243	Regen after r334223: make vadvise compat freebsd11.	2018-05-25 20:41:26 +00:00
Brooks Davis	7351a8bdb5	Make vadvise compat freebsd11. The vadvise syscall (aka ovadvise) is undocumented and has always been implmented as returning EINVAL. Put the syscall under COMPAT11 and provide a userspace implementation. Reviewed by: kib Sponsored by: DARPA, AFRL Differential Revision: https://reviews.freebsd.org/D15557	2018-05-25 20:40:23 +00:00
Matt Macy	acf9fd05d8	AF_UNIX: It is possible for UNIX datagram sockets to be connected to themselves. The updated code assumed that that could not happen and would try to lock the unp mutex twice. There may be a lingering issue here but this fixes it for the reporter. PR: 228458 Reported by: marieheleneka at gmail.com	2018-05-24 21:13:46 +00:00
Matt Macy	c684c14ce3	AF_UNIX: evidently Samba likes to connect a unix socket to itself, fix locking	2018-05-24 18:22:13 +00:00
Matt Macy	a3a734908b	AF_UNIX in connectat unp and unp2 can be the same	2018-05-24 18:22:05 +00:00
Conrad Meyer	a0638b33f7	Yank crufty INTR_FILTER option It was introduced to the tree in r169320 and r169321 in May 2007. It never got much use and never became a kernel default. The code duplicates the default path quite a bit, with slight modifications. Just yank out the cruft. Whatever goals were being aimed for can probably be met within the existing framework, without a flag day option. Mostly mechanical change: 'unifdef -m -UINTR_FILTER'. Reviewed by: mmacy Sponsored by: Dell EMC Isilon Differential Revision: https://reviews.freebsd.org/D15546	2018-05-24 17:06:00 +00:00
Brooks Davis	5f77b8a88b	Avoid two suword() calls per auxarg entry. Instead, construct an auxargs array and copy it out all at once. Use an array of Elf_Auxinfo rather than pairs of Elf_Addr * to represent the array. This is the correct type where pairs of words just happend to work. To reduce the size of the diff, AUXARGS_ENTRY is altered to act on this array rather than introducing a new macro. Return errors on copyout() and suword() failures and handle them in the caller. Incidentally fixes AT_RANDOM and AT_EXECFN in 32-bit linux on amd64 which incorrectly used AUXARG_ENTRY instead of AUXARGS_ENTRY_32 (now removed due to the use of proper types). Reviewed by: kib Comments from: emaste, jhb Obtained from: CheriBSD Sponsored by: DARPA, AFRL Differential Revision: https://reviews.freebsd.org/D15485	2018-05-24 16:25:18 +00:00
Bjoern A. Zeeb	bb8f162363	Try to be consistent and spell "vnet" lower case like all the other options (and as we do on command line). Sponsored by: iXsystems, Inc.	2018-05-24 15:31:05 +00:00
Bjoern A. Zeeb	36b41cc336	Improve the KASSERT to also have the prison pointer. Helpful when debugging from ddb. Sponsored by: iXsystems, Inc.	2018-05-24 15:28:21 +00:00
Matt Macy	16529dace8	AF_UNIX: assert that we're not acquiring the same lock	2018-05-24 15:28:16 +00:00
Mateusz Guzik	6fee84e35e	Remove incorrect owepreempt assertion added in r334062 Yet another preemption request hitting between the counter being 0 and the check being reached will result in the flag no longer being set. Note the situation was already present prior to r334062 and is harmless. Reported by: pho Reviewed by: kib	2018-05-23 10:13:17 +00:00
Matt Macy	8a656309b3	kern_sendit: use pre-initialized rights	2018-05-23 01:48:09 +00:00
Mateusz Guzik	748b15fc02	Move preemption handling out of critical_exit. In preperataion for making the enter/exit pair inline. Reviewed by: kib	2018-05-22 19:24:57 +00:00
Fabien Thomas	f8e73c47d8	Add a SPD cache to speed up lookups. When large SPDs are used, we face two problems: - too many CPU cycles are spent during the linear searches in the SPD for each packet - too much contention on multi socket systems, since we use a single shared lock. Main changes: - added the sysctl tree 'net.key.spdcache' to control the SPD cache (disabled by default). - cache the sp indexes that are used to perform SP lookups. - use a range of dedicated mutexes to protect the cache lines. Submitted by: Emeric Poupon <emeric.poupon@stormshield.eu> Reviewed by: ae Sponsored by: Stormshield Differential Revision: https://reviews.freebsd.org/D15050	2018-05-22 15:54:25 +00:00
Mateusz Guzik	ee252fc995	sx: fixup a braino in r334024 If a thread waiting on sx dropped Giant it would not be properly reacquired on exit from the routine, later resulting in panics indicating Giant is not held (when it should be). The bug was not present in the original patch sent to pho, I wittingly added it just prior to the commit and only smoke-tested it. Reported by: pho	2018-05-22 15:13:25 +00:00
Mateusz Guzik	99ece3a9cd	Reduce sdt-related branch-fest in mi_switch. The code was evaluating flags before resorting to checking if dtrace is enabled. This was inducing forward jumps in the common case.	2018-05-22 08:27:33 +00:00
Mateusz Guzik	2466d12b09	sx: port over writer starvation prevention measures from rwlock A constant stream of readers could completely starve writers and this is not a hypothetical scenario. The 'poll2_threads' test from the will-it-scale suite reliably starves writers even with concurrency < 10 threads. The problem was run into and diagnosed by dillon@backplane.com There was next to no change in lock contention profile during -j 128 pkg build, despite an sx lock being at the top. Tested by: pho	2018-05-22 07:20:22 +00:00
Mateusz Guzik	9feec7ef69	rw: decrease writer starvation Writers waiting on readers to finish can set the RW_LOCK_WRITE_SPINNER bit. This prevents most new readers from coming on. However, the last reader to unlock also clears the bit which means new readers can sneak in and the cycle starts over. Change the code to keep the bit after last unlock. Note that starvation potential is still there: no matter how many write spinners are there, there is one bit. After the writer unlocks, the lock is free to get raided by readers again. It is good enough for the time being. The real fix would include counting writers. This runs into a caveat: the writer which set the bit may now be preempted. In order to get rid of the problem all attempts to set the bit are preceeded with critical_enter. The bit gets cleared when the thread which set it goes to sleep. This way an invariant holds that if the bit is set, someone is actively spinning and will grab the lock soon. In particular this means that readers which find the lock in this transient state can safely spin until the lock finds itself an owner (i.e. they don't need to block nor speculate how long to spin speculatively). Tested by: pho	2018-05-22 07:16:39 +00:00
Andriy Gapon	27dca831a6	stop and restart kernel event timers in the suspend / resume cycle I have a system that is very unstable after resuming from suspend-to-RAM but only if HPET is used as the event timer. The theory is that SMM code / firmware could be enabling HPET for its own uses and unexpected interrupts cause a trouble for it. Originally I wanted to solve the problem in hpet_suspend() method, but that was insufficient as the event timer could get reprogrammed again. So, it's better, for my case and in general, to stop the event timer(s) before entering the hardware suspend. MFC after: 4 weeks Differential Revision: https://reviews.freebsd.org/D15413	2018-05-21 20:23:04 +00:00
Mark Johnston	13679ebac9	Don't pass a section cookie to CK for non-preemptible epoch sections. They're only useful when multiple threads may share an epoch record, and that can't happen with non-preemptible sections. Reviewed by: mmacy Differential Revision: https://reviews.freebsd.org/D15507	2018-05-21 16:03:51 +00:00
Matt Macy	9725c9cef7	AF_UNIX gc unused label ...sigh	2018-05-20 21:37:34 +00:00
Matt Macy	cb8f450b94	AF_UNIX: Don't unlock unp/unp2 if they're not locked Reported by: mjg	2018-05-20 21:20:26 +00:00
Matt Macy	e10ef65d23	AF_UNIX: fix LOR introduced by the locking rewrite	2018-05-20 05:50:53 +00:00
Matt Macy	7118990962	Add additional preinitialized cap_rights	2018-05-20 05:13:12 +00:00
Mateusz Guzik	2186ee6e72	vfs: simplify vop_stdlock/unlock The interlock pointer is non-NULL by definition and the compiler see through that and eliminates the NULL checks. Just remove them from the code as they play no role. No difference in generated assembly.	2018-05-20 04:45:05 +00:00
Matt Macy	d95253403f	AF_UNIX: make unpcb lock name line up with what's in witness	2018-05-20 04:32:48 +00:00
Warner Losh	f344fb0b4b	Restore the all rights reserved language. Put it on each of the prior two copyrights. The line originated with the Berkeely Regents, who we have not approached about removing it (it's honestly too trivial to be worth that fight). Restore it to rwatson's line as well. He can decide if he wants it or not on his own. Matt clearly doesn't want it, per project preference and his own statements on IRC. Noticed by: rgrimes@	2018-05-19 17:29:57 +00:00
Ed Maste	1b30e10e48	Remove duplicate cap_no_rights from r333874 Archs using in-tree gcc were broken with `warning: redundant redeclaration of 'cap_no_rights' [-Wredundant-decls]`. Sponsored by: The FreeBSD Foundation	2018-05-19 11:37:02 +00:00
Matt Macy	f6a1a10613	Unbreak BeagleBone Black boot by collapsing 29 SYSINITs in to 1 Reported by: ilya at bakulin.de	2018-05-19 07:31:35 +00:00
Matt Macy	fc2e87be2b	intr unbreak KTR/LINT build	2018-05-19 07:04:43 +00:00
Matt Macy	4b06dee1e5	AF_UNIX: switch to annotations to avoid warnings	2018-05-19 05:37:58 +00:00
Matt Macy	acbde29858	capsicum: propagate const correctness	2018-05-19 05:14:05 +00:00
Matt Macy	ba3f7276c0	intr: eliminate / annotate unused stack locals	2018-05-19 05:12:18 +00:00
Matt Macy	7fd6841438	sendfile: annotate unused value and ensure that npages is actually initialized	2018-05-19 05:10:51 +00:00
Matt Macy	e1a92f058f	umtx: don't call umtxq_getchain unless the value is needed	2018-05-19 05:09:10 +00:00
Matt Macy	a6c7423a92	cpuset: revert and annotate instead	2018-05-19 05:07:31 +00:00
Matt Macy	6fa5abfdda	conf: revert last change and annotate unused var instead	2018-05-19 05:07:03 +00:00
Matt Macy	1c0336c1c1	kevent: annotate unused stack local	2018-05-19 05:06:18 +00:00
Matt Macy	788390df0a	lockf: annotate LOCKF_DEBUG only var	2018-05-19 05:04:38 +00:00
Matt Macy	d1230b1159	capsicum: annotate variable only used by debug	2018-05-19 05:02:40 +00:00
Matt Macy	3adccf38e3	turnstile / sleepqueue: annotate variables only used by debug builds	2018-05-19 05:00:16 +00:00
Matt Macy	84482abd21	vfs: annotate variables only used by debug builds as __unused	2018-05-19 04:59:39 +00:00
Matt Macy	a2bb4e080e	tty: use __unused annotation instead to silence warnings	2018-05-19 04:48:26 +00:00
Matt Macy	5072a5f465	malloc: avoid possibly returning stack garbage if MALLOC_DEBUG is defined	2018-05-19 04:43:49 +00:00
Matt Macy	39eef2f45a	cpuset_thread0: avoid unused assignment on non debug build	2018-05-19 04:14:00 +00:00
Matt Macy	926cfdb8da	make_dev: avoid unused assignments on non debug builds	2018-05-19 04:13:20 +00:00
Matt Macy	4949ad7264	mqueue: avoid unused variables	2018-05-19 04:10:53 +00:00
Matt Macy	cd6ba3f086	physio: avoid uninitialized variables	2018-05-19 04:09:58 +00:00
Matt Macy	e9b1074bc7	cache_lookup remove unused variable and initialize used	2018-05-19 04:08:11 +00:00
Matt Macy	ec8d23352b	filt_timerdetach: only assign to old if we're going to check it in a KASSERT	2018-05-19 04:07:00 +00:00
Matt Macy	5cc2d25a2b	getnextevent: put variable only used by KTR under ifdef KTR	2018-05-19 04:05:36 +00:00
Matt Macy	bfd0eacb02	simplify control flow so that gcc knows we never pass save to curthread_pflags_restore without initializing	2018-05-19 04:04:44 +00:00
Matt Macy	3ef78c9c96	tty: conditionally assign to ret value only used by MPASS statement	2018-05-19 04:02:29 +00:00
Matt Macy	02fe8a2409	remove unused locked variable in lockmgr_unlock_fast_path	2018-05-19 03:58:40 +00:00
Matt Macy	ddd4d15ecd	signotify: don't create a stack local that isn't used on non-debug builds	2018-05-19 03:57:41 +00:00
Matt Macy	46117e1f0c	sysv_msg initialize saved_msgsz	2018-05-19 03:56:39 +00:00
Matt Macy	11d4f748d7	remove unused variable	2018-05-19 03:55:42 +00:00
Matt Macy	1dce110f63	fix uninitialized variable warning in reader locks	2018-05-19 03:52:55 +00:00
Matt Macy	b203713694	fix uninitialized variable warning	2018-05-19 03:49:36 +00:00
Matt Macy	ac8b2d5cb1	sys_process.c fix set but not used warning	2018-05-19 03:48:35 +00:00
Matt Macy	e339e43685	subr_epoch.c fix unused variable warnings	2018-05-19 03:47:37 +00:00
Matt Macy	ae6be8e6f7	pidctrl Actually use the variables that we assign to as seatbelts to prevent divide by zero Reviewed by: jeffr	2018-05-19 02:17:18 +00:00
Matt Macy	c0874c3468	fix gcc8 unused variable and set but not used variable in unix sockets add copyright from lock rewrite while here	2018-05-19 02:15:40 +00:00
Mateusz Guzik	10391db530	lockmgr: avoid atomic on unlock in the slow path The code is pretty much guaranteed not to be able to unlock. This is a minor nit. The code still performs way too many reads. The altered exclusive-locked condition is supposed to be always true as well, to be cleaned up at a later date.	2018-05-18 22:57:52 +00:00
Matt Macy	d7c5a620e2	ifnet: Replace if_addr_lock rwlock with epoch + mutex Run on LLNW canaries and tested by pho@ gallatin: Using a 14-core, 28-HTT single socket E5-2697 v3 with a 40GbE MLX5 based ConnectX 4-LX NIC, I see an almost 12% improvement in received packet rate, and a larger improvement in bytes delivered all the way to userspace. When the host receiving 64 streams of netperf -H $DUT -t UDP_STREAM -- -m 1, I see, using nstat -I mce0 1 before the patch: InMpps OMpps InGbs OGbs err TCP Est %CPU syscalls csw irq GBfree 4.98 0.00 4.42 0.00 4235592 33 83.80 4720653 2149771 1235 247.32 4.73 0.00 4.20 0.00 4025260 33 82.99 4724900 2139833 1204 247.32 4.72 0.00 4.20 0.00 4035252 33 82.14 4719162 2132023 1264 247.32 4.71 0.00 4.21 0.00 4073206 33 83.68 4744973 2123317 1347 247.32 4.72 0.00 4.21 0.00 4061118 33 80.82 4713615 2188091 1490 247.32 4.72 0.00 4.21 0.00 4051675 33 85.29 4727399 2109011 1205 247.32 4.73 0.00 4.21 0.00 4039056 33 84.65 4724735 2102603 1053 247.32 After the patch InMpps OMpps InGbs OGbs err TCP Est %CPU syscalls csw irq GBfree 5.43 0.00 4.20 0.00 3313143 33 84.96 5434214 1900162 2656 245.51 5.43 0.00 4.20 0.00 3308527 33 85.24 5439695 1809382 2521 245.51 5.42 0.00 4.19 0.00 3316778 33 87.54 5416028 1805835 2256 245.51 5.42 0.00 4.19 0.00 3317673 33 90.44 5426044 1763056 2332 245.51 5.42 0.00 4.19 0.00 3314839 33 88.11 5435732 1792218 2499 245.52 5.44 0.00 4.19 0.00 3293228 33 91.84 5426301 1668597 2121 245.52 Similarly, netperf reports 230Mb/s before the patch, and 270Mb/s after the patch Reviewed by: gallatin Sponsored by: Limelight Networks Differential Revision: https://reviews.freebsd.org/D15366	2018-05-18 20:13:34 +00:00
Matt Macy	20ba6811e6	epoch(9): assert that epoch is allocated post-configure	2018-05-18 18:27:17 +00:00
Ed Maste	891cf3ed44	Use NULL for SYSINIT's last arg, which is a pointer type Sponsored by: The FreeBSD Foundation	2018-05-18 17:58:09 +00:00
Matt Macy	70398c2f86	epoch(9): Make epochs non-preemptible by default There are risks associated with waiting on a preemptible epoch section. Change the name to make them not be the default and document the issue under CAVEATS. Reported by: markj	2018-05-18 17:29:43 +00:00
Matt Macy	60b7b90d65	epoch: actually allocate the counters we've assigned sysctls too Approved by: sbruno	2018-05-18 02:57:39 +00:00
Matt Macy	5e68a3dfe3	epoch: add non-preemptible "critical" variant adds: - epoch_enter_critical() - can be called inside a different epoch, starts a section that will acquire any MTX_DEF mutexes or do anything that might sleep. - epoch_exit_critical() - corresponding exit call - epoch_wait_critical() - wait variant that is guaranteed that any threads in a section are running. - epoch_global_critical - an epoch_wait_critical safe epoch instance Requested by: markj Approved by: sbruno	2018-05-18 01:52:51 +00:00
Brooks Davis	dedc82ae26	Use strsep() to parse init_path in start_init(). This simplifies the use of the path variable by making it NUL terminated. This is a prerequisite for further cleanups. Reviewed by: imp Sponsored by: DARPA, AFRL Differential Revision: https://reviews.freebsd.org/D15467	2018-05-17 23:07:51 +00:00
Matt Macy	a5f1042498	epoch: skip poll function call in hardclock unless there are callbacks pending Reported by: mjg Approved by: sbruno	2018-05-17 21:39:15 +00:00
Matt Macy	c4d901e9bd	epoch(9): schedule pcpu callback task in hardclock if there are callbacks pending Approved by: sbruno	2018-05-17 19:57:07 +00:00
Matt Macy	2a45e8282a	epoch(9): eliminate the need to wait when polling for callbacks to run by using ck's own callback handling mechanism we can simply check which callbacks have had a grace period elapse Approved by: sbruno	2018-05-17 19:50:55 +00:00
Matt Macy	d1bcb409f6	epoch(9): fix potential deadlock Don't acquire a waiting thread's lock while holding our own Approved by: sbruno	2018-05-17 19:41:58 +00:00
Matt Macy	766d225326	epoch(9): restore thread priority on exit if it was changed by a waiter Reported by: markj Approved by: sbruno	2018-05-17 19:08:28 +00:00
Matt Macy	75a67bf3d0	AF_UNIX: make unix socket locking finer grained This change moves to using a reference count across lock drop / reacquire to guarantee liveness. Currently sends on unix sockets contend heavily on read locking the list lock. unix1_processes in will-it-scale peaks at 6 processes and then declines. With this change I get a substantial improvement in number of operations per second with 96 processes: x before + after N Min Max Median Avg Stddev x 11 1688420 1696389 1693578 1692766.3 2971.1702 + 10 63417955 71030114 70662504 69576423 2374684.6 Difference at 95.0% confidence 6.78837e+07 +/- 1.49463e+06 4010.22% +/- 88.4246% (Student's t, pooled s = 1.63437e+06) And even for 2 processes shows a ~18% improvement. "Small" iron changes (1, 2, and 4 processes): x before1 + after1.2 +------------------------------------------------------------------------+ \| + \| \| x + \| \| x + \| \| x + \| \| x ++ \| \| xx ++ \| \|x x xx ++ \| \| \|__________________A_____M_____AM____\|\| +------------------------------------------------------------------------+ N Min Max Median Avg Stddev x 10 1131648 1197750 1197138.5 1190369.3 20651.839 + 10 1203840 1205056 1204919 1204827.9 353.27404 Difference at 95.0% confidence 14458.6 +/- 13723 1.21463% +/- 1.16683% (Student's t, pooled s = 14605.2) x before2 + after2.2 +------------------------------------------------------------------------+ \| +\| \| +\| \| +\| \| +\| \| +\| \| +\| \| x +\| \| x +\| \| x xx +\| \|x xxxx +\| \| \|___AM_\| A\| +------------------------------------------------------------------------+ N Min Max Median Avg Stddev x 10 1972843 2045866 2038186.5 2030443.8 21367.694 + 10 2400853 2402196 2401043.5 2401172.7 385.40024 Difference at 95.0% confidence 370729 +/- 14198.9 18.2585% +/- 0.826943% (Student's t, pooled s = 15111.7) x before4 + after4.2 N Min Max Median Avg Stddev x 10 3986994 3991728 3990137.5 3989985.2 1300.0164 + 10 4799990 4806664 4806116.5 4805194 1990.6625 Difference at 95.0% confidence 815209 +/- 1579.64 20.4314% +/- 0.0421713% (Student's t, pooled s = 1681.19) Tested by: pho Reported by: mjg Approved by: sbruno Sponsored by: Limelight Networks Differential Revision: https://reviews.freebsd.org/D15430	2018-05-17 17:59:35 +00:00
Matt Macy	fdf71aeb54	epoch(9): make recursion lighter weight There isn't any real work to do except bump td_epochnest when recursing. Skip the additional work in this case. Approved by: sbruno	2018-05-17 01:13:40 +00:00
Matt Macy	b8205686b4	epoch(9): Guarantee forward progress on busy sections Add epoch section to struct thread. We can use this to ennable epoch counter to advance even if a section is perpetually occupied by a thread. Approved by: sbruno	2018-05-17 00:45:35 +00:00
Matt Macy	6161b98c99	hwpmc: Implement per-thread counters for PMC sampling This implements per-thread counters for PMC sampling. The thread descriptors are stored in a list attached to the process descriptor. These thread descriptors can store any per-thread information necessary for current or future features. For the moment, they just store the counters for sampling. The thread descriptors are created when the process descriptor is created. Additionally, thread descriptors are created or freed when threads are started or stopped. Because the thread exit function is called in a critical section, we can't directly free the thread descriptors. Hence, they are freed to a cache, which is also used as a source of allocations when needed for new threads. Approved by: sbruno Obtained from: jtl Sponsored by: Juniper Networks, Limelight Networks Differential Revision: https://reviews.freebsd.org/D15335	2018-05-16 22:29:20 +00:00
Jean-Sébastien Pédron	547e74a8be	teken, vt(4): New callbacks to lock the terminal once ... to process input, instead of inside each smaller operations such as appending a character or moving the cursor forward. In other words, before we were doing (oversimplified): teken_input() <for each input character> vtterm_putchar() VTBUF_LOCK() VTBUF_UNLOCK() vtterm_cursor_position() VTBUF_LOCK() VTBUF_UNLOCK() Now, we are doing: vtterm_pre_input() VTBUF_LOCK() teken_input() <for each input character> vtterm_putchar() vtterm_cursor_position() vtterm_post_input() VTBUF_UNLOCK() The situation was even worse when the vtterm_copy() and vtterm_fill() callbacks were involved. The new callbacks are: * struct terminal_class->tc_pre_input() * struct terminal_class->tc_post_input() They are called in teken_input(), surrounding the while() loop. The goal is to improve input processing speed of vt(4). As a benchmark, here is the time taken to write a text file of 360 000 lines (26 MiB) on `ttyv0`: * vt(4), unmodified: 1500 ms * vt(4), with this patch: 1200 ms * syscons(4): 700 ms This is on a Haswell laptop with a GENERIC-NODEBUG kernel. At the same time, the locking is changed in the vt_flush() function which is responsible to draw the text on screen. So instead of (indirectly) using VTBUF_LOCK() just to read and reset the dirty area of the internal buffer, the lock is held for about the entire function, including the drawing part. The change is mostly visible while content is scrolling fast: before, lines could appear garbled while scrolling because the internal buffer was accessed without locks (once the scrolling was finished, the output was correct). Now, the scrolling appears correct. In the end, the locking model is closer to what syscons(4) does. Differential Revision: https://reviews.freebsd.org/D15302	2018-05-16 09:01:02 +00:00
Ed Maste	ea0939f0af	subr_pidctrl: use standard 2-Clause FreeBSD license and disclaimer Approved by: jeff	2018-05-15 00:50:09 +00:00
Matt Macy	0f00315cb3	hwpmc: fix load/unload race and vm map LOR - fix load/unload race by allocating the per-domain list structure at boot - fix long extant vm map LOR by replacing pmc_sx sx_slock with global_epoch to protect the liveness of elements of the pmc_ss_owners list Reported by: pho Approved by: sbruno	2018-05-14 00:21:04 +00:00
Matt Macy	0c58f85b8d	epoch(9): allow sx locks to be held across epoch_wait() The INVARIANTS checks in epoch_wait() were intended to prevent the block handler from returning with locks held. What it in fact did was preventing anything except Giant from being held across it. Check that the number of locks held has not changed instead. Approved by: sbruno@	2018-05-14 00:14:00 +00:00
Matt Macy	1f4beb6312	epoch(9): cleanups, additional debug checks, and add global_epoch - GC the _nopreempt routines - to really benefit we'd need a separate routine - they're not currently in use - they complicate the API for no benefit at this time - check that we're actually in a epoch section at exit - handle epoch_call() early in boot - Fix copyright declaration language Approved by: sbruno@	2018-05-13 23:24:48 +00:00
Konstantin Belousov	2ebc882927	Detect and optimize reads from the hole on UFS. - Create getblkx(9) variant of getblk(9) which can return error. - Add GB_NOSPARSE flag for getblk()/getblkx() which requests that BMAP was performed before the buffer is created, and EJUSTRETURN returned in case the requested block does not exist. - Make ffs_read() use GB_NOSPARSE to avoid instantiating buffer (and allocating the pages for it), copying from zero_region instead. The end result is less page allocations and buffer recycling when a hole is read, which is important for some benchmarks. Requested and reviewed by: jeff Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 2 weeks Differential revision: https://reviews.freebsd.org/D14917	2018-05-13 09:47:28 +00:00
Matt Macy	f1401123c5	hwpmc/epoch - don't reference domain if NUMA is not set It appears that domain information is set correctly independent of whether or not NUMA is defined. However, there is no memory backing secondary domains leading to allocation failure. Reported by: pho@, np@ Approved by: sbruno@	2018-05-12 20:00:29 +00:00
Matt Macy	e6b475e0af	hwpmc(9): Make pmclog buffer pcpu and update constants On non-trivial SMP systems the contention on the pmc_owner mutex leads to a substantial number of samples captured being from the pmc process itself. This change a) makes buffers larger to avoid contention on the global list b) makes the working sample buffer per cpu. Run pmcstat in the background (default event rate of 64k): pmcstat -S UNHALTED_CORE_CYCLES -O /dev/null sleep 600 & Before: make -j96 buildkernel -s >&/dev/null 3336.68s user 24684.10s system 7442% cpu 6:16.50 total After: make -j96 buildkernel -s >&/dev/null 2697.82s user 1347.35s system 6058% cpu 1:06.77 total For more realistic overhead measurement set the sample rate for ~2khz on a 2.1Ghz processor: pmcstat -n 1050000 -S UNHALTED_CORE_CYCLES -O /dev/null sleep 6000 & Collecting 10 samples of `make -j96 buildkernel` from each: x before + after real time: N Min Max Median Avg Stddev x 10 76.4 127.62 84.845 88.577 15.100031 + 10 59.71 60.79 60.135 60.179 0.29957192 Difference at 95.0% confidence -28.398 +/- 10.0344 -32.0602% +/- 7.69825% (Student's t, pooled s = 10.6794) system time: N Min Max Median Avg Stddev x 10 2277.96 6948.53 2949.47 3341.492 1385.2677 + 10 1038.7 1081.06 1070.555 1064.017 15.85404 Difference at 95.0% confidence -2277.47 +/- 920.425 -68.1574% +/- 8.77623% (Student's t, pooled s = 979.596) x no pmc + pmc running real time: HEAD: N Min Max Median Avg Stddev x 10 58.38 59.15 58.86 58.847 0.22504567 + 10 76.4 127.62 84.845 88.577 15.100031 Difference at 95.0% confidence 29.73 +/- 10.0335 50.5208% +/- 17.0525% (Student's t, pooled s = 10.6785) patched: N Min Max Median Avg Stddev x 10 58.38 59.15 58.86 58.847 0.22504567 + 10 59.71 60.79 60.135 60.179 0.29957192 Difference at 95.0% confidence 1.332 +/- 0.248939 2.2635% +/- 0.426506% (Student's t, pooled s = 0.264942) system time: HEAD: N Min Max Median Avg Stddev x 10 1010.15 1073.31 1025.465 1031.524 18.135705 + 10 2277.96 6948.53 2949.47 3341.492 1385.2677 Difference at 95.0% confidence 2309.97 +/- 920.443 223.937% +/- 89.3039% (Student's t, pooled s = 979.616) patched: N Min Max Median Avg Stddev x 10 1010.15 1073.31 1025.465 1031.524 18.135705 + 10 1038.7 1081.06 1070.555 1064.017 15.85404 Difference at 95.0% confidence 32.493 +/- 16.0042 3.15% +/- 1.5794% (Student's t, pooled s = 17.0331) Reviewed by: jeff@ Approved by: sbruno@ Differential Revision: https://reviews.freebsd.org/D15155	2018-05-12 01:26:34 +00:00
Matt Macy	8dcbd0eae6	epoch(9): always set inited in epoch_init - set inited in the !usedomains case Reported by: jhibbits Approved by: sbruno	2018-05-11 18:37:14 +00:00
Matt Macy	4aa302dfc9	epoch(9): callback task fixes - initialize the pcpu STAILQ in the NUMA case - don't enqueue the callback task if there isn't sufficient work to be done Reported by: pho@ Approved by: sbruno@	2018-05-11 08:16:56 +00:00
Mateusz Guzik	85c1b3c1cb	rmlock: partially depessimize lock/unlock fastpath Previusly the slow path was folded in and partially jumped over in the common case.	2018-05-11 06:59:54 +00:00
Matt Macy	b2cb28963b	epoch(9): fix priority handling, make callback lists pcpu, and other fixes - Lend priority to preempted threads in epoch_wait to handle the case in which we've had priority lent to us. Previously we borrowed the priority of the lowest priority preempted thread. (pointed out by mjg@) - Don't attempt allocate memory per-domain on powerpc, we don't currently handle empty sockets (as is the case on jhibbits Talos' board). - Handle deferred callbacks as pcpu lists and poll the lists periodically. Currently the interval is 1/hz. - Drop the thread lock when adaptive spinning. Holding the lock starves other threads and can even lead to lockups. - Keep a generation count pcpu so that we don't keep spining if a thread has left and re-entered an epoch section. - Actually removed the callback from the callback list so that we don't double free. Sigh ... Approved by: sbruno@	2018-05-11 04:54:12 +00:00
Matt Macy	06bf2a6aef	Add simple preempt safe epoch API Read locking is over used in the kernel to guarantee liveness. This API makes it easy to provide livenes guarantees without atomics. Includes epoch_test kernel module to stress test the API. Documentation will follow initial use case. Test case and improvements to preemption handling in response to discussion with mjg@ Reviewed by: imp@, shurd@ Approved by: sbruno@	2018-05-10 17:55:24 +00:00
Andrew Gallatin	d5cdcc3a06	Fix the build after r333457 In r333457, the arguments to kern_pwritev() were accidentally re-ordered as part of ANSIfication, breaking the build.	2018-05-10 13:19:42 +00:00
Ed Maste	cc3c9df80f	ANSIfy sys_generic.c	2018-05-10 11:36:16 +00:00
Matt Macy	36688f706e	Add taskqgroup_config_gtask_deinit to support teardown after taskqgroup_config_gtask_init. Approved by: sbruno	2018-05-09 18:51:35 +00:00
Matt Macy	cbd92ce62e	Eliminate the overhead of gratuitous repeated reinitialization of cap_rights - Add macros to allow preinitialization of cap_rights_t. - Convert most commonly used code paths to use preinitialized cap_rights_t. A 3.6% speedup in fstat was measured with this change. Reported by: mjg Reviewed by: oshogbo Approved by: sbruno MFC after: 1 month	2018-05-09 18:47:24 +00:00
Konstantin Belousov	55c9d75e6b	Avoid calls to bzero() before ireloc. Evaluate cpu_stdext_feature early to have moved link_elf_ireloc() see correct flags, most important is SMAP. Tested by: mjg Sponsored by: The FreeBSD Foundation Differential revision: https://reviews.freebsd.org/D15367	2018-05-09 14:39:24 +00:00
Matt Macy	ad738f3791	Reduce overhead of ktrace checks in the common case. KTRPOINT() checks both if we are tracing _and_ if we are recursing within ktrace. The second condition is only ever executed if ktrace is actually enabled. This change moves the check out of the hot path in to the functions themselves. Discussed with mjg@ Reported by: mjg@ Approved by: sbruno@	2018-05-09 00:00:47 +00:00
Mateusz Guzik	2824088536	Inlined sched_userret. The tested condition is rarely true and it induces a function call on each return to userspace. Bumps getuid rate by about 1% on Broadwell.	2018-05-07 23:36:16 +00:00
Mateusz Guzik	75e9b455a9	Change trap_enotcap to bool and annotate with __read_frequently It is read on each return to user space.	2018-05-07 23:10:12 +00:00
Mateusz Guzik	79ca7cbf09	Avoid calls to syscall_thread_enter/exit for statically defined syscalls The entire mechanism is rarely used and is quite not performant due to atomci ops on the syscall table. It also has added overhead for completely unrelated syscalls. Reduce it by avoiding the func calls if possible (which consistutes vast majority of cases). Provides about 3% syscall rate speed up for getuid on Broadwell.	2018-05-07 22:29:32 +00:00
Warner Losh	ad7142757b	Add device_quiet_children() and device_has_quiet_children() If you add a child to a device that has quiet children, we'll automatically set the quiet flag on the children, and its children. This is indended for things like CPU that have a large amount of repetition in booting that adds nothing.	2018-05-07 21:09:08 +00:00
Andrew Gallatin	e7bd0750af	Boost thread priority while changing CPU frequency Boost the priority of user-space threads when they set their affinity to a core to adjust its frequency. This avoids a situation where a CPU bound kernel thread with the same affinity is running on a down-clocked core, and will "block" powerd from up-clocking the core until the kernel thread yields. This can lead to poor perfomance, and to things potentially getting stuck on Giant. Reviewed by: kib (imp reviewed earlier version) Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D15246	2018-05-07 15:24:03 +00:00
Mark Johnston	bd92e6b6f5	Refactor some of the MI kernel dump code in preparation for netdump. - Add clear_dumper() to complement set_dumper(). - Drain netdump's preallocated mbuf pool when clearing the dumper. - Don't do bounds checking for dumpers with mediasize 0. - Add dumper callbacks for initialization for writing out headers. Reviewed by: sbruno MFC after: 1 month Sponsored by: Dell EMC Isilon Differential Revision: https://reviews.freebsd.org/D15252	2018-05-06 00:22:38 +00:00
Mark Johnston	5475ca5aca	Add an mbuf allocator for netdump. The aim is to permit mbuf allocations after a panic without calling into the page allocator, without imposing any runtime overhead during regular operation of the system, and without modifying driver code. The approach taken is to preallocate a number of mbufs and clusters, storing them in linked lists, and using the lists to back some UMA cache zones. At panic time, the mbuf and cluster zone pointers are overwritten with those of the cache zones so that the mbuf allocator returns preallocated items. Using this scheme, drivers which cache mbuf zone pointers from m_getzone() require special handling when implementing netdump support. Reviewed by: cem (earlier version), julian, sbruno MFC after: 1 month Sponsored by: Dell EMC Isilon Differential Revision: https://reviews.freebsd.org/D15251	2018-05-06 00:19:48 +00:00
Mark Johnston	c2ba2d1b0e	Style. MFC after: 3 days	2018-05-06 00:11:30 +00:00
Andriy Gapon	bd3afae0ca	for bus suspend, detach and shutdown iterate children in reverse order For most buses all children are equal, so the order does not matter. Other buses, such as acpi, carefully order their child devices to express implicit dependencies between them. For such buses it is safer to bring down devices in the reverse order. I believe that this is the reason why hpet_suspend had to be disabled. Some drivers depend on a working event timer until they are suspended. But previously we would suspend hpet very early. I tested this change by makinbg hpet_suspend actually stop HPET timers and tested that too. Note that this change is not a complete solution as it does not take into account bus passes. A better approach would be to track the actual attach order of the devices and to use the reverse of that. Reviewed by: imp, mav MFC after: 3 weeks Differential Revision: https://reviews.freebsd.org/D15291	2018-05-05 05:19:32 +00:00
Mateusz Guzik	5ec2c93667	tc: bcopy -> memcpy	2018-05-04 22:48:10 +00:00
Jamie Gritton	0e5c6bd436	Make it easier for filesystems to count themselves as jail-enabled, by doing most of the work in a new function prison_add_vfs in kern_jail.c Now a jail-enabled filesystem need only mark itself with VFCF_JAIL, and the rest is taken care of. This includes adding a jail parameter like allow.mount.foofs, and a sysctl like security.jail.mount_foofs_allowed. Both of these used to be a static list of known filesystems, with predefined permission bits. Reviewed by: kib Differential Revision: D14681	2018-05-04 20:54:27 +00:00
Mark Johnston	1b5c869d64	Fix some races introduced in r332974. With r332974, when performing a synchronized access of a page's "queue" field, one must first check whether the page is logically dequeued. If so, then the page lock does not prevent the page from being removed from its page queue. Intoduce vm_page_queue(), which returns the page's logical queue index. In some cases, direct access to the "queue" field is still required, but such accesses should be confined to sys/vm. Reported and tested by: pho Reviewed by: kib Sponsored by: Dell EMC Isilon Differential Revision: https://reviews.freebsd.org/D15280	2018-05-04 17:17:30 +00:00
Matt Macy	748ff486b0	`dup1_processes -t 96 -s 5` on a dual 8160 x dup_before + dup_after +------------------------------------------------------------+ \| x + \| \|x x x x ++ ++\| \| \|____AM___\| \|AM\|\| +------------------------------------------------------------+ N Min Max Median Avg Stddev x 5 1.514954e+08 1.5230351e+08 1.5206157e+08 1.5199371e+08 341205.71 + 5 1.5494336e+08 1.5519569e+08 1.5511982e+08 1.5508323e+08 96232.829 Difference at 95.0% confidence 3.08952e+06 +/- 365604 2.03266% +/- 0.245071% (Student's t, pooled s = 250681) Reported by: mjg@ MFC after: 1 week	2018-05-04 06:51:01 +00:00
Konstantin Belousov	7035cf14ee	Implement support for ifuncs in the kernel linker. Required MD bits are only provided for x86. Reviewed by: jhb (previous version, as part of the larger patch) Sponsored by: The FreeBSD Foundation MFC after: 2 weeks Differential revision: https://reviews.freebsd.org/D13838	2018-05-03 21:37:46 +00:00
Stephen Hurd	f3e1324b41	Separate list manipulation locking from state change in multicast Multicast incorrectly calls in to drivers with a mutex held causing drivers to have to go through all manner of contortions to use a non sleepable lock. Serialize multicast updates instead. Submitted by: mmacy <mmacy@mattmacy.io> Reviewed by: shurd, sbruno Sponsored by: Limelight Networks Differential Revision: https://reviews.freebsd.org/D14969	2018-05-02 19:36:29 +00:00
Mark Johnston	20f85b1ddd	Print the dump progress indicator after calling dump_start(). Dumpers may wish to print messages from an initialization hook; this change ensures that such messages aren't mixed with output from the generic dump code. MFC after: 1 week	2018-05-01 17:32:43 +00:00
Nathan Whitehorn	ee900504cf	Report the kernel base address properly in kldstat when using PowerPC kernels loaded at addresses other than their link address.	2018-05-01 04:06:59 +00:00
Ed Maste	2216c6933c	Disable connectat/bindat with AT_FDCWD in capmode Previously it was possible to connect a socket (which had the CAP_CONNECT right) by calling "connectat(AT_FDCWD, ...)" even in capabilties mode. This combination should be treated the same as a call to connect (i.e. forbidden in capabilities mode). Similarly for bindat. Disable connectat/bindat with AT_FDCWD in capabilities mode, fix up the documentation and add tests. PR: 222632 Submitted by: Jan Kokemüller <jan.kokemueller@gmail.com> Reviewed by: Domagoj Stolfa MFC after: 1 week Relnotes: Yes Differential Revision: https://reviews.freebsd.org/D15221	2018-04-30 17:31:06 +00:00
Mateusz Guzik	9d68f7741f	systrace: track it like sdt probes While here predict false. Note the code is wrong (regardless of this change). Dereference of the pointer can race with module unload. A fix would set the probe to a nop stub instead of NULL.	2018-04-27 15:16:34 +00:00
Emmanuel Vadot	ee710ecf32	clk: Put the sysctls under hw.clock instead of clock This is more consistant with hw.regulator and other hardware related sysctls.	2018-04-27 00:12:00 +00:00
Mark Johnston	5cd29d0f3c	Improve VM page queue scalability. Currently both the page lock and a page queue lock must be held in order to enqueue, dequeue or requeue a page in a given page queue. The queue locks are a scalability bottleneck in many workloads. This change reduces page queue lock contention by batching queue operations. To detangle the page and page queue locks, per-CPU batch queues are used to reference pages with pending queue operations. The requested operation is encoded in the page's aflags field with the page lock held, after which the page is enqueued for a deferred batch operation. Page queue scans are similarly optimized to minimize the amount of work performed with a page queue lock held. Reviewed by: kib, jeff (previous versions) Tested by: pho Sponsored by: Dell EMC Isilon Differential Revision: https://reviews.freebsd.org/D14893	2018-04-24 21:15:54 +00:00
Sean Bruno	7875017ca9	Revert r332894 at the request of the submitter. Submitted by: Johannes Lundberg <johalun0_gmail.com> Sponsored by: Limelight Networks	2018-04-24 19:55:12 +00:00
Conrad Meyer	65df124845	Do not totally silence suppressed secondary kasserts unless debug.kassert.do_log is disabled To totally silence and ignore secondary kassert violations after a primary panic, set debug.kassert.do_log=0 and debug.kassert.suppress_in_panic=1. Additional assertion warnings shouldn't block core dump and may alert the developer to another erroneous condition. Secondary stack traces may be printed, identically to the unsuppressed case where panic() is reentered -- controlled via debug.trace_all_panics. Sponsored by: Dell EMC Isilon	2018-04-24 19:10:51 +00:00
Conrad Meyer	07aa6ea677	Fix debug.kassert.do_log description text This has been an (incorrect) copy-paste duplicate of debug.kassert.warn_only since it was originally committed in r243980. Sponsored by: Dell EMC Isilon	2018-04-24 18:59:40 +00:00
Conrad Meyer	ad1fc31570	panic: Optionally, trace secondary panics To diagnose and fix secondary panics, it is useful to have a stack trace. When panic tracing is enabled, optionally trace secondary panics as well. The option is configured with the tunable/sysctl debug.trace_all_panics. (The original concern that inspired only tracing the primary panic was likely that the secondary trace may scroll the original panic message or trace off the screen. This is less of a concern for serial consoles with logging. Not everything has a serial console, though, so the behavior is optional.) Discussed with: jhb Sponsored by: Dell EMC Isilon	2018-04-24 18:54:20 +00:00
Jonathan T. Looney	18959b695d	Update r332860 by changing the default from suppressing post-panic assertions to not suppressing post-panic assertions. There are some post-panic assertions that are valuable and we shouldn't default to disabling them. However, when a user trips over them, the user can still adjust the tunable/sysctl to suppress them temporarily to get conduct troubleshooting (e.g. get a core dump). Reported by: cem, markj	2018-04-24 18:47:35 +00:00
Conrad Meyer	b543c98cab	lockmgr: Add missed neutering during panic r313683 introduced new lockmgr APIs that missed the panic-time neutering present in the rest of our locks. Correct that by adding the usual check. Additionally, move the __lockmgr_args neutering above the assertions at the top of the function. Drop the interlock unlock because we shouldn't have an unneutered interlock either. No point trying to unlock it. PR: 227749 Reported by: jtl Sponsored by: Dell EMC Isilon	2018-04-24 18:41:14 +00:00
Mateusz Guzik	d357c16adc	lockf: change the owner hash from pid to vnode-based This adds a bit missed due to the patch split, see r332882 Tested by: pho	2018-04-24 06:10:36 +00:00
Mateusz Guzik	7cd794214a	dtrace: depessimize dtmalloc when dtrace is active Each malloc/free was testing dtrace_malloc_enabled and forcing extra reads from the malloc type struct to see if perhaps a dtmalloc probe was on. Treat it like lockstat and sdt: have a global bolean.	2018-04-24 01:06:20 +00:00
Mateusz Guzik	4c5209cb21	lockstat: track lockstat just like sdt probes In particular flip the frequently tested var to bool.	2018-04-24 01:04:10 +00:00
Mateusz Guzik	c9e05ccd62	malloc: stop reading the subzone if MALLOC_DEBUG_MAXZONES == 1 (the default) malloc was showing at the top of profile during while running microbenchmarks. #define DTMALLOC_PROBE_MAX 2 struct malloc_type_internal { uint32_t mti_probes[DTMALLOC_PROBE_MAX]; u_char mti_zone; struct malloc_type_stats mti_stats[MAXCPU]; }; Reading mti_zone it wastes a cacheline to hold mti_probes + mti_zone (which we know is 0) + part of malloc stats of the first cpu which on top induces false-sharing. In particular will-it-scale lock1_processes -t 128 -s 10: before: average:45879692 after: average:51655596 Note the counters can be padded but the right fix is to move them to counter(9), leaving the struct read-only after creation (modulo dtrace probes).	2018-04-23 22:28:49 +00:00
Sean Bruno	7b7796eea5	Load balance sockets with new SO_REUSEPORT_LB option This patch adds a new socket option, SO_REUSEPORT_LB, which allow multiple programs or threads to bind to the same port and incoming connections will be load balanced using a hash function. Most of the code was copied from a similar patch for DragonflyBSD. However, in DragonflyBSD, load balancing is a global on/off setting and can not be set per socket. This patch allows for simultaneous use of both the current SO_REUSEPORT and the new SO_REUSEPORT_LB options on the same system. Required changes to structures Globally change so_options from 16 to 32 bit value to allow for more options. Add hashtable in pcbinfo to hold all SO_REUSEPORT_LB sockets. Limitations As DragonflyBSD, a load balance group is limited to 256 pcbs (256 programs or threads sharing the same socket). Submitted by: Johannes Lundberg <johanlun0@gmail.com> Sponsored by: Limelight Networks Differential Revision: https://reviews.freebsd.org/D11003	2018-04-23 19:51:00 +00:00
Mateusz Guzik	833dc05a6e	lockf: add per-chain locks to the owner hash This combined with previous changes significantly depessimizes the behaviour under contentnion. In particular the lock1_processes test (locking/unlocking separate files) from the will-it-scale suite was executed with 128 concurrency on a 4-socket Broadwell with 128 hardware threads. Operations/second (lock+unlock) go from ~750000 to ~45000000 (6000%) For reference single-process is ~1680000 (i.e. on stock kernel the resulting perf is less than half of the single-threaded run), Note this still does not really scale all that well as the locks were just bolted on top of the current implementation. Significant room for improvement is still here. In particular the top performance fluctuates depending on the extent of false sharing in given run (which extends beyond the file). Added chain+lock pairs were not padded w.r.t. cacheline size. One big ticket item is the hash used for spreading threads: it used to be the process pid (which basically serialized all threaded ops). Temporarily the vnode addr was slapped in instead. Tested by: pho	2018-04-23 08:23:10 +00:00
Mateusz Guzik	63286976b5	lockf: skip locking the graph if not necessary (common case) Tested by: pho	2018-04-23 07:54:02 +00:00
Mateusz Guzik	717df0b0e8	lockf: perform wakeup onlly when there is anybody waiting Tested by: pho	2018-04-23 07:52:56 +00:00
Mateusz Guzik	c72ead2815	lockf: skip the hard work in lf_purgelocks if possible Tested by: pho	2018-04-23 07:52:10 +00:00
Mateusz Guzik	0d3323f557	lockf: free state only when recycling the vnode This avoids malloc/free cycles when locking/unlocking the vnode when nobody is contending. Tested by: pho	2018-04-23 07:51:19 +00:00
Tijl Coosemans	7dfbbc613b	Make bufdaemon and bufspacedaemon use kthread_suspend_check instead of kproc_suspend_check. In r329612 bufspacedaemon was turned into a thread of the bufdaemon process causing both to call kproc_suspend_check with the same proc argument and that function contains the following while loop: while (SIGISMEMBER(p->p_siglist, SIGSTOP)) { wakeup(&p->p_siglist); msleep(&p->p_siglist, &p->p_mtx, PPAUSE, "kpsusp", 0); } So one thread wakes up the other and the other wakes up the first again, locking up UP machines on shutdown. Also register the shutdown handlers with SHUTDOWN_PRI_LAST + 100 so they run after the syncer has shutdown, because the syncer can cause a situation where bufdaemon help is needed to proceed. PR: 227404 Reviewed by: kib Tested by: cy, rmacklem	2018-04-22 16:05:29 +00:00
Mateusz Guzik	7d853f62bf	lockf: slightly depessimize 1. check if P_ADVLOCK is already set and if so, don't lock to set it (stolen from DragonFly) 2. when trying for fast path unlock, check that we are doing unlock first instead of taking the interlock for no reason (e.g. if we want to lock). whilere make it more likely that falling fast path will not take the interlock either by checking for state Note the code is severely pessimized both single- and multithreaded.	2018-04-22 09:30:07 +00:00
Jonathan T. Looney	44b71282b5	When running with INVARIANTS, the kernel contains extra checks. However, these assumptions may not hold true once we've panic'd. Therefore, the checks hold less value after a panic. Additionally, if one of the checks fails while we are already panic'd, this creates a double-panic which can interfere with debugging the original panic. Therefore, this commit allows an administrator to suppress a response to KASSERT checks after a panic by setting a tunable/sysctl. The tunable/sysctl (debug.kassert.suppress_in_panic) defaults to being enabled. Reviewed by: kib Sponsored by: Netflix, Inc. Differential Revision: https://reviews.freebsd.org/D12920	2018-04-21 17:05:00 +00:00
Konstantin Belousov	1302eea7bb	Rename PROC_PDEATHSIG_SET -> PROC_PDEATHSIG_CTL and PROC_PDEATHSIG_GET -> PROC_PDEATHSIG_STATUS for consistency with other procctl(2) operations names. Requested by: emaste Sponsored by: The FreeBSD Foundation MFC after: 13 days	2018-04-20 15:19:27 +00:00
Andriy Gapon	f87beb93e8	call racct_proc_ucred_changed() under the proc lock The lock is required to ensure that the switch to the new credentials and the transfer of the process's accounting data from the old credentials to the new ones is done atomically. Otherwise, some updates may be applied to the new credentials and then additionally transferred from the old credentials if the updates happen after proc_set_cred() and before racct_proc_ucred_changed(). The problem is especially pronounced for RACCT_RSS because - there is a strict accounting for this resource (it's reclaimable) - it's updated asynchronously by the vm daemon - it's updated by setting an absolute value instead of applying a delta I had to remove a call to rctl_proc_ucred_changed() from racct_proc_ucred_changed() and make all callers of latter call the former as well. The reason is that rctl_proc_ucred_changed, as it is implemented now, cannot be called while holding the proc lock, so the lock is dropped after calling racct_proc_ucred_changed. Additionally, I've added calls to crhold / crfree around the rctl call, because without the proc lock there is no gurantee that the new credentials, owned by the process, will stay stable. That does not eliminate a possibility that the credentials passed to the rctl will get stale. Ideally, rctl_proc_ucred_changed should be able to work under the proc lock. Many thanks to kib for pointing out the above problems. PR: 222027 Discussed with: kib No comment: trasz MFC after: 2 weeks Differential Revision: https://reviews.freebsd.org/D15048	2018-04-20 13:08:04 +00:00
John Baldwin	73c8686e91	Simplify the code to allocate stack for auxv, argv[], and environment vectors. Remove auxarg_size as it was only used once right after a confusing assignment in each of the variants of exec_copyout_strings(). Reviewed by: emaste MFC after: 1 month Differential Revision: https://reviews.freebsd.org/D15123	2018-04-19 16:00:34 +00:00
Konstantin Belousov	b940886338	Add PROC_PDEATHSIG_SET to procctl interface. Allow processes to request the delivery of a signal upon death of their parent process. Supposed consumer of the feature is PostgreSQL. Submitted by: Thomas Munro Reviewed by: jilles, mjg MFC after: 2 weeks Differential revision: https://reviews.freebsd.org/D15106	2018-04-18 21:31:13 +00:00
John Baldwin	8ce99bb405	Properly do a deep copy of the ioctls capability array for fget_cap(). fget_cap() tries to do a cheaper snapshot of a file descriptor without holding the file descriptor lock. This snapshot does not do a deep copy of the ioctls capability array, but instead uses a different return value to inform the caller to retry the copy with the lock held. However, filecaps_copy() was returning 1 to indicate that a retry was required, and fget_cap() was checking for 0 (actually '!filecaps_copy()'). As a result, fget_cap() did not do a deep copy of the ioctls array and just reused the original pointer. This cause multiple file descriptor entries to think they owned the same pointer and eventually resulted in duplicate frees. The only code path that I'm aware of that triggers this is to create a listen socket that has a restricted list of ioctls and then call accept() which calls fget_cap() with a valid filecaps structure from getsock_cap(). To fix, change the return value of filecaps_copy() to return true if it succeeds in copying the caps and false if it fails because the lock is required. I find this more intuitive than fixing the caller in this case. While here, change the return type from 'int' to 'bool'. Finally, make filecaps_copy() more robust in the failure case by not copying any of the source filecaps structure over. This avoids the possibility of leaking a pointer into a structure if a similar future caller doesn't properly handle the return value from filecaps_copy() at the expense of one more branch. I also added a test case that panics before this change and now passes. Reviewed by: kib Discussed with: mjg (not a fan of the extra branch) MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D15047	2018-04-17 18:07:40 +00:00
Brooks Davis	cee61c8cac	Stop using fuswintr() and suswintr() in the profiler. Always take the AST path rather than calling MD functions which are often implemented as always failing. The is the case on amd64, arm, i386, and powerpc. This optimization (inherited from 4.4 Lite) is a pessimization on those architectures and is the sole use of these functions. They will be removed in a seperate commit. Reviewed by: kib Sponsored by: DARPA, AFRL Differential Revision: https://reviews.freebsd.org/D15101	2018-04-17 16:36:53 +00:00
Alan Somers	52c0983128	lio_listio: return EAGAIN instead of EIO when out of resources This behavior is already documented by the man page, and suggested by POSIX. Reviewed by: jhb MFC after: 3 weeks Differential Revision: https://reviews.freebsd.org/D15099	2018-04-16 18:12:15 +00:00
Konstantin Belousov	d86c1f0dc1	i386 4/4G split. The change makes the user and kernel address spaces on i386 independent, giving each almost the full 4G of usable virtual addresses except for one PDE at top used for trampoline and per-CPU trampoline stacks, and system structures that must be always mapped, namely IDT, GDT, common TSS and LDT, and process-private TSS and LDT if allocated. By using 1:1 mapping for the kernel text and data, it appeared possible to eliminate assembler part of the locore.S which bootstraps initial page table and KPTmap. The code is rewritten in C and moved into the pmap_cold(). The comment in vmparam.h explains the KVA layout. There is no PCID mechanism available in protected mode, so each kernel/user switch forth and back completely flushes the TLB, except for the trampoline PTD region. The TLB invalidations for userspace becomes trivial, because IPI handlers switch page tables. On the other hand, context switches no longer need to reload %cr3. copyout(9) was rewritten to use vm_fault_quick_hold(). An issue for new copyout(9) is compatibility with wiring user buffers around sysctl handlers. This explains two kind of locks for copyout ptes and accounting of the vslock() calls. The vm_fault_quick_hold() AKA slow path, is only tried after the 'fast path' failed, which temporary changes mapping to the userspace and copies the data to/from small per-cpu buffer in the trampoline. If a page fault occurs during the copy, it is short-circuit by exception.s to not even reach C code. The change was motivated by the need to implement the Meltdown mitigation, but instead of KPTI the full split is done. The i386 architecture already shows the sizing problems, in particular, it is impossible to link clang and lld with debugging. I expect that the issues due to the virtual address space limits would only exaggerate and the split gives more liveness to the platform. Tested by: pho Discussed with: bde Sponsored by: The FreeBSD Foundation MFC after: 1 month Differential revision: https://reviews.freebsd.org/D14633	2018-04-13 20:30:49 +00:00
Mateusz Guzik	e0e259a888	locks: extend speculative spin waiting for readers to drain Now that 10 years have passed since the original limit of 10000 was committed, bump it a little bit. Spinning waiting for writers is semi-informed in the sense that we always know if the owner is running and base the decision to spin on that. However, no such information is provided for read-locking. In particular this means that it is possible for a write-spinner to completely waste cpu time waiting for the lock to be released, while the reader holding it was preempted and is now waiting for the spinner to go off cpu. Nonetheless, in majority of cases it is an improvement to spin instead of instantly giving up and going to sleep. The current approach is pretty simple: snatch the number of current readers and performs that many pauses before checking again. The total number of pauses to execute is limited to 10k. If the lock is still not free by that time, go to sleep. Given the previously noted problem of not knowing whether spinning makes any sense to begin with the new limit has to remain rather conservative. But at the very least it should also be related to the machine. Waiting for writers uses parameters selected based on the number of activated hardware threads. The upper limit of pause instructions to be executed in-between re-reads of the lock is typically 16384 or 32678. It was selected as the limit of total spins. The lower bound is set to already present 10000 as to not change it for smaller machines. Bumping the limit reduces system time by few % during benchmarks like buildworld, buildkernel and others. Tested on 2 and 4 socket machines (Broadwell, Skylake). Figuring out how to make a more informed decision while not pessimizing the fast path is left as an exercise for the reader.	2018-04-11 01:43:29 +00:00
Ian Lepore	97603f1da2	Use explicit_bzero() when cleaning values out of the kernel environment. Sometimes the values contain geli passphrases being communicated from loader(8) to the kernel, and some day the compiler may decide to start eliding calls to memset() for a pointer which is not dereferenced again before being passed to free().	2018-04-10 22:57:56 +00:00
Mateusz Guzik	04457342a3	rw: whack avoidable re-reads in try_upgrade	2018-04-10 22:32:31 +00:00
Stephen Hurd	f422673e10	Make BPF global lock an SX This allows NIC drivers to sleep on polling config operations. Submitted by: Matthew Macy <mmacy@mattmacy.io> Reviewed by: shurd Sponsored by: Limelight Networks Differential Revision: https://reviews.freebsd.org/D14982	2018-04-10 19:42:50 +00:00
Mateusz Guzik	a045941bd2	locks: tweak backoff a little bit Previous limits were chosen when locking primitives had spurious lock accesses. Flipping the starting point to 1 (or rather 2 as the first call shifts it) provides a modest win when mild contention is seen while not hurting worse cases. Tested on a bunch of one, two and four socket old and new systems (Westmere, Skylake, Threadreaper and others) by doing concurrent page faults, buildkernel/buildworld and other stuff (although not all systems got all the tests). Another thing is the upper limit. It is semi-arbitrarily chosen as it was getting out of hand for slightly less small systems (e.g. a 128-thread one). Note that backoff is fundamentally a speculative bandaid and this change just makes it fit a little bit better. It remains completely oblivious to the hardware topology or the contention pattern. This is being experimented with.	2018-04-08 16:34:10 +00:00
Brooks Davis	6469bdcdb6	Move most of the contents of opt_compat.h to opt_global.h. opt_compat.h is mentioned in nearly 180 files. In-progress network driver compabibility improvements may add over 100 more so this is closer to "just about everywhere" than "only some files" per the guidance in sys/conf/options. Keep COMPAT_LINUX32 in opt_compat.h as it is confined to a subset of sys/compat/linux/*.c. A fake _COMPAT_LINUX option ensure opt_compat.h is created on all architectures. Move COMPAT_LINUXKPI to opt_dontuse.h as it is only used to control the set of compiled files. Reviewed by: kib, cem, jhb, jtl Sponsored by: DARPA, AFRL Differential Revision: https://reviews.freebsd.org/D14941	2018-04-06 17:35:35 +00:00
Brooks Davis	89ea4a30d6	Added SAL annotatations to system calls. Modify makesyscalls.sh to strip out SAL annotations. No functional change. This is based on work I started in CheriBSD and use to validate fat pointers at the syscall boundary. Tal Garfinkel reviewed the changes, added annotations to COMPAT* syscalls and is using them in a record and playback framework. One can envision other uses such as a WITNESS-like validator for copyin/out as speculated on in the review. As this time we are only annotating sys/kern/syscalls.master as that is sufficient for userspace work. If kernel use cases materialize, we can annotate other syscalls.master as needed. Submitted by: Tal Garfinkel <talg@cs.stanford.edu> Sponsored by: DARPA, AFRL (in part) Differential Revision: https://reviews.freebsd.org/D14285	2018-04-05 20:31:45 +00:00
Jeff Roberson	e5818a53db	Implement several enhancements to NUMA policies. Add a new "interleave" allocation policy which stripes pages across domains with a stride or width keeping contiguity within a multi-page region. Move the kernel to the dedicated numbered cpuset #2 making it possible to assign kernel threads and memory policy separately from user. This also eliminates the need for the complicated interrupt binding code. Add a sysctl API for viewing and manipulating domainsets. Refactor some of the cpuset_t manipulation code using the generic bitset type so that it can be used for both. This probably belongs in a dedicated subr file. Attempt to improve the include situation. Reviewed by: kib Discussed with: jhb (cpuset parts) Tested by: pho (before review feedback) Sponsored by: Netflix, Dell/EMC Isilon Differential Revision: https://reviews.freebsd.org/D14839	2018-03-29 02:54:50 +00:00
Jeff Roberson	27a3c9d710	Restore r331606 with a bugfix to setup cpuset_domain[] earlier on all platforms. Original commit message as follows: Only use CPUs in the domain the device is attached to for default assignment. Device drivers are able to override the default assignment if they bind directly. There are severe performance penalties for handling interrupts on remote CPUs and this should only be done in very controlled circumstances. Reviewed by: jhb, kib Tested by: pho Sponsored by: Netflix, Dell/EMC Isilon Differential Revision: https://reviews.freebsd.org/D14838	2018-03-28 18:47:35 +00:00
Andriy Gapon	f4043145f2	ZFS vn_rele_async: catch up with the use of refcount(9) for the vnode use count It's not sufficient nor required to use the vnode interlock when checking if we are going to drop the last use count as the code in vputx() uses refcount (atomic) operations for both checking and decrementing the use code. Apply the same method to vn_rele_async(). While here, remove vn_rele_inactive(), a wrapper around vrele() that didn't add any value. Also, the change required making vfs_refcount_release_if_not_last() public. I've made vfs_refcount_acquire_if_not_zero() public as well. They are in sys/refcount.h now. While making the move I've dropped the vfs_ prefix. Reviewed by: mjg MFC after: 2 weeks Sponsored by: Panzura Differential Revision: https://reviews.freebsd.org/D14869	2018-03-28 08:55:31 +00:00
Mateusz Guzik	179da98f71	fd: tighten seq protected areas to not contain malloc/free	2018-03-28 03:07:02 +00:00
Konstantin Belousov	fb441a8829	Fix several leaks of kernel stack data through paddings. It is random collection of fixes for issues not yet corrected, reported at https://tsyrklevi.ch/clang_analyzer/freebsd_013017/. Many issues from that list were already corrected. Most of them are for compat32, old compat32 or affect both primary host ABI and compat32. The freebsd32_kldstat(), for instance, was already fixed by using malloc(M_ZERO). Patch includes correction to report the supplied version back, which is just pedantic. Reviewed by: brooks, emaste (previous version) Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D14868	2018-03-27 18:05:51 +00:00
Brooks Davis	34a77b9741	Move uio enums to sys/_uio.h. Include _uio.h instead of uio.h in several headers to reduce header polution. Fix a few places that relied on header polution to get the uio.h header. I have not moved struct uio as many more things that use it rely on header polution to get other definitions from uio.h. Reviewed by: cem, kib, markj Sponsored by: DARPA, AFRL Differential Revision: https://reviews.freebsd.org/D14811	2018-03-27 15:20:03 +00:00
Andriy Gapon	31260bf042	vfs_donmount: in certain cases try r/o mount if r/w mount fails If the operation is not an update, if neither r/w nor r/o mode is explicitly requested, if the error code hints at the possibility of the media being read-only, and if the fallback is allowed, then we can try to automatically downgrade to the readonly mode. This is especially useful for auto-mounting of removable media that sometimes can happen to be write-protected. The fallback to r/o is not enabled by default. It can be requested on a per-mount basis with a new mount option, 'autoro'. Or it can be globally allowed by setting vfs.default_autoro. Reviewed by: cem, kib MFC after: 3 weeks Relnotes: yes Differential Revision: https://reviews.freebsd.org/D13361	2018-03-27 14:31:42 +00:00
Jeff Roberson	e8cbe51a04	Fix a bug introduced in r329612 that slowly invalidates all clean bufs. Reported by: bde Reviewed by: bde Sponsored by: Netflix, Dell/EMC Isilon	2018-03-26 18:36:17 +00:00
Mark Johnston	803c11a3a6	Use LIST_FOREACH_SAFE in sleepq_chains_remove_matching(). We may remove a sleepqueue from the hash table in sleepq_resume_thread(). Reviewed by: kib MFC after: 3 days Sponsored by: Dell EMC Isilon Differential Revision: https://reviews.freebsd.org/D14847	2018-03-25 20:12:14 +00:00
Konstantin Belousov	ed9e8bc468	Account the size of the vslock-ed memory by the thread. Assert that all such memory is unwired on return to usermode. The count of the wired memory will be used to detect the copyout mode. Tested by: pho (as part of the larger patch) Sponsored by: The FreeBSD Foundation MFC after: 1 week	2018-03-24 13:51:27 +00:00
Konstantin Belousov	161bf65f8a	In vn_io_fault1(), reduce the scope where pagefaults are disabled. Most important for the future use, do not call vm_fault_quick_hold_pages() with disabled pagefaults. Reported and tested by: pho (as part of the larger patch) Sponsored by: The FreeBSD Foundation MFC after: 1 week	2018-03-24 13:13:52 +00:00
Konstantin Belousov	c398200721	Do not send signals to init directly from shutdown_nice(9), do it from the task context. shutdown_nice() is used from the fast interrupt handlers, mostly for console drivers, where we cannot lock blockable locks. Schedule the task in the fast queue to send the signal from the proper context. Reviewed by: imp Discussed with: bde Sponsored by: The FreeBSD Foundation MFC after: 1 week	2018-03-22 20:47:25 +00:00
Jeff Roberson	9a4b4cd3bc	Start witness much earlier in boot so that we can shrink the pend list and make it more immune to further change. Reviewed by: markj, imp (Part of D14707) Sponsored by: Netflix, Dell/EMC Isilon	2018-03-22 19:11:43 +00:00
Warner Losh	f0d847af61	Drop any recursed taking of Giant once and for all at the top of kern_reboot(). The shutdown path is now safe to run without Giant. Discussed with: kib@ Sponsored by: Netflix	2018-03-22 15:34:37 +00:00
Jonathan T. Looney	2529f56ed3	Add the "TCP Blackbox Recorder" which we discussed at the developer summits at BSDCan and BSDCam in 2017. The TCP Blackbox Recorder allows you to capture events on a TCP connection in a ring buffer. It stores metadata with the event. It optionally stores the TCP header associated with an event (if the event is associated with a packet) and also optionally stores information on the sockets. It supports setting a log ID on a TCP connection and using this to correlate multiple connections that share a common log ID. You can log connections in different modes. If you are doing a coordinated test with a particular connection, you may tell the system to put it in mode 4 (continuous dump). Or, if you just want to monitor for errors, you can put it in mode 1 (ring buffer) and dump all the ring buffers associated with the connection ID when we receive an error signal for that connection ID. You can set a default mode that will be applied to a particular ratio of incoming connections. You can also manually set a mode using a socket option. This commit includes only basic probes. rrs@ has added quite an abundance of probes in his TCP development work. He plans to commit those soon. There are user-space programs which we plan to commit as ports. These read the data from the log device and output pcapng files, and then let you analyze the data (and metadata) in the pcapng files. Reviewed by: gnn (previous version) Obtained from: Netflix, Inc. Relnotes: yes Differential Revision: https://reviews.freebsd.org/D11085	2018-03-22 09:40:08 +00:00
Gleb Smirnoff	27cd06b391	Redo r331328. We need to fix not only type but also format. While here again notice that we are fixing regression from r331106.	2018-03-22 05:26:27 +00:00
Gleb Smirnoff	5aab68f24a	Fix sysctl types broken in r329612.	2018-03-21 23:21:32 +00:00
Mark Johnston	a7defaea9a	Elide the object lock in the common case in vfs_vmio_unwire(). The object lock was only needed when attempting to free B_DIRECT buffer pages, and for testing for invalid pages (and freeing them if so). Handle the latter by instead moving invalid pages near the head of the inactive queue, where they will be reclaimed quickly. Reviewed by: alc, kib, jeff MFC after: 3 weeks Differential Revision: https://reviews.freebsd.org/D14778	2018-03-21 21:15:43 +00:00
Warner Losh	3e867f24cb	bufshutdown is no longer called with Giant held, so there's no need to drop or pickup Giant anymore. Remove that code and adjust comments.	2018-03-21 14:46:59 +00:00
Warner Losh	d5292812f8	Remove Giant from init creation and vfs_mountroot. Sponsored by: Netflix Discussed with: kib@, mckusick@ Differential Review: https://reviews.freebsd.org/D14712	2018-03-21 14:46:54 +00:00
Conrad Meyer	c37125d9e5	Add missed sys/limits.h include Apparently header pollution on x86 hid its absense. Sorry, other arch users. Fix the missed header introduced in r331279. Reported by: tinderbox	2018-03-21 03:43:40 +00:00
Conrad Meyer	4948f7bf11	Regenerate sysent files after r331279.	2018-03-21 01:17:01 +00:00
Conrad Meyer	e9ac27430c	Implement getrandom(2) and getentropy(3) The general idea here is to provide userspace programs with well-defined sources of entropy, in a fashion that doesn't require opening a new file descriptor (ulimits) or accessing paths (/dev/urandom may be restricted by chroot or capsicum). getrandom(2) is the more general API, and comes from the Linux world. Since our urandom and random devices are identical, the GRND_RANDOM flag is ignored. getentropy(3) is added as a compatibility shim for the OpenBSD API. truss(1) support is included. Tests for both system calls are provided. Coverage is believed to be at least as comprehensive as LTP getrandom(2) test coverage. Additionally, instructions for running the LTP tests directly against FreeBSD are provided in the "Test Plan" section of the Differential revision linked below. (They pass, of course.) PR: 194204 Reported by: David CARLIER <david.carlier AT hardenedbsd.org> Discussed with: cperciva, delphij, jhb, markj Relnotes: maybe Differential Revision: https://reviews.freebsd.org/D14500	2018-03-21 01:15:45 +00:00
Jamie Gritton	672756aa9f	Represent boolean jail options as an array of structures containing the flag and both the regular and "no" names, instead of two different string arrays whose indices need to match the flag's bit position. This makes them similar to the say "jailsys" options are represented. Loop through either kind of option array with a structure pointer rather then an integer index.	2018-03-20 23:08:42 +00:00
Gleb Smirnoff	83fc34ea0d	At this point iwmesg isn't initialized yet, so print pointer to lock rather than panic before panicing.	2018-03-20 22:05:21 +00:00
Mark Johnston	8c7549da2b	Drop KTR_CONTENTION. It is incomplete, has not been adopted in the other locking primitives, and we have other means of measuring lock contention (lock_profiling, lockstat, KTR_LOCK). Drop it to slightly de-clutter the mutex code and free up a precious KTR class index. Reviewed by: jhb, mjg MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D14771	2018-03-20 15:51:05 +00:00
Justin Hibbits	2acde6a85a	Cast through uintptr_t to narrow the buf domain pointer on 32-bit archs arg2 is an intmax_t, which on 32-bit architectures is 64 bits, wider than a pointer. When &bdomain[i] is added to arg2 it widens from uintptr_t to intmax_t, then gcc whines when it gets cast to a pointer. Casting through uintptr_t silences this warning.	2018-03-20 02:01:30 +00:00

... 3 4 5 6 7 ...

16353 Commits