freebsd-skq

Author	SHA1	Message	Date
kevans	dd71f380da	Add debug.verbose_sysinit tunable for VERBOSE_SYSINIT VERBOSE_SYSINIT is currently an all-or-nothing option. debug.verbose_sysinit adds an option to have the code compiled in but quiet by default so that getting this information from a device in the field doesn't necessarily require distributing a recompiled kernel. Its default is VERBOSE_SYSINIT's value as defined in the kernconf. As such, the default behavior for simply omitting or including this option is unchanged. MFC after: 1 week	2018-06-20 19:23:56 +00:00
manu	b5d43b277c	Add pmap_mapdev_attr for arm64 This is needed for efifb. arm and ricv pmap (the two arch with arm64 that uses subr_devmap) have very different implementation so for now only add this for arm64. Tested with efifb on Pine64 with a few other patches. Reviewed by: cognet Differential Revision: https://reviews.freebsd.org/D15294	2018-06-20 16:07:35 +00:00
bz	299899c6fd	Instead of using hand-rolled loops where not needed switch them to FOREACH_PROC_IN_SYSTEM() to have a single pattern to look for. Reviewed by: kib MFC after: 2 weeks Sponsored by: iXsystems, Inc. Differential Revision: https://reviews.freebsd.org/D15916	2018-06-20 11:42:06 +00:00
bz	49043c2660	Sometimes it is helpful to get the path for a vnode. Implement a ddb function walking the namecache to do this. Reviewed by: jhb, mjg Inspired by: gdb macro from jhb (old version) Sponsored by: iXsystems, Inc. Differential Revision: https://reviews.freebsd.org/D14898	2018-06-20 08:34:29 +00:00
mmacy	79793784f7	convert inpcbinfo hash and info rwlocks to epoch + mutex - Convert inpcbinfo info & hash locks to epoch for read and mutex for write - Garbage collect code that handled INP_INFO_TRY_RLOCK failures as INP_INFO_RLOCK which can no longer fail When running 64 netperfs sending minimal sized packets on a 2x8x2 reduces unhalted core cycles samples in rwlock rlock/runlock in udp_send from 51% to 3%. Overall packet throughput rate limited by CPU affinity and NIC driver design choices. On the receiver unhalted core cycles samples in in_pcblookup_hash went from 13% to to 1.6% Tested by LLNW and pho@ Reviewed by: jtl Sponsored by: Limelight Networks Differential Revision: https://reviews.freebsd.org/D15686	2018-06-19 01:54:00 +00:00
ae	a58623ba71	Switch RIB and RADIX_NODE_HEAD lock from rwlock(9) to rmlock(9). Using of rwlock with multiqueue NICs for IP forwarding on high pps produces high lock contention and inefficient. Rmlock fits better for such workloads. Reviewed by: melifaro, olivier Obtained from: Yandex LLC Sponsored by: Yandex LLC Differential Revision: https://reviews.freebsd.org/D15789	2018-06-16 08:26:23 +00:00
glebius	dfbc255945	Since 'ticks' is an int, it may wrap around and cr_ticks at a certain counter_rate will be greater than ticks, resulting in counter_ratecheck() failure. To fix this take an absolute value of the difference between ticks and cr_ticks. Reported by: jtl Sponsored by: Netflix	2018-06-15 21:36:16 +00:00
bdrewery	16a3710079	proc0_post: Fix some locking issues - Filter out PRS_NEW procs as rufetch() tries taking the thread lock which may not yet be initialized. - Hold PROC_LOCK to ensure stability of iterating the threads. - p_rux fields are protected by the process statlock as well. MFC after: 2 weeks Reviewed by: kib Sponsored by: Dell EMC Differential Revision: https://reviews.freebsd.org/D15809	2018-06-15 00:36:41 +00:00
cognet	79d2fc0f98	Use M_EXEC when calling malloc() to allocate the memory to store the module, as it'll contain executable code.	2018-06-14 23:10:10 +00:00
brooks	8e419faaf8	Regen after 335177 (rename sys_obreak to sys_break).	2018-06-14 21:29:31 +00:00
brooks	ad6bae500f	Name the implementation of brk and sbrk sys_break(). The break() system call was renamed (several times) starting in v3 AT&T UNIX when C was invented and break was a language keyword. The last vestage of a need for it to be called something else (eg obreak) was removed in r225617 which consistantly prefixed all syscall implementations. Reviewed by: emaste, kib (older version) Sponsored by: DARPA, AFRL Differential Revision: https://reviews.freebsd.org/D15638	2018-06-14 21:27:25 +00:00
jtl	8222f5cb7c	Make UMA and malloc(9) return non-executable memory in most cases. Most kernel memory that is allocated after boot does not need to be executable. There are a few exceptions. For example, kernel modules do need executable memory, but they don't use UMA or malloc(9). The BPF JIT compiler also needs executable memory and did use malloc(9) until r317072. (Note that a side effect of r316767 was that the "small allocation" path in UMA on amd64 already returned non-executable memory. This meant that some calls to malloc(9) or the UMA zone(9) allocator could return executable memory, while others could return non-executable memory. This change makes the behavior consistent.) This change makes malloc(9) return non-executable memory unless the new M_EXEC flag is specified. After this change, the UMA zone(9) allocator will always return non-executable memory, and a KASSERT will catch attempts to use the M_EXEC flag to allocate executable memory using uma_zalloc() or its variants. Allocations that do need executable memory have various choices. They may use the M_EXEC flag to malloc(9), or they may use a different VM interfact to obtain executable pages. Now that malloc(9) again allows executable allocations, this change also reverts most of r317072. PR: 228927 Reviewed by: alc, kib, markj, jhb (previous version) Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D15691	2018-06-13 17:04:41 +00:00
imp	28723d54b4	Implement a 'car limit' for bioq. Allow one to implement a 'car limit' for bioq_disksort. debug.bioq_batchsize sets the size of car limit. Every time we queue that many requests, we start over so that we limit the latency for requests when the software queue depths are large. A value of '0', the default, means to revert to the old behavior. Sponsored by: Netflix	2018-06-13 16:48:07 +00:00
bde	8bff5fd956	Fix the encoding of major and minor numbers in 64-bit dev_t by restoring the old encodings for the lower 16 and 32 bits and only using the higher 32 bits for unusually large major and minor numbers. This change breaks compatibility with the previous encoding (which was only used in -current). Fix truncation to (essentially) 16-bit dev_t in newnfs v3. Any encoding of device numbers gives an ABI, so it can't be changed without translations for compatibility. Extra bits give the much larger complication that the translations need to compress into fewer bits. Fortunately, more than 32 bits are rarely needed, so compression is rarely needed except for 16-bit linux dev_t where it was always needed but never done. The previous encoding moved the major number into the top 32 bits. Almost no translation code handled this, so the major number was blindly truncated away in most 32-bit encodings. E.g., for ffs, mknod(8) with major = 1 and minor = 2 gave dev_t = 0x10000002; ffs cannot represent this and blindly truncated it to 2. But if this mknod was run on any released version of FreeBSD, it gives dev_t = 0x102. ffs can represent this, but in the previous encoding it was not decoded, giving major = 0, minor = 0x102. The presence of bugs was most obvious for exporting dev_t's from an old system to -current, since bugs in newnfs augment them. I fixed oldnfs to support 32-bit dev_t in 1996 (r16634), but this regressed to 16-bit dev_t in newnfs, first to the old 16-bit encoding and then further in -current. E.g., old ad0 with major = 234, minor = 0x10002 had the correct (major, minor) number on the wire, but newnfs truncated this to (234, 2) and then the previous encoding shifted the major number into oblivion as seen by ffs or old applications. I first tried to fix this by translating on every ABI/API boundary, but there are too many boundaries and too many sloppy translations by blind truncation. So use the old encoding for the low 32 bits so that sloppy translations work no worse than before provided the high 32 bits are not set. Add some error checking for when bits are lost. Keep not doing any error checking for translations for almost everything in compat/linux. compat/freebsd32/freebsd32_misc.c: Optionally check for losing bits after possibly-truncating assignments as before. compat/linux/linux_stats.c: Depend on the representation being compatible with Linux's (or just with itself for local use) and spell some of the translations as assignments in a macro that hides the details. fs/nfsclient/nfs_clcomsubs.c: Essentially the same fix as in 1996, except there is now no possible truncation in makedev() itself. Also fix nearby style bugs. kern/vfs_syscalls.c: As for freebsd32. Also update the sysctl description to include file numbers, and change it to describe device ids as device numbers. sys/types.h: Use inline functions (wrapped by macros) since the expressions are now a bit too complicated for plain macros. Describe the encoding and some of the reasons for it. 16-bit compatibility didn't leave many reasonable choices for the 32-bit encoding, and 32-bit compatibility doesn't leave many reasonable choices for the 64-bit encoding. My choice is to put the 8 new minor bits in the low 8 bits of the top 32 bits. This minimizes discontiguities. Reviewed by: kib (except for rewrite of the comment in linux_stats.c)	2018-06-13 12:22:00 +00:00
bde	216f1ebfa0	Fix some bugs found while fixing the representation and translation of 64-bit dev_t's (but not ones involving dev_t's). st_size was supposed to be clamped in cvtstat() and linux's copy_stat(), but the clamping code wasn't aware that st_size is signed, and also had an obfuscated off-by-1 value for the unsigned limit, so its effect was to produce a bizarre negative size instead of clamping. Change freebsd32's copy_ostat() to be no worse than cvtstat(). It was missing clamping and bzero()ing of padding. Reviewed by: kib (except a final fix of the clamp to the signed maximum)	2018-06-13 08:50:43 +00:00
emaste	e452e96446	makesyscalls: simplify capenabled pipeline Replace cat + 2x grep with one grep. Sponsored by: Turing Robotic Industries	2018-06-11 18:57:40 +00:00
mmacy	b55c8854cf	limit change to fixing controlp handling pending review	2018-06-11 17:10:19 +00:00
mmacy	1db1491876	soreceive_stream: correctly handle edge cases - non NULL controlp is not an error, returning EINVAL would cause X forwarding to fail - MSG_PEEK and MSG_WAITALL are fairly exceptional, but we still want to handle them - punt to soreceive_generic	2018-06-11 16:31:42 +00:00
mjg	617832d02f	counter: add a bit missed in r334858 It happens to be a noop.	2018-06-08 22:06:32 +00:00
mmacy	54ac1282ce	AF_UNIX: bring uipc_ready in compliance with new locking protocol PR: 228742 Submitted by: markj Reviewed by: markj	2018-06-08 20:31:59 +00:00
jtl	7cf8a13d28	Add a socket destructor callback. This allows kernel providers to set callbacks to perform additional cleanup actions at the time a socket is closed. Michio Honda presented a use for this at BSDCan 2018. (See https://www.bsdcan.org/2018/schedule/events/965.en.html .) Submitted by: Michio Honda <micchie at sfc.wide.ad.jp> (previous version) Reviewed by: lstewart (previous version) Differential Revision: https://reviews.freebsd.org/D15706	2018-06-08 19:35:24 +00:00
mjg	08fabf55c9	uma: fix up r334824 Turns out there is code which ends up passing M_ZERO to counters. Since counters zero unconditionally on their own, just ignore drop the flag in that place.	2018-06-08 05:40:36 +00:00
mmacy	33d22ed3f8	hwpmc: simplify calling convention for hwpmc interrupt handling pmc_process_interrupt takes 5 arguments when only 3 are needed. cpu is always available in curcpu and inuserspace can always be derived from the passed trapframe. While facially a reasonable cleanup this change was motivated by the need to workaround a compiler bug. core2_intr(cpu, tf) -> pmc_process_interrupt(cpu, ring, pmc, tf, inuserspace) -> pmc_add_sample(cpu, ring, pm, tf, inuserspace) In the process of optimizing the tail call the tf pointer was getting clobbered: (kgdb) up at /storage/mmacy/devel/freebsd/sys/dev/hwpmc/hwpmc_mod.c:4709 4709 pmc_save_kernel_callchain(ps->ps_pc, (kgdb) up 1205 error = pmc_process_interrupt(cpu, PMC_HR, pm, tf, resulting in a crash in pmc_save_kernel_callchain.	2018-06-08 04:58:03 +00:00
rrs	e4ec942fc5	This commit brings in a new refactored TCP stack called Rack. Rack includes the following features: - A different SACK processing scheme (the old sack structures are not used). - RACK (Recent acknowledgment) where counting dup-acks is no longer done instead time is used to knwo when to retransmit. (see the I-D) - TLP (Tail Loss Probe) where we will probe for tail-losses to attempt to try not to take a retransmit time-out. (see the I-D) - Burst mitigation using TCPHTPS - PRR (partial rate reduction) see the RFC. Once built into your kernel, you can select this stack by either socket option with the name of the stack is "rack" or by setting the global sysctl so the default is rack. Note that any connection that does not support SACK will be kicked back to the "default" base FreeBSD stack (currently known as "default"). To build this into your kernel you will need to enable in your kernel: makeoptions WITH_EXTRA_TCP_STACKS=1 options TCPHPTS Sponsored by: Netflix Inc. Differential Revision: https://reviews.freebsd.org/D15525	2018-06-07 18:18:13 +00:00
alc	dc7928340b	When pidctrl_daemon() is called multiple times within an interval, it should use the cumulative error to calculate the output.	2018-06-07 07:48:50 +00:00
mmacy	2b52b582f3	AF_UNIX: check for unp == unp2 on disconnect	2018-06-07 04:57:40 +00:00
alc	ffaddd3ad9	pidctrl_daemon() implements a variation on the classical, discrete PID controller that tries to handle early invocations of the controller, in other words, invocations before the expected end of the interval. However, there were some calculation errors in this early invocation case. Notably, if an early invocation occurred while the error was negative, the derivative term was off by a large amount. One visible effect of this error was that processes were being killed by the virtual memory system's OOM killer when in fact there was plentiful free memory. Correct a couple minor errors in the sysctl descriptions, and apply some style fixes. Reviewed by: jeff, markj	2018-06-07 02:54:11 +00:00
sbruno	d0aeaa5af7	Load balance sockets with new SO_REUSEPORT_LB option. This patch adds a new socket option, SO_REUSEPORT_LB, which allow multiple programs or threads to bind to the same port and incoming connections will be load balanced using a hash function. Most of the code was copied from a similar patch for DragonflyBSD. However, in DragonflyBSD, load balancing is a global on/off setting and can not be set per socket. This patch allows for simultaneous use of both the current SO_REUSEPORT and the new SO_REUSEPORT_LB options on the same system. Required changes to structures: Globally change so_options from 16 to 32 bit value to allow for more options. Add hashtable in pcbinfo to hold all SO_REUSEPORT_LB sockets. Limitations: As DragonflyBSD, a load balance group is limited to 256 pcbs (256 programs or threads sharing the same socket). This is a substantially different contribution as compared to its original incarnation at svn r332894 and reverted at svn r332967. Thanks to rwatson@ for the substantive feedback that is included in this commit. Submitted by: Johannes Lundberg <johalun0@gmail.com> Obtained from: DragonflyBSD Relnotes: Yes Sponsored by: Limelight Networks Differential Revision: https://reviews.freebsd.org/D11003	2018-06-06 15:45:57 +00:00
jhibbits	ddc936d650	Revert r334708 This is the wrong place to put the barrier. Requested by: kib,mjg	2018-06-06 15:12:19 +00:00
jhibbits	7449055198	Add a memory barrier after taking a reference on the vnode holdcnt in _vhold This is needed to avoid a race between the VNASSERT() below, and another thread updating the VI_FREE flag, on weakly-ordered architectures. On a 72-thread POWER9, without this barrier a 'make -j72 buildworld' would panic on the assert regularly. It may be possible to use a weaker barrier, and I'll investigate that once all stability issues are worked out on POWER9.	2018-06-06 12:57:11 +00:00
mmacy	2b8bb3bc50	hwpmc: log name->pid, name->tid mappings By logging all threads and processes 'pmc filter' can now filter on process or thread name, relieving the user of the burden of determining which tid or pid was which when the sample was taken. % pmc filter -T if_io_tqg -P nginx pmc.log pmc-iflib.log % pmc filter -x -T idle pmc.log pmc-noidle.log	2018-06-05 04:26:40 +00:00
markj	d11c50b901	Regen after r334626.	2018-06-04 19:36:47 +00:00
markj	9d9fd255d6	Reimplement brk() and sbrk() to avoid the use of _end. Previously, libc.so would initialize its notion of the break address using _end, a special symbol emitted by the static linker following the bss section. Compatibility issues between lld and ld.bfd could cause the wrong definition of _end (libc.so's definition rather than that of the executable) to be used, breaking the brk()/sbrk() interface. Avoid this problem and future interoperability issues by simply not relying on _end. Instead, modify the break() system call to return the kernel's view of the current break address, and have libc initialize its state using an extra syscall upon the first use of the interface. As a side effect, this appears to fix brk()/sbrk() usage in executables run with rtld direct exec, since the kernel and libc.so no longer maintain separate views of the process' break address. PR: 228574 Reviewed by: kib (previous version) MFC after: 2 months Differential Revision: https://reviews.freebsd.org/D15663	2018-06-04 19:35:15 +00:00
alc	2d5adeeb3e	Use a single, consistent approach to returning success versus failure in vm_map_madvise(). Previously, vm_map_madvise() used a traditional Unix- style "return (0);" to indicate success in the common case, but Mach- style return values in the edge cases. Since KERN_SUCCESS equals zero, the only problem with this inconsistency was stylistic. vm_map_madvise() has exactly two callers in the entire source tree, and only one of them cares about the return value. That caller, kern_madvise(), can be simplified if vm_map_madvise() consistently uses Unix-style return values. Since vm_map_madvise() uses the variable modify_map as a Boolean, make it one. Eliminate a redundant error check from kern_madvise(). Add a comment explaining where the check is performed. Explicitly note that exec_release_args_kva() doesn't care about vm_map_madvise()'s return value. Since MADV_FREE is passed as the behavior, the return value will always be zero. Reviewed by: kib, markj MFC after: 7 days	2018-06-04 16:28:06 +00:00
mmacy	73041f23e1	hwpmc: support sampling both kernel and user stacks when interrupted in kernel This adds the -U options to pmcstat which will attribute in-kernel samples back to the user stack that invoked the system call. It is not the default, because when looking at kernel profiles it is generally more desirable to merge all instances of a given system call together. Although heavily revised, this change is directly derived from D7350 by Jonathan T. Looney. Obtained from: jtl Sponsored by: Juniper Networks, Limelight Networks	2018-06-04 01:10:23 +00:00
mjg	cdca29b9c6	Remove an unused argument to turnstile_unpend. PR: 228694 Submitted by: Julian Pszczołowski <julian.pszczolowski@gmail.com>	2018-06-02 22:37:53 +00:00
mjg	fe4195ffb4	malloc: try to use builtins for zeroing at the callsite Plenty of allocation sites pass M_ZERO and sizes which are small and known at compilation time. Handling them internally in malloc loses this information and results in avoidable calls to memset. Instead, let the compiler take the advantage of it whenever possible. Discussed with: jeff	2018-06-02 22:20:09 +00:00
markj	746978a065	Avoid completing I/O when dumping core after a panic. Filesystem or pager completion callbacks are generally non-functional after a panic and may trigger deadlocks if invoked in this context (e.g., by attempting to destroying a buffer mapping). To avoid this situation, short-circuit I/O completion in biodone(). Reviewed by: imp Discussed with: mav MFC after: 2 weeks Differential Revision: https://reviews.freebsd.org/D15592	2018-06-01 23:49:32 +00:00
emaste	98274e3f11	ANSIfy sys/kern	2018-06-01 13:26:45 +00:00
imp	2b71962940	Make the data returned by devinfo harder to overflow. Rather than using fixed-length strings, pack them into a string table to return. Also expand the buffer from ~300 charaters to 3k. This should be enough, even for USB. This fixes a problem where USB pnp info is truncated on return to userland. Differential Revision: https://reviews.freebsd.org/D15629	2018-05-31 02:57:58 +00:00
brooks	e56e8a49a8	Remove alternative names that are identical to the default. Verified by make sysent producing no changes.	2018-05-30 22:22:58 +00:00
emaste	c2e419ed96	link_elf_obj: correct an error message Previously we'd report that a file has "no valid symbol table" if it in fact had two or more. Change the message to report that there must be exactly one.	2018-05-30 12:55:27 +00:00
mmacy	b697e1087d	epoch(9): make epoch closer to style(9)	2018-05-30 03:39:57 +00:00
shurd	40a1e4b33c	iflib: mark irq allocation name parameter as constant The name parameter passed to iflib_irq_alloc_generic and iflib_softirq_alloc_generic is never modified. Many places in code pass string literals and thus should not be modified. Mark the name parameter as a const char * instead, so that we enforce that the name is not modified before passing to bus_describe_intr() Submitted by: Jacob Keller <jacob.e.keller@intel.com> Reviewed by: kmacy Sponsored by: Intel Corporation Differential Revision: https://reviews.freebsd.org/D15343	2018-05-29 21:56:39 +00:00
glebius	6f677ae721	Revert second chunk of r333860. The warning from gcc is false positive. The npages won't be ever used in no space case.	2018-05-29 21:45:15 +00:00
mmacy	9e319ee6ad	hwpmc: don't enter epoch section across mmap hook	2018-05-29 18:03:48 +00:00
brooks	474f6cc8eb	Correct pointer subtraction in KASSERT(). The assertion would never fire without truly spectacular future programming errors. Reported by: Coverity CID: 1391367, 1391368 Sponsored by: DARPA, AFRL	2018-05-29 17:49:03 +00:00
avg	db453a7a34	add support for console resuming, implement it for uart, use on x86 This change adds a new optional console method cn_resume and a kernel console interface cnresume. Consoles that may need to re-initialize their hardware after suspend (e.g., because firmware does not care to do it) will implement cn_resume. Note that it is called in rather early environment not unlike early boot, so the same restrictions apply. Platform specific code, for platforms that support hardware suspend, should call cnresume early after resume, before any console output is expected. This change fixes a problem with a system of mine failing to resume when a serial console is used. I found that the serial port was in a strange configuration and an attempt to write to it likely resulted in an infinite loop. To avoid adding cn_resume method to every console driver, CONSOLE_DRIVER macro has been extended to support optional methods. Reviewed by: imp, mav MFC after: 3 weeks Differential Revision: https://reviews.freebsd.org/D15552	2018-05-29 16:16:24 +00:00
mmacy	666cb83479	witness/hwpmc: fix locking order for pmc locks	2018-05-28 23:14:38 +00:00
vangyzen	7293d43c89	kern_cpuset: fix small leak on error path The "mask" was leaked on some error paths. Reported by: Coverity CID: 1384683 Sponsored by: Dell EMC	2018-05-26 14:23:11 +00:00

1 2 3 4 5 ...

16262 Commits