- Convert inpcbinfo info & hash locks to epoch for read and mutex for write
- Garbage collect code that handled INP_INFO_TRY_RLOCK failures as
INP_INFO_RLOCK which can no longer fail
When running 64 netperfs sending minimal sized packets on a 2x8x2 reduces
unhalted core cycles samples in rwlock rlock/runlock in udp_send from 51% to
3%.
Overall packet throughput rate limited by CPU affinity and NIC driver design
choices.
On the receiver unhalted core cycles samples in in_pcblookup_hash went from
13% to to 1.6%
Tested by LLNW and pho@
Reviewed by: jtl
Sponsored by: Limelight Networks
Differential Revision: https://reviews.freebsd.org/D15686
Using of rwlock with multiqueue NICs for IP forwarding on high pps
produces high lock contention and inefficient. Rmlock fits better for
such workloads.
Reviewed by: melifaro, olivier
Obtained from: Yandex LLC
Sponsored by: Yandex LLC
Differential Revision: https://reviews.freebsd.org/D15789
Multicast incorrectly calls in to drivers with a mutex held causing drivers
to have to go through all manner of contortions to use a non sleepable lock.
Serialize multicast updates instead.
Submitted by: mmacy <mmacy@mattmacy.io>
Reviewed by: shurd, sbruno
Sponsored by: Limelight Networks
Differential Revision: https://reviews.freebsd.org/D14969
Currently both the page lock and a page queue lock must be held in
order to enqueue, dequeue or requeue a page in a given page queue.
The queue locks are a scalability bottleneck in many workloads. This
change reduces page queue lock contention by batching queue operations.
To detangle the page and page queue locks, per-CPU batch queues are
used to reference pages with pending queue operations. The requested
operation is encoded in the page's aflags field with the page lock
held, after which the page is enqueued for a deferred batch operation.
Page queue scans are similarly optimized to minimize the amount of
work performed with a page queue lock held.
Reviewed by: kib, jeff (previous versions)
Tested by: pho
Sponsored by: Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D14893
The change makes the user and kernel address spaces on i386
independent, giving each almost the full 4G of usable virtual addresses
except for one PDE at top used for trampoline and per-CPU trampoline
stacks, and system structures that must be always mapped, namely IDT,
GDT, common TSS and LDT, and process-private TSS and LDT if allocated.
By using 1:1 mapping for the kernel text and data, it appeared
possible to eliminate assembler part of the locore.S which bootstraps
initial page table and KPTmap. The code is rewritten in C and moved
into the pmap_cold(). The comment in vmparam.h explains the KVA
layout.
There is no PCID mechanism available in protected mode, so each
kernel/user switch forth and back completely flushes the TLB, except
for the trampoline PTD region. The TLB invalidations for userspace
becomes trivial, because IPI handlers switch page tables. On the other
hand, context switches no longer need to reload %cr3.
copyout(9) was rewritten to use vm_fault_quick_hold(). An issue for
new copyout(9) is compatibility with wiring user buffers around sysctl
handlers. This explains two kind of locks for copyout ptes and
accounting of the vslock() calls. The vm_fault_quick_hold() AKA slow
path, is only tried after the 'fast path' failed, which temporary
changes mapping to the userspace and copies the data to/from small
per-cpu buffer in the trampoline. If a page fault occurs during the
copy, it is short-circuit by exception.s to not even reach C code.
The change was motivated by the need to implement the Meltdown
mitigation, but instead of KPTI the full split is done. The i386
architecture already shows the sizing problems, in particular, it is
impossible to link clang and lld with debugging. I expect that the
issues due to the virtual address space limits would only exaggerate
and the split gives more liveness to the platform.
Tested by: pho
Discussed with: bde
Sponsored by: The FreeBSD Foundation
MFC after: 1 month
Differential revision: https://reviews.freebsd.org/D14633
This allows NIC drivers to sleep on polling config operations.
Submitted by: Matthew Macy <mmacy@mattmacy.io>
Reviewed by: shurd
Sponsored by: Limelight Networks
Differential Revision: https://reviews.freebsd.org/D14982
summits at BSDCan and BSDCam in 2017.
The TCP Blackbox Recorder allows you to capture events on a TCP connection
in a ring buffer. It stores metadata with the event. It optionally stores
the TCP header associated with an event (if the event is associated with a
packet) and also optionally stores information on the sockets.
It supports setting a log ID on a TCP connection and using this to correlate
multiple connections that share a common log ID.
You can log connections in different modes. If you are doing a coordinated
test with a particular connection, you may tell the system to put it in
mode 4 (continuous dump). Or, if you just want to monitor for errors, you
can put it in mode 1 (ring buffer) and dump all the ring buffers associated
with the connection ID when we receive an error signal for that connection
ID. You can set a default mode that will be applied to a particular ratio
of incoming connections. You can also manually set a mode using a socket
option.
This commit includes only basic probes. rrs@ has added quite an abundance
of probes in his TCP development work. He plans to commit those soon.
There are user-space programs which we plan to commit as ports. These read
the data from the log device and output pcapng files, and then let you
analyze the data (and metadata) in the pcapng files.
Reviewed by: gnn (previous version)
Obtained from: Netflix, Inc.
Relnotes: yes
Differential Revision: https://reviews.freebsd.org/D11085
global to per-domain state. Protect reservations with the free lock
from the domain that they belong to. Refactor to make vm domains more
of a first class object.
Reviewed by: markj, kib, gallatin
Tested by: pho
Sponsored by: Netflix, Dell/EMC Isilon
Differential Revision: https://reviews.freebsd.org/D14000
This reduces noise when kernel is compiled by newer GCC versions,
such as one used by external toolchain ports.
Reviewed by: kib, andrew(sys/arm and sys/arm64), emaste(partial), erj(partial)
Reviewed by: jhb (sys/dev/pci/* sys/kern/vfs_aio.c and sys/kern/kern_synch.c)
Differential Revision: https://reviews.freebsd.org/D10385
Mainly focus on files that use BSD 2-Clause license, however the tool I
was using misidentified many licenses so this was mostly a manual - error
prone - task.
The Software Package Data Exchange (SPDX) group provides a specification
to make it easier for automated tools to detect and summarize well known
opensource licenses. We are gradually adopting the specification, noting
that the tags are considered only advisory and do not, in any way,
superceed or replace the license texts.
Add a DDB command that mirrors sysctl debug.witness.badstacks.
Reapply r323935 after fixing trivial deficiency. I forgot to compile with
WITNESS enabled. Thanks emaste@ for fixing the build while I was asleep.
Reported by: rstone
Reviewed by: rstone (previous version)
Sponsored by: Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D12468
When all instances of a lock type are destroyed (for example, after a
module unload), the corresponding witness entry remains associated with
that lock type. In this case, we shouldn't panic if a new instance of the
lock type is created and its lock class does not match that recorded in the
witness entry.
Reviewed by: jhb
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D11788
witness_warn() either breaks into the debugger or panics the system, so its
output should go to the console regardless of the witness(4) output channel
configuration.
MFC after: 1 week
Sponsored by: Dell EMC Isilon
Add a table of vnode locks and use them along with bucketlocks to provide
concurrent modification support. The approach taken is to preserve the
current behaviour of the namecache and just lock all relevant parts before
any changes are made.
Lookups still require the relevant bucket to be locked.
Discussed with: kib
Tested by: pho
An array of bucket locks is added.
All modifications still require the global cache_lock to be held for
writing. However, most readers only need the relevant bucket lock and in
effect can run concurrently to the writer as long as they use a
different lock. See the added comment for more details.
This is an intermediate step towards removal of the global lock.
Reviewed by: kib
Tested by: pho
syscons spinlock for the output routine alone. It is better to extend
the coverage of the first syscons spinlock added in r162285. 2 locks
might work with complicated juggling, but no juggling was done. What
the 2 locks actually did was to cover some of the missing locking in
each other and deadlock less often against each other than a single
lock with larger coverage would against itself. Races are preferable
to deadlocks here, but 2 locks are still worse since they are harder
to understand and fix.
Prefer deadlocks to races and merge the second lock into the first one.
Extend the scope of the spinlocking to all of sc_cnputc() instead of
just the sc_puts() part. This further prefers deadlocks to races.
Extend the kdb_active hack from sc_puts() internals for the second lock
to all spinlocking. This reduces deadlocks much more than the other
changes increases them. The s/p,10* test in ddb gets much further now.
Hide this detail in the SC_VIDEO_LOCK() macro. Add namespace pollution
in 1 nested #include and reduce namespace pollution in other nested
#includes to pay for this.
Move the first lock higher in the witness order. The second lock was
unnaturally low and the first lock was unnaturally high. The second
lock had to be above "sleepq chain" and/or "callout" to avoid spurious
LORs for visual bells in sc_puts(). Other console driver locks are
already even higher (but not adjacent like they should be) except when
they are missing from the table. Audio bells also benefit from the
syscons lock being high so that audio mutexes have chance of being
lower. Otherwise, console drviver locks should be as low as possible.
Non-spurious LORs now occur if the bell code calls printf() or is
interrupted (perhaps by an NMI) and the interrupt handler calls
printf(). Previous commits turned off many bells in console i/o but
missed ones done by the teken layer.
This is useful in environments where system configuration is performed by
automated interaction with the system console, since unexpected witness
output makes such automation difficult. With this change, the new
debug.witness.output_channel sysctl allows one to specify that witness
output is to be printed to the kernel log (using log(9)) rather than the
console.
Reviewed by: cem, jhb
MFC after: 2 weeks
Relnotes: yes
Sponsored by: EMC / Isilon Storage Division
Differential Revision: https://reviews.freebsd.org/D4183
from x86 to use smp_ipi_mtx spin lock not only for smp_rendezvous_cpus()
but also for the MD cache invalidation, TLB demapping and remote register
reading IPIs due to the following reasons:
- The cross-IPI SMP deadlock x86 otherwise is subject to can't happen on
sparc64. That's because on sparc64, spin locks don't disable interrupts
completely but only raise the processor interrupt level to PIL_TICK. This
means that IPIs still get delivered and direct dispatch IPIs such as the
cache invalidation etc. IPIs in question are still executed.
- In smp_rendezvous_cpus(), smp_ipi_mtx is held not only while sending an
IPI_RENDEZVOUS, but until all CPUs have processed smp_rendezvous_action().
Consequently, smp_ipi_mtx may be locked for an extended amount of time as
queued IPIs (as opposed to the direct ones) such as IPI_RENDEZVOUS are
scheduled via a soft interrupt. Moreover, given that this soft interrupt
is only delivered at PIL_RENDEZVOUS, processing of smp_rendezvous_action()
on a target may be interrupted by f. e. a tick interrupt at PIL_TICK, in
turn leading to the target in question trying to send an IPI by itself
while IPI_RENDEZVOUS isn't fully handled, yet, and, thus, resulting in a
deadlock.
o As mentioned in the commit message of r245850, on least some sun4u platforms
concurrent sending of IPIs by different CPUs is fatal. Therefore, hold the
reintroduced MD ipi_mtx also while delivering cross-traps via MI helpers,
i. e. ipi_{all_but_self,cpu,selected}().
o Akin to x86, let the last CPU to process cpu_mp_bootstrap() set smp_started
instead of the BSP in cpu_mp_unleash(). This ensures that all APs actually
are started, when smp_started is no longer 0.
o In all MD and MI IPI helpers, check for smp_started == 1 rather than for
smp_cpus > 1 or nothing at all. This avoids races during boot causing IPIs
trying to be delivered to APs that in fact aren't up and running, yet.
While at it, move setting of the cpu_ipi_{selected,single}() pointers to
the appropriate delivery functions from mp_init() to cpu_mp_start() where
it's better suited and allows to get rid of the global isjbus variable.
o Given that now concurrent IPI delivery no longer is possible, also nuke
the delays before completely disabling interrupts again in the CPU-specific
cross-trap delivery functions, previously giving other CPUs a window for
sending IPIs on their part. Actually, we now should be able to entirely get
rid of completely disabling interrupts in these functions. Such a change
needs more testing, though.
o In {s,}tick_get_timecount_mp(), make the {s,}tick variable static. While not
necessary for correctness, this avoids page faults when accessing the stack
of a foreign CPU as {s,}tick now is locked into the TLBs as part of static
kernel data. Hence, {s,}tick_get_timecount_mp() always execute as fast as
possible, avoiding jitter.
PR: 201245
MFC after: 3 days
The number of available lock list entries for a thread is LOCK_CHILDCOUNT,
and each entry can record up to LOCK_NCHILDREN locks. When iterating over
the locks held by a thread, a bound on the loop index is therefore given
by LOCK_CHILDCOUNT * LOCK_NCHILDREN; WITNESS_COUNT is an unrelated
constant.
Reviewed by: jhb
MFC after: 1 week
Sponsored by: EMC / Isilon Storage Division
Differential Revision: https://reviews.freebsd.org/D2974
Lock order checking is done without the witness mutex held, so multiple
threads that are racing to establish a new lock order may read matrix
entries that are in an inconsistent state. Don't print a warning in this
case, but instead just redo the check after taking the witness lock.
Differential Revision: https://reviews.freebsd.org/D2713
Reviewed by: jhb
MFC after: 2 weeks
Sponsored by: EMC / Isilon Storage Division
The replacement started at r283088 was necessarily incomplete without
replacing boolean_t with bool. This also involved cleaning some type
mismatches and ansifying old C function declarations.
Pointed out by: bde
Discussed with: bde, ian, jhb
currently a spin lock. Apparently, the only reason for this is that
umtx_thread_exit() is called under the process spinlock, which put the
requirement on the umtx_lock. Note that the witness static order list
is wrong for the umtx_lock, umtx_lock is explicitely before any thread
lock, so it is also before sleepq locks.
Change umtx_lock to be the sleepable mutex. For the reason above, the
calls to umtx_thread_exit() are moved from thread_exit() earlier in
each caller, when the process spin lock is not yet taken.
Discussed with: jhb
Tested by: pho (previous version)
Sponsored by: The FreeBSD Foundation
MFC after: 3 weeks
1. Remove initializer for badstack_sbuf_size; it gets set unconditionally.
2. Remove meaningless comment.
3. Group witness_count and its sysctl together.
4. Fix spacing in for statements (space after for and within condition).
5. Change *all* M_NOWAIT usages in witness_initialize() to M_WAITOK; not
just those that were newly introduced -- the allocation is assumed to
succeed for all allocations.
6. Avoid using uint8_t as the base type in sizeof() expressions; Use the
variable name (w_rmatrix) as much as possible.
Pointed out by: jhb@ (thanks!)
the value without recompiling the kernel. This is useful when
recompiling is not possible as an immediate solution. When we run out
of witness objects, witness is completely disabled. Not having an
immediate solution can therefore be problematic.
Submitted by: Sreekanth Rupavatharam <rupavath@juniper.net>
Obtained from: Juniper Networks, Inc.
This includes:
o All directories named *ia64*
o All files named *ia64*
o All ia64-specific code guarded by __ia64__
o All ia64-specific makefile logic
o Mention of ia64 in comments and documentation
This excludes:
o Everything under contrib/
o Everything under crypto/
o sys/xen/interface
o sys/sys/elf_common.h
Discussed at: BSDcan
These changes prevent sysctl(8) from returning proper output,
such as:
1) no output from sysctl(8)
2) erroneously returning ENOMEM with tools like truss(1)
or uname(1)
truss: can not get etype: Cannot allocate memory
there is an environment variable which shall initialize the SYSCTL
during early boot. This works for all SYSCTL types both statically and
dynamically created ones, except for the SYSCTL NODE type and SYSCTLs
which belong to VNETs. A new flag, CTLFLAG_NOFETCH, has been added to
be used in the case a tunable sysctl has a custom initialisation
function allowing the sysctl to still be marked as a tunable. The
kernel SYSCTL API is mostly the same, with a few exceptions for some
special operations like iterating childrens of a static/extern SYSCTL
node. This operation should probably be made into a factored out
common macro, hence some device drivers use this. The reason for
changing the SYSCTL API was the need for a SYSCTL parent OID pointer
and not only the SYSCTL parent OID list pointer in order to quickly
generate the sysctl path. The motivation behind this patch is to avoid
parameter loading cludges inside the OFED driver subsystem. Instead of
adding special code to the OFED driver subsystem to post-load tunables
into dynamically created sysctls, we generalize this in the kernel.
Other changes:
- Corrected a possibly incorrect sysctl name from "hw.cbb.intr_mask"
to "hw.pcic.intr_mask".
- Removed redundant TUNABLE statements throughout the kernel.
- Some minor code rewrites in connection to removing not needed
TUNABLE statements.
- Added a missing SYSCTL_DECL().
- Wrapped two very long lines.
- Avoid malloc()/free() inside sysctl string handling, in case it is
called to initialize a sysctl from a tunable, hence malloc()/free() is
not ready when sysctls from the sysctl dataset are registered.
- Bumped FreeBSD version to indicate SYSCTL API change.
MFC after: 2 weeks
Sponsored by: Mellanox Technologies
pmap pvlist locks which are scaled by MAXCPU.
This allows an amd64 system to boot with MAXCPU set
to 256, which is currently FreeBSD's hard limit without
x2apic support.
Compile-tested for other arch's.
PR: 185831
Discussed with: jhb
MFC after: 3 weeks
AppleTalk was a network transport protocol for Apple Macintosh devices
in 80s and then 90s. Starting with Mac OS X in 2000 the AppleTalk was
a legacy protocol and primary networking protocol is TCP/IP. The last
Mac OS X release to support AppleTalk happened in 2009. The same year
routing equipment vendors (namely Cisco) end their support.
Thus, AppleTalk won't be supported in FreeBSD 11.0-RELEASE.
is operational. init_sleepqueues() initializes 256 mutexes, which,
due to witness still being cold, started to overflow the pending_locks
array.
As stated in the reported panic message, increase WITNESS_PENDLIST
from 768 to 1024, which provides space for additional 256 locks.
Reported by: many
Tested by: rakuco, bdrewery