Some stack frames are too large for a store pair instruction we already
detect in the arm64 fbt code. Add support for handling subtracting the
stack pointer directly.
Sponsored by: Innovate UK
When searching for an instruction to patch out in the arm64 function
boundary trace we search for a store pair with a write back. This
instruction is commonly used to store two registers to the stack
and update the stack pointer to hold space for more.
This works in many cases, however not all functions use this, e.g.
when the stack frame is too large. In these cases we may find another
instruction of the same type that doesn't store through the stack
pointer. Filter these instructions out and assume if we see one we
are past the function prologue.
Reported by: rwatson
Sponsored by: Innovate UK
With newer AMD GPUs (>=Navi,Renoir) there is FPU context usage in the
amdgpu driver.
The `kernel_fpu_begin/end` implementations in drm did not even allow nested
begin-end blocks.
Submitted by: Greg V
Reviewed By: manu, hselasky
Differential Revision: https://reviews.freebsd.org/D28061
A driver can register a shrinker that will be called when the kernel
wants to free some memory.
Add support for that in linuxkpi and call the registered shrinkers
when the lowmem event is triggered.
Reviewed by: bz
Differential Revision: https://reviews.freebsd.org/D27728
-pci_get_class : This function search for a matching pci device based on
the class/subclass and returns a newly created pci_dev.
- pci_{save,restore}_state : This is analogous to ours with the same name
- pci_is_root_bus : Return true if this is the root bus
- pci_get_domain_bus_and_slot : This function search for a matching pci
device based on domain, bus and slot/function concat into a single
unsigned int (devfn) and returns a newly created pci_dev
- pci_bus_{read,write}_config* : Read/Write to the config space.
While here add some helper function to alloc and fill the pci_dev struct.
Reviewed by: hselasky, bz (older version)
Differential Revision: https://reviews.freebsd.org/D27550
pci_find_class_from help finding one or multiple device matching
a class and subclass.
If the from argument is not null we will first loop in the device list
until we find the matching device and only then start to check if the
class/subclass matches.
Reviewed by: jhb
Differential Revision: https://reviews.freebsd.org/D27549
Even if sigfastblock block is non-zero, non-blockable signals must be
checked on ast and delivered now. This also affects debugger ability
to attach, because issignal() also calls ptracestop() if there is
a pending stop for debugee.
Instead of checking for sigfastblock, and either setting PENDING flag
for usermode or doing signal delivery loop, always do the loop after
checking, and then handle PENDING bit. issignal() already does the right
thing for fast-blocked case, allowing only STOPs and SIGKILL delivery to
happen.
Reported by: Vasily Postnicov <shamaz.mazum@gmail.com>, markj
Reviewed by: markj
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D28089
User pending bit should not be set if kernel did not noted a pending signal.
Reviewed by: markj
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D28089
Right now the routine leaves the current CPU in the map, later tripping
on an assert when filling in the scoreboard: panic: IPI scoreboard is
zero, initiator 1 target 1
Instead pre-check if all CPUs are present in the map and remember that
outcome for later.
Fixes: 7eaea04a5b ("amd64: compare TLB shootdown target to all_cpus")
Reviewed by: kib
Differential Revision: https://reviews.freebsd.org/D28111
Previously, we would accept any kind of LIO_* opcode, including ones
that were intended for in-kernel use only like LIO_SYNC (which is not
defined in userland). The situation became more serious with
022ca2fc7f. After that revision, setting
aio_lio_opcode to LIO_WRITEV or LIO_READV would trigger an assertion.
Note that POSIX does not specify what should happen if aio_lio_opcode is
invalid.
MFC-with: 022ca2fc7f
Reviewed by: jhb, tmunro, 0mp
Differential Revision: <https://reviews.freebsd.org/D28078
On amd64, the pmap code passes all_cpus to
smp_targeted_tlb_shootdown() when unmapping from the
kernel pmap. This function has an optimized path to send IPIs
to all but itself, which it intends to do when the target
is all cpus. However, we need to compare the target cpu mask
with all_cpus, rather than using CPU_ISFULLSET(). Comparing with
CPU_ISFULLSET() will only work when we have MAXCPU cpus active in
the system, otherwise, we'll be sending repeated IPIs, rather than
a single IPI to all CPUs but ourself.
Fixing this should reduce the time spent in native_lapic_ipi_wait()
as we will be sending ipis in parallel, rather than one-by-one.
This is confirmed by dtrace.
Reviewed by: alc, jhb, kib, markj
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D28102
UFS uses a new "mntfs" pseudo file system which provides private
device vnodes for a file system to safely access its disk device.
The original device vnode is saved in um_odevvp to hold the exclusive
lock on the device so that any attempts to open it for writing will
fail. But it is otherwise unused and has its BO_NOBUFS flag set to
enforce that file systems using mntfs vnodes do not accidentally
use the original devfs vnode. When the file system is unmounted,
um_odevvp is no longer needed and is released.
The lock order reversal happens because device vnodes must be locked
before UFS vnodes. During unmount, the root directory vnode lock
is held. When when calling vrele() on um_odevvp, vrele() attempts to
exclusive lock um_odevvp causing the lock order reversal. The problem
is eliminated by doing a non-blocking exclusive lock on um_odevvp
which will always succeed since there are no users of um_odevvp.
With um_odevvp locked, it can be released using vput which does not
attempt to do a blocking exclusive lock request and thus avoids the
lock order reversal.
Sponsored by: Netflix
For rate-based resources that support throttling (e.g.
readiops/writeips), this fixes a divide-by-zero panic when rctl(8)
passes 0 as the throttle value. For these resources, treat
zero-throttle requests as requests to suspend forward progress as long
as possible using the duration specified in
kern.racct.rctl.throttle_max.
PR: 251803
Reported by: chris@cretaforce.gr
Reviewed by: kib
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D27858
Relevant inet/inet6 code has the control over deciding what
the RIB lookup function currently is. With that in mind,
explicitly set it to the current value (rn_match) in the
datapath lookups. This avoids cost on indirect call.
Differential Revision: https://reviews.freebsd.org/D28066
After old vmspace is destroyed during execve(2), but before the new space
is fully constructed, an error during image activation cannot be returned
because there is no executing program to receive it.
In the relatively common case of failure to map stack, print some hints
on the control terminal. Note that user has enough knobs to cause stack
mapping error, and this is the most common reason for execve(2) aborting
the process.
Requested by: jhb
Reviewed by: emaste, jhb
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D28050
It is checked in vm_map_insert() and vm_map_protect() that PROT_WRITE |
PROT_EXEC are never specified together, if vm_map has MAP_WX flag set.
FreeBSD control flag allows specific binary to request WX exempt, and
there are per ABI boolean sysctls kern.elf{32,64}.allow_wx to enable/
disable globally.
Reviewed by: emaste, jhb
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D28050
Otherwise parallel pmap_allocpte_alloc() for nearby va might also fail
allocating page table page and free the page under us. The end result is
that we could dereference unmapped pte when doing cleanup after sleep.
Instead, on allocation failure, first free everything, only then we can
drop pmap mutex and sleep safely, right before returning to caller.
Split inner non-sleepable part of the pmap_allocpte_alloc() into a new
helper pmap_allocpte_nosleep().
Reviewed by: markj
Reported and tested by: pho
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D27956
The function performs actual allocation of pte, as opposed to
pmap_allocpte() that uses existing free pte if pt page is already
there. This also moves function out of namespace similar to a language
reserved.
Reviewed by: markj
Tested by: pho
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D27956
pmap_pdpe() might return NULL, check for it.
Reviewed by: markj
Reported and tested by: pho
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D27956
Currently default behaviour is to keep only 1 packet per unresolved entry.
Ability to queue more than one packet was added 10 years ago, in r215207,
though the default value was kep intact.
Things have changed since that time. Systems tend to initiate multiple
connections at once for a variety of reasons.
For example, recent kern/252278 bug report describe happy-eyeball DNS
behaviour sending multiple requests to the DNS server.
The primary driver for upper value for the queue length determination is
memory consumption. Remote actors should not be able to easily exhaust
local memory by sending packets to unresolved arp/ND entries.
For now, bump value to 16 packets, to match Darwin implementation.
The proper approach would be to switch the limit to calculate memory
consumption instead of packet count and limit based on memory.
We should MFC this with a variation of D22447.
Reviewers: #manpages, #network, bz, emaste
Reviewed By: emaste, gbe(doc), jilles(doc)
MFC after: 1 month
Differential Revision: https://reviews.freebsd.org/D28068
The cgem(4) driver was updated to support 64-bit bus addressing in
facdd1cd20. However, the committed version determines this in an
un-idiomatic way. Change the compile-time conditional to check
BUS_SPACE_MAXADDR, rather than comparing int and pointer sizes.
Reported by: jrtc27
- Implement a dtrace_getnanouptime(), matching the existing
dtrace_getnanotime(), to avoid DTrace calling out to a potentially
instrumentable function.
(These should probably both be under KDTRACE_HOOKS. Also, it's not clear
to me that they are correct implementations for the DTrace thread time
functions they are used in .. fixes for another commit.)
- Don't allow FBT to instrument functions involved in EL1 exception handling
that are involved in FBT trap processing: handle_el1h_sync() and
do_el1h_sync().
- Don't allow FBT to instrument DDB and KDB functions, as that makes it
rather harder to debug FBT problems.
Prior to these changes, use of FBT on FreeBSD/arm64 rapidly led to kernel
panics due to recursion in DTrace.
Reliable FBT on FreeBSD/arm64 is reliant on another change from @andrew to
have the aarch64 instrumentor more carefully check that instructions it
replaces are against the stack pointer, which can otherwise lead to memory
corruption. That change remains under review.
MFC after: 2 weeks
Reviewed by: andrew, kp, markj (earlier version), jrtc27 (earlier version)
Differential revision: https://reviews.freebsd.org/D27766
Use an interface compatible with the Linux one so that the user-space
libraries already using the Linux interface can be used without much
modifications.
This allows an open privcmd instance to limit against which domains it
can act upon.
Sponsored by: Citrix Systems R&D
Use an interface compatible with the Linux one so that the user-space
libraries already using the Linux interface can be used without much
modifications.
This allows user-space to make use of the dm_op family of hypercalls,
which are used by device models.
Sponsored by: Citrix Systems R&D
The interface is mostly the same as the Linux ioctl, so that we don't
need to modify the user-space libraries that make use of it.
The ioctl is just a proxy for the XENMEM_acquire_resource hypercall.
Sponsored by: Citrix Systems R&D
In e86bddea9f sys/netpfil/pf/pf.h grew a
declaration of pf_get_ruleset_number. Now delete the old declaration
from sys/net/pfvar.h.
Reviewed by: kp
Sponsored by: Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D28081
Restore the hwofs functionality temporarily disabled by
7ba6ecf216 to prevent issues with iflib.
This patch brings the necessary changes to iflib to
enable howfs to allow interface restarts without
disrupting netmap applications actively using its
rings.
After this change, it becomes possible for multiple
non-cooperating netmap applications to use non-overlapping
subsets of the available netmap rings without clashing
with each other.
PR: 252453
MFC after: 1 week
i386 codegen insists on preloading tc_priv into register on i386, and
this register cannot be %eax because RDTSCP instruction clobbers it
before it is used.
Reported and tested by: dim
MFC after: 6 days
Sponsored by: The FreeBSD Foundation
Add KASSERTS to nfsm_trimtrailing() to confirm the sanity of
the arguments for the M_EXTPG case.
Suggested by: kib
Reviewed by: kib
Differential Revision: https://reviews.freebsd.org/D28053
Add 64-bit address support to Cadence CGEM Ethernet driver for use in
other SoCs such as the Zynq UltraScale+ and SiFive HighFive Unleashed.
Reviewed by: philip, 0mp (manpages)
Differential Revision: https://reviews.freebsd.org/D24304
When we already have the vm page in hand, use vm_page_domain() instead
of vm_phys_domain(). The former has a trivial constant-time
implementation whereas the latter iterates over the mem_affinity array.
Reviewed by: kib, markj
Differential Revision: https://reviews.freebsd.org/D28005
The iflib_queues_alloc() allocates isc_nrxqs iflib_dma_info structs
for each rxqset, and links each struct to a different free list.
As a result, it must be isc_nrxqs >= isc_nfl (plus the completion
queue, if present).
Add an assertion to make this constraint explicit.
MFC after: 2 weeks
Since 1d238b07d5, krings are disabled before
a reinit cycle triggered by iflib_netmap_register.
However, this operation is actually necessary also for
any interface reinit triggered by other causes (i.e.,
ifconfig commands).
We achieve this goal by moving the krings enable/disable
operation inside iflib_stop() and iflib_init_locked().
Once here, this change also removes some redundant operations
from iflib_netmap_register(), that are already performed by
iflib_stop().
PR: 252453
MFC after: 1 week
Track the the current lock/reference state in a single variable,
rather than deducing the proper prison_deref() flags from a
combination of equations and hard-coded values.
Despite TMPFS_UNLOCK() is done in both paths later, unlocking not locked
mutex provides different failure mode.
MFC after: 1 week
Sponsored by: The FreeBSD Foundation
Use it in preference of Xfenced RDTSC if RDTSCP is supported. It is
recommended by both Intel and AMD. But, on AMD Zens and newer use
LFENCE, as recommended by AMD [*]. In particular, this means that now
AMD CPUs use more appropriate fence instead of too harsh MFENCe.
Add comment explaining the intent of the selection logic.
Reported by: gallatin [*]
Reviewed by: gallatin, markj
Tested by: gallatin, pho
MFC after: 1 week
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D27986
Instead of trying to maintain pg_jobc counter on each process group
update (and sometimes before), just calculate the counter when needed.
Still, for the benefit of the signal delivery code, explicitly mark
orphaned groups as such with the new process group flag.
This way we prevent bugs in the corner cases where updates to the counter
were missed due to complicated configuration of p_pptr/p_opptr/real_parent
(debugger).
Since we need to iterate over all children of the process on exit, this
change mostly affects the process group entry and leave, where we need
to iterate all process group members to detect orpaned status.
(For MFC, keep pg_jobc around but unused).
Reported by: jhb
Reviewed by: jilles
Tested by: pho
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D27871
This improves code structure and allows to put the lock asserts right
into place where the locks are needed.
Also move zeroing of the kinfo_proc structure from fill_kinfo_proc_only()
to fill_kinfo_proc(), this looks more symmetrical.
Reviewed by: jilles
Tested by: pho
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D27871
Proctree lock is needed for correct calculation and collection of the
job-control related data in kinfo_proc. There was even an XXX comment
about it.
Satisfy locking and lock ordering requirements by taking proctree lock
around pass over each bucket in proc_iterate(), and in sysctl_kern_proc()
and note_procstat_proc() for individual process reporting.
Reviewed by: jilles
Tested by: pho
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D27871
Increase the scope of the process group lock ownership. This ensures that
we are consistent in returning EIO for tty write from an orphan and delivery
of TTYOUT signals.
Reviewed by: jilles
Tested by: pho
MFC after: 1 week
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D27871
Often, we have a process locked and need to get locked process group.
In this case, because progress group lock is before process lock,
unlocking process allows the group to be freed. See for instance
tty_wait_background().
Make pgrp structures allocated from nofree zone, and ensure type stability
of the pgrp mutex.
Reviewed by: jilles
Tested by: pho
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D27871
Similarly to what done for iflib in 1d238b07d5,
this patch prevents access to the krings during the interface
reset triggered by netmap_register().
MFC after: 1 week
The netmap_reset() function is meant to be called by the driver
when they initialize (or re-initialize) a hardware ring.
However, since the introduction of support for opening (in
netmap mode) a subset of the available rings, netmap_reset()
may be called multiple times on actively used rings, causing
both kring and netmap ring to transition to an inconsistent
state.
This changes improves the situation by resetting all the
indices fields of the kring to 0, as expected after the
reinitialization of a hardware ring.
PR: 252518
MFC after: 1 week
When netmap_fl_refill() is called at initialization time (e.g.,
during netmap_iflib_register()), nic_i must be 0, since the
free list is reinitialized. At the end of the refill cycle, nic_i
must still be zero, because exactly N descriptors (N is the ring size)
are refilled.
This patch therefore fixes the assertions to check on nic_i rather
than on nm_i. The current netmap_reset() may in fact cause nm_i
to be != 0 while the device is resetting: this may happen when
multiple non-cooperating processes open different subsets of the
available netmap rings.
PR: 252518
MFC after: 1 week
When different processes open separate subsets of the
available rings of a same netmap interface, a device
reset may be performed while one of the processes
is actively using some rings (e.g., caused by another
process executing a nmport_open()).
With this patch, such situation will cause the
active process to get a POLLERR, so that it can
have a chance to detect the situation.
We also guarantee that no process is running a txsync
or rxsync (ioctl or poll) while an iflib device reset
is in progress.
PR: 252453
MFC after: 1 week
This imposes a fairly severe limitation on space available for mmap that
was not noticed prior to commit. Unfixed mmap will only map from
[data + MAXSIZE, end of user VA space], bringing the amount of usable space
down way too low for non-trivial link jobs (for instance).
Reported by: mmel
When using NOTE_NSECONDS in the kevent(2) API, US_TO_SBT should be
used instead of NS_TO_SBT, otherwise the timeout results are
misleading.
PR: 252539
Reviewed by: kevans, kib
Approved by: kevans
MFC after: 3 weeks
This module intended to measure performance of routing lookups.
Uses a list of IP addresses specified by sysctl one-by-one.
Performance testing is triggered by changing sysctl OID with a number of lookups to execute.
Lookups are done by the chunks of 10K routes, entering/exiting epoch on
chunk granularity to amortise cost.
Example:
make -C sys/modules/test/fib_lookup unload load
for i in `cat ~/ip4.txt`; do sysctl net.route.test.add_inet_addr=$i; done
for i in `cat ~/ip6.txt`; do sysctl net.route.test.add_inet6_addr=$i; done
sysctl net.route.test.run_inet=10000000
dmesg | tail
Dec 13 23:24:05 current kernel: 10000000 packets in 417240173 nanoseconds, 23967011 pps
Dec 13 23:24:06 current kernel: run: 10000000 packets vnet 0xfffff80003073f00
Dec 13 23:24:07 current kernel: 10000000 packets in 423086254 nanoseconds, 23635842 pps
Differential Revision: https://reviews.freebsd.org/D27604
This change introduces loadable fib lookup modules based on
DPDK rte_lpm lib targeted for high-speed lookups in large-scale tables.
It is based on the lookup framework described in D27401.
IPv4 module is called dpdk_lpm4. It wraps around rte_lpm [1] library.
This library implements variation of DIR24-8 [2] lookup algorithm.
Module provide lockless route lookups and in-place incremental updates,
allowing for good RIB performance.
IPv6 module is called dpdk_lpm6. It wraps around rte_lpm6 [3] library.
Implementation can be seen as multi-bit trie where the stride or number of bits
inspected on each level varies from level to level.
It can vary from 1 to 14 memory accesses, with 5 being the average value
for the lengths that are most commonly used in IPv6.
Module provide lockless route lookups for global unicast addresses
and in-place incremental updates, allowing for good RIB performance.
Implementation details:
* wrapper code lives in `sys/contrib/dpdk_rte_lpm/dpdk_lpm[6].c`.
* rte_lpm[6] implementation contains both RIB and FIB code.
. RIB ("rule_") code, backed by array of hash tables part has been commented out,
as base radix already provides all the necessary primitives.
* link-local lookups are currently implemented as base radix lookup.
This part should be converted to something like read-only radix trie.
Usage detail:
Compile kernel with option FIB_ALGO and load dpdk_lpm4/dpdk_lpm6
module at any time. They will be picked up automatically when
amount of routes raises to several thousand.
[1]: https://doc.dpdk.org/guides/prog_guide/lpm_lib.html
[2]: http://yuba.stanford.edu/~nickm/papers/Infocom98_lookup.pdf
[3]: https://doc.dpdk.org/guides/prog_guide/lpm6_lib.html
Differential Revision: https://reviews.freebsd.org/D27412
The NVMe byte-swap routines for big-endian platforms used memcpy() to
move the unaligned 64-bit value into a temp register to byte swap it.
Instead of introducing a dependency, manually byte-swap the values in
place.
Point hat: me
A test of this is funcs/tst.strtok.d which has this filter:
BEGIN
/(this->field = strtok(this->str, ",")) == NULL/
{
exit(1);
}
The test will randomly fail with exit status of 1 indicating that this->field
was NULL even though printing it out shows it is not.
This is compiled to the DTrace instruction set:
// Pushed arguments not shown here
// call strtok() and set result into %r1
07: 2f001f01 call DIF_SUBR(31), %r1 ! strtok
// set thread local scalar this->field from %r1
08: 39050101 stls %r1, DT_VAR(1281) ! DT_VAR(1281) = "field"
// Prepare for the == comparison
// Set right side of %r2 to NULL
09: 25000102 setx DT_INTEGER[1], %r2 ! 0x0
// string compare %r1 (strtok result) to %r2
10: 27010200 scmp %r1, %r2
In this case only %r1 is loaded with a string limit set to lim1. %r2 being
NULL does not get loaded and does not set lim2. Then we call dtrace_strncmp()
with MIN(lim1, lim2) resulting in passing 0 and comparing neither side.
dtrace_strncmp() handles this case fine and it already has been while
being lucky with what lim2 was [un]initialized as.
Reviewed by: markj, Don Morris <dgmorris AT earthlink.net>
Sponsored by: Dell EMC
Differential Revision: https://reviews.freebsd.org/D27671
Add an additional set of braces to clarify intention. The '&' operator
has a higher precedence than '|', but the reader may not always remember
this. No functional change.
This is just like debug.kdb.panic, except the string that's passed in
is reported in the panic message. This allows people with automated
systems to collect kernel panics over a large fleet of machines to
flag panics better. Strings like "Warner look at this hang" or "see
JIRA ABC-1234 for details" allow these automated systems to route the
forced panic to the appropriate engineers like you can with other
types of panics. Other users are likely possible.
Relnotes: Yes
Sponsored by: Netflix
Reviewed by: allanjude (earlier version)
Suggestions from review folded in by: 0mp, emaste, lwhsu
Differential Revision: https://reviews.freebsd.org/D28041
Stop trying to manually calculate RID, which cannot be done correctly
by PCI_DEVFN(). Use PCI_GET_RID() method instead.
Do not use pci_find_dbsf() to go from the linux pci_dev to freebsd
device_t. First, device is readily available as dev.bsddev. Second,
using pci_find_dbsf() fails for ARI-enabled functions with large
function numbers, because PCI_SLOT()/PCI_FUNC() are for non-ARI.
Reviewed by: bz, hselasky, manu
Tested by: manu (drm)
Sponsored by: Mellanox Technologies/NVidia Networking
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D27960
Everything required for remote kernel debugging over a serial
connection. For FDT-based systems, a debug port can be specified by
setting hw.fdt.dbgport to the desired device tree node in loader.conf.
For example, hw.fdt.dbgport="uart1", or
hw.fdt.dbgport="serial@ff1a0000".
Looks good: emaste
Tested by: rwatson
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D27727
The program counter field in the PCB is written in exactly one place,
makectx(), upon entry to the debugger. For threads other than curthread,
its value will be empty, or bogus. Rather than writing to this field in
more places, it can be removed in favor of using the value in the link
register.
To make this clearer, pcb->pcb_x[30] is renamed to pcb->pcb_lr, similar
to what already exists in struct trapframe. Also, prefer lr to x30 in
assembly, as it better conveys intention.
This improves PC_REGS() for kdb_thread != curthread. It is required for
a functional gdb(4) stub, fixing the output of `info threads`, in
particular.
The space occupied by pcb_pc is retained, for compatibility with kgdb.
Reviewed by: markj, jhb
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D27720
This effectively undoes the changes made in r321571. While useful, it is
inconsistent with how other architectures pass trapframes to kdb. This
change is also required to get a working gdb(4) stub on arm64, as
otherwise the backtrace will begin too early.
As of 088a7eef95, this information can still be obtained via
"show registers/u".
Reviewed by: jhb (slightly earlier version)
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
Pull Request: https://reviews.freebsd.org/D27719
The debugger is always entered after some kind of kernel trap, often a
breakpoint in kdb_enter(). This means that the most recent trapframe
will include kernel state at the time of the trap, when often it is
desirable to the developer to view the contents of the previous
trapframe. This trapframe often corresponds to the entry from userspace.
The ddb(4) man page claims the ability to display user register state
via the 'u' modifier to `show registers`, but this appears untrue. It is
not obvious from a quick search of the history when this feature was
added, or when it was removed. (Re)implement this feature in
db_show_regs, noting that it is not necessarily populated with userspace
state.
Reviewed by: jhb (earlier version), markj, bcr (manpages)
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D27705
Ext_pg mbufs allow carrying multiple pages per mbuf. This
reduces mbuf linked list traversals, especially in socket
buffers, thereby reducing cache misses and CPU use for
applications using sendfile. Note that ext_pages use
unmapped pages, eliminating KVA mapping costs on 32-bit
platforms.
Ext_pg mbufs are also required for ktls (KERN_TLS), and having
them disabled by default is a stumbling block for those
wishing to enable ktls.
Reviewed-by: jhb, glebius
Sponsored by: Netfix
The device mapping table contains sc->max_devices entries, so only
indices in [0, sc->max_devices) are valid.
MFC after: 3 days
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D27964
Previously we copied in the request into a stack-allocated structure
that could be smaller than the request size. Furthermore, we checked
the request size only after doing the copyin.
Fix this by allocating a buffer to hold the request, then copying the
buffer's contents into a command descriptor. This is a bit heavy-handed
but I expect the overhead will not be noticeable. The approach of
coping the header in first is susceptible to TOCTOU problems.
Reviewed by: imp
Reported by: maxpl0it@protonmail.com
MFC after: 3 days
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D27963
safexcel_ring_intr() could fail to observed that sc_blocked is set after
completing all outstanding ops for a ring, in which case blocked ops
would be deferred forever.
Request structures are managed by individual rings, so move the
"blocked" flag into the per-ring state block and use the ring lock to
synchronize with safexcel_process(). Remove sc_mtx since it is now
unused.
MFC after: 3 days
Sponsored by: Rubicon Communications, LLC (Netgate)
mtx_init() does not make a copy of the name so the buffer must be valid
for the lifetime of the driver instance. Store each ring's lock's name
in the ring structure.
MFC after: 3 days
Sponsored by: Rubicon Communications, LLC (Netgate)
The new UAR API already offsets the UAR map pointer the mlx5en(4) is using.
While at it remove some no longer needed variables for keeping track
of the current BF offset.
This fixes a regression issue after the new UAR allocation APIs
were introduced.
MFC after: 1 week
Sponsored by: Mellanox Technologies // NVIDIA Networking
Add decoding of the Device Self-test log page and the ability to start
or abort a test.
Reviewed by: imp, mav
Tested by: Muhammad Ahmad <muhammad.ahmad@seagate.com>
MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D27517
This ioctl would instantly induce a panic, likely since near inception, up
until 0861c7d3e0. Lack of previous interest in fixing it combined with
the problematic interface (exports a pointer, really a physical address)
brings us to the natural conclusion: remove it until a useful consumer
forward.
If it eventually gets resurrected, the interface should definitely not
return in this exact form and likely needs to be reimagined.
The associated KPI, efi_get_table, is left intact for the time being.
Reviewed by: imp, jrtc27
Also discussed with: brooks, jhb
Differential Revision: https://reviews.freebsd.org/D28030
Virtual PMCs could be running on multiple CPUs so this needs to be
a per-CPU value.
Submitted by: rwatson (earlier version)
Reviewed by: gnn
Sponsored by: Innovate UK
Differential Revision: https://reviews.freebsd.org/D27973
When testing hwpmc on arm64 we found the counter could overflow while
reading the event count. Handle this case in the armv7 code by also
checking if the overflow bit is set and incrementing the overflow
cound as needed.
Sponsored by: Innovate UK
Differential Revision: https://reviews.freebsd.org/D27969
This change include several changes as listed below all related to UAR.
UAR is a special PCI memory area where the so-called doorbell register and
blue flame register live. Blue flame is a feature for sending small packets
more efficiently via a PCI memory page, instead of using PCI DMA.
- All structures and functions named xxx_uuars were renamed into xxx_bfreg.
- Remove partially implemented Blueflame support from mlx5en(4) and mlx5ib.
- Implement blue flame register allocator.
- Use blue flame register allocator in mlx5ib.
- A common UAR page is now allocated by the core to support doorbell register
writes for all of mlx5en and mlx5ib, instead of allocating one UAR per
sendqueue.
- Add support for DEVX query UAR.
- Add support for 4K UAR for libmlx5.
Linux commits:
7c043e908a74ae0a935037cdd984d0cb89b2b970
2f5ff26478adaff5ed9b7ad4079d6a710b5f27e7
0b80c14f009758cefeed0edff4f9141957964211
30aa60b3bd12bd79b5324b7b595bd3446ab24b52
5fe9dec0d045437e48f112b8fa705197bd7bc3c0
0118717583cda6f4f36092853ad0345e8150b286
a6d51b68611e98f05042ada662aed5dbe3279c1e
MFC after: 1 week
Sponsored by: Mellanox Technologies // NVIDIA Networking
- call pci_iov_detach() on detaching from PCI device to take care of hang
on destroying VFs after PF is down.
- disable eswitch SRIOV support right after pci_iov_detach(),
else the eswitch cleanup sometimes occur while the SRIOV flow table
is still present.
Submitted by: kib@
MFC after: 1 week
Sponsored by: Mellanox Technologies // NVIDIA Networking
O_RESOLVE_BENEATH recently took value 0x00800000, but I failed to spot
that while rebasing. Let's use 0x01000000 for the new O_DSYNC flag.
Reported by: kevans
Remove wi(4). pccard is going away, and wi only supports PC Card
devices, though it has a minor amount of glue to also support
PCI cards. However, removing the one without removing the other
is hard, so the whole driver is being removed.
Relnotes: Yes
pccard is being removed, so remove bt3c driver since it only has PC
Card attachment. Also remove bt3cfw(8) since it's the firmware for this
driver.
Relnotes: Yes
PC Card support is being removed, so remove its attachment here. ndis
is slated to be removed entirely for 13, but that's not been done yet.
Relnotes: Yes
The originally chosen numbers interfere with downstream projects'
syscalls. Move them to the end of the syscall table instead.
Reported by: jrtc27
Reviewed by: brooks
MFC-With: 022ca2fc7f
Differential Revision: 022ca2fc7f
aio_fsync(O_DSYNC, ...) is the asynchronous version of fdatasync(2).
Reviewed by: kib, asomers, jhb
Differential Review: https://reviews.freebsd.org/D25071
Respect the new IO_DATASYNC flag when performing synchronous writes.
Compared to O_SYNC, O_DSYNC lets us skip updating the inode in some
cases, matching the behaviour of fdatasync(2).
Reviewed by: kib
Differential Review: https://reviews.freebsd.org/D25160
POSIX O_DSYNC means that writes include an implicit fdatasync(2), just
as O_SYNC implies fsync(2).
VOP_WRITE() functions that understand the new IO_DATASYNC flag can act
accordingly, but we'll still pass down IO_SYNC so that file systems that
don't understand it will continue to provide the stronger O_SYNC
behaviour.
Flag also applies to fcntl(2).
Reviewed by: kib, delphij
Differential Revision: https://reviews.freebsd.org/D25090
This change includes:
hpen - Generic / MS Windows compatible HID pen tablet driver.
hgame - Generic game controller and joystick driver.
xb360gp - Xbox360-compatible game controller driver.
Submitted by: Greg V <greg_unrelenting.technology>
Reviewed by: hselasky (as part of D27993)
hidmap is a kernel module that maps HID input usages to evdev events.
Following dependent drivers is included in the commit:
hms - HID mouse driver.
hcons - Consumer page AKA Multimedia keys driver.
hsctrl - System Controls page (Power/Sleep keys) driver.
ps4dshock - Sony DualShock 4 gamepad driver.
Reviewed by: hselasky
Differential revision: https://reviews.freebsd.org/D27993
which installs /dev/uhid# alias to hidraw character device for
compatibility with some existing uhid(4) users like Firefox.
As side effect it renames traditional uhid(4) driver to hidraw
to make possible using of common unit number allocator.
Requested by: Greg V <greg_unrelenting.technology>
Reviewed by: hselasky (as part of D27992)
This driver provides raw access to HID devices through uhid(4)-compatible
interface and is based on pre-8.x uhid(4) code. Unlike uhid(4) it does
not take devices in to monopoly ownership and allows parallel access
from other drivers.
hidraw supports Linux's hidraw-compatible interface as well.
Reviewed by: hselasky
Differential revision: https://reviews.freebsd.org/D27992
This allows to mark HID-device interrupt handlers as MP-SAFE.
Atomics-based lockless key event queue with swi_giant taskqueue is used
to pass key-press events into Giant-protected system console.
Reviewed by: hselasky (as part of D27991)
This change implements hid_if.m methods for HID-over-USB protocol [1].
Also, this change adds USBHID_ENABLED kernel option which changes
device_probe() priority and adds/removes PnP records to prefer usbhid
over ums, ukbd, wmt and other USB HID device drivers and vice-versa.
The module is based on uhid(4) driver. It is disabled by default for
now due to conflicts with existing USB HID drivers.
[1] https://www.usb.org/sites/default/files/hid1_11.pdf
Reviewed by: hselasky
Differential revision: https://reviews.freebsd.org/D27893
hidquirk(4) is derived from usb_quirk(4) and inherits all its HID-related
functionality. It does not support ioctl(2) interface yet.
Reviewed by: hselasky
Differential revision: https://reviews.freebsd.org/D27890
This driver provides support for multiple HID driver attachments
to single HID transport backend. This ability existed in Net/OpenBSD
(uhidev and ihidev drivers) but has never been ported to FreeBSD.
Unlike Net/OpenBSD we do not use report number alone to distinct report
source but we follow MS way and use a top level collection (TLC) usage
index that report belongs to as a location key.
The driver performs child device autodiscovery based on HID report
descriptor data, proxying of HID requests from child devices to parent
transport backends and broadcasting of interrupts in backward direction.
Differential revision: https://reviews.freebsd.org/D27888
Create an abstract HID interface that provides hardware independent
access to HID capabilities and functions through the device tree.
hid_if.m resembles existing USBHID KPI and consist of next methods:
HID method USBHID variant
-----------------------------------------------------------------------
hid_intr_setup usbd_transfer_setup (INTERRUPT IN xfer)
hid_intr_unsetup usbd_transfer_unsetup (INTERRUPT IN xfer)
hid_intr_start usbd_transfer_start (INTERRUPT IN xfer)
hid_intr_stop usbd_transfer_drain (INTERRUPT IN xfer)
hid_intr_poll usbd_transfer_poll (INTERRUPT IN xfer)
hid_get_rdesc usbd_req_get_report_descriptor
hid_read No direct analog. Not intended for common use.
hid_write uhid(4) write()
hid_get_report usbd_req_get_report
hid_set_report usbd_req_set_report
hid_set_idle usbd_req_set_idle
hid_set_protocol usbd_req_set_protocol
This change is part of D27888
Also hide shim code added in a previous commit under COMPAT_USBHID12.
Note: it is enough to add -DCOMPAT_USBHID12 to CFLAGS to compile old
code with new HID subsystem, but it is not enough to link it at runtime.
HID dependency has to be added explicitly with MODULE_DEPEND macro.
Reviewed by: manu, hselasky (as part of D27887)
This does an import of quirk stubs, debugging macros from USB code and
numerous usage constants used by dependent drivers.
Besides, this change renames some functions to get a better matching
with userland library and NetBSD/OpenBSD HID code. Namely:
- Old hid_report_size() renamed to hid_report_size_max()
- New hid_report_size() calculates size of given report rather than
maximum size of all reports.
- hid_get_data_unsigned() renamed to hid_get_udata()
- hid_put_data_unsigned() renamed to hid_put_udata()
Compat shim functions are provided in usbhid.h to make possible compile
of legacy code unmodified after this change.
Reviewed by: manu, hselasky
Differential revision: https://reviews.freebsd.org/D27887
It will be used by the upcoming HID-over-i2C implementation. Should be
no-op, except hid.ko module dependency is to be added to affected drivers.
Reviewed by: hselasky, manu
Differential revision: https://reviews.freebsd.org/D27867
It is possible that the client list lock is taken by other process for too
long due to e.g. IO timeouts. Allow user to terminate open() in this case.
Reviewed by: markj (as part of D27865)
At the beginning of evdev there was a LOR between hardware driver's and
evdev client list locks as they were taken in different order at
driver's interrupt and evdev open()/close() handlers.
The LOR was fixed with introduction of evdev_register_mtx() function
which allowed to use a hardware driver's lock as evdev client list lock.
While this works good with PS/2 and USB, this does not work with I2C.
Unlike PS/2 and USB, I2C open()/close() handlers do unbound sleeps
while waiting for I2C bus to release and while performing IO.
This change uses epoch(9) for traversing evdev client list in interrupt
handler to avoid the LOR thus making possible to convert evdev client
list lock to sleepable sx.
While here add brief locking protocol description.
Reviewed by: markj
Differential revision: https://reviews.freebsd.org/D27865
hid_locate() currently ignores all HID items which tagged as constant,
i.e. bit 0 of main item data is set to 1. See p.6.2.2.4 of
hid1_11.pdf [1]. Such an items are unconditionally treated as
byte-alignment padding. While that may be right decision for input and
output reports that is wrong for features reports. Feature reports can
contain constant capabilities e.g. 'Contact Count Maximum'.
See: https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=232040
Remove check for constant from hid_locate() to make possible parsing of
such a reports.
[1] https://www.usb.org/sites/default/files/documents/hid1_11.pdf
Reviewed by: hselasky
Obtained from: sysutils/iichid
Differential Revision: https://reviews.freebsd.org/D27747
This was overlooked in the pfi_kkif/pfi_kif splitup and as a result
userspace could no longer tell which interfaces had the skip flag
applied.
MFC after: 2 weeks
Doing a 'dd' over iscsi will reliably cause stalls. Tx
cleaning _should_ reliably happen as data is sent.
However, currently if the transmit queue fills it will
wait until the iflib timer (hz/2) runs.
This change causes the the tx taskq thread to be run
if there are completed descriptors.
While here:
- make timer interrupt delay a sysctl
- simplify txd_db_check handling
- comment on INTR types
Background on the change:
Initially doorbell updates were minimized by only writing to the register
on every fourth packet. If txq_drain would return without writing to the
doorbell it scheduled a callout on the next tick to do the doorbell write
to ensure that the write otherwise happened "soon". At that time a sysctl
was added for users to avoid the potential added latency by simply writing
to the doorbell register on every packet. This worked perfectly well for
e1000 and ixgbe ... and appeared to work well on ixl. However, as it
turned out there was a race to this approach that would lockup the ixl MAC.
It was possible for a lower producer index to be written after a higher one.
On e1000 and ixgbe this was harmless - on ixl it was fatal. My initial
response was to add a lock around doorbell writes - fixing the problem but
adding an unacceptable amount of lock contention.
The next iteration was to use transmit interrupts to drive delayed doorbell
writes. If there were no packets in the queue all doorbell writes would be
immediate as the queue started to fill up we could delay doorbell writes
further and further. At the start of drain if we've cleaned any packets we
know we've moved the state machine along and we write the doorbell (an
obvious missing optimization was to skip that doorbell write if db_pending
is zero). This change required that tx interrupts be scheduled periodically
as opposed to just when the hardware txq was full. However, that just leads
to our next problem.
Initially dedicated msix vectors were used for both tx and rx. However, it
was often possible to use up all available vectors before we set up all the
queues we wanted. By having rx and tx share a vector for a given queue we
could halve the number of vectors used by a given configuration. The problem
here is that with this change only e1000 passed the necessary value to have
the fast interrupt drive tx when appropriate.
Reported by: mav@
Tested by: mav@
Reviewed by: gallatin@
MFC after: 1 month
Sponsored by: iXsystems
Differential Revision: https://reviews.freebsd.org/D27683
vn_rdwr() must lock the entire file range for IO_APPEND
just like vn_io_fault() does for O_APPEND.
Reviewed by: kib, imp, mckusick
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D28008
Only ACPI attachment is supported for now, some others depend on the
presence of smbios(4) support, which we lack on arm64.
Reviewed by: emaste
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D28009
A straightforward(ish) port from aesni(4). This implementation does not
perform loop unrolling on the input blocks, so this is left as a future
performance improvement.
Submitted by: Greg V <greg AT unrelenting.technology>
Looks good: jhb, jmg
Tested by: mhorne
Differential Revision: https://reviews.freebsd.org/D21017
In the conversion into a tunable, we converted the
size of the s/g list used by the driver to be based
off of a hardcoded size of 128k rather than maxphys,
this caused performance problems for us. Revert this
to use the maxphys tunable.
Note that this constant is used to size dynamically allocated
things, and not static data structs, so this is safe.
Reviewed By: imp, kib, mav
Tested By:i dhw
Differential Revision: https://reviews.freebsd.org/D28023
Sponsored by: Netflix
No functional change - only moved lines, changed whitespace, and
updated comments.
Reviewed by: allanjude
MFC after: 1 week
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D28001
Code changes in this commit were obtained from straight from OpenBSD's
uplcom.c with almost no modification, the list of chip names and USB
IDs was obtained from Linux.
Differential Revision: https://reviews.freebsd.org/D27952
Submitted by: tomli_tomli.me (Yifeng Li)
MFC after: 1 week
Sponsored by: Mellanox Technologies // NVIDIA Networking
This makes the minimum amount of changes to allow inclusion of dtrace.h
without all the solaris compatibility headers. Installing dtrace.h allows
compiling consumers of libdtrace (e.g. https://github.com/tmetsch/python-dtrace)
without requiring a copy of the source tree.
For python-dtrace I worked around this in 58019c9a12
but being able to build the library without installed sources would be
extremely useful.
Reviewed By: gnn
Differential Revision: https://reviews.freebsd.org/D27884
There is no reason this driver can't return default probe value.
Submitted by: Artur Rojek <ar@semihalf.com>
Reviewed by: emaste, mmel
Obtained from: Semihalf
Sponsored by: Alstom Group
Differential Revision: https://reviews.freebsd.org/D26869
Replace various hw reg bit set/clear helpers with a universal
`qoriq_gpio_set` function.
Submitted by: Artur Rojek <ar@semihalf.com>
Reviewed by: mmel
Obtained from: Semihalf
Sponsored by: Alstom Group
Differential Revision: https://reviews.freebsd.org/D26868
Make the code more conformant to style(9) and improve the general
readability.
This patch does not alter the driver logic.
Submitted by: Artur Rojek <ar@semihalf.com>
Reviewed by: mmel
Obtained from: Semihalf
Sponsored by: Alstom Group
Differential Revision: https://reviews.freebsd.org/D26867
The routine can be used when the caller does not expect to ever have to
wait for anything. Checking later with seqc_consistent retains all the
guarantees.