The VT-x VMCS only stores the base address of the GDTR and IDTR. As a
result, VM exits use a fixed limit of 0xffff for the host GDTR and
IDTR losing the smaller limits set in when the initial GDT is loaded
on each CPU during boot. Explicitly save and restore the full GDTR
and IDTR contents around VM entries and exits to restore the correct
limit.
Similarly, explicitly save and restore the LDT selector. VM exits
always clear the host LDTR as if the LDT was loaded with a NULL
selector and a userspace hypervisor is probably using a NULL selector
anyway, but save and restore the LDT explicitly just to be safe.
PR: 230773
Reported by: John Levon <levon@movementarian.org>
Reviewed by: kib
Tested by: araujo
Approved by: re (rgrimes)
MFC after: 1 week
resilver (r334844)
MFV/ZoL: Fix deadlock in IO pipeline
commit a76f3d0437
Author: Brian Behlendorf <behlendorf1@llnl.gov>
Date: Fri Mar 16 16:46:06 2018 -0700
Fix deadlock in IO pipeline
In vdev_queue_aggregate() the zio_execute() bypass should not be
called under the vdev queue lock. This can result in a deadlock
as shown in the stack traces below.
Drop the vdev queue lock then walk the parents of the aggregate IO
to determine the list of component IOs to be bypassed. This can
be done safely without holding the io_lock since the new aggregate
IO has not yet been returned and its parents cannot change.
--- THREAD 1 ---
arc_read()
zio_nowait()
zio_vdev_io_start()
vdev_queue_io() <--- mutex_enter(vq->vq_lock)
vdev_queue_io_to_issue()
vdev_queue_aggregate()
zio_execute()
vdev_queue_io_to_issue()
vdev_queue_aggregate()
zio_execute()
zio_vdev_io_assess()
zio_wait_for_children() <- mutex_enter(zio->io_lock)
--- THREAD 2 --- (inverse order)
arc_read()
zio_change_priority() <- mutex_enter(zio->zio_lock)
vdev_queue_change_io_priority() <- mutex_enter(vq->vq_lock)
Reviewed-by: Tom Caputi <tcaputi@datto.com>
Reviewed-by: Don Brady <don.brady@delphix.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reported by: ZFS Leadership Meeting
Reviewed by: mav
Approved by: re (kib)
Obtained from: ZFS-on-Linux
MFC after: 2 weeks
Sponsored by: Klara Systems
Differential Revision: https://reviews.freebsd.org/D17495
This connects new tunables that were added but not exposed in:
r329502 (zpool remove)
r337007 (zpool initialize)
Reviewed by: avg
Approved by: re (kib)
MFC after: 2 weeks
Sponsored by: Klara Systems
Differential Revision: https://reviews.freebsd.org/D17494
This is caused by a deadlock between zil_commit() and zfs_zget()
Add a way for zfs_zget() to break out of the retry loop in the common case
PR: 229614
Reported by: grembo, Andreas Sommer, many others
Tested by: Andreas Sommer, Vicki Pfau
Reviewed by: avg (no objection)
Approved by: re (gjb)
MFC after: 2 months
Sponsored by: Klara Systems
Differential Revision: https://reviews.freebsd.org/D17460
spa_condense_indirect_thread() is no longer a thread function, but just
a callback for new zthr KPI.
Submitted by: allanjude
Approved by: re (gjb)
MFC after: 3 days
Recent changes in Linux updated Marvell Armada 38x
UART compatible string. As a result the FreeBSD driver
(uart_dev_snps) does not probe. This commit fixes the
situation, however not applying any functional modification
to the driver methods.
Approved by: re (kib)
Obtained from: Semihalf
- Update OpenSSL to version 1.1.1.
- Update Kerberos/Heimdal API for OpenSSL 1.1.1 compatibility.
- Bump __FreeBSD_version.
Approved by: re (kib)
Sponsored by: The FreeBSD Foundation
epoch section without exiting that epoch section. This is bad for two
reasons: the epoch section won't exit, and we will leave the epoch tracker
from the stack on the epoch list.
Fix the epoch leak by making sure we exit epoch sections before returning.
Reviewed by: ae, gallatin, mmacy
Approved by: re (gjb, kib)
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D17450
The pNFS server would report the total disk space used and free for all
of the DSs, even when certain DSs are assigned to the file system via
the "#<path>" suffix used in the "nfsd -p" option argument.
This patch fixes this case. It only reports usage for the file system
that the argument vnode resides on. This is consistent with the non-pNFS
NFSv4 server. In NFSv4 it is possible to have subtrees on other file
systems, but these are not included in the usage information for NFSv4.
Approved by: re (gjb)
r339213 was cherry-picked back to head from the project branch, which
caused a conflict. This commit properly records the mergeinfo from
head.
r339205 was missed, and r339214 is required for reintegration.
Sponsored by: The FreeBSD Foundation
When acting as a VF it is required to add steering rules for all unicast
addresses. Even if promiscious mode is selected. Else incoming data packets
will be dropped.
MFC after: 3 days
Approved by: re (gjb)
Sponsored by: Mellanox Technologies
These messages are totally redundant with the iflib messages.
They're also not very useful, since they don't include the
interface name.
Discussed with: shurd
Approved by: re (rgrimes)
Sponsored by: Dell EMC Isilon
configuring kernels for i386, amd64, and arm64.
The 'GEOM_PART_GPT' option was added to the DEFAULTS configuration
in r337967.
Approved by: re (kib@)
Reviewed by: ler@
Differential Revision: https://reviews.freebsd.org/D17458
Sponsored by: Netflix, Inc.
locally generated SCTP packets sent over IPv4. This make
the behaviour consistent with IPv6.
Reviewed by: ae@, bz@, jtl@
Approved by: re (kib@)
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D17406
Discussing with Benjamin Herrenschmidt, OPAL_INT_GET_XIRR masks the
returned priority, so must be resumed before more interrupts can be
handled at this priority. Since there are only two priorities used in
FreeBSD, we know that the previous priority in an EOI will always be
0xff (lowest priority).
Reviewed by: nwhitehorn
Approved by: re(rgrimes)
Differential Revision: https://reviews.freebsd.org/D17361
Summary:
Discussing with Benjamin Herrenschmidt, MSIs, and edge-triggered
interrupts in general, must not be masked in XICS and XIVE, else
subsequent interrupts may be ignored.
Testing locally on my Talos II (single CPU, 18-core POWER9), NVMe now
works with MSI, improving read throughput by ~70% (900MB/s -> 1.67GB/s,
with 64MB block size) over INTx interrupts, and snd_hda(4) now will
actually play music with MSI. Previously, snd_hda(4) would not receive
interrupts, timing out, and declaring the channels dead.
This has also been tested by Kevin Bowling, and others, with great
success. Kevin reported NVMe unusable on his Talos II prior to this
patch.
Reviewed by: nwhitehorn, kbowling
Approved by: re(rgrimes)
Differential Revision: https://reviews.freebsd.org/D17356
It's not supposed to be legal for two jails to contain the same IP address,
unless both jails contain only that one address. This is the behavior
documented in jail(8), and is there to prevent confusion when multiple
jails are listening on IADDR_ANY.
VIMAGE jails (now the default for GENERIC kernels) test this correctly,
but non-VIMAGE jails have been performing an incomplete test when nested
jails are used.
Approved by: re@ (kib@)
MFC after: 5 days
When using a vlan with igb and the vlanhwcsum option, any mbufs which
already had the TCP, UDP, or SCTP checksum calculated and therefore don't
have the CSUM_[IP|IP6]_[TCP|UDP|SCTP] bits set in the csum_flags field would
have the L4 checksum corrupted by the hardware.
This was caused by the driver setting E1000_TXD_POPTS_TXSM any time a
checksum bit was set OR a vlan tag was present.
The patched driver only sets E1000_TXD_POPTS_TXSM when an offload is
requested.
PR: 231416
Reported by: pi
Approved by: re (gjb)
Sponsored by: Limelight Networks
Differential Revision: https://reviews.freebsd.org/D17404
rep stos has a high startup time even on modern microarchitectures like
Skylake. Intel optimization manuals discuss how for small sizes it is
beneficial to go for streaming stores. Since those cannot be used without
extra penalty in the kernel I investigated performance impact of just
regular movs.
The patch below implements a very simple scheme: a 32-byte loop followed
by filling in the remainder of at most 31 bytes. It has a 256 breaking
point on which it falls back to rep stos. It provides a significant win
over the current primitive on several machines I tested (both Intel and
AMD). A 64-byte loop did not provide any benefit even for multiple of 64
sizes.
See the review for benchmark data.
Reviewed by: kib
Approved by: re (gjb)
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D17398
When getting the number of bytes to checksum make sure to convert the UDP
length to host byte order when the entire header is not in the first mbuf.
Reviewed by: jtl, tuexen, ae
Approved by: re (gjb), jtl (mentor)
Differential Revision: https://reviews.freebsd.org/D17357
Refactor sample ring buffer ring handling to make it more robust to
long running callchain collection handling
r338112 introduced a (now fixed) regression that exposed a number of race
conditions within the management of the sample buffers. This
simplifies the handling and moves the decision to overwrite a
callchain sample that has taken too long out of the NMI in to the
hardlock handler. With this change the problem no longer shows up as a
ring corruption but as the code spending all of its time in callchain
collection.
- Makes the producer / consumer index incrementing monotonic, making it
easier (for me at least) to reason about.
- Moves the decision to overwrite a sample from NMI context to interrupt
context where we can enforce serialization.
- Puts a time limit on waiting to collect a user callchain - putting a
bound on head-of-line blocking causing samples to be dropped
- Removes the flush routine which was previously needed to purge
dangling references to the pmc from the sample buffers but now is only
a source of a race condition on unload.
Previously one could lock up or crash HEAD by running:
pmcstat -S inst_retired.any_p -T and then hitting ^C
After this change it is no longer possible.
PR: 231793
Reviewed by: markj@
Approved by: re (gjb@)
Differential Revision: https://reviews.freebsd.org/D17011
Change swap_reserve and swap_total to be in units of pages so that
swap reservations can be done using only atomics instead of using a single
global mutex for swap_reserve and a single mutex for all processes running
under the same uid for uid accounting.
Results in mmap speed up and a 70% increase in brk calls / second.
Reviewed by: alc@, markj@, kib@
Approved by: re (delphij@)
Differential Revision: https://reviews.freebsd.org/D16273
With the new route cache feature udp_notify() will modify the inp when it
needs to invalidate the route cache. Ensure that we hold a write lock on
the inp before calling the function to ensure that multiple threads don't
race while trying to invalidate the cache (which previously lead to a page
fault).
Differential Revision: https://reviews.freebsd.org/D17246
Reviewed by: sbruno, bz, karels
Sponsored by: Dell EMC Isilon
Approved by: re (gjb)
This change is a no-op in terms of semantics, but has a side effect
of removing a perfectly useless nop sled for CPUs with ERMS.
Approved by: re (gjb)
Sponsored by: The FreeBSD Foundation
ioctl(2) commands only have meaning in the context of a file descriptor
so translating them in the syscall layer is incorrect.
The new handler users an accessor to retrieve/construct a pointer from
the last member of the passed structure and relies on type punning to
access the other member which requires no translation.
Reviewed by: kib
Approved by: re (rgrimes, gjb)
Obtained from: CheriBSD
Sponsored by: DARPA, AFRL
Differential Review: https://reviews.freebsd.org/D17388
shutdown() to wakeup another thread blocked on a stream listen socket.
This code is failing, while it used to work on FreeBSD 10 and still
works on Linux.
It seems reasonable to add another exception to support something users are
actually doing, which used to work on FreeBSD 10, and still works on Linux.
And, it seems like it should be acceptable to POSIX, as we still return
ENOTCONN.
This patch is different to what had been committed to stable/11, since
code around listening sockets is different. Patch in D15019 is written
by jtl@, slightly modified by me.
PR: 227259
Obtained from: jtl
Approved by: re (kib)
Differential Revision: D15019
This caused microcode to be updated only on the BSP if hyperthreading
was disabled, typically resulting in a hang or reset.
Approved by: re (kib)
Sponsored by: The FreeBSD Foundation
ioctl(2) commands only have meaning in the context of a file descriptor
so translating them in the syscall layer is incorrect.
The new handler users an accessor to retrieve/construct a pointer from
the last member of the passed structure and relies on type punning to
access the other members which require no translation.
Reviewed by: kib (prior version), jhb
Approved by: re (rgrimes)
Obtained from: CheriBSD
Sponsored by: DARPA, AFRL
Differential Review: https://reviews.freebsd.org/D17378
using an application trying to use a v4mapped destination address on a
kernel without INET support or on a v6only socket.
Catch this case and prevent the packet from going anywhere;
else, without the KASSERT() armed, a v4mapped destination
address might go out on the wire or other undefined behaviour
might happen, while with the KASSERT() we panic.
PR: 231728
Reported by: Jeremy Faulkner (gldisater gmail.com)
Approved by: re (kib)
system-call entry and whenever audit arguments or return values are
captured:
1. Expose a single global, audit_syscalls_enabled, which controls
whether the audit framework is entered, rather than exposing
components of the policy -- e.g., if the trail is enabled,
suspended, etc.
2. Introduce a new function audit_syscalls_enabled_update(), which is
called to update audit_syscalls_enabled whenever an aspect of the
policy changes, so that the value can be updated.
3. Remove a check of trail enablement/suspension from audit_new() --
at the point where this function has been entered, we believe that
system-call auditing is already in force, or we wouldn't get here,
so simply proceed to more expensive policy checks.
4. Use an audit-provided global, audit_dtrace_enabled, rather than a
dtaudit-provided global, to provide policy indicating whether
dtaudit would like system calls to be audited.
5. Do some minor cosmetic renaming to clarify what various variables
are for.
These changes collectively arrange it so that traditional audit
(trail, pipes) or the DTrace audit provider can enable system-call
probes without the other configured. Otherwise, dtaudit cannot
capture system-call data without auditd(8) started.
Reviewed by: gnn
Sponsored by: DARPA, AFRL
Approved by: re (gjb)
Differential Revision: https://reviews.freebsd.org/D17348
In the probe case for SCSI SMR Host Aware or Most Managed drives, be sure
to free allocated memory.
sys/cam/scsi/scsi_da.c:
In dadone_probezone(), free the data pointer before returning.
MFC after: 3 days
Sponsored by: Spectra Logic
Approved by: re (kib)
Otherwise (iter % ds->ds_cnt) is not guaranteed to lie in the range
[0, MAXMEMDOM).
Reported by: pho
Reviewed by: kib
Approved by: re (rgrimes)
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D17374
Tested with ifunc resolvers in the kernel and module with calls from
kernel to kernel, module to kernel, and module to module.
Reviewed by: kib (previous version)
Approved by: re (gjb)
Differential Revision: https://reviews.freebsd.org/D17370
Belatedly add a comment to the amd64 pmap explaining why we initialize
the kernel pmap's resident page count.
Reviewed by: alc, kib
Approved by: re (gjb)
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D17377
Such data may later be unmapped. This occurs, for example, when a
loader-provided microcode update file is discarded.
Reviewed by: alc, kib
Approved by: re (gjb)
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D17340
The initial raise in r336519 wasn't enough for using big resolution
(1920 x 1200 for example). Raise it again.
Reported by: bob prohaska <fbsd@www.zefox.net>
Tested by: bob prohaska <fbsd@www.zefox.net>
Approved by: re (gjb@)
The AMD Threadripper 2990WX is basically a slightly crippled Epyc.
Rather than having 4 memory controllers, one per NUMA domain, it has
only 2 memory controllers enabled. This means that only 2 of the
4 NUMA domains can be populated with physical memory, and the
others are empty.
Add support to FreeBSD for empty NUMA domains by:
- creating empty memory domains when parsing the SRAT table,
rather than failing to parse the table
- not running the pageout deamon threads in empty domains
- adding defensive code to UMA to avoid allocating from empty domains
- adding defensive code to cpuset to avoid binding to an empty domain
Thanks to Jeff for suggesting this strategy.
Reviewed by: alc, markj
Approved by: re (gjb@)
Differential Revision: https://reviews.freebsd.org/D1683
This removes two assignments for the flags field being done
twice and adds one, which was missing.
Thanks to Felix Weinrank for reporting the issue he found
by using fuzz testing of the userland stack.
Approved by: re (kib@)
MFC after: 1 week
INP_INFO_UNLOCK_ASSERT() in TCP-related code. For encapsulated traffic
it is possible, that the code is running in net_epoch_preempt section,
and INP_INFO_UNLOCK_ASSERT() is very strict assertion for such case.
PR: 231428
Reviewed by: mmacy, tuexen
Approved by: re (kib)
Differential Revision: https://reviews.freebsd.org/D17335
arguments wrong in r339020.
PR: 231625
Reported by: Yuri Pankov (yuripv yuripv.net)
Reviewed by: cem, Yuri Pankov (yuripv yuripv.net)
Approved by: re (kib)
Pointyhat to: bz (a rather big one for this one)
sctp_process_cmsgs_for_init() and sctp_findassociation_cmsgs()
similar to sctp_find_cmsg() to improve consistency and avoid
the signed/unsigned issues in sctp_process_cmsgs_for_init()
and sctp_findassociation_cmsgs().
Thanks to andrew@ for reporting the problem he found using
syzcaller.
Approved by: re (kib@)
MFC after: 1 week
sending UDP encapsulated SCTP packets.
This is consistent with the behaviour that when such packets are received,
the corresponding UDP stats counter (udps_ipackets) is incremented.
Thanks to Peter Lei for making me aware of this inconsistency.
Approved by: re (kib@)
MFC after: 1 week
is done via using ifconfig, which uses a SIOCSIFMTU ioctl() command, or
doing it using a TUNSIFINFO/TAPSIFINFO ioctl() command.
Without this patch, for IPv6 the new MTU is not used when creating routes.
Especially, when initiating TCP connections after increasing the MTU,
the old MTU is still used to compute the MSS.
Thanks to ae@ and bz@ for helping to improve the patch.
Reviewed by: ae@, bz@
Approved by: re (kib@)
MFC after: 1 week
Sponsored by: Netflix, Inc.
Differential Revision: https://reviews.freebsd.org/D17180
This is mostly a cosmetic change except that obsolete system calls are
assigned meaningful names in the names arrays which means that using
tools like kdump or truss against binaries invoking these system calls
will print out the name instead of the number. The script I use to
generate the XML list of syscalls for GDB also ignores UNIMPL but not
OBSOL entries. In general UNIMPL should only be used to reserve
placeholders for system calls that have never been implemented while
system calls that existed at one time in FreeBSD but were removed
should be marked OBSOL instead.
Reviewed by: brooks, kib, imp
Approved by: re (gjb)
Differential Revision: https://reviews.freebsd.org/D17344
after the file mapping was wired.
if a wired map entry is backed by vnode and the file is truncated,
corresponding pages are invalidated. vm_fault_copy_entry() should be
aware of it and allow for invalid pages past end of file. Also, such
pages should be not mapped into userspace. If userspace accesses the
truncated part of the mapping later, it gets a signal, there is no way
kernel can prevent the page fault.
Reported by: andrew using syzkaller
Reviewed by: alc
Sponsored by: The FreeBSD Foundation
Approved by: re (gjb)
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D17323
if the dst_object is not of swap type.
It can only happen when entry does not require copy, otherwise
vm_map_protect() already adds the charge. So the assert was right for
the case where swap object was allocated in the vm_fault_copy_entry(),
but not when it was just copied from src_entry and its type is not
swap.
Reported by: andrew using syzkaller
Reviewed by: alc
Sponsored by: The FreeBSD Foundation
Approved by: re (gjb)
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D17323
allocated dst_object in a single place.
Suggested and reviewed by: alc
Sponsored by: The FreeBSD Foundation
Approved by: re (gjb)
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D17323
For PCID case, there is a dependency between pm_gen zeroing and
reading pm_active for IPI target selection, to ensure that the
invalidation is not missed.
Reported and tested by: mjg
Sponsored by: The FreeBSD Foundation
Approved by: re (gjb)
As with r338962 also export the instruction set attribute register. This
will allow userland to identify optional instructions the hardware
supports, for example in a future ifunc handler to decide which
implementation of a function to return.
Approved by: re (kib)
The pre-7.x compat for both native and 32-bit code was already in
pci_user.c. Use this infrastructure to add implement 32-bit support.
This is more correct as ioctl(2) commands only have meaning in the
context of a file descriptor.
Reviewed by: kib
Approved by: re (gjb)
Obtained from: CheriBSD
Sponsored by: DARPA, AFRL
Differential revision: https://reviews.freebsd.org/D17324
The function stopped swapping rdi and rsi, but the error handling
code was not updated with the new register name.
Approved by: re (implicit)
Sponsored by: The FreeBSD Foundation
This reverts part of r333368. The attempt to clear DR6 was occuring
too soon as trapsignal() does not pause to let the debugger notice the
SIGTRAP and query DR6. The signal exchange does not occur until much
later during ast(). As a result, GDB was no longer recognizing
hardware breakpoints and watchpoints on x86.
In addition, any userland programs that want to inspect DR6 in a
SIGTRAP handler don't have a way to do this if we clear DR6 in the
exception handler.
Instead of relying on the kernel to clear DR6, debuggers will have to
explicitly clear it after a trace trap (which they needed to do on
older kernels anyway).
Reviewed by: kib
Approved by: re (delphij)
MFC after: 3 days
Differential Revision: https://reviews.freebsd.org/D17319
once we have a lock, make sure the inp is not marked freed.
This can happen since the list traversal and locking was
converted to epoch(9). If the inp is marked "freed", skip it.
This prevents a NULL pointer deref panic later on.
Reported by: slavash (Mellanox)
Tested by: slavash (Mellanox)
Reviewed by: markj (no formal review but caught my unlock mistake)
Approved by: re (kib)
- remove a forward branch in the common case
- replace xchg + lodsb/stosb loop with simple movs
A simple test on Intel(R) Core(TM) i7-4600U CPU @ 2.10GH copying
/foo/bar/baz in a loop goes from 295715863 ops/s to 465807408.
Further changes are pending.
Reviewed by: kib
Approved by: re (gjb)
Differential Revision: https://reviews.freebsd.org/D17281
- move the PSL.AC comment to the fault handler
- stop testing for zero-sized ops. after several minutes of package
building there were no copyin calls with zero bytes and very few
copyout. the semantic of returning 0 in this case is preserved
- shorten exit paths by clearing %eax earlier
- replace xchg with 3 movs. this is what compilers do. a naive
benchmark on EPYC suggests about 1% increase in thoughput thanks to
this change.
- remove the useless movb %cl,%al from copyout. it looks like a
leftover from many years ago
Reviewed by: kib
Approved by: re (gjb)
Differential Revision: https://reviews.freebsd.org/D17286
Both the in-kernel C variant and libc asm variant have very poor performance.
The former compiles to a single byte comparison loop, which breaks down even
for small sizes. The latter uses rep cmpsq/b which turn out to have very poor
throughput and are slower than a hand-coded 32-byte comparison loop.
Depending on size this is about 3-4 times faster than the current routines.
Reviewed by: kib
Approved by: re (gjb)
Differential Revision: https://reviews.freebsd.org/D17328
Create a user view of the ID_AA64PFR0_EL1 register with values common
across all CPUs.
Approved by: re (kib)
Sponsored by: ABT Systems Ltd
Differential Revision: https://reviews.freebsd.org/D17301
Building the kernel in Git repositories when git-svn is not available and
the "help.autocorrect" Git parameter is enabled results in Git trying to
replace the "svn" command (it does not know) with "serve". As a result the
output of the "git server" command is appended to the value of the
environmental variable VERINFO, which causes the auto generated vers.c
file to contain invalid C syntax (missing newline escapes):
#define "@(#)FreeBSD 12.0-ALPHA7 r000eversion 2
0015agent=git/2.19.0
000cls-refs
0012fetch=shallow
0012server-option
0000=5e2272613fa(splash-vt)"
#define VERSTR "FreeBSD 12.0-ALPHA7 r000eversion 2
0015agent=git/2.19.0
000cls-refs
0012fetch=shallow
0012server-option
0000=5e2272613fa(splash-vt)\n"
Using `-c help.autocorrect=0` seems to be a good solution as it does not
modify user's environment. I am not sure, however, if we should use
programs (or Git commands), which we are not sure exist (we never check if
git-svn is available on the host), as there may be more unexpected
behaviors like this one.
Reviewed by: eadler, emaste, krion
Approved by: re (gjb), krion (mentor)
Sponsored by: Bally Wulff Games & Entertainment GmbH
Differential Revision: https://reviews.freebsd.org/D17271
undefined instruction exception. Previously we would exit the guest,
however an unprivileged user could execute these.
Found with: syzkaller
Reviewed by: araujo, tychon (previous version)
Approved by: re (kib)
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D17192
As part of ZFS Crypto, I started getting a series of panics when I did not
have AESNI loaded. Adding locking fixed it, and I concluded that the
Reinit function altered the AES key schedule. This locking is not as
fine-grained as it could be (AESNI uses per-cpu locking), but
it's minimally invasive.
Sponsored by: iXsystems Inc
Reviewed by: cem, mav
Approved by: re (gjb), mav (mentor)
Differential Revision: https://reviews.freebsd.org/D17307
Remove unused and easy to misuse PNP macro parameter
Inspired by r338025, just remove the element size parameter to the
MODULE_PNP_INFO macro entirely. The 'table' parameter is now required to
have correct pointer (or array) type. Since all invocations of the macro
already had this property and the emitted PNP data continues to include the
element size, there is no functional change.
Mostly done with the coccinelle 'spatch' tool:
$ cat modpnpsize0.cocci
@normaltables@
identifier b,c;
expression a,d,e;
declarer MODULE_PNP_INFO;
@@
MODULE_PNP_INFO(a,b,c,d,
-sizeof(d[0]),
e);
@singletons@
identifier b,c,d;
expression a;
declarer MODULE_PNP_INFO;
@@
MODULE_PNP_INFO(a,b,c,&d,
-sizeof(d),
1);
$ rg -l MODULE_PNP_INFO -- sys | \
xargs spatch --in-place --sp-file modpnpsize0.cocci
(Note that coccinelle invokes diff(1) via a PATH search and expects diff to
tolerate the -B flag, which BSD diff does not. So I had to link gdiff into
PATH as diff to use spatch.)
Tinderbox'd (-DMAKE_JUST_KERNELS).
Approved by: re (glen)
Do not call crypto_newsession() while holding xforms_lock mutex.
Release mutex before invoking crypto_newsession(), and use
ipsec_kmod_enter()/ipsec_kmod_exit() functions to protect from doing
access to unloaded kernel module memory.
Move xform-releated functions into subr_ipsec.c to be able use
ipsec_kmod_* functions. Also unconditionally build ipsec_kmod_*
functions, since now they are always used by IPSec code.
Add xf_cntr field to struct xformsw, it is used by ipsec_kmod_*
functions. Also constify xf_name field, since it is not expected to be
modified.
Approved by: re (kib)
Differential Revision: https://reviews.freebsd.org/D17302
From PCI Spec rev 2.2, 6.2.1. Device Identification:
Vendor ID This field identifies the manufacturer of the device. Valid
vendor identifiers are allocated by the PCI SIG to ensure uniqueness.
0FFFFh is an invalid value for Vendor ID.
MFC after: 3 days
Approved by: re (Glen), hselasky (mentor), kib (mentor)
Sponsored by: Mellanox Technologies
dmaplimit is the first byte after the end of DMAP.
Reported by: "Johnson, Archna" <Archna.Johnson@netapp.com>
Reviewed by: alc, markj
Approved by: re (gjb)
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D17318
For pmap_invalidate_all_pcid(), only reset pm_gen for non-kernel
pmaps, as it was done before the conversion to ifuncs. The reset is
useless but innocent for kernel_pmap. Coverity reported that cpuid is
used uninitialized in this case.
Reported by: cem
Reviewed by: alc, cem, markj
CID: 1395807
Sponsored by: The FreeBSD Foundation
Approved by: re (gjb)
Differential revision: https://reviews.freebsd.org/D17314
Currently vfs calls the root method on each absolute lookup and when
crossing mount points.
zfs_root ends up looking up the inode internally as if it was not
instantianted which results in significant lock contention on systems
like EPYC.
Store the vnode in the mount point and protect the access with rmlocks.
This is a temporary hack for 12.0.
Sample result:
before:
make -s -j 128 buildkernel 2778.09s user 3319.45s system 8370% cpu 1:12.85 total
after:
make -s -j 128 buildkernel 3199.57s user 1772.78s system 8232% cpu 1:00.40 total
Tested by: pho (zfs mount/unmount tests)
Reviewed by: kib, mav, sef (different parts)
Approved by: re (gjb)
Differential Revision: https://reviews.freebsd.org/D17233
- Switch to using 32b port/link capabilities in the driver. The 32b
format is used internally by firmwares > 1.16.45.0 and the driver will
now interact with the firmware in its native format, whether it's 16b
or 32b. Note that the 16b format doesn't have room for 50G, 200G, or
400G speeds.
- Add a bit in the pause_settings knobs to allow negotiated PAUSE
settings to override manual settings.
- Ensure that manual link settings persist across an administrative
down/up as well as transceiver unplug/replug.
- Remove unused is_*G_port() functions.
Approved by: re@ (gjb@)
MFC after: 1 month
Sponsored by: Chelsio Communications
The PHB4 host bridge used by the POWER9 uses a 64kB range in 32-bit
space at the address 0xffff0000-0xffffffff. Reserve this range so that
DMA memory cannot be allocated within this range. This fixes seemingly
random crashes on a POWER9 system. Ideally this range will have been
reserved by the firmware, but as of now this is not the case.
Submitted by: git_bdragon.rtk0.net
Reviewed by: nwhitehorn
Approved by: re(kib)
Differential Revision: https://reviews.freebsd.org/D17183
Use these predicates instead of inline references to vm_min_domains.
Also add a global all_domains set, akin to all_cpus.
Reviewed by: alc, jeff, kib
Approved by: re (gjb)
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D17278
In 11.x and earlier these were accessible as direct members of 'struct
kinfo_file'. Existing code already knows about the new location of
these members as well, so wrapper macros did not work for these
fields. Instead, define an anonymous struct containing the fields
from 'struct kinfo_file' in FreeBSD 11 that were not part of the
'kf_un' union. This anonymous struct is then placed in an anonymous
union along with the new 'kf_un' union. This preserves the API of
both structure layouts without requiring any wrapper macros.
PR: 231525
Reviewed by: kib
Approved by: re (gjb)
Differential Revision: https://reviews.freebsd.org/D17262
This invokes "fence" on the hart performing the write followed by an IPI
to execute "fence.i" on all harts.
This is required to support userland debuggers setting breakpoints in
user processes.
Reviewed by: br (earlier version), markj
Approved by: re (gjb)
Sponsored by: DARPA
Differential Revision: https://reviews.freebsd.org/D17139
redundant, because uma_zone_reserve_kva() is performed on both zones and it
sets this same flag on the zone. (Moreover, the implementation of the swap
pager does not itself require these zones to be UMA_ZONE_NOFREE.)
Reviewed by: kib, markj
Approved by: re (gjb)
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D17296
Currently stats are collected in a MAXCPU-sized array which is not
aligned and suffers enormous false-sharing. Fix the problem by
utilizing per-cpu allocation.
The counter(9) API is not used here as it is too incomplete and does
not provide a win over per-cpu zone sized for malloc stats struct. In
particular stats are being reported for each cpu separately by just
copying what is supposed to be an array element for given cpu.
This eliminates significant false-sharing during malloc-heavy tests
e.g. on Skylake. See the review for details.
Reviewed by: markj
Approved by: re (kib)
Differential Revision: https://reviews.freebsd.org/D17289
syncache_respond(). There is no functional change. The
parameter became unused in r313330, but wasn't removed.
Approved by: re (kib@)
MFC after: 1 month
Sponsored by: Netflix, Inc.
Split calculation of mask for shootdown IPI and local
invalidation. Reorder IPI before local.
Suggested by: alc
Reviewed by: alc, markj
Tested by: pho
Sponsored by: The FreeBSD Foundation
Approved by: re (rgrimes)
Differential revision: https://reviews.freebsd.org/D17277
NFS is the only in-tree filesystem using the feature, but all ops test
for it.
Currently the resulting sigdefer calls have to be jumped over in the
common case.
This is a bandaid, longer term fix will move this feature away.
Approved by: re (kib)
illumos/illumos-gate@82f63c3c2b
Reviewed by: Toomas Soome <tsoome@me.com>
Reviewed by: Sanjay Nadkarni <sanjay.nadkarni@nexenta.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Approved by: Matthew Ahrens <mahrens@delphix.com>
Author: Jerry Jelinek <jerry.jelinek@joyent.com>
Approved by: re (delphij)
- _fault handlers for both primitives are identical, provide just one
- change the copying scheme to match memcpy (in particular jump
avoidance for the most common case of multiply of 8)
- stop re-reading pcb address on exit, just store it locally (in r9)
Reviewed by: kib
Approved by: re (gjb)
Differential Revision: https://reviews.freebsd.org/D17265
In non-reproducible mode we have the kernel ident as a side effect of
including the build directory. Explicitly add it to the ident string in
reproducible mode.
Reported by: mjg
Approved by: re (gjb)
Sponsored by: The FreeBSD Foundation
If the size is 15 bytes or less avoid spinning up rep just to copy the 8
bytes. In my tests on EPYC and old Intel microarchs without ERMS (like
Westmere) it provided a nice win over the current version (e.g. for EPYC
memset with 15 bytes of size goes from 59712651 ops/s to 70600095) all
while almost not pessimizing the other cases.
Data collected during package building shows that < 16 sizes are pretty
common.
Verified with the glibc test suite.
Approved by: re (kib)
Fix a fat-fingered typo with a "funny" side-effect: when doing copyin on a
cpu without ERMS and with size being a multiply of 8 a page fault would be
triggered resulting in EFAULT.
Pointy hat: mjg
Approved by: re (implicit)
It seems igb supports TSO6, but the capability got lost in
the iflib update. Restore this capability.
PR: 231476
Reported by: lev
Reviewed by: erj
Approved by: re (gjb)
Sponsored by: Limelight Networks
Differential Revision: https://reviews.freebsd.org/D17242
It is currently unused and reserved for future use to keep KBI/KPI.
Also add several spare pointers to be able extend structure if it
will be needed.
Approved by: re (gjb)
Various capabilities were not being handled correctly in the
SIOCSIFCAP handler. Specifically:
IFCAP_RXCSUM and IFCAP_RXCSUM_IPV6 could be set even if not supported
It was impossible to disable IFCAP_RXCSUM and/or IFCAP_RXCSUM_IPV6 via
ifconfig since it does ioctl() per command-line flag rather than combine
them into a single call.
IFCAP_VLAN_HWCSUM could not be modified via the ioctl()
Setting any combination of the three IFCAP_WOL flags would set only
IFCAP_WOL_MCAST | IFCAP_WOL_MAGIC. For example, setting only
IFCAP_WOL_UCAST would result in both IFCAP_WOL_MCAST and IFCAP_WOL_MAGIC
being enabled, but IFCAP_WOL_UCAST would not be enabled.
Because if_vlancap() was called before if_togglecapenable(), vlan flags
were sometimes not applied correctly.
Interfaces were being unnecessarily stopped and restarted for WoL
PR: 231151
Submitted by: Kaho Toshikazu <kaho@elam.kais.kyoto-u.ac.jp>
Reported by: Shirkdog <mshirk@daemon-security.com>
Reviewed by: galladin
Approved by: re (gjb)
Sponsored by: Limelight Networks
Differential Revision: https://reviews.freebsd.org/D17158
The old code appears to assume that vmem_alloc() would import
size-aligned KVA chunks from the parent kernel_arena, but vmem doesn't
provide this guarantee.
Also remove the unused global RWX arena and add comments explaining why
we have per-domain arenas.
Reported by: alc
Reviewed by: alc, kib (previous version)
Approved by: re (gjb)
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D17249
not freed. This can happen since the list traversal and locking
was converted to epoch(9). If the inp is marked "freed", skip it.
This prevents a NULL pointer deref panic in ip6_savecontrol_v4()
trying to access the socket hanging off the inp, which was gone
by the time we got there.
Reported by: andrew
Tested by: andrew
Approved by: re (gjb)
Ensure that pages backing the same virtual large page come from the
same physical domain, as kmem_malloc_domain() does.
PR: 231038
Reviewed by: alc, kib
Approved by: re (gjb)
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D17248
A lot of function have the following check:
cmpq %rax,%rdi /* verify address is valid */
ja fusufault
The label is present earlier in kernel .text, which means this is a jump
backwards. Absent any information in branch predictor, the cpu predicts it
as taken. Since it is almost never taken in practice, this results in a
completely avoidable misprediction.
Move it past all consumers, so that it is predicted as not taken.
Approved by: re (kib)
- Explicitly load an empty initial state into FP registers when taking
the fault on the first FP instruction in a thread. Setting
SSTATE.FS to INITIAL is just a marker to let context switch restore
code know that it can load FP registers with zeroes instead of
memory loads. It does not imply that the hardware will reset all
registers to zero on first access. In addition, set the state to
CLEAN instead of INITIAL after the first FP instruction.
cpu_switch() doesn't do anything for INITIAL and only restores from
the pcb if the state is CLEAN. We could perhaps change cpu_switch
to call fpe_state_clear if the state was INITIAL and leave SSTATE.FS
set to INITIAL instead of CLEAN after the first FP instruction.
However, adding this complexity to cpu_switch() doesn't seem worth
the supposed gain.
- Only save the current FPU registers in fill_fpregs() if the request
is made to save the current thread's registers. Previously if a
debugger requested FP registers via ptrace() it was getting a copy
of the debugger's FP registers rather than the debugee's.
- Zero the entire FP register set structure returned for ptrace() if a
thread hasn't used FP registers rather than leaking garbage in the
fp_fcsr field.
- If a debugger writes FP registers via ptrace(), always mark the pcb
as having valid FP registers and set SSTATUS.FS_MASK to CLEAN so
that the registers will be restored when the debugged thread
resumes.
- Be more explicit about clearing the SSTATUS.FS field before setting
it to CLEAN on the first FP instruction trap.
Submitted by: br, markj
Approved by: re (rgrimes)
Sponsored by: DARPA
Differential Revision: https://reviews.freebsd.org/D17141
Zero the entire FP register set structure returned for ptrace() if a
thread hasn't used FP registers rather than leaking garbage in the
fp_sr and fp_cr fields.
Reviewed by: emaste, andrew
Approved by: re (rgrimes)
MFC after: 2 weeks
Sponsored by: DARPA, AFRL
Differential Revision: https://reviews.freebsd.org/D17140
This simplifies the runtime logic and reduces the number of
runtime-constant branches.
Reviewed by: alc, markj
Sponsored by: The FreeBSD Foundation
Approved by: re (gjb)
Differential revision: https://reviews.freebsd.org/D16736
This keeps the initialization coupled together with the kmem_* KPI
implementation, which is the main user of these arenas.
No functional change intended.
Reviewed by: alc
Approved by: re (gjb)
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D17247
route cache updates.
Bring over locking changes applied to udp_output() for the route cache
in r297225 and fixed in r306559 which achieve multiple things:
(1) acquire an exclusive inp lock earlier depending on the expected
conditions; we add a comment explaining this in udp6,
(2) having acquired the exclusive lock earlier eliminates a slight
possible chance for a race condition which was present in v4 for
multiple years as well and is now gone, and
(3) only pass the inp_route6 to ip6_output() if we are holding an
exclusive inp lock, so that possible route cache updates in case
of routing table generation number changes can happen safely.
In addition this change (as the legacy IP counterpart) decomposes the
tracking of inp and pcbinfo lock and adds extra assertions, that the
two together are acquired correctly.
PR: 230950
Reviewed by: karels, markj
Approved by: re (gjb)
Pointyhat to: bz (for completely missing this bit)
Differential Revision: https://reviews.freebsd.org/D17230
Since ifunc-capable linker is now required on i386, bring this code in
line with the amd64 counterpart.
Reviewed by: alc, markj
Sponsored by: The FreeBSD Foundation
Approved by: re (gjb)
Differential revision: https://reviews.freebsd.org/D16736
The current cache logic checks the total number of stacks in the kernel,
which even on small boxes significantly exceeds the 128 limit (e.g. an
8-way box with zfs has almost 800 stacks allocated).
Stacks are cached earlier for each main thread.
As a result the code is rarely executed, but when it is then (on boxes like
the above) it always fails. Since there are no provisions made for NUMA and
release time is approaching, just do a quick check to avoid acquiring the
lock.
Approved by: re (kib)
pm_pcid is unsigned.
Reviewed by: cem, markj
CID: 1395727
Noted by: cem
Sponsored by: The FreeBSD Foundation
Approved by: re (gjb)
MFC after: 3 days
Differential revision: https://reviews.freebsd.org/D17235
UFS quotaoff iterates over all mp vnodes, and derefences and clears
the pointers to corresponding dquots. If SU work items transiently
reference some of dquots,quotaoff() would eventually fail, but all
processed vnodes are already stripped from dquots. The state is
problematic, since quotas are left enabled, but there is no dquots
where blocks and inodes can be accounted. The result is assertion
failures and NULL pointer dereferences.
Fix it by suspending writes around quotaoff() call. Since the
filesystem is synced, no dandling references to dquots from SU
workitems can left behind, which means that quotaoff succeeds.
The complication there is that quotaoff VFS op is performed with the
mount point busied, while to suspend, we need to start write on the
mp. If vn_start_write() is called on busied mp, system might deadlock
against parallel unmount request. Handle this by unbusy-ing mp before
starting write, which in turn requires changing the quotaoff()
interface to return with the mount point not busied, same as was done
for quotaon().
Reviewed by: mckusick
Reported and tested by: pho
Sponsored by: The FreeBSD Foundation
Approved by: re (gjb)
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D17208
We drop the keg lock when we go to actually allocate the slab, allowing
other threads to advance the cursor. This can cause us to exit the
round-robin loop before having attempted allocations from all domains,
resulting in a hang during a subsequent blocking allocation attempt from
a depleted domain.
Reported and tested by: Jan Bramkamp <crest@bultmann.eu>
Reviewed by: alc, cem
Approved by: re (gjb)
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D17209
The amd64 kernel started using ifunc for a variety of functions with
arch-specific implementations, and we would like to make use of the
same functionality on i386 and as much as possible avoid divergence
between i386 and amd64. In particular, future changes for security
improvements and mitigations may rely on ifunc support.
Approved by: re (kib)
Sponsored by: The FreeBSD Foundation
Limits can be safely obtained with lim_cur from the thread. racct is compiled
in but disabled by default. Note that racct enablement is a boot-only tunable.
This eliminates second most common place of taking the lock while pkg building.
While here don't take the lock in mlockall either.
Reviewed by: kib
Approved by: re (gjb)
Differential Revision: https://reviews.freebsd.org/D17210
State check before enqueuing transmit task in bxe_link_attn() routine.
State check before invoking bxe_nic_unload in bxe_shutdown().
Submitted by:Vaishali.Kulkarni@cavium.com
Approved by:re(gjb)
Doing so can deadlock when the thread already owns another vnode lock,
e.g. during a rename, as was demonstrated by the reporter. In fact,
there seems to be no need to force the call to getinoquota() always,
because vn_open() locks vnode exclusively, and this is the most
important case. To add to the point, directories where the dirent is
added or removed, are locked exclusively as well.
Reported by: bwidawsk
Tested by: bwidawsk, pho (as part of the larger patch)
Sponsored by: The FreeBSD Foundation
Approved by: re (gjb)
MFC after: 1 week
The atpic_register_sources callback tries to avoid registering interrupt
sources that would collide with an I/O APIC. However, the previous
implementation was failing to register IRQs 8-15 since the slave PIC
saw valid IRQs from the master and assumed an I/O APIC was present. To
fix, go back to registering all 8259A interrupt sources in one loop when
the master's register_sources method is invoked.
PR: 231291
Approved by: re (kib)
MFC after: 1 month
Also change the behaviour slightly: instead of freeing "config" if the
last nvlist doesn't pass the tests, return the last config that did pass
those tests. This matches the comment at the beginning of the function.
PR: 230704
Diagnosed by: avg
Reviewed by: asomers, avg
Tested by: Mark Martinec <Mark.Martinec@ijs.si>
Approved by: re (gjb)
MFC after: 1 week
Sponsored by: The FreeBSD Foundation
Differential revision: https://reviews.freebsd.org/D17202
Patch removes all checks for pti/pcid/invpcid from the context switch
path. I verified this by looking at the generated code, compiling with
the in-tree clang. The invpcid_works1 trick required inline attribute
for pmap_activate_sw_pcid_pti() to work.
Reviewed by: alc, markj
Sponsored by: The FreeBSD Foundation
Approved by: re (gjb)
Differential revision: https://reviews.freebsd.org/D17181
There is no need to use %rax for temporary values and avoiding doing
so shortens the func.
Handle the explicit 'check for tail' depessimisization for backwards copying.
This reduces the diff against userspace.
Tested with the glibc test suite.
Approved by: re (kib)
This will be used in following conversion of pmap_activate_sw().
Reviewed by: alc, markj
Sponsored by: The FreeBSD Foundation
Approved by: re (gjb)
Differential revision: https://reviews.freebsd.org/D17181
There is a braino in the non-erms variant which breaks the
functionality.
Will be fixed at a later time with a different patch.
Reported by: Manfred Antar
Approved by: re (implicit)
There is no need to use %rax for temporary values and avoiding doing
so shortens the func.
Handle the explicit 'check for tail' depessimisization for backwards copying.
This reduces the diff against userspace.
Approved by: re (kib)
when there is work to do. This reduces CPU consumption to one
third on systems. This will help keep the thread CPU usage under
control now that the default hash size has increased.
Reviewed by: kp
Approved by: re (kib)
Differential Revision: https://reviews.freebsd.org/D17097
Intel docs claim such a memset (rep stosb + 4096 bytes) is
special-cased by microarchs. They also switched Linux to use
it for this purpose.
Approved by: re (gjb)
Not all event descriptions have a sample rate (such as inst_retired.any)
this will restore the legacy behavior of using 65536 in that case. It also
prevents accidental API misuse that could lead to panic.
PR: 230985
Reported by: markj
Reviewed by: markj
Approved by: re (gjb)
Sponsored by: Limelight Networks
Differential Revision: https://reviews.freebsd.org/D16958
We ought to be consistent across our Tier-1 and nearly-Tier-1
architectures, so enable Capsicum for 32-bit armv6/armv7 by default.
PR: 204008
Reviewed by: ian, oshogbo
Approved by: re (gjb)
Relnotes: Yes
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D17023
The previous default of "balanced" appears to have caused pathological
behavior, including very poor performance and 100% CPU load in the
arc_reclaim_thread.
The symptoms appeared when the daily periodic run started.
With this change, the system--and the ARC in particular--behaved
normally during a manual daily periodic run.
From Mark Johnston: The port of the balanced strategy is incomplete,
since arc_prune_async() is a no-op on FreeBSD. (This also seems
to imply that r337653 is a no-op.) After 12 is branched we can
port the remaining bits and consider changing the default back.
Submitted by: markj (essentially)
Reviewed by: markj
Approved by: re (gjb)
Sponsored by: Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D17156
r338642 toggled the REPRODUCIBLE_BUILD knob but missed the
corresponding kern.opts.mk change.
We want to build the 12.0 release artifacts with reproducible builds
mode enabled. Switch it on in HEAD now to enable testing with upcoming
ALPHA builds. We can revisit the default setting for HEAD after the
branch is created.
This change eliminates the build metadata (user, hostname, timestamp,
etc.) from the kernel and loader. If the src tree is a git, svn or p4
checkout with changes then the metadata is retained.
The WITHOUT_REPRODUCIBLE_BUILD src.conf(5) knob can be used to revert
to the previous behaviour.
Approved by: re (gjb)
Sponsored by: The FreeBSD Foundation
Both drivers use this interface so add a dependancy on it.
Since awg uses aw_sid for generating the MAC address, make it
depend on both aw_sid and nmvem so when only removing nvmem from
kernel config it will not include this driver.
Reported by: sbruno
Approved by: re (gjb)
The Xen page-table walker used to resolve the virtual addresses in the
hypercalls will refuse to access user-space pages when SMAP is enabled
unless the AC flag in EFLAGS is set (just like normal hardware with
SMAP support would do).
Since privcmd allows forwarding hypercalls (and buffers) from
user-space into Xen make sure SMAP is temporary disabled for the
duration of the hypercall from user-space.
Approved by: re (gjb)
Sponsored by: Citrix Systems R&D
Register interrupts using the PIC pic_register_sources method instead
of doing it in apic_setup_io. This is now required, since the internal
interrupt structures are not yet setup when calling apic_setup_io.
Approved by: re (gjb)
Sponsored by: Citrix Systems R&D
Instead of panicking. Legacy PVH mode doesn't provide a lapic, and
since native_lapic_intrcnt is called unconditionally this would cause
the assert to trigger. Change the assert into a continue in order to
take into account the possibility of systems without a lapic.
Reviewed by: jhb
Approved by: re (gjb)
Sponsored by: Citrix Systems R&D
Differential revision: https://reviews.freebsd.org/D17015
The recommended way to obtain the vcpu id is using the cpuid
instruction with a specific leaf value. This leaf value must be
obtained at runtime, and it's done when populating the hypercall page.
Legacy PVH however will get the hypercall page populated by the
hypervisor itself before booting, so the cpuid leaf was not actually
set, thus preventing setting the vcpu id value from cpuid.
Fix this by making sure the cpuid leaf has been probed before
attempting to set the vcpu id.
Approved by: re (gjb)
Sponsored by: Citrix Systems R&D
That's the only mode in FreeBSD that requires the usage of PIRQs, so
there's no need to attach the PIRQ PIC when running in other modes.
Approved by: re (gjb)
Sponsored by: Citrix Systems R&D
When adding support for the new PVH mode the kenv handling was
switched to use a boot time allocated scratch space, however the
legacy PVH early boot code was not modified to allocate such space.
Approved by: re (gjb)
Sponsored by: Citrix Systems R&D
The vcpu_id for legacy PVH mode can be set from the output of cpuid,
so there's no need to have a special function to set it.
Also note that xenpv_set_ids should have been executed only for PV
guests, but was executed for all guests types and vcpu_id was later
fixed up for HVM guests.
Reported by: cperciva
Approved by: re (gjb)
Sponsored by: Citrix Systems R&D
So that it's done when the vcpu_id has been set. For the BSP the
vcpu_id is set at SUB_INTR, while for the APs it's done in
init_secondary_tail that's called at SUB_SMP order FIRST.
Reported and tested by: cperciva
Approved by: re (gjb)
Sponsored by: Citrix Systems R&D
Differential revision: https://reviews.freebsd.org/D17013
When running as a specific type of Xen guest the hypervisor won't
provide any emulated IO-APICs or legacy PICs at all, thus hitting the
following assert in the MSI code:
panic: Assertion num_io_irqs > 0 failed at /usr/src/sys/x86/x86/msi.c:334
cpuid = 0
time = 1
KDB: stack backtrace:
db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame 0xffffffff826ffa70
vpanic() at vpanic+0x1a3/frame 0xffffffff826ffad0
panic() at panic+0x43/frame 0xffffffff826ffb30
msi_init() at msi_init+0xed/frame 0xffffffff826ffb40
apic_setup_io() at apic_setup_io+0x72/frame 0xffffffff826ffb50
mi_startup() at mi_startup+0x118/frame 0xffffffff826ffb70
start_kernel() at start_kernel+0x10
Fix this by removing the assert in the MSI code, since it's possible
to get to the MSI initialization without having registered any other
interrupt sources.
Reviewed by: jhb
Approved by: re (gjb)
Sponsored by: Citrix Systems R&D
Differential revision: https://reviews.freebsd.org/D17001
Or else it triggers the following bug:
APIC: CPU 6 has ACPI ID 6
APIC: CPU 7 has ACPI ID 7
panic: vm_wait in early boot
cpuid = 0
time = 1
KDB: stack backtrace:
db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame 0xffffffff826ff8d0
vpanic() at vpanic+0x1a3/frame 0xffffffff826ff930
panic() at panic+0x43/frame 0xffffffff826ff990
vm_wait_domain() at vm_wait_domain+0xf9/frame 0xffffffff826ff9c0
kmem_alloc_contig_domain() at kmem_alloc_contig_domain+0x252/frame 0xffffffff826ffa50
kmem_alloc_contig() at kmem_alloc_contig+0x6c/frame 0xffffffff826ffad0
contigmalloc() at contigmalloc+0x2e/frame 0xffffffff826ffb00
x86bios_modevent() at x86bios_modevent+0x225/frame 0xffffffff826ffb20
module_register_init() at module_register_init+0xc0/frame 0xffffffff826ffb50
mi_startup() at mi_startup+0x118/frame 0xffffffff826ffb70
start_kernel() at start_kernel+0x10
While there also make x86bios_unmap_mem idempotent.
Reviewed by: kib
Approved by: re (gjb)
Sponsored by: Citrix Systems R&D
Differential revision: https://reviews.freebsd.org/D17000
* Fix a bug where the SYN handling during established state was
applied to a front state.
* Move a check for retransmission after the timer handling.
This was suppressing timer based retransmissions.
* Fix an off-by one byte in the sequence number of retransmissions.
* Apply fixes corresponding to
https://svnweb.freebsd.org/changeset/base/336934
Reviewed by: rrs@
Approved by: re (kib@)
MFC after: 1 month
Sponsored by: Netflix, Inc.
Differential Revision: https://reviews.freebsd.org/D16912
relocation.
elf_relocaddr() has a hook to handle VIMAGE data addresses.
This fixes VIMAGE support for RISC-V when built as a module.
Approved by: re (gjb)
Sponsored by: DARPA, AFRL
Similar to arm64, riscv compiler uses PC-relative loads/stores,
and with static data compiler does not emit relocations.
In result, kernel module linker has nothing to fix and data accessed
from the wrong location.
Approved by: re (gjb)
Sponsored by: DARPA, AFRL
newvers.sh supports two modes for reproducible builds:
-r Reproducible build. Do not embed directory names, user
names, time stamps or other dynamic information into
the output file. This is intended to allow two builds
done at different times and even by different people on
different hosts to produce identical output.
-R Reproducible build if the tree represents an unmodified
checkout from a version control system. Metadata is
included if the tree is modified.
Switch to the second mode when reproducible builds are enabled.
The value of a reproducible build is much less when building from an
uncontrolled, modified src tree, and -R likely provides the best
compromise in allowing the REPRODUCIBLE_BUILD knob to be enabled by
default for the release.
Approved by: re (kib)
Sponsored by: The FreeBSD Foundation
From Piotr:
ix(4), ixv(4): Add VLAN tag strip check when receiving packets
ixv(4): Fix support for VLAN_HWTAGGING and VLAN_HWFILTER flags
This change will prevent driver from passing VLAN tags when
interface configuration is not expecting them. VF driver will
check for VLAN_HWTAGGING and VLAN_HWFILTER flags and act adequately.
This patch resolves problem occuring on EC2 platforms.
Submitted by: Piotr Pietruszewski <piotr.pietruszewski@intel.com>
Reported by: cperciva@
Reviewed by: cperciva@, Intel Networking
Approved by: re
Sponsored by: Intel Corporation
Differential Revision: https://reviews.freebsd.org/D17061
of the fix/workaround for the "ctld hanging on reload" problem.
PR: 220175
Reported by: Eugene M. Zheganin <emz at norma.perm.ru>
Tested by: Eugene M. Zheganin <emz at norma.perm.ru>
Approved by: re (kib)
MFC after: 2 weeks
Sponsored by: playkey.net
Lookups are protected by an epoch section, so the LB group linkage must
be a CK_LIST rather than a plain LIST. Furthermore, we were not
deferring LB group frees, so in_pcbremlbgrouphash() could race with
readers and cause a use-after-free.
Reviewed by: sbruno, Johannes Lundberg <johalun0@gmail.com>
Tested by: gallatin
Approved by: re (gjb)
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D17031
No functional change intended.
Reviewed by: alc, kib
Approved by: re (gjb)
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D17028
In both scenarios a timeout (EWOULDBLOCK) is considered as a
normal condition and the error should not pop up to upper layers.
PR: 231181
Submitted by: cem
Reported by: lev
Reviewed by: vangyzen, markm, delphij
Approved by: re (kib)
Approved by: secteam (delphij)
Differential Revision: https://reviews.freebsd.org/D17049
For RoCE, when CM requests are received for RC and UD connections,
netdevice of the incoming request is unavailable. Because of that CM
requests are always forwarded to init_net namespace.
Now that we have the GID index available, introduce SGID index in
incoming CM requests and refer to the netdevice of it.
While at it fix some incorrect uses of init_net and make sure
the rdma_create_id() function stores the VNET it is passed.
Based on linux commit:
cee104334c98dd04e9dd4d9a4fa4784f7f6aada9
MFC after: 3 days
Approved by: re (gjb)
Sponsored by: Mellanox Technologies
On the ThunderX the region occupied by the framebuffer is included in
the EFI map, so explicitly add it to the set of regions that aren't
managed by the physical memory allocator.
PR: 231064
Reviewed by: andrew
Approved by: re (gjb)
MFC after: 1 week
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D17073
These limits are hit on the ThunderX. Also make
arm_physmem_exclude_region() panic rather than fail silently if the
limit on excluded regions is reached.
PR: 231064
Reviewed by: andrew
Approved by: re (kib)
MFC after: 1 week
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D17073
While executing vm_pqbatch_process_page(m), m->queue may change to
PQ_NONE if the page daemon is concurrently freeing the page. In this
case m's queue state flags must be clear, so vm_pqbatch_process_page()
will be a no-op, but the race could cause spurious assertion failures.
Correct the assertion which assumed that m->queue's value does not
change while the page queue lock is held.
Reviewed by: alc, kib
Reported and tested by: pho
Approved by: re (gjb)
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D17027
Reviewed by: bz, Johannes Lundberg <johalun0@gmail.com>
Approved by: re (kib)
Sponsored by: The FreeBSD Foundation
Differential revision: https://reviews.freebsd.org/D17065
Also fix the validate_ipv4_net_dev() and validate_ipv6_net_dev() functions
which had source and destination addresses swapped, and didn't set the
scope ID for IPv6 link-local addresses.
This allows applications like krping to work using IPoIB devices.
MFC after: 3 days
Approved by: re (gjb)
Sponsored by: Mellanox Technologies
found that performance was no worse and usually better when running
with TRIM consolidation. Performance improvement was most noticable
when multiple large files are released in a short period of time.
Thus, TRIM consolidation is being enabled by default. Should
operational problems be found, it can be disabled using the command
`sysctl vfs.ffs.dotrimcons=0'. This variable can also be set as a
tunable if early disabling is necessary.
Approved by: re (gjb)
Sponsored by: Netflix
for the rt and lle cache were added in r191129 (2009).
To my best knowledge they have never been used and route caching
has converted the inp_rt field from that commit to inp_route
rendering this field and these flags obsolete.
Convert the pointer into a spare pointer to not change the size of
the structure anymore (and to have a spare pointer) and mark the
two fields as unused.
Reviewed by: markj, karels
Approved by: re (gjb)
Differential Revision: https://reviews.freebsd.org/D17062
The stac/clac combo around each byte copy is causing a measurable
slowdown in benchmarks. Do it only before and after all data is
copied. While here reorder the code to avoid a forward branch in
the common case.
Note the copying loop (originating from copyinstr) is avoidably slow
and will be fixed later.
Reviewed by: kib
Approved by: re (gjb)
Differential Revision: https://reviews.freebsd.org/D17063
other allowed domains if the requested domain is below the minimum paging
threshold. Block in fork only if all domains available to the forking
thread are below the severe threshold rather than any.
Submitted by: jeff
Reported by: mjg
Reviewed by: alc, kib, markj
Approved by: re (rgrimes)
Differential Revision: https://reviews.freebsd.org/D16191
Update the BOOTSTRAPPING check for libelf to require the fix for
mips64el object files committed in r338478 and re-enable kernel
modules in the MALTA64EL config file.
Reviewed by: emaste
Approved by: re (gjb)
Differential Revision: https://reviews.freebsd.org/D17054
Remove sysctls:
txq_drain_encapfail - now a duplicate of encap_txd_encap_fail
intr_link - was never incremented
intr_msix - was never incremented
rx_zero_len - was never incremented
The following were not incremented in all code-paths that apply:
m_pullups, mbuf_defrag, rxd_flush, tx_encap, rx_intr_enables, tx_frees,
encap_txd_encap_fail.
Fixes:
Replace the broken collapse_pkthdr() implementation with an MPASS().
fl_refills and fl_refills_large were not incremented when using netmap.
Reviewed by: gallatin
Approved by: re (marius)
Sponsored by: Limelight Networks
Differential Revision: https://reviews.freebsd.org/D16733
To support INTRNG with ACPI we need to set a non-zero cross reference value
for the interrupt controller. The GICv3 driver already had this value set,
however it was missed in the GICv2 driver. Fix this by setting xref to the
correct value.
Approved by: re (gjb)
This patch adds the very initial support for HTM that might come at FreeBSD
version 12.1. This basic support defines a new kABI, so, we do not need to change
it later during 12.1 time frame, when the full implementation will come.
Reviewed by: jhibbits
Approved by: re(marius), jhibbits (mentor)
Differential Revision: https://reviews.freebsd.org/D16889
Testing m->queue != PQ_NONE is not sufficient; see the commit log
message for r338276. As of r332974 vm_page_dequeue() handles
already-dequeued pages, so just replace vm_page_remque() calls with
vm_page_dequeue() calls.
Reviewed by: kib
Tested by: pho
Approved by: re (marius)
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D17025
adding the missing include files and changing a the type of cpuid which
would otherwise cause a false comparison with NETISR_CPUID_NONE.
Reviewed by: rrs
Approved by: re (marius)
Differential Revision: https://reviews.freebsd.org/D16891
Otherwise the "depends_on provider" guard in sctp.d does not work as
intended.
Reported by: mjg
Reviewed by: tuexen
Approved by: re (gjb)
Differential Revision: https://reviews.freebsd.org/D17057
Somehow this was working even after PTI in, at least on amd64, and got
broken by something only very recently.
Reviewed by: araujo
Approved by: re (gjb)
The receive side scaling stride parameter is a value which define the interval
between active receive side queues. The traffic for the inactive queues is
redirected to the nearest active queue by use of modulus. The default value
of this parameter is one, which means all receive side queues are used.
The point of this feature is to redirect more traffic to fewer receive side
queues in order to take more advantage of sorted large receive offload,
sorted LRO. The sorted LRO works better when more packets are accumulated
per service interval.
MFC after: 3 days
Approved by: re (marius)
Sponsored by: Mellanox Technologies
In r337943 ifnet's if_pcp was set to the PCP value in use
instead of IFNET_PCP_NONE.
Current ibcore code assumes that if_pcp is IFNET_PCP_NONE with
VLAN interfaces so it can identify prio-tagged traffic.
Fix that by explicitly verifying that that the if_type is IFT_ETHER
and not IFT_L2VLAN.
MFC after: 3 days
Approved by: re (Marius), hselasky (mentor), kib (mentor)
Sponsored by: Mellanox Technologies
Inspecting the PRM no more than 0x3F data segments, DS, of size 16 bytes is
allowed.
Worst case scenario summary of DS usage:
Header is fixed: 2 DS
Maximum inlining: 98 => (98 - 2) / 16 = 6 DS
Remainder: 0x3F - 2 - 6 = 55 DS (mbuf frags)
Previously a value of 56 DS was used and this would work in the
normal case because not all inline data area was used up.
MFC after: 3 days
Approved by: re (marius)
Sponsored by: Mellanox Technologies
Also remove some related and unused subroutines. They have long been
replaced by variants that handle multiple coalesced events with a single
call.
No functional change intended.
Reviewed by: cem, kib
Approved by: re (gjb)
Differential Revision: https://reviews.freebsd.org/D17029
MIPS64 does not store the 'r_info' field of a relocation table entry as
a 64-bit value consisting of a 32-bit symbol index in the high 32 bits
and a 32-bit type in the low 32 bits as on other architectures. Instead,
the 64-bit 'r_info' field is really a 32-bit symbol index followed by four
individual byte type fields. For big-endian MIPS64, treating this as a
64-bit integer happens to be compatible with the layout expected by other
architectures (symbol index in upper 32-bits of resulting "native" 64-bit
integer). However, for little-endian MIPS64 the parsed 64-bit integer
contains the symbol index in the low 32 bits and the 4 individual byte
type fields in the upper 32-bits (but as if the upper 32-bits were
byte-swapped).
To cope, add two helper routines in gelf_getrel.c to translate between the
correct native 'r_info' value and the value obtained after the normal
byte-swap translation. Use these routines in gelf_getrel(), gelf_getrela(),
gelf_update_rel(), and gelf_update_rela(). This fixes 'readelf -r' on
little-endian MIPS64 objects which was previously decoding incorrect
relocations as well as 'objcopy: invalid symbox index' warnings from
objcopy when extracting debug symbols from kernel modules.
Even with this fixed, objcopy was still crashing when trying to extract
debug symbols from little-endian MIPS64 modules. The workaround in
gelf_*rel*() depends on the current ELF object having a valid ELF header
so that the 'e_machine' field can be compared against EM_MIPS. objcopy
was parsing the relocation entries to possibly rewrite the 'r_info' fields
in the update_relocs() function before writing the initial ELF header to
the destination object file. Move the initial write of the ELF header
earlier before copy_contents() so that update_relocs() uses the correct
symbol index values.
Note that this change should really go upstream. The binutils readelf
source has a similar hack for MIPS64EL though I implemented this version
from scratch using the MIPS64 ABI PDF as a reference.
Discussed with: jkoshy
Reviewed by: emaste, imp
Approved by: re (gjb, kib)
MFC after: 1 month
Differential Revision: https://reviews.freebsd.org/D15734
We will have last_block < blocks if the block count is divisible
by BLIST_BMAP_RADIX, but a terminator node is still needed if the
tree isn't balanced. In this case we were overruning the blist
array by 16 bytes during initialization.
While here, add a check for the invalid blocks == 0 case.
PR: 231116
Reviewed by: alc, kib (previous version), Doug Moore <dougm@rice.edu>
Approved by: re (gjb)
MFC after: 1 week
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D17020
No functional change intended.
Reviewed by: bz, Johannes Lundberg <johalun0@gmail.com>
Approved by: re (rgrimes)
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D17030
fast forwarding path, as it already works for IPv6 and for both of them
on old slow path.
PR: 231143
Reviewed by: ae
Approved by: re (gjb)
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D17039
This is done by setting SUM (permit Supervisor User Memory access)
bit in sstatus register.
The functions we allow access for are routines in assembly that
explicitly handle crossing the user kernel boundary.
Approved by: re (kib)
Sponsored by: DARPA, AFRL
Also this fixes the eflags.ac leak from copyin_smap() when the copied
data length is multiple of eight bytes.
Sponsored by: The FreeBSD Foundation
Approved by: re (gjb)
Non-PTI mode does not switch kcr3, which means that kcr3 is almost
always stale. This is important for the NMI handler, which reloads
%cr3 with PCPU(kcr3) if the value is different from PMAP_NO_CR3.
The end result is that curpmap in NMI handler does not match the page
table loaded into hardware. The manifestation was copyin(9) looping
forever when a usermode access page fault cannot be resolved by
vm_fault() updating a different page table.
Reported by: mmacy
Reviewed by: markj
Sponsored by: The FreeBSD Foundation
MFC after: 3 days
Approved by: re (gjb)
r337289 has a side effect of reducing usb frame 0 buffer size down to
touch report size. That broke some devices e.g. "Raydium Touch System"
which are capable of generating non-touch frames of bigger length.
Fix it with enlarging frame 0 buffer up to internal wmt(4) buffer size.
Reported by: Roberto Fernandez Cueto <roberfern@gmail.com>
Tested by: Roberto Fernandez Cueto <roberfern@gmail.com>
Approved by: re (gjb)
MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D16772
to clear L2 and L3 route caches.
Also mark one function argument as __unused.
Reviewed by: karels, ae
Approved by: re (rgrimes)
Differential Revision: https://reviews.freebsd.org/D17007
macro rather than hand crafted code.
No functional changes.
Reviewed by: karels
Approved by: re (rgrimes)
Differential Revision: https://reviews.freebsd.org/D17006
inp_route6 for IPv6 code after r301217.
This was most likely a c&p error from the legacy IP code, which
did not matter as it is a union and both structures have the same
layout at the beginning.
No functional changes.
Reviewed by: karels, ae
Approved by: re (rgrimes)
Differential Revision: https://reviews.freebsd.org/D17005
use the already existing one. No functional changes.
Reviewed by: karels, ae
Approved by: re (rgrimes)
Differential Revision: https://reviews.freebsd.org/D17004
This was disabled recently due to lack of support in KDB disassembler
and DTrace FBT provider. Support for 'C'-extension to both of these was
added, so we can now enable 'C'-extension.
This reduces size of the kernel important for low-end embedded devices,
and saves cache footprint for high perfomance machines.
Approved by: re (kib)
Sponsored by: DARPA, AFRL
/etc/security/audit_event to provide a list of audit event-number <->
name mappings. However, this occurs too late for anonymous tracing.
With this change, adding 'audit_event_load="YES"' to /boot/loader.conf
will cause the boot loader to preload the file, and then the kernel
audit code will parse it to register an initial set of audit event-number
<-> name mappings. Those mappings can later be updated by auditd(8) if
the configuration file changes.
Reviewed by: gnn, asomers, markj, allanjude
Discussed with: jhb
Approved by: re (kib)
MFC after: 1 week
Sponsored by: DARPA, AFRL
Differential Revision: https://reviews.freebsd.org/D16589
This appeared to be required to have EFI RT support and EFI RTC
enabled by default, because there are too many reports of faulting
calls on many different machines. The knob is added to leave the
exceptions unhandled to allow to debug the actual bugs.
Reviewed by: kevans
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Approved by: re (rgrimes)
Differential revision: https://reviews.freebsd.org/D16972
handling.
This is split into a separate commit from the main change to make it
easier to handle possible revert after upcoming KBI freeze.
Reviewed by: kevans
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Approved by: re (rgrimes)
Differential revision: https://reviews.freebsd.org/D16972
Print error message in verbose mode when CLOCK_SETTIME() clock_if.m
method failed. For EFIRT RTC clock, add error code for the failure of
CLOCK_GETTIME() report.
Reviewed by: kevans
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Approved by: re (rgrimes)
Differential revision: https://reviews.freebsd.org/D16972
trap_pfault() KPTI violation check.
EFI RT may set curpmap to NULL for the duration of the call for some
machines (PCID but no INVPCID). Since apparently EFI RT code must be
ready for exceptions from the calls, avoid dereferencing curpmap until
we know that this call does not come from usermode.
Reviewed by: kevans
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Approved by: re (rgrimes)
Differential revision: https://reviews.freebsd.org/D16972
that can be coalesced. To be clear, fragmentation of phys_avail[] is not
the cause. This fragmentation of vm_phys_segs[] arises from the "special"
calls to vm_phys_add_seg(), in other words, not those that derive directly
from phys_avail[], but those that we create for the initial kernel page
table pages and now for the kernel and modules loaded at boot time. Since
we sometimes iterate over the physical memory segments, coalescing these
segments at initialization time is a worthwhile change.
Reviewed by: kib, markj
Approved by: re (rgrimes)
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D16976
for the "ctld hanging on reload" problem observed in same cases under
high load. I'm not 100% sure it's _the_ fix, as the issue is rather hard
to reproduce, but it was tested as part of a larger path and the problem
disappeared. It certainly shouldn't break anything.
Now, technically, it shouldn't be needed. Quoting mav@, "After
ct->ct_online == 0 there should be no new sessions attached to the target.
And if you see some problems abbout it, it may either mean that there are
some races where single cfiscsi_session_terminate(cs) call may be lost,
or as a guess while this thread was sleeping target was reenabbled and
redisabled again". Should such race be discovered and properly fixed
in the future, than this and the followup two commits can be backed out.
PR: 220175
Reported by: Eugene M. Zheganin <emz at norma.perm.ru>
Tested by: Eugene M. Zheganin <emz at norma.perm.ru>
Discussed with: mav
Approved by: re (gjb)
MFC after: 2 weeks
Sponsored by: playkey.net
play the MIDI files through /dev/sequencer device with tools like
playmidi. The audio output will go through the external MIDI device
such like wavetable synthesis card.
Reviewed by: matk (a long time ago), kib
Approved by: re (kib)
Tested with: Terratec SiXPack 5.1+ + Yamaha DB50XG
MFC after: 4 weeks
This fixes an upstream regression introduced in r331404, causing overly
aggressive reclamation of the ARC when under pressure.
Diagnosed by: Paul <devgs@ukr.net>
Approved by: re (gjb)
MFC after: 3 days
cycle. The i386 build failure appears to be transient, and
now becoming more difficult to reliably reproduce to identify
the cause. I will continue to investigate this, however.
Approved by: re (implicit)
Sponsored by: The FreeBSD Foundation
The PCIOCLISTVPD ioctl on /dev/pci is used to fetch a list of VPD
key-value pairs for a specific PCI function. It is used by
'pciconf -l -V'. The list is stored in a userland-supplied buffer as
an array of variable-length structures where the key and data length
are stored in a fixed-size header followed by the variable-length
value as a byte array. To facilitate walking this array in userland,
<sys/pciio.h> provides a PVE_NEXT() helper macro to return a pointer
to the next array element by reading the the length out of the current
header and using it to compute the address of the next header.
To simplify the implementation, the ioctl handler was also using
PVE_NEXT() when on the user address of the user buffer to compute the
user address of the next array element. However, the PVE_NEXT() macro
when used with a user address was reading the value's length by
indirecting the user pointer. The value was ready after the current
record had been copied out to the user buffer, so it appeared to work
on architectures where user addresses are directly dereferencable from
the kernel (all but powerpc and i386 after the 4:4 split). The recent
enablement of SMAP on amd64 caught this violation however. To fix,
add a variant of PVE_NEXT() for use in the ioctl handler that takes an
explicit value length.
Reported by: Jeffrey Pieper @ Intel
Reviewed by: kib
Approved by: re (gjb)
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D16800
r337776 started hashing the fragments into buckets for faster lookup.
The hashkey is larger than intended. This results in random stack data being
included in the hashed data, which in turn means that fragments of the same
packet might end up in different buckets, causing the reassembly to fail.
Set the correct size for hashkey.
PR: 231045
Approved by: re (kib)
MFC after: 3 days
Remove the PNP info for the moment from the driver. It's an
experimental driver (as noted in r328150). It's performance is about
1/10th that of aesni. It will often panic when used with GELI (PR
2279820). It's not in our best interest to have such a driver be
autoloaded by default.
Approved by: re@ (rgrimes)
Reviewed By: cem@
Differential Review: https://reviews.freebsd.org/D16959
Same as r333305, with Linux 4.17 dts the compatible for the prcm added
'simplebus', it mean that the simplebus driver will attach to it
at the BUS_PASS_BUS pass.
Change the pass for the prcm driver to be at BUS_PASS_BUS so we will win
the attach.
This introduce a problem as this driver needs the omap_scm one to be already
attached. omap_scm also attach at BUS_PASS_BUS but after the prcm one as it is
after in the dtb and the simplebus driver simpy walk the tree to attach it's
children.
Use the bus_new_pass method to defer the frequencies read at BUS_PASS_TIMER.
This fixes booting on pandaboard
Approved by: re (rgrimes)
It is used by a number of applications, notably top(1).
Reported by: netchild
Reviewed by: allanjude
Approved by: re (delphij)
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D16943
The MP ring may have txq pointers enqueued. Previously, these were
passed to m_free() when IFC_QFLUSH was set. This patch checks for
the value and doesn't call m_free().
Reviewed by: gallatin
Approved by: re (gjb)
Sponsored by: Limelight Networks
Differential Revision: https://reviews.freebsd.org/D16882
Fix the build of the GENERIC-MMCCAM kernel config after the sdhci_xenon
driver was commited.
While here correct sdhci_fdt and tegra_sdhci, even with MMCCAM they do
need to depend on sdhci(4)
Reported by: Reshetnikov Dmitriy <genserg@hotmail.com>
Approved by: re (kib)
Sponsored by: Rubicon Communications, LLC ("NetGate")
Exposing max_offset and min_offset defines in public headers is
causing clashes with variable names, for example when building QEMU.
Based on the submission by: royger
Reviewed by: alc, markj (previous version)
Sponsored by: The FreeBSD Foundation (kib)
MFC after: 1 week
Approved by: re (marius)
Differential revision: https://reviews.freebsd.org/D16881
properly in a couple of places in the driver.
Submitted by: Krishnamraju Eraparaju @ Chelsio
Approved by: re@ (rgrimes@)
Sponsored by: Chelsio Communications
Previously, x86 used static ranges of IRQ values for different types
of I/O interrupts. Interrupt pins on I/O APICs and 8259A PICs used
IRQ values from 0 to 254. MSI interrupts used a compile-time-defined
range starting at 256, and Xen event channels used a
compile-time-defined range after MSI. Some recent systems have more
than 255 I/O APIC interrupt pins which resulted in those IRQ values
overflowing into the MSI range triggering an assertion failure.
Replace statically assigned ranges with dynamic ranges. Do a single
pass computing the sizes of the IRQ ranges (PICs, MSI, Xen) to
determine the total number of IRQs required. Allocate the interrupt
source and interrupt count arrays dynamically once this pass has
completed. To minimize runtime complexity these arrays are only sized
once during bootup. The PIC range is determined by the PICs present
in the system. The MSI and Xen ranges continue to use a fixed size,
though this does make it possible to turn the MSI range size into a
tunable in the future.
As a result, various places are updated to use dynamic limits instead
of constants. In addition, the vmstat(8) utility has been taught to
understand that some kernels may treat 'intrcnt' and 'intrnames' as
pointers rather than arrays when extracting interrupt stats from a
crashdump. This is determined by the presence (vs absence) of a
global 'nintrcnt' symbol.
This change reverts r189404 which worked around a buggy BIOS which
enumerated an I/O APIC twice (using the same memory mapped address for
both entries but using an IRQ base of 256 for one entry and a valid
IRQ base for the second entry). Making the "base" of MSI IRQ values
dynamic avoids the panic that r189404 worked around, and there may now
be valid I/O APICs with an IRQ base above 256 which this workaround
would incorrectly skip.
If in the future the issue reported in PR 130483 reoccurs, we will
have to add a pass over the I/O APIC entries in the MADT to detect
duplicates using the memory mapped address and use some strategy to
choose the "correct" one.
While here, reserve room in intrcnts for the Hyper-V counters.
PR: 229429, 130483
Reviewed by: kib, royger, cem
Tested by: royger (Xen), kib (DMAR)
Approved by: re (gjb)
MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D16861
With GNU ifuncs, multiple FBT probes may correspond to the same
instruction. fbt_invop() assumed that this could not happen and
would return after the first probe found in the global FBT hash
table, which might not be the one that's enabled. Fix the problem
on x86 by linking probes that share a tracepoint and having each
linked probe fire when the tracepoint is hit.
PR: 230846
Approved by: re (gjb)
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D16921
table allocation.
At the time that mp_bootaddress() is called, phys_avail[] array does
not reflect some memory reservations already done, like kernel
placement. Recent changes to DMAP protection which make kernel text
read-only in DMAP revealed this, where on some machines AP boot page
tables selection appears to intersect with the kernel itself.
Fix this by checking the addresses selected using the same algorithm
as bootaddr_rwx(). Also, try to chomp pages for the page table not
only at the start of the contiguous range, but also at the end. This
should improve robustness when the only suitable range is already
consumed by the kernel.
Reported and tested by: Michael Gmelin <freebsd@grem.de>
Reviewed by: jhb
MFC after: 1 week
Sponsored by: The FreeBSD Foundation
Approved by: re (gjb)
Differential revision: https://reviews.freebsd.org/D16907
The ip/ipv6 header files are included only if the appropriate definition
exists, but the driver was missing similar checks when using the
ip and ip6_hdr structures.
If the kernel was not built with the INET or INET6 option, the driver
was preventing kernel from being built.
To fix that, the missing ifdef checks were added to the driver.
PR: Bug 230886
Submitted by: Michal Krawczyk <mk@semihalf.com>
Reported by: O. Hartmann
Approved by: re (gjb)
Obtained from: Semihalf
MFC after: 1 week
Sponsored by: Amazon, Inc.
This code works for some people, but hasn't been updated in a long
time. Still allow people to use this code for the moment, but put a
big, nasty obsolete message to inform and encourage people to move to
the port.
Approved by: re@ (gjb)
Differential Review: https://reviews.freebsd.org/D16894
Make the building of drm dependent on MK_MODULE_DRM and the building
of module drm2 on MK_MODULE_DRM2. The defaults are unchanged.
Approved by: re@ (gjb)
Differential Review: https://reviews.freebsd.org/D16894
Similar to how the IPv4 code will reject an IPv6 LB group,
we must ignore IPv4 LB groups when looking up an IPv6
listening socket. If this is not done, a port only match
may return an IPv4 socket, which causes problems (like
sending IPv6 packets with a hopcount of 0, making them unrouteable).
Thanks to rrs for all the work to diagnose this.
Approved by: re (rgrimes)
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D16899
Previously we have been lucky where the state was already in r0, however
this is not guaranteed. Use the passed in register as the location to
store the upper half of the arm VFP registers rather than relying on it
being r0.
Approved by: re (kib)
Users of arc4random(3) should never call them directly.
All ports tree usage was fixed as part of bug 230756.
Relnotes: yes
Approved by: re (marius), exp-run (bug 230756 by portmgr antoine)
given in random(4).
This includes updating of the relevant man pages, and no-longer-used
harvesting parameters.
Ensure that the pseudo-unit-test still does something useful, now also
with the "other" algorithm instead of Yarrow.
PR: 230870
Reviewed by: cem
Approved by: so(delphij,gtetlow)
Approved by: re(marius)
Differential Revision: https://reviews.freebsd.org/D16898
error in the function hypercall_memfree(), where the wrong arena was being
passed to kmem_free().
Introduce a per-page flag, VPO_KMEM_EXEC, to mark physical pages that are
mapped in kmem with execute permissions. Use this flag to determine which
arena the kmem virtual addresses are returned to.
Eliminate UMA_SLAB_KRWX. The introduction of VPO_KMEM_EXEC makes it
redundant.
Update the nearby comment for UMA_SLAB_KERNEL.
Reviewed by: kib, markj
Discussed with: jeff
Approved by: re (marius)
Differential Revision: https://reviews.freebsd.org/D16845
the foreground and background colours. In bitblt_text functions, compare
values to this cache and don't re-draw the characters if they haven't changed.
When invalidating the display, clear this cache in order to force characters
to be redrawn; also force full redraws between suspend/resume pairs since odd
artifacts can otherwise result.
When scrolling the display (which is where most time is spent within the vt
driver) this yields a significant performance improvement if most lines are
less than the width of the terminal, since this avoids re-drawing blanks on
top of blanks.
(Note that "re-drawing" here includes writing to the VGA text mode buffer; on
virtualized systems this can be extremely slow since it triggers a glyph
being rendered onto a 640x480 screen).
On a c5.4xlarge EC2 instance (with emulated text mode VGA) this cuts the time
spent in vt(4) during the kernel boot from 1200 ms to 700ms; on my laptop
(with a 3200x1800 display) the corresponding time is reduced from 970 ms down
to 155 ms.
Reviewed by: imp, cem
Approved by: re (gjb)
Relnotes: Significant speedup in vt(4) and the system boot generally.
Differential Revision: https://reviews.freebsd.org/D16723
Curpmap must be already valid when cpu_throw() is called, even for early
AP startup.
Suggested by: alc
Reviewed by: alc, markj
Sponsored by: The FreeBSD Foundation
Approved by: re (marius)
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D16893
Add pmap_activate_boot() for i386, move the invocation on APs from MD
init_secondary() to x86 init_secondary_tail().
Suggested by: alc
Reviewed by: alc, markj
Sponsored by: The FreeBSD Foundation
Approved by: re (marius)
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D16893
ether_set_pcp should not be called from ether_output_frame for VLAN
interfaces -- the vid + pcp will be inserted during vlan_transmit in
that case. r337943 sets the VLAN's ifnet's if_pcp to a proper PCP value
and this led to double encapsulation (once with vid 0 and second time
with vid+pcp).
PR: 230794
Reviewed by: kib@
Approved by: re@ (gjb@)
Sponsored by: Chelsio Communications
Differential Revision: https://reviews.freebsd.org/D16887
they don't check the result of BUS_READ_IVAR(9) and silently return stack
garbage on failure in case a bus doesn't implement a particular instance
variable for example. With MMC bridges not providing MMCBR_IVAR_RETUNE_REQ,
yet, this in turn can cause mmc(4) to get into a state in which re-tuning
seems to be necessary but is inappropriate, causing mmc_wait_for_request()
to fail. Thus, don't use __BUS_ACCESSOR() for mmcbr_get_retune_req() and
instead provide a version of the latter which returns retune_req_none if
reading MMCBR_IVAR_RETUNE_REQ fails.
One more straight-forward solution would have been to change mmc(4) to not
call mmcbr_get_retune_req() if the current transfer mode doesn't require
re-tuning to begin with. However, for modes such as SDR50, it depends on
the controller whether periodic re-tuning is need. Therefore, knowledge of
whether a particular transfer mode does require re-tuning should be kept
to the bridge drivers.
This change is the generic version of r338271, as intended not requiring
bridge drivers to be touched (unless transfer modes beyond high speed are
to be supported that is).
Approved by: re (gjb)
- sun50i-a64-sid.dtso registers the Security ID node, needed for thermal
- sun50i-a64-ths.dtso registers the thermal node, for which we already have a
driver
- sun50i-a64-timer.dtso registers the timer node, needed as the generic timer
glitch on A64 SoC.
Approved by: re (gjb)
for security, and the excess just slows things down badly.
PR: 230808
Submitted by: rwmaillists@googlemail.com, but tweeked by me
Reported by: Danilo Egea Gondolfo <danilo@FreeBSD.org>
Reviewed by: cem,delphij
Approved by: re(rgrimes)
Approved by: so(delphij)
MFC after: 1 Month
Differential Revision: https://reviews.freebsd.org/D16873
Fix at r331950 appeared to be incomplete, fixing only case of pool
import, but not pool creation, leaving prefetcher still blocked for
newly created pools.
Approved by: re (gjb)
MFC after: 1 week
Revert r338177, r338176, r338175, r338174, r338172
After long consultations with re@, core members and mmacy, revert
these changes. Followup changes will be made to mark them as
deprecated and prent a message about where to find the up-to-date
driver. Followup commits will be made to make this clear in the
installer. Followup commits to reduce POLA in ways we're still
exploring.
It's anticipated that after the freeze, this will be removed in
13-current (with the residual of the drm2 code copied to
sys/arm/dev/drm2 for the TEGRA port's use w/o the intel or
radeon drivers).
Due to the impending freeze, there was no formal core vote for
this. I've been talking to different core members all day, as well as
Matt Macey and Glen Barber. Nobody is completely happy, all are
grudgingly going along with this. Work is in progress to mitigate
the negative effects as much as possible.
Requested by: re@ (gjb, rgrimes)
Expose these counters under the vm.domain sysctl node. The existing
vm.stats.vm.v_pdpages sysctl is preserved.
Reviewed by: alc (previous version)
Differential Revision: https://reviews.freebsd.org/D14666
Per-page queue state is updated non-atomically, with either the page
lock or the page queue lock held. When vm_page_dequeue() is called
without the page lock, in rare cases a different thread may be
concurrently dequeuing the page with the pagequeue lock held. Because
of the non-atomic update, vm_page_dequeue() might return before queue
state is completely updated, which can lead to race conditions.
Restrict the vm_page_dequeue() interface so that it must be called
either with the page lock held or on a free page, and busy wait when
a different thread is concurrently updating queue state, which must
happen in a critical section.
While here, do some related cleanup: inline vm_page_dequeue_locked()
into its only caller and delete a prototype for the unimplemented
vm_page_requeue_locked(). Replace the volatile qualifier for "queue"
added in r333703 with explicit uses of atomic_load_8() where required.
Reported and tested by: pho
Reviewed by: alc
Differential Revision: https://reviews.freebsd.org/D15980
Per r338251, this ensures that ifunc calls have the same ordinary
function calls.
Reviewed by: emaste (previous version)
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D16750
a10_timer is currently use in UP allwinner SoC (A10 and A13).
Those don't have the generic arm timer.
The arm generic timecounter is broken in the A64 SoC, some attempts have
been made to fix the glitch but users still reported some minor ones.
Since the A64 (and all Allwinner SoC) still have this timer controller, rework
the driver so we can use it in any SoC.
Since it doesn't have the 64 bits counter on all SoC, use one of the
generic 32 bits counter as the timecounter source.
PR: 229644
Without this the mmc stack sometimes think that we are in in a retune
operation and some command like switch the bus width to 4 bits failed.
We now switch correctly to 4 bits mode for sd card.
Reported by: jmg, others in pine64 irc channel
SDHCI_TRNS_ACMD12 is to be set only for multiple-block read/write
commands without data length information, so don't unconditionally
set this bit. The result matches what e. g. Linux does.
- Section 2.2.19 of the SDHCI specification version 4.20 states that
SDHCI_ACMD12_ERR should be only valid if SDHCI_INT_ACMD12ERR is set
and hardware may clear SDHCI_ACMD12_ERR when SDHCI_INT_ACMD12ERR is
cleared (differing silicon behavior is specifically allowed, though).
Thus, read SDHCI_ACMD12_ERR before clearing SDHCI_INT_ACMD12ERR.
While at it, use the 16-bit accessor rather than the 32-bit one for
reading the 16-bit SDHCI_ACMD12_ERR.
- SDHCI_INT_TUNEERR isn't one of the ROC bits in SDHCI_INT_STATUS so
clear it explicitly.
- Add missing prototypes and sort them.
Migrate udp6_send() v4mapped code to udp6_output() saving us a re-lock and
further simplifying the address-family handling code by eliminating
AF_INET checks and almost all v4mapped handling right after the start
as cases could actually not happen anymore.
Rework output path locking similar to UDP4 allowing for better
parallelism (see r222488, and later versions).
Sponsored by: The FreeBSD Foundation (2012)
Sponsored by: iXsystems (2012)
Differential Revision: https://reviews.freebsd.org/D3721
netfront_backend_changed() is called from the xenwatch_thread(), which means
that the curvnet is not set. We have to set it before we can call things like
arp_ifinit().
PR: 230845
- Most of the boards are using U-Boot, u-boot embed a DTB that isn't
compiled with -@ (overlay ready) so we cannot use overlays. We want
overlays, overlays are nice.
- The DTS life is going to linux, then sometimes it's imported in
U-Boot but it depend on the SoC family, U-Boot doesn't batch import
every DTS like we do. So sometimes to U-Boot DTS are very old. Or when
an interesting patch in commited upstream it is in Linux X+2 (roughly 4
months from now), we then have to wait for U-Boot to catch up, that
give us between 4 and 6 months to have an update.
- Some boards like the Marvell ones have 3 DTS, the one in the
vendor U-Boot made by Marvell themselves, the one in u-boot mainline
and the one in Linux. I found that the DTS in the Marvell U-Boot have
some problem with FreeBSD (especially the macchiatobin that declare
node with the same address but not the same size, that is not something
that the rman code can handle, it could be modified, I don't know the
code well enough). Also some compatible are used when they shouldn't,
for example they declare the gpio being orion-gpio while this binding
requires interrupts supports, which the node doesn't have.
- The above situation is mostly the same with RockChip SoCs (possibly
others, those are the only SoCs I work on that have this problem).
Note that importing the DTS doesn't mean that every board will use
them, I don't intend to copy the DTB to the GENERIC memstick image for
the Overdrive 1000/3000 for example, the ones provided by the firmware
works fine.
RPI3 will still stay an exception as we use the DTB provided by the
rpi-firmware package, so they come from the rpi foundation linux fork.
use sizeof() or explicit #definesi instead. No functional change.
This was suggested by jmg@.
MFC after: 1 month
XMFC with: r338053
Sponsored by: Netflix, Inc.
This flag is set once the device has been successfully attached. When
set, it inhibits devmatch from trying to match the device. This in
turn allows kldunload to work as expected. Prior to the change, the
driver would immediately reload because devmatch had no notion that
the driver had once been attached, and therefore shouldn't participate
in further matching.
Differential Revision: https://reviews.freebsd.org/D16735
This adds it to devctl, libdevctl, defines the two IOCTLs and
implements the kernel bits. causes any new drivers that are added via
kldload to be deferred until a 'thaw' comes in. These do not stack: it
is an error to freeze while frozen, or thaw while thawed.
Differential Revision: https://reviews.freebsd.org/D16735
No functional change.
When attempting to document the changed argument types in devstat.9, I
discovered the 20 year old manual page severely mismatched reality even
prior to my simple change. So I took a first cut pass cleaning that up to
match reality. I'm sure I've missed some things; the goal was just to leave
it better than when I started.
Sponsored by: Dell EMC Isilon
Due to hardware limitation AMD I2C controller can't trigger pending
interrupt if interrupt status has been changed after clearing
interrupt status bits. So, I2C will lose the interrupt and IO will be
timed out. Implements a workaround to disable I2C controller interrupt
and re-enable I2C interrupt before existing interrupt handler.
Submitted by: rajfbsd@gmail.com
MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D16720
Add an option, KASSERT_PANIC_OPTIONAL, that allows runtime KASSERT()
behavior changes. When this option is not enabled, code that allows
KASSERTs to become optional is not enabled, and all violated assertions
cause termination.
The runtime KASSERT behavior was added in r243980.
One important distinction here is that panic has __dead2
("attribute((noreturn))"), while kassert_panic does not. Static analyzers
like Coverity understand __dead2. Without it, KASSERTs go misunderstood,
resulting in many false positives that result from violation of program
invariants.
Reviewed by: jhb, jtl, np, vangyzen
Relnotes: yes
Sponsored by: Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D16835
SCTP. They are based on what is specified in the Solaris DTrace manual
for Solaris 11.4.
Reviewed by: 0mp, dteske, markj
Relnotes: yes
Differential Revision: https://reviews.freebsd.org/D16839
The boot-time ifunc resolver assumes that it only needs to apply
IRELATIVE relocations to PLT entries. With an upcoming optimization,
this assumption no longer holds, so add the support required to handle
PC-relative relocations targeting GNU_IFUNC symbols.
- Provide a custom symbol lookup routine that can be used in early boot.
The default lookup routine uses kobj, which is not functional at that
point.
- Apply all existing relocations during boot rather than filtering
IRELATIVE relocations.
- Ensure that we continue to apply ifunc relocations in a second pass
when loading a kernel module.
Reviewed by: kib
MFC after: 1 month
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D16749
2^32 bps or greater to be used. Prior to this, bandwidth parameters
would simply wrap at the 2^32 boundary. The computations in the HFSC
scheduler and token bucket regulator have been modified to operate
correctly up to at least 100 Gbps. No other algorithms have been
examined or modified for correct operation above 2^32 bps (some may
have existing computation resolution or overflow issues at rates below
that threshold). pfctl(8) will now limit non-HFSC bandwidth
parameters to 2^32 - 1 before passing them to the kernel.
The extensions to the pf(4) ioctl interface have been made in a
backwards-compatible way by versioning affected data structures,
supporting all versions in the kernel, and implementing macros that
will cause existing code that consumes that interface to use version 0
without source modifications. If version 0 consumers of the interface
are used against a new kernel that has had bandwidth parameters of
2^32 or greater configured by updated tools, such bandwidth parameters
will be reported as 2^32 - 1 bps by those old consumers.
All in-tree consumers of the pf(4) interface have been updated. To
update out-of-tree consumers to the latest version of the interface,
define PFIOC_USE_LATEST ahead of any includes and use the code of
pfctl(8) as a guide for the ioctls of interest.
PR: 211730
Reviewed by: jmallett, kp, loos
MFC after: 2 weeks
Relnotes: yes
Sponsored by: RG Nets
Differential Revision: https://reviews.freebsd.org/D16782
Upcoming Ethernet hardware will support new media types that aren't in the kernel
yet, so they are added here. These mostly include new 25G/50G/100G media types;
and this commit introduces new 200G/400G speeds and media.
Reviewed by: hselasky@, jhb@
MFC after: 1 week
Sponsored by: Intel Corporation
Differential Revision: https://reviews.freebsd.org/D16731
The error handling got lost during r334810, while according to the report
error there may happen in case of dataset being over quota. In such case
just leave the node in the unlinked list to be freed sometimes later.
PR: 229887
Sponsored by: iXsystems, Inc.
r334810 introduced zfs_unlinked_drain() dispatch to taskqueue on every
deletion of a file with extended attributes. Using system_taskq for that
with its multiple threads in case of multiple files deletion caused all
available CPU threads to uselessly spin on busy locks, completely blocking
the system.
Use of single dedicated taskqueue is the only easy solution I've found,
while in would be great if we could specify that some task should be
executed only once at a time, but never in parallel, while many tasks
could use different threads same time.
Sponsored by: iXsystems, Inc.
The original NVMe API used bit-fields to represent fields in data
structures defined by the specification (e.g. the op-code in the command
data structure). The implementation targeted x86_64 processors and
defined the bit fields for little endian dwords (i.e. 32 bits).
This approach does not work as-is for big endian architectures and was
changed to use a combination of bit shifts and masks to support PowerPC.
Unfortunately, this changed the NVMe API and forces #ifdef's based on
the OS revision level in user space code.
This change reverts to something that looks like the original API, but
it uses bytes instead of bit-fields inside the packed command structure.
As a bonus, this works as-is for both big and little endian CPU
architectures.
Bump __FreeBSD_version to 1200081 due to API change
Reviewed by: imp, kbowling, smh, mav
Approved by: imp (mentor)
Differential Revision: https://reviews.freebsd.org/D16404