* register handler for ingress address appearing/disappearing;
* add new srcaddr hash table for fast softc lookup by srcaddr;
* when srcaddr disappears, clear IFF_DRV_RUNNING flag from interface,
and set it otherwise;
MFC after: 1 month
Sponsored by: Yandex LLC
Differential Revision: https://reviews.freebsd.org/D17214
* register handler for ingress address appearing/disappearing;
* add new srcaddr hash table for fast softc lookup by srcaddr;
* when srcaddr disappears, clear IFF_DRV_RUNNING flag from interface,
and set it otherwise;
* remove the note about ingress address from BUGS section.
MFC after: 1 month
Sponsored by: Yandex LLC
Differential Revision: https://reviews.freebsd.org/D17134
appearing and disappearing on the host system.
Such handling is need, because tunneling interfaces must use addresses,
that are configured on the host as ingress addresses for tunnels.
Otherwise the system can send spoofed packets with source address, that
belongs to foreign host.
The KPI uses ifaddr_event_ext event to implement addresses tracking.
Tunneling interfaces register event handlers and then they are
notified by the kernel, when an address disappears or appears.
ifaddr_event_compat() handler from if.c replaced by srcaddr_change_event()
in the ip_encap.c
MFC after: 1 month
Sponsored by: Yandex LLC
Differential Revision: https://reviews.freebsd.org/D17134
vlan_lladdr_fn() is called from taskqueue, which means there's no vnet context
set. We can end up trying to send ARP messages (through the iflladdr_event
event), which requires a vnet context.
PR: 227654
MFC after: 3 days
This allows use differen values configured by user for sysctl variable
net.inet.ip.fw.dyn_rst_lifetime.
Obtained from: Yandex LLC
MFC after: 3 weeks
Sponsored by: Yandex LLC
to switch the output method in run-time. Also document some sysctl
variables that can by changed for NAT64 module.
NAT64 had compile time option IPFIREWALL_NAT64_DIRECT_OUTPUT to use
if_output directly from nat64 module. By default is used netisr based
output method. Now both methods can be used, but they require different
handling by rules.
Obtained from: Yandex LLC
MFC after: 3 weeks
Sponsored by: Yandex LLC
Differential Revision: https://reviews.freebsd.org/D16647
This allows the memory mapped I/O virtio driver to attach when we boot
with ACPI tables, for example in some cases with QEMU emulating arm64.
MFC after: 1 month
that was added using "new rule format". And then, when the kernel
returns rule with this flag, ipfw(8) can correctly show it.
Reported by: lev
MFC after: 3 weeks
Sponsored by: Yandex LLC
Differential Revision: https://reviews.freebsd.org/D17373
supported all the "old" chips it did, so we should have killed it in
4, but 12 will do. It's a bit outside of the normal deprecation
process, but given the extreme age, it's obsolete status for 8 major
releases and the fact that I couldn't find any users who posted dmesgs
with ncr0: in them after 2000 or 3.4. It may be too late for 12 (this
change will be merged, but maybe not the next one to remove it), but
it will be removed in 13 with the first round of other drivers tagged
to be gone in 12.
MFC after: 3 days
handler receives the type of event IFADDR_EVENT_ADD/IFADDR_EVENT_DEL,
and the pointer to ifaddr. Also ifaddr_event now is implemented using
ifaddr_event_ext handler.
MFC after: 3 weeks
Sponsored by: Yandex LLC
Differential Revision: https://reviews.freebsd.org/D17100
SADB_ACQUIRE requests are send by kernel, when security policy doesn't
have corresponding security association for outbound packet. IKE daemon
usually registers its handler for such messages and when the kernel asks
for SA it can handle this request. Now such requests will contain
additional fields that can help IKE daemon to create SA. And IKE now
can create SAs using only information from SADB_ACQUIRE request, this
is useful when many if_ipsec(4) interfaces are in use and IKE doesn track
security policies that was installed by kernel.
Obtained from: Yandex LLC
MFC after: 3 weeks
Sponsored by: Yandex LLC
This driver was already 99% identical to the ofw_pcib_pci driver, except for
the attachment. Since ofw_pcib_pci is already a subclass of pcib, this
creates a private declaration of that class, to use for the base class for
this driver.
At some point in the future, ofw_pcib_pci_driver should probably be exported
to a header, so we're not tracking the softc struct contents, but for now,
since there's only this one other driver, it's not a pressing issue.
r279252 inverted the logic in moea64_scan_init, such that instead of
terminating when reaching a dead page, it terminates when reaching a live
page, ostensibly preserving exactly one page of KVA.
In the FDT based probe, check for "arm,armv8-timer" before "arm,armv7-timer".
This gets the description right when the timer node has both entries in
compatible list.
There seems to be a race in CI, such that dtrace_asm.S might be assembled
before the genassym is completed. This causes a build failure when PSL_EE
doesn't exist, and is read as 0. Get around this by explicitly specifying
the bits in the mask instead.
The Signal Processing Engine (SPE) found in Freescale e500 cores (and
others) offloads IEEE-754 compliance (NaN, Inf handling, overflow,
underflow) to software, most likely as a means of simplifying the APU
silicon. Some software, like AbiWord, needs full IEEE-754 compliance,
including NaN handling. Implement the necessary bits to enable it.
Differential Revision: https://reviews.freebsd.org/D17446
The knob allows to select the flushing mode or turn it off/on. The
idea, as well as the list of the ignored syscall errors, were taken
from https://www.openwall.com/lists/kernel-hardening/2018/10/11/10 .
I was not able to measure statistically significant difference between
flush enabled vs disabled using syscall_timing getuid.
Reviewed by: bwidawsk
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D17536
libc/gmon's mcount was ANSIfied in r124180, with libkern following over
a decade later, in r325988, but some minor discrepancies remained.
Update libc/gmon's mexitcount to an ANSI C function definition, and use
(void) for libkern-only functions that take no arguments.
Reported by: bde
Prior to this revision, we allocated sufficient context space for 'level'
but never actually set the compress level parameter, so we would always get
the default '3'.
Reviewed by: markj, vangyzen
MFC after: 12 hours
Sponsored by: Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D17144
If multiple threads enter fortuna_pre_read contemporaneously, such as via
read(2) or getrandom(2), they could race to check how long it has been since
the last update due to a TOCTOU problem with 'now'.
Here is an example problematic execution:
Thread A: Thread B:
now_A = getsbinuptime();
now_B = getsbinuptime(); // now_B > now_A
RANDOM_RESEED_LOCK();
if (now - fs_lasttime > SBT_1S/10) {
fs_lasttime = now;
... // reseed
}
RANDOM_RESEED_UNLOCK();
RANDOM_RESEED_LOCK();
if (now_A - fs_lasttime > SBT_1S/10) // now_A - fs_lasttime underflows
fs_lasttime = now_A;
... // reseed again, despite less than 100ms elapsing
}
RANDOM_RESEED_UNLOCK();
To resolve the race, simply check the current time after we win the lock
race.
If getsbinuptime is perceived to be expensive, another option might be to
just accept the race and validate that fs_lasttime isn't "in the future."
(It should be within the last ~2^31 seconds out of ~2^32 seconds
representable duration.)
Reviewed by: delphij, markm
Approved by: secteam (delphij)
Sponsored by: Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D16984
The convention for updating hc_destination[] is to index with a
random_entropy_source. Zero happens to match RANDOM_CACHED, which is
correct for this source (early random data). Spell the zero value as the
enum name instead of the magic constant.
No functional change.
Reviewed by: delphij, markm
Approved by: secteam (delphij)
Sponsored by: Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D16983
When modifying an existing managed mapping, we should find a PV entry
for the old mapping. Verify this.
Before r335784 this would have been implicitly tested by the fact that
we always freed the PV entry for the old mapping.
Reviewed by: alc, kib
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D17626
In various places, random represents the set of sources as a 32-bit word
bitmask. It assumes all sources fit within this, i.e., the maximum valid
source number is 31.
There was a comment specifying this limitation, but we can actually refuse
to compile if our assumption is violated instead. We still have a few spare
random source slots, but sooner or later someone may need to convert the
masks used from raw 32-bit words to bitset(9) APIs.
This prevents some kinds of developer foot-shooting when adding new random
sources. No functional change.
Reviewed by: delphij, markm
Approved by: secteam (delphij)
Sponsored by: Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D16982
Currently, the 'thread' command (to switch the debugger to another thread)
only accepts decimal-encoded tids. Use the same parsing logic as 'show
thread <arg>' to accept hex-encoded thread pointers in addition to
decimal-encoded tids.
Document the 'thread' command in ddb.4 and expand the 'show thread'
documentation to cover the tid usage.
Reported by: bwidawsk
Reviewed by: bwidawsk (earlier version), kib (earlier version), markj
Sponsored by: Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D16962
Remove unnecessary use of function-local static variable. 32 bytes is
small enough to live on the stack.
Reviewed by: delphij, markm
Approved by: secteam (delphij)
Sponsored by: Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D16937
FS&K GenerateBlocks function asserts C (counter) != 0. This should also
be true in our implementation.
Reviewed by: delphij, markm
Approved by: secteam (delphij)
Sponsored by: Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D16936
When reseeding, only incorporate actual key material. Do not include e.g.
the derived key schedules or other AES context.
I don't think the extra material was harmful here, just not beneficial.
Reviewed by: delphij, markm
Approved by: secteam (delphij)
Sponsored by: Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D16934
Like the companion API devvn_refthread, leave *ref uninitialized when a
reference was not acquired. Initializing to 1 provides a vaguely
correct-looking but bogus value for broken callers to (mistakenly) pass to
dev_relthread() when refthread fails.
Make it even more clear to consumers that dev_relthread is only valid when
dev_refthread succeeds.
Reviewed by: kib, markj
Sponsored by: Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D16885
We no longer build the drm/drm2 modules by default. See UPDATING for
which package to install instead. drm and drm2 have been completely
unsupported abandonware for a long time now. Please report issues with
the pkg modules to x11@freebsd.org.
Approved by: FreeBSD Graphics Team
Update messaging for which drm module to install. Add guidance on what
hardware is supported (which should be copied into the release
notes). Note: the in tree drivers are abandonware. There has been no
organized support for them for many years, and the plan is to still
remove them for all but arm once the transition to drm-*kmod is
complete. Also note that WITHOUT_MODULE_DRM and WITHOUT_MODULE_DRM2
should generally be added to src.conf for anybody using the drm-*kmod
ports. That will become default in 13 soon, however.
Approved by: FreeBSD Graphics Team
Relnotes: Yes
MFC After: 3 days
Differential Revision: https://reviews.freebsd.org/D17451
It is often useful for developers and administrators to determine a running
thread's stack for debugging purposes. With this feature, using ^T will
print that information
For now, the feature is disabled by default. Enable with sysctl
kern.tty_info_kstacks=1.
Discussed with: markj
Reviewed by: oshogbo
Relnotes: yes
Sponsored by: Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D17621
the 3WHS is completed, establish the backend connection. The trigger
for "3WHS completed" is the reception of the first ACK. However, we
should not proceed if that ACK also has RST or FIN set.
PR: 197484
Obtained from: OpenBSD
MFC after: 2 weeks
Some best-effort consumers may find trylock behavior for stack(9) symbol
resolution acceptable. Expose that behavior to such consumers.
This API is ugly. If in the future the modules and linker file list locking
is cleaned up such that the linker_files list can be iterated safely without
acquiring a sleepable lock, this API should be removed. However, most of
the time nothing will be holding the linker files lock exclusive and the
acquisition can proceed.
Reviewed by: markj
Sponsored by: Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D17620
Pre-defined policies are useful when integrating the domainset(9)
policy machinery into various kernel memory allocators.
The refactoring will make it easier to add NUMA support for other
architectures.
No functional change intended.
Reviewed by: alc, gallatin, jeff, kib
Tested by: pho (part of a larger patch)
MFC after: 3 days
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D17416
This makes the compiler less likely to reload the content from %gs.
The 'P' modifier drops all synteax prefixes and 'n' constraint treats
input as a known at compilation time immediate integer.
Example reloading victim was spinlock_enter.
Stolen from: OpenBSD
Reported by: jtl
Reviewed by: kib
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D17615
Apparently AMD machines cannot tolerate this. This was uncovered by
r339386, where cache flush started really flushing the requested range.
Introduce pmap_mapdev_pciecfg(), which simply does not flush cache
comparing with pmap_mapdev(). It assumes that the MCFG region was
never accessed through the cacheable mapping, which is most likely
true for machine to boot at all.
Note that i386 does not need the change, since the architecture
handles access per-page due to the KVA shortage, and page remapping
already does not flush the cache.
Reported and tested by: mjg, Mike Tancsa <mike@sentex.net>
Reviewed by: alc
Sponsored by: The FreeBSD Foundation
Approved by: re (gjb)
MFC after: 2 weeks
Differential revision: https://reviews.freebsd.org/D17612
returns the section start and stop locations as well as a count if the
caller asks for them.
There was only one out-of-file consumer of count which did not actually
use it and hence was eliminated in r339407.
In r194784 parse_dpcpu(), and in r195699 parse_vnet() (a copy of the
former) started to use the link_elf_lookup_set() interface internally
also asking for the count.
count is computed as the difference of the void **stop - void **start
locations and as such, if the absoulte numbers
(stop - start) % sizeof(void *) != 0
a round-down happens, e.g., **stop 0x1003 - **start 0x1000 => count 0.
To get the section size instead of "count is the number of pointer
elements in the section", the parse_*() functions do a
count *= sizeof(void *).
They use the result to allocate memory and copy the section data
into the "master" and per-instance memory regions with a size of
count.
As a result of count possibly round-down this can miss the last
bytes of the section. The good news is that we do not touch
out of bounds memory during these operations (we may at a later stage
if the last bytes would overflow the master sections).
Given relocation in elf_relocaddr() works based on the absolute
numbers of start and stop, this means that we can possibly try to
access relocated data which was never copied and hence we get
random garbage or at best zeroed memory.
Stop the two (last) consumers of count (the parse_*() functions)
from using count as well, and calculate the section size based on
the absolute numbers of stop and start and use the proper size for
the memory allocation and data copies. This will make the symbols
in the last bytes of the pcpu or vnet sections be presented as
expected.
PR: 232289
Approved by: re (gjb)
MFC after: 2 weeks
code paths. Both are not consistent and the one on the syn cache code
does not conform to the relevant specifications (Page 69 of RFC 793
and Section 4.2 of RFC 5961).
This patch fixes this:
* The sequence numbers checks are fixed as specified on
page Page 69 RFC 793.
* The sysctl variable net.inet.tcp.insecure_rst is now honoured
and the behaviour as specified in Section 4.2 of RFC 5961.
Approved by: re (gjb@)
Reviewed by: bz@, glebius@, rrs@,
Differential Revision: https://reviews.freebsd.org/D17595
Sponsored by: Netflix, Inc.
(e.g. RocketChip, lowRISC and derivatives).
RISC-V page table entries support A (accessed) and D (dirty) bits. The
spec makes hardware support for these bits optional. Implementations that
do not manage these bits in hardware raise page faults for accesses to a
valid page without A set and writes to a writable page without D set.
Check for these types of faults when handling a page fault and fixup the
PTE without calling vm_fault if they occur.
Reviewed by: jhb, markj
Approved by: re (gjb)
Sponsored by: DARPA, AFRL
Differential Revision: https://reviews.freebsd.org/D17424
(e.g. RocketChip, lowRISC and derivatives).
RISC-V page table entries support A (accessed) and D (dirty) bits. The
spec makes hardware support for these bits optional. Implementations that
do not manage these bits in hardware raise page faults for accesses to a
valid page without A set and writes to a writable page without D set.
Check for these types of faults when handling a page fault and fixup the
PTE without calling vm_fault if they occur.
Reviewed by: jhb, markj
Approved by: re (gjb)
Sponsored by: DARPA, AFRL
Differential Revision: https://reviews.freebsd.org/D17424
was really a "socket close" callback.
Update the socket destructor functionality to run when a socket is
destroyed (rather than when it is closed). The original submitter has
confirmed that this change satisfies the intended use case.
Suggested by: rwatson
Submitted by: Michio Honda <micchie at sfc.wide.ad.jp>
Tested by: Michio Honda <micchie at sfc.wide.ad.jp>
Approved by: re (kib)
Differential Revision: https://reviews.freebsd.org/D17590
can see the dmesg buffer (this is the current behavior). When false (the
new default), dmesg will be unavailable to jailed users, whether root or
not.
The security.bsd.unprivileged_read_msgbuf sysctl still works as before,
controlling system-wide whether non-root users can see the buffer.
PR: 211580
Submitted by: bz
Approved by: re@ (kib@)
MFC after: 3 days
Driver enumerates NVDIMMs. Besides, for each found System Physical
Address (SPA) range, spaN geom provider is created, which allows
formatting and mounting the region as the normal volume. Also,
/dev/nvdimm_spaN node is created, which can be read/written/mapped by
userspace, the mapping is zero-copy.
No support for block access methods implemented, labels are not
parsed. No management interfaces are provided.
Tested by: Intel, NetApp
Sponsored by: The FreeBSD Foundation
Approved by: re (gjb)
MFC after: 2 weeks
Unconditionally reparenting to PID 1 breaks the procctl(2) reaper
functionality.
Add a regression test for this case.
Reviewed by: kib
Approved by: re (gjb)
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D17589
translator, when using the DWC OTG USB controller driver. Make sure to re-try
getting the complete split packets until a DATA0 packet is received. Larger
isochronous frames may be split into multiple MDATA packets terminated
by a single DATA0 packet.
PR: 230434
MFC after: 3 days
Approved by: re (gjb)
Sponsored by: Mellanox Technologies
The KPI allows to map very large contigous physical memory regions
into KVA, which are not covered by DMAP.
I see both with QEMU and with some real hardware started shipping, the
regions for NVDIMMs might be very far apart from the normal RAM, and
we expect that at least initial users of NVDIMM could install very
large amount of such memory. IMO it is not reasonable to extend DMAP
to cover that far-away regions both because it could overflow existing
4T window for DMAP in KVA, and because it costs in page table pages
allocations, for gap and for possibly unused NV RAM.
Also, KPI provides some special functionality for fast cache flushing
based on the knowledge of the NVRAM mapping use.
Reviewed by: alc, markj
Sponsored by: The FreeBSD Foundation
Approved by: re (gjb)
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D17070
Reviewed by: alc, markj
Sponsored by: The FreeBSD Foundation
Approved by: re (gjb)
MFC after: 3 days
Differential revision: https://reviews.freebsd.org/D17070
This was missed in r339367 ("Various fixes for TLB management on RISC-V.").
This fixes operation on lowRISC.
Reviewed by: jhb
Approved by: re (gjb)
Sponsored by: DARPA, AFRL
Differential Revision: https://reviews.freebsd.org/D17583
to this change, the code sometimes used a temporary stack variable to hold
details of a TCP segment. r338102 stopped using the variable to hold
segments, but did not actually remove the variable.
Because the variable is no longer used, we can safely remove it.
Approved by: re (gjb)
cycle.
This is expected to be the final ALPHA build of this release
cycle, prior to branching stable/12.
Approved by: re (implicit)
Sponsored by: The FreeBSD Foundation
This fixes two problems, one where epoch calls could occur before all
the readers had exited the epoch section, and one where the epoch calls
could be unnecessarily delayed.
Approved by: re (glebius)
Device removal code uses zio_vdev_child_io() with ZIO_TYPE_NULL parent,
that never happened before. It confused FreeBSD-specific TRIM code,
which does not use VDEV_IO_DONE for logical ZIO_TYPE_FREE ZIOs. As
result of that stage being skipped device removal ZIOs leaked references
and memory that supposed to be freed by VDEV_IO_DONE, making it stuck.
It is a quick patch rather then a nice fix, but hopefully we'll be able
to drop it all together when alternative TRIM implementation finally get
landed.
PR: 228750, 229007
Discussed with: allanjude, avg, smh
Approved by: re (delphij)
MFC after: 5 days
Sponsored by: iXsystems, Inc.
Both ^/sys/compat/freebsd32/syscalls.master and ^/sys/kern/syscalls.master
cited "COMPAT[n] #ifdef" instead of "COMPAT_FREEBSD[n] #ifdef" in places.
Approved by: re (glebius)
- Remove the arm64-specific cpu_*cache* and cpu_tlb_flush* functions.
Instead, add RISC-V specific inline functions in cpufunc.h for the
fence.i and sfence.vma instructions.
- Catch up to changes in the arm64 pmap and remove all the cpu_dcache_*
calls, pmap_is_current, pmap_l3_valid_cacheable, and PTE_NEXT bits from
pmap.
- Remove references to the unimplemented riscv_setttb().
- Remove unused cpu_nullop.
- Add a link to the SBI doc to sbi.h.
- Add support for a 4th argument in SBI calls. It's not documented but
it seems implied for the asid argument to SBI_REMOVE_SFENCE_VMA_ASID.
- Pass the arguments from sbi_remote_sfence*() to the SEE. BBL ignores
them so this is just cosmetic.
- Flush icaches on other CPUs when they resume from kdb in case the
debugger wrote any breakpoints while the CPUs were paused in the IPI_STOP
handler.
- Add SMP vs UP versions of pmap_invalidate_* similar to amd64. The
UP versions just use simple fences. The SMP versions use the
sbi_remove_sfence*() functions to perform TLB shootdowns. Since we
don't have a valid pm_active field in the riscv pmap, just IPI all
CPUs for all invalidations for now.
- Remove an extraneous TLB flush from the end of pmap_bootstrap().
- Don't do a TLB flush when writing new mappings in pmap_enter(), only if
modifying an existing mapping. Note that for COW faults a TLB flush is
only performed after explicitly clearing the old mapping as is done in
other pmaps.
- Sync the i-cache on all harts before updating the PTE for executable
mappings in pmap_enter and pmap_enter_quick. Previously the i-cache was
only sync'd after updating the PTE in pmap_enter.
- Use sbi_remote_fence() instead of smp_rendezvous in pmap_sync_icache().
Reviewed by: markj
Approved by: re (gjb, kib)
Sponsored by: DARPA
Differential Revision: https://reviews.freebsd.org/D17414
cpu_switch() always reloads the LDT, so this can only affect the
hypervisor process itself. Fix this by explicitly reloading the host
LDT selector after each #VMEXIT. The stock bhyve process on FreeBSD
never uses a custom LDT, so this change is cosmetic.
Reviewed by: kib
Tested by: Mike Tancsa <mike@sentex.net>
Approved by: re (gjb)
MFC after: 2 weeks
Rename functions and variables from ixlv to iavf to match the
user-facing name change. There shouldn't be any functional changes
with this change, but this may help with browsing the source code
and reducing diffs in the future.
Submitted by: kbowling@
Reviewed by: erj@, sbruno@
Approved by: re (gjb@)
Differential Revision: https://reviews.freebsd.org/D17544
At early boot, PCPU_GET(), that obtains a pointer from SPRG0, was being
used with SPRG0 not yet initialized. If it pointed to an invalid
address, the machine would hang.
Approved by: re(gjb), jhibbits(mentor)
not be used as condition for ternary operator.
Submitted by: Tatsuki Makino <tatsuki_makino at hotmail dot com>
Approved by: re (kib)
MFC after: 1 week
r338927("zfs: depessimize zfs_root with rmlocks") failed to error check
the mount before caching root vnode.
Results in crashes in rrw_enter_read_impl tracing back to zfs_mount.
Reported by: Mike Tancsa
Tested by: allanjude
Approved by: re (kib)
- Fix assert/panic on receive when Jumbo Frames are enabled.
From the commit I made to ixl:
"It turns out that *_isc_rxd_available is supposed to return how many
packets are available to be cleaned on the rx ring. This patch removes
a section of code where if the budget argument is 1, the function would return
one if there was a descriptor available, not necessarily a packet.
This is okay in regular mtu 1500 traffic since the max frame size is less
than the configured receive buffer size (2048), but this doesn't work when
received packets can span more than one descriptor, as is the case when the
mtu is 9000 and the receive buffer size is 4096."
- Fix possible Tx hang because *_isc_txd_credits_update returns incorrect result
From the commit by Krzysztof Galazka to ixl: "Function isc_txd_update_credits
called with clear set to false should return 1 if there are TX descriptors
already handled by HW. It was always returning 0 causing troubles with UDP TX
traffic."
PR: 231659
Reported by: lev@
Approved by: re (gjb@)
Sponsored by: Intel Corporation
Vast majority of syscalls take 6 or less arguments. Move handling of other
cases to a fallback function. Similarly, special casing for _syscall
and __syscall
magic syscalls is moved away.
Return is almost always 0. The change replaces 3 branches with 1 in the common
case. Also the 'frame' variable convinces clang not to reload it on each access.
Reviewed by: kib
Approved by: re (gjb)
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D17542
Reading caps is in the hot path (on each successful fd lookup), but
completely unnecessarily requires a function call.
Approved by: re (gjb)
Sponsored by: The FreeBSD Foundation
an inp marked FREED after the epoch(9) changes.
Check once we hold the lock and skip the inp if it is the case.
Contrary to IPv6 the locking of the inp is outside the multicast
section and hence a single check seems to suffice.
PR: 232192
Reviewed by: mmacy, markj
Approved by: re (kib)
Differential Revision: https://reviews.freebsd.org/D17540
Finishes the conversion of the 40Gb Intel Ethernet drivers to iflib(9) for
FreeBSD 12.0, and fixes numerous bugs in both ixl(4) and iavf(4).
This commit also re-adds the VF driver to GENERIC since it now compiles and
functions.
The VF driver name was changed from ixlv(4) to iavf(4) because the VF driver is
now intended to be used with future products, not just with Fortville/Fort Park
VFs.
A man page update that documents these drivers is forthcoming in a separate
commit.
Reviewed by: sbruno@, kbowling@
Tested by: jeffrey.e.pieper@intel.com
Approved by: re (gjb@)
Relnotes: yes
Sponsored by: Intel Corporation
Differential Revision: https://reviews.freebsd.org/D16429
See r339205 for justification.
Reviewed by: kib
Approved by: re (gjb)
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D17526
The device evacuation code adds a dependency that
vdev_compact_children() be able to properly empty the vdev_child
array by setting it to NULL and zeroing vdev_children. Under Linux,
kmem_alloc() and related functions return a sentinel pointer rather
than NULL for zero-sized allocations.
This is a part of ZoL port of device removal patch:
commit a1d477c24c
Author: Matthew Ahrens <mahrens@delphix.com>
Ported-by: Tim Chase <tim@chase2k.com>
Approved by: re (kib)
MFC after: 1 week
The function tweaks CPU capabilities based on the VM platform and
tunables, which affected selection of the cache flush method before
ifuncs were used, and should affect the cache flush in the same way
after ifunc.
PR: 232081
Reported by: phk
Analyzed by: avg
Sponsored by: The FreeBSD Foundation
Approved by: re (gjb)
Without this hardware raises an interrupt regardless of any
pending bits set.
This fixes operation on RocketChip and derivatives (e.g. lowRISC).
Approved by: re (kib)
Sponsored by: DARPA, AFRL
Apparently CLFLUSH on mmio can cause VM exit, as reported in the PR.
I do not see that anything useful can be done except emulating page
faults on invalid addresses.
Due to the instruction encoding pecularity, also emulate SFENCE.
PR: 232081
Reported by: phk
Reviewed by: araujo, avg, jhb (all: previous version)
Sponsored by: The FreeBSD Foundation
Approved by: re (gjb)
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D17482
The only source of documentation for this device is verilog,
so driver is minimalistic.
Reviewed by: Dr Jonathan Kimmitt <jrrk2@cam.ac.uk>
Approved by: re (kib)
Sponsored by: DARPA, AFRL
Upstream code expects only ZIO_TYPE_READ and some ZIO_TYPE_WRITE
requests to removed (indirect) vdevs, while on FreeBSD there is also
ZIO_TYPE_FREE (TRIM). ZIO_TYPE_FREE requests do not have the data
buffers, so don't need the pointer adjustment.
PR: 228750, 229007
Reviewed by: allanjude, sef
Approved by: re (kib)
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D17523
but leaving the variable assignment outside the block, where it is no longer
used. Move both the variable and the assignment one block further in.
This should result in no functional changes. It will however make upcoming
changes slightly easier to apply.
Reviewed by: markj, jtl, tuexen
Approved by: re (kib)
Differential Revision: https://reviews.freebsd.org/D17525
The reasoning is the same as with the memset change, see r339205
Reviewed by: kib (previous version)
Approved by: re (gjb)
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D17441
The change is a no-op for architectures which don't ifunc memset,
memcpy nor memmove.
Convert places which need them. Xen bits by royger.
Reviewed by: kib
Approved by: re (gjb)
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D17487
The VT-x VMCS only stores the base address of the GDTR and IDTR. As a
result, VM exits use a fixed limit of 0xffff for the host GDTR and
IDTR losing the smaller limits set in when the initial GDT is loaded
on each CPU during boot. Explicitly save and restore the full GDTR
and IDTR contents around VM entries and exits to restore the correct
limit.
Similarly, explicitly save and restore the LDT selector. VM exits
always clear the host LDTR as if the LDT was loaded with a NULL
selector and a userspace hypervisor is probably using a NULL selector
anyway, but save and restore the LDT explicitly just to be safe.
PR: 230773
Reported by: John Levon <levon@movementarian.org>
Reviewed by: kib
Tested by: araujo
Approved by: re (rgrimes)
MFC after: 1 week
resilver (r334844)
MFV/ZoL: Fix deadlock in IO pipeline
commit a76f3d0437
Author: Brian Behlendorf <behlendorf1@llnl.gov>
Date: Fri Mar 16 16:46:06 2018 -0700
Fix deadlock in IO pipeline
In vdev_queue_aggregate() the zio_execute() bypass should not be
called under the vdev queue lock. This can result in a deadlock
as shown in the stack traces below.
Drop the vdev queue lock then walk the parents of the aggregate IO
to determine the list of component IOs to be bypassed. This can
be done safely without holding the io_lock since the new aggregate
IO has not yet been returned and its parents cannot change.
--- THREAD 1 ---
arc_read()
zio_nowait()
zio_vdev_io_start()
vdev_queue_io() <--- mutex_enter(vq->vq_lock)
vdev_queue_io_to_issue()
vdev_queue_aggregate()
zio_execute()
vdev_queue_io_to_issue()
vdev_queue_aggregate()
zio_execute()
zio_vdev_io_assess()
zio_wait_for_children() <- mutex_enter(zio->io_lock)
--- THREAD 2 --- (inverse order)
arc_read()
zio_change_priority() <- mutex_enter(zio->zio_lock)
vdev_queue_change_io_priority() <- mutex_enter(vq->vq_lock)
Reviewed-by: Tom Caputi <tcaputi@datto.com>
Reviewed-by: Don Brady <don.brady@delphix.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reported by: ZFS Leadership Meeting
Reviewed by: mav
Approved by: re (kib)
Obtained from: ZFS-on-Linux
MFC after: 2 weeks
Sponsored by: Klara Systems
Differential Revision: https://reviews.freebsd.org/D17495
This connects new tunables that were added but not exposed in:
r329502 (zpool remove)
r337007 (zpool initialize)
Reviewed by: avg
Approved by: re (kib)
MFC after: 2 weeks
Sponsored by: Klara Systems
Differential Revision: https://reviews.freebsd.org/D17494
This is caused by a deadlock between zil_commit() and zfs_zget()
Add a way for zfs_zget() to break out of the retry loop in the common case
PR: 229614
Reported by: grembo, Andreas Sommer, many others
Tested by: Andreas Sommer, Vicki Pfau
Reviewed by: avg (no objection)
Approved by: re (gjb)
MFC after: 2 months
Sponsored by: Klara Systems
Differential Revision: https://reviews.freebsd.org/D17460
spa_condense_indirect_thread() is no longer a thread function, but just
a callback for new zthr KPI.
Submitted by: allanjude
Approved by: re (gjb)
MFC after: 3 days
Recent changes in Linux updated Marvell Armada 38x
UART compatible string. As a result the FreeBSD driver
(uart_dev_snps) does not probe. This commit fixes the
situation, however not applying any functional modification
to the driver methods.
Approved by: re (kib)
Obtained from: Semihalf
- Update OpenSSL to version 1.1.1.
- Update Kerberos/Heimdal API for OpenSSL 1.1.1 compatibility.
- Bump __FreeBSD_version.
Approved by: re (kib)
Sponsored by: The FreeBSD Foundation
epoch section without exiting that epoch section. This is bad for two
reasons: the epoch section won't exit, and we will leave the epoch tracker
from the stack on the epoch list.
Fix the epoch leak by making sure we exit epoch sections before returning.
Reviewed by: ae, gallatin, mmacy
Approved by: re (gjb, kib)
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D17450
The pNFS server would report the total disk space used and free for all
of the DSs, even when certain DSs are assigned to the file system via
the "#<path>" suffix used in the "nfsd -p" option argument.
This patch fixes this case. It only reports usage for the file system
that the argument vnode resides on. This is consistent with the non-pNFS
NFSv4 server. In NFSv4 it is possible to have subtrees on other file
systems, but these are not included in the usage information for NFSv4.
Approved by: re (gjb)
r339213 was cherry-picked back to head from the project branch, which
caused a conflict. This commit properly records the mergeinfo from
head.
r339205 was missed, and r339214 is required for reintegration.
Sponsored by: The FreeBSD Foundation
When acting as a VF it is required to add steering rules for all unicast
addresses. Even if promiscious mode is selected. Else incoming data packets
will be dropped.
MFC after: 3 days
Approved by: re (gjb)
Sponsored by: Mellanox Technologies
These messages are totally redundant with the iflib messages.
They're also not very useful, since they don't include the
interface name.
Discussed with: shurd
Approved by: re (rgrimes)
Sponsored by: Dell EMC Isilon
configuring kernels for i386, amd64, and arm64.
The 'GEOM_PART_GPT' option was added to the DEFAULTS configuration
in r337967.
Approved by: re (kib@)
Reviewed by: ler@
Differential Revision: https://reviews.freebsd.org/D17458
Sponsored by: Netflix, Inc.
locally generated SCTP packets sent over IPv4. This make
the behaviour consistent with IPv6.
Reviewed by: ae@, bz@, jtl@
Approved by: re (kib@)
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D17406
Discussing with Benjamin Herrenschmidt, OPAL_INT_GET_XIRR masks the
returned priority, so must be resumed before more interrupts can be
handled at this priority. Since there are only two priorities used in
FreeBSD, we know that the previous priority in an EOI will always be
0xff (lowest priority).
Reviewed by: nwhitehorn
Approved by: re(rgrimes)
Differential Revision: https://reviews.freebsd.org/D17361
Summary:
Discussing with Benjamin Herrenschmidt, MSIs, and edge-triggered
interrupts in general, must not be masked in XICS and XIVE, else
subsequent interrupts may be ignored.
Testing locally on my Talos II (single CPU, 18-core POWER9), NVMe now
works with MSI, improving read throughput by ~70% (900MB/s -> 1.67GB/s,
with 64MB block size) over INTx interrupts, and snd_hda(4) now will
actually play music with MSI. Previously, snd_hda(4) would not receive
interrupts, timing out, and declaring the channels dead.
This has also been tested by Kevin Bowling, and others, with great
success. Kevin reported NVMe unusable on his Talos II prior to this
patch.
Reviewed by: nwhitehorn, kbowling
Approved by: re(rgrimes)
Differential Revision: https://reviews.freebsd.org/D17356
It's not supposed to be legal for two jails to contain the same IP address,
unless both jails contain only that one address. This is the behavior
documented in jail(8), and is there to prevent confusion when multiple
jails are listening on IADDR_ANY.
VIMAGE jails (now the default for GENERIC kernels) test this correctly,
but non-VIMAGE jails have been performing an incomplete test when nested
jails are used.
Approved by: re@ (kib@)
MFC after: 5 days
When using a vlan with igb and the vlanhwcsum option, any mbufs which
already had the TCP, UDP, or SCTP checksum calculated and therefore don't
have the CSUM_[IP|IP6]_[TCP|UDP|SCTP] bits set in the csum_flags field would
have the L4 checksum corrupted by the hardware.
This was caused by the driver setting E1000_TXD_POPTS_TXSM any time a
checksum bit was set OR a vlan tag was present.
The patched driver only sets E1000_TXD_POPTS_TXSM when an offload is
requested.
PR: 231416
Reported by: pi
Approved by: re (gjb)
Sponsored by: Limelight Networks
Differential Revision: https://reviews.freebsd.org/D17404
rep stos has a high startup time even on modern microarchitectures like
Skylake. Intel optimization manuals discuss how for small sizes it is
beneficial to go for streaming stores. Since those cannot be used without
extra penalty in the kernel I investigated performance impact of just
regular movs.
The patch below implements a very simple scheme: a 32-byte loop followed
by filling in the remainder of at most 31 bytes. It has a 256 breaking
point on which it falls back to rep stos. It provides a significant win
over the current primitive on several machines I tested (both Intel and
AMD). A 64-byte loop did not provide any benefit even for multiple of 64
sizes.
See the review for benchmark data.
Reviewed by: kib
Approved by: re (gjb)
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D17398
When getting the number of bytes to checksum make sure to convert the UDP
length to host byte order when the entire header is not in the first mbuf.
Reviewed by: jtl, tuexen, ae
Approved by: re (gjb), jtl (mentor)
Differential Revision: https://reviews.freebsd.org/D17357
Refactor sample ring buffer ring handling to make it more robust to
long running callchain collection handling
r338112 introduced a (now fixed) regression that exposed a number of race
conditions within the management of the sample buffers. This
simplifies the handling and moves the decision to overwrite a
callchain sample that has taken too long out of the NMI in to the
hardlock handler. With this change the problem no longer shows up as a
ring corruption but as the code spending all of its time in callchain
collection.
- Makes the producer / consumer index incrementing monotonic, making it
easier (for me at least) to reason about.
- Moves the decision to overwrite a sample from NMI context to interrupt
context where we can enforce serialization.
- Puts a time limit on waiting to collect a user callchain - putting a
bound on head-of-line blocking causing samples to be dropped
- Removes the flush routine which was previously needed to purge
dangling references to the pmc from the sample buffers but now is only
a source of a race condition on unload.
Previously one could lock up or crash HEAD by running:
pmcstat -S inst_retired.any_p -T and then hitting ^C
After this change it is no longer possible.
PR: 231793
Reviewed by: markj@
Approved by: re (gjb@)
Differential Revision: https://reviews.freebsd.org/D17011
Change swap_reserve and swap_total to be in units of pages so that
swap reservations can be done using only atomics instead of using a single
global mutex for swap_reserve and a single mutex for all processes running
under the same uid for uid accounting.
Results in mmap speed up and a 70% increase in brk calls / second.
Reviewed by: alc@, markj@, kib@
Approved by: re (delphij@)
Differential Revision: https://reviews.freebsd.org/D16273
With the new route cache feature udp_notify() will modify the inp when it
needs to invalidate the route cache. Ensure that we hold a write lock on
the inp before calling the function to ensure that multiple threads don't
race while trying to invalidate the cache (which previously lead to a page
fault).
Differential Revision: https://reviews.freebsd.org/D17246
Reviewed by: sbruno, bz, karels
Sponsored by: Dell EMC Isilon
Approved by: re (gjb)
This change is a no-op in terms of semantics, but has a side effect
of removing a perfectly useless nop sled for CPUs with ERMS.
Approved by: re (gjb)
Sponsored by: The FreeBSD Foundation
ioctl(2) commands only have meaning in the context of a file descriptor
so translating them in the syscall layer is incorrect.
The new handler users an accessor to retrieve/construct a pointer from
the last member of the passed structure and relies on type punning to
access the other member which requires no translation.
Reviewed by: kib
Approved by: re (rgrimes, gjb)
Obtained from: CheriBSD
Sponsored by: DARPA, AFRL
Differential Review: https://reviews.freebsd.org/D17388
shutdown() to wakeup another thread blocked on a stream listen socket.
This code is failing, while it used to work on FreeBSD 10 and still
works on Linux.
It seems reasonable to add another exception to support something users are
actually doing, which used to work on FreeBSD 10, and still works on Linux.
And, it seems like it should be acceptable to POSIX, as we still return
ENOTCONN.
This patch is different to what had been committed to stable/11, since
code around listening sockets is different. Patch in D15019 is written
by jtl@, slightly modified by me.
PR: 227259
Obtained from: jtl
Approved by: re (kib)
Differential Revision: D15019
This caused microcode to be updated only on the BSP if hyperthreading
was disabled, typically resulting in a hang or reset.
Approved by: re (kib)
Sponsored by: The FreeBSD Foundation
ioctl(2) commands only have meaning in the context of a file descriptor
so translating them in the syscall layer is incorrect.
The new handler users an accessor to retrieve/construct a pointer from
the last member of the passed structure and relies on type punning to
access the other members which require no translation.
Reviewed by: kib (prior version), jhb
Approved by: re (rgrimes)
Obtained from: CheriBSD
Sponsored by: DARPA, AFRL
Differential Review: https://reviews.freebsd.org/D17378
using an application trying to use a v4mapped destination address on a
kernel without INET support or on a v6only socket.
Catch this case and prevent the packet from going anywhere;
else, without the KASSERT() armed, a v4mapped destination
address might go out on the wire or other undefined behaviour
might happen, while with the KASSERT() we panic.
PR: 231728
Reported by: Jeremy Faulkner (gldisater gmail.com)
Approved by: re (kib)
system-call entry and whenever audit arguments or return values are
captured:
1. Expose a single global, audit_syscalls_enabled, which controls
whether the audit framework is entered, rather than exposing
components of the policy -- e.g., if the trail is enabled,
suspended, etc.
2. Introduce a new function audit_syscalls_enabled_update(), which is
called to update audit_syscalls_enabled whenever an aspect of the
policy changes, so that the value can be updated.
3. Remove a check of trail enablement/suspension from audit_new() --
at the point where this function has been entered, we believe that
system-call auditing is already in force, or we wouldn't get here,
so simply proceed to more expensive policy checks.
4. Use an audit-provided global, audit_dtrace_enabled, rather than a
dtaudit-provided global, to provide policy indicating whether
dtaudit would like system calls to be audited.
5. Do some minor cosmetic renaming to clarify what various variables
are for.
These changes collectively arrange it so that traditional audit
(trail, pipes) or the DTrace audit provider can enable system-call
probes without the other configured. Otherwise, dtaudit cannot
capture system-call data without auditd(8) started.
Reviewed by: gnn
Sponsored by: DARPA, AFRL
Approved by: re (gjb)
Differential Revision: https://reviews.freebsd.org/D17348
In the probe case for SCSI SMR Host Aware or Most Managed drives, be sure
to free allocated memory.
sys/cam/scsi/scsi_da.c:
In dadone_probezone(), free the data pointer before returning.
MFC after: 3 days
Sponsored by: Spectra Logic
Approved by: re (kib)
Otherwise (iter % ds->ds_cnt) is not guaranteed to lie in the range
[0, MAXMEMDOM).
Reported by: pho
Reviewed by: kib
Approved by: re (rgrimes)
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D17374
Tested with ifunc resolvers in the kernel and module with calls from
kernel to kernel, module to kernel, and module to module.
Reviewed by: kib (previous version)
Approved by: re (gjb)
Differential Revision: https://reviews.freebsd.org/D17370
Belatedly add a comment to the amd64 pmap explaining why we initialize
the kernel pmap's resident page count.
Reviewed by: alc, kib
Approved by: re (gjb)
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D17377
Such data may later be unmapped. This occurs, for example, when a
loader-provided microcode update file is discarded.
Reviewed by: alc, kib
Approved by: re (gjb)
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D17340
The initial raise in r336519 wasn't enough for using big resolution
(1920 x 1200 for example). Raise it again.
Reported by: bob prohaska <fbsd@www.zefox.net>
Tested by: bob prohaska <fbsd@www.zefox.net>
Approved by: re (gjb@)
The AMD Threadripper 2990WX is basically a slightly crippled Epyc.
Rather than having 4 memory controllers, one per NUMA domain, it has
only 2 memory controllers enabled. This means that only 2 of the
4 NUMA domains can be populated with physical memory, and the
others are empty.
Add support to FreeBSD for empty NUMA domains by:
- creating empty memory domains when parsing the SRAT table,
rather than failing to parse the table
- not running the pageout deamon threads in empty domains
- adding defensive code to UMA to avoid allocating from empty domains
- adding defensive code to cpuset to avoid binding to an empty domain
Thanks to Jeff for suggesting this strategy.
Reviewed by: alc, markj
Approved by: re (gjb@)
Differential Revision: https://reviews.freebsd.org/D1683
This removes two assignments for the flags field being done
twice and adds one, which was missing.
Thanks to Felix Weinrank for reporting the issue he found
by using fuzz testing of the userland stack.
Approved by: re (kib@)
MFC after: 1 week
INP_INFO_UNLOCK_ASSERT() in TCP-related code. For encapsulated traffic
it is possible, that the code is running in net_epoch_preempt section,
and INP_INFO_UNLOCK_ASSERT() is very strict assertion for such case.
PR: 231428
Reviewed by: mmacy, tuexen
Approved by: re (kib)
Differential Revision: https://reviews.freebsd.org/D17335
arguments wrong in r339020.
PR: 231625
Reported by: Yuri Pankov (yuripv yuripv.net)
Reviewed by: cem, Yuri Pankov (yuripv yuripv.net)
Approved by: re (kib)
Pointyhat to: bz (a rather big one for this one)
sctp_process_cmsgs_for_init() and sctp_findassociation_cmsgs()
similar to sctp_find_cmsg() to improve consistency and avoid
the signed/unsigned issues in sctp_process_cmsgs_for_init()
and sctp_findassociation_cmsgs().
Thanks to andrew@ for reporting the problem he found using
syzcaller.
Approved by: re (kib@)
MFC after: 1 week
sending UDP encapsulated SCTP packets.
This is consistent with the behaviour that when such packets are received,
the corresponding UDP stats counter (udps_ipackets) is incremented.
Thanks to Peter Lei for making me aware of this inconsistency.
Approved by: re (kib@)
MFC after: 1 week
is done via using ifconfig, which uses a SIOCSIFMTU ioctl() command, or
doing it using a TUNSIFINFO/TAPSIFINFO ioctl() command.
Without this patch, for IPv6 the new MTU is not used when creating routes.
Especially, when initiating TCP connections after increasing the MTU,
the old MTU is still used to compute the MSS.
Thanks to ae@ and bz@ for helping to improve the patch.
Reviewed by: ae@, bz@
Approved by: re (kib@)
MFC after: 1 week
Sponsored by: Netflix, Inc.
Differential Revision: https://reviews.freebsd.org/D17180
This is mostly a cosmetic change except that obsolete system calls are
assigned meaningful names in the names arrays which means that using
tools like kdump or truss against binaries invoking these system calls
will print out the name instead of the number. The script I use to
generate the XML list of syscalls for GDB also ignores UNIMPL but not
OBSOL entries. In general UNIMPL should only be used to reserve
placeholders for system calls that have never been implemented while
system calls that existed at one time in FreeBSD but were removed
should be marked OBSOL instead.
Reviewed by: brooks, kib, imp
Approved by: re (gjb)
Differential Revision: https://reviews.freebsd.org/D17344
after the file mapping was wired.
if a wired map entry is backed by vnode and the file is truncated,
corresponding pages are invalidated. vm_fault_copy_entry() should be
aware of it and allow for invalid pages past end of file. Also, such
pages should be not mapped into userspace. If userspace accesses the
truncated part of the mapping later, it gets a signal, there is no way
kernel can prevent the page fault.
Reported by: andrew using syzkaller
Reviewed by: alc
Sponsored by: The FreeBSD Foundation
Approved by: re (gjb)
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D17323
if the dst_object is not of swap type.
It can only happen when entry does not require copy, otherwise
vm_map_protect() already adds the charge. So the assert was right for
the case where swap object was allocated in the vm_fault_copy_entry(),
but not when it was just copied from src_entry and its type is not
swap.
Reported by: andrew using syzkaller
Reviewed by: alc
Sponsored by: The FreeBSD Foundation
Approved by: re (gjb)
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D17323
allocated dst_object in a single place.
Suggested and reviewed by: alc
Sponsored by: The FreeBSD Foundation
Approved by: re (gjb)
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D17323
For PCID case, there is a dependency between pm_gen zeroing and
reading pm_active for IPI target selection, to ensure that the
invalidation is not missed.
Reported and tested by: mjg
Sponsored by: The FreeBSD Foundation
Approved by: re (gjb)
As with r338962 also export the instruction set attribute register. This
will allow userland to identify optional instructions the hardware
supports, for example in a future ifunc handler to decide which
implementation of a function to return.
Approved by: re (kib)
The pre-7.x compat for both native and 32-bit code was already in
pci_user.c. Use this infrastructure to add implement 32-bit support.
This is more correct as ioctl(2) commands only have meaning in the
context of a file descriptor.
Reviewed by: kib
Approved by: re (gjb)
Obtained from: CheriBSD
Sponsored by: DARPA, AFRL
Differential revision: https://reviews.freebsd.org/D17324
The function stopped swapping rdi and rsi, but the error handling
code was not updated with the new register name.
Approved by: re (implicit)
Sponsored by: The FreeBSD Foundation
This reverts part of r333368. The attempt to clear DR6 was occuring
too soon as trapsignal() does not pause to let the debugger notice the
SIGTRAP and query DR6. The signal exchange does not occur until much
later during ast(). As a result, GDB was no longer recognizing
hardware breakpoints and watchpoints on x86.
In addition, any userland programs that want to inspect DR6 in a
SIGTRAP handler don't have a way to do this if we clear DR6 in the
exception handler.
Instead of relying on the kernel to clear DR6, debuggers will have to
explicitly clear it after a trace trap (which they needed to do on
older kernels anyway).
Reviewed by: kib
Approved by: re (delphij)
MFC after: 3 days
Differential Revision: https://reviews.freebsd.org/D17319
once we have a lock, make sure the inp is not marked freed.
This can happen since the list traversal and locking was
converted to epoch(9). If the inp is marked "freed", skip it.
This prevents a NULL pointer deref panic later on.
Reported by: slavash (Mellanox)
Tested by: slavash (Mellanox)
Reviewed by: markj (no formal review but caught my unlock mistake)
Approved by: re (kib)
- remove a forward branch in the common case
- replace xchg + lodsb/stosb loop with simple movs
A simple test on Intel(R) Core(TM) i7-4600U CPU @ 2.10GH copying
/foo/bar/baz in a loop goes from 295715863 ops/s to 465807408.
Further changes are pending.
Reviewed by: kib
Approved by: re (gjb)
Differential Revision: https://reviews.freebsd.org/D17281
- move the PSL.AC comment to the fault handler
- stop testing for zero-sized ops. after several minutes of package
building there were no copyin calls with zero bytes and very few
copyout. the semantic of returning 0 in this case is preserved
- shorten exit paths by clearing %eax earlier
- replace xchg with 3 movs. this is what compilers do. a naive
benchmark on EPYC suggests about 1% increase in thoughput thanks to
this change.
- remove the useless movb %cl,%al from copyout. it looks like a
leftover from many years ago
Reviewed by: kib
Approved by: re (gjb)
Differential Revision: https://reviews.freebsd.org/D17286
Both the in-kernel C variant and libc asm variant have very poor performance.
The former compiles to a single byte comparison loop, which breaks down even
for small sizes. The latter uses rep cmpsq/b which turn out to have very poor
throughput and are slower than a hand-coded 32-byte comparison loop.
Depending on size this is about 3-4 times faster than the current routines.
Reviewed by: kib
Approved by: re (gjb)
Differential Revision: https://reviews.freebsd.org/D17328
Create a user view of the ID_AA64PFR0_EL1 register with values common
across all CPUs.
Approved by: re (kib)
Sponsored by: ABT Systems Ltd
Differential Revision: https://reviews.freebsd.org/D17301
Building the kernel in Git repositories when git-svn is not available and
the "help.autocorrect" Git parameter is enabled results in Git trying to
replace the "svn" command (it does not know) with "serve". As a result the
output of the "git server" command is appended to the value of the
environmental variable VERINFO, which causes the auto generated vers.c
file to contain invalid C syntax (missing newline escapes):
#define "@(#)FreeBSD 12.0-ALPHA7 r000eversion 2
0015agent=git/2.19.0
000cls-refs
0012fetch=shallow
0012server-option
0000=5e2272613fa(splash-vt)"
#define VERSTR "FreeBSD 12.0-ALPHA7 r000eversion 2
0015agent=git/2.19.0
000cls-refs
0012fetch=shallow
0012server-option
0000=5e2272613fa(splash-vt)\n"
Using `-c help.autocorrect=0` seems to be a good solution as it does not
modify user's environment. I am not sure, however, if we should use
programs (or Git commands), which we are not sure exist (we never check if
git-svn is available on the host), as there may be more unexpected
behaviors like this one.
Reviewed by: eadler, emaste, krion
Approved by: re (gjb), krion (mentor)
Sponsored by: Bally Wulff Games & Entertainment GmbH
Differential Revision: https://reviews.freebsd.org/D17271
undefined instruction exception. Previously we would exit the guest,
however an unprivileged user could execute these.
Found with: syzkaller
Reviewed by: araujo, tychon (previous version)
Approved by: re (kib)
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D17192
As part of ZFS Crypto, I started getting a series of panics when I did not
have AESNI loaded. Adding locking fixed it, and I concluded that the
Reinit function altered the AES key schedule. This locking is not as
fine-grained as it could be (AESNI uses per-cpu locking), but
it's minimally invasive.
Sponsored by: iXsystems Inc
Reviewed by: cem, mav
Approved by: re (gjb), mav (mentor)
Differential Revision: https://reviews.freebsd.org/D17307
Remove unused and easy to misuse PNP macro parameter
Inspired by r338025, just remove the element size parameter to the
MODULE_PNP_INFO macro entirely. The 'table' parameter is now required to
have correct pointer (or array) type. Since all invocations of the macro
already had this property and the emitted PNP data continues to include the
element size, there is no functional change.
Mostly done with the coccinelle 'spatch' tool:
$ cat modpnpsize0.cocci
@normaltables@
identifier b,c;
expression a,d,e;
declarer MODULE_PNP_INFO;
@@
MODULE_PNP_INFO(a,b,c,d,
-sizeof(d[0]),
e);
@singletons@
identifier b,c,d;
expression a;
declarer MODULE_PNP_INFO;
@@
MODULE_PNP_INFO(a,b,c,&d,
-sizeof(d),
1);
$ rg -l MODULE_PNP_INFO -- sys | \
xargs spatch --in-place --sp-file modpnpsize0.cocci
(Note that coccinelle invokes diff(1) via a PATH search and expects diff to
tolerate the -B flag, which BSD diff does not. So I had to link gdiff into
PATH as diff to use spatch.)
Tinderbox'd (-DMAKE_JUST_KERNELS).
Approved by: re (glen)
Do not call crypto_newsession() while holding xforms_lock mutex.
Release mutex before invoking crypto_newsession(), and use
ipsec_kmod_enter()/ipsec_kmod_exit() functions to protect from doing
access to unloaded kernel module memory.
Move xform-releated functions into subr_ipsec.c to be able use
ipsec_kmod_* functions. Also unconditionally build ipsec_kmod_*
functions, since now they are always used by IPSec code.
Add xf_cntr field to struct xformsw, it is used by ipsec_kmod_*
functions. Also constify xf_name field, since it is not expected to be
modified.
Approved by: re (kib)
Differential Revision: https://reviews.freebsd.org/D17302
From PCI Spec rev 2.2, 6.2.1. Device Identification:
Vendor ID This field identifies the manufacturer of the device. Valid
vendor identifiers are allocated by the PCI SIG to ensure uniqueness.
0FFFFh is an invalid value for Vendor ID.
MFC after: 3 days
Approved by: re (Glen), hselasky (mentor), kib (mentor)
Sponsored by: Mellanox Technologies
dmaplimit is the first byte after the end of DMAP.
Reported by: "Johnson, Archna" <Archna.Johnson@netapp.com>
Reviewed by: alc, markj
Approved by: re (gjb)
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D17318
For pmap_invalidate_all_pcid(), only reset pm_gen for non-kernel
pmaps, as it was done before the conversion to ifuncs. The reset is
useless but innocent for kernel_pmap. Coverity reported that cpuid is
used uninitialized in this case.
Reported by: cem
Reviewed by: alc, cem, markj
CID: 1395807
Sponsored by: The FreeBSD Foundation
Approved by: re (gjb)
Differential revision: https://reviews.freebsd.org/D17314
Currently vfs calls the root method on each absolute lookup and when
crossing mount points.
zfs_root ends up looking up the inode internally as if it was not
instantianted which results in significant lock contention on systems
like EPYC.
Store the vnode in the mount point and protect the access with rmlocks.
This is a temporary hack for 12.0.
Sample result:
before:
make -s -j 128 buildkernel 2778.09s user 3319.45s system 8370% cpu 1:12.85 total
after:
make -s -j 128 buildkernel 3199.57s user 1772.78s system 8232% cpu 1:00.40 total
Tested by: pho (zfs mount/unmount tests)
Reviewed by: kib, mav, sef (different parts)
Approved by: re (gjb)
Differential Revision: https://reviews.freebsd.org/D17233
- Switch to using 32b port/link capabilities in the driver. The 32b
format is used internally by firmwares > 1.16.45.0 and the driver will
now interact with the firmware in its native format, whether it's 16b
or 32b. Note that the 16b format doesn't have room for 50G, 200G, or
400G speeds.
- Add a bit in the pause_settings knobs to allow negotiated PAUSE
settings to override manual settings.
- Ensure that manual link settings persist across an administrative
down/up as well as transceiver unplug/replug.
- Remove unused is_*G_port() functions.
Approved by: re@ (gjb@)
MFC after: 1 month
Sponsored by: Chelsio Communications
The PHB4 host bridge used by the POWER9 uses a 64kB range in 32-bit
space at the address 0xffff0000-0xffffffff. Reserve this range so that
DMA memory cannot be allocated within this range. This fixes seemingly
random crashes on a POWER9 system. Ideally this range will have been
reserved by the firmware, but as of now this is not the case.
Submitted by: git_bdragon.rtk0.net
Reviewed by: nwhitehorn
Approved by: re(kib)
Differential Revision: https://reviews.freebsd.org/D17183
Use these predicates instead of inline references to vm_min_domains.
Also add a global all_domains set, akin to all_cpus.
Reviewed by: alc, jeff, kib
Approved by: re (gjb)
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D17278
In 11.x and earlier these were accessible as direct members of 'struct
kinfo_file'. Existing code already knows about the new location of
these members as well, so wrapper macros did not work for these
fields. Instead, define an anonymous struct containing the fields
from 'struct kinfo_file' in FreeBSD 11 that were not part of the
'kf_un' union. This anonymous struct is then placed in an anonymous
union along with the new 'kf_un' union. This preserves the API of
both structure layouts without requiring any wrapper macros.
PR: 231525
Reviewed by: kib
Approved by: re (gjb)
Differential Revision: https://reviews.freebsd.org/D17262
This invokes "fence" on the hart performing the write followed by an IPI
to execute "fence.i" on all harts.
This is required to support userland debuggers setting breakpoints in
user processes.
Reviewed by: br (earlier version), markj
Approved by: re (gjb)
Sponsored by: DARPA
Differential Revision: https://reviews.freebsd.org/D17139
redundant, because uma_zone_reserve_kva() is performed on both zones and it
sets this same flag on the zone. (Moreover, the implementation of the swap
pager does not itself require these zones to be UMA_ZONE_NOFREE.)
Reviewed by: kib, markj
Approved by: re (gjb)
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D17296
Currently stats are collected in a MAXCPU-sized array which is not
aligned and suffers enormous false-sharing. Fix the problem by
utilizing per-cpu allocation.
The counter(9) API is not used here as it is too incomplete and does
not provide a win over per-cpu zone sized for malloc stats struct. In
particular stats are being reported for each cpu separately by just
copying what is supposed to be an array element for given cpu.
This eliminates significant false-sharing during malloc-heavy tests
e.g. on Skylake. See the review for details.
Reviewed by: markj
Approved by: re (kib)
Differential Revision: https://reviews.freebsd.org/D17289
syncache_respond(). There is no functional change. The
parameter became unused in r313330, but wasn't removed.
Approved by: re (kib@)
MFC after: 1 month
Sponsored by: Netflix, Inc.
Split calculation of mask for shootdown IPI and local
invalidation. Reorder IPI before local.
Suggested by: alc
Reviewed by: alc, markj
Tested by: pho
Sponsored by: The FreeBSD Foundation
Approved by: re (rgrimes)
Differential revision: https://reviews.freebsd.org/D17277
NFS is the only in-tree filesystem using the feature, but all ops test
for it.
Currently the resulting sigdefer calls have to be jumped over in the
common case.
This is a bandaid, longer term fix will move this feature away.
Approved by: re (kib)
illumos/illumos-gate@82f63c3c2b
Reviewed by: Toomas Soome <tsoome@me.com>
Reviewed by: Sanjay Nadkarni <sanjay.nadkarni@nexenta.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Approved by: Matthew Ahrens <mahrens@delphix.com>
Author: Jerry Jelinek <jerry.jelinek@joyent.com>
Approved by: re (delphij)
- _fault handlers for both primitives are identical, provide just one
- change the copying scheme to match memcpy (in particular jump
avoidance for the most common case of multiply of 8)
- stop re-reading pcb address on exit, just store it locally (in r9)
Reviewed by: kib
Approved by: re (gjb)
Differential Revision: https://reviews.freebsd.org/D17265
In non-reproducible mode we have the kernel ident as a side effect of
including the build directory. Explicitly add it to the ident string in
reproducible mode.
Reported by: mjg
Approved by: re (gjb)
Sponsored by: The FreeBSD Foundation
If the size is 15 bytes or less avoid spinning up rep just to copy the 8
bytes. In my tests on EPYC and old Intel microarchs without ERMS (like
Westmere) it provided a nice win over the current version (e.g. for EPYC
memset with 15 bytes of size goes from 59712651 ops/s to 70600095) all
while almost not pessimizing the other cases.
Data collected during package building shows that < 16 sizes are pretty
common.
Verified with the glibc test suite.
Approved by: re (kib)
Fix a fat-fingered typo with a "funny" side-effect: when doing copyin on a
cpu without ERMS and with size being a multiply of 8 a page fault would be
triggered resulting in EFAULT.
Pointy hat: mjg
Approved by: re (implicit)
It seems igb supports TSO6, but the capability got lost in
the iflib update. Restore this capability.
PR: 231476
Reported by: lev
Reviewed by: erj
Approved by: re (gjb)
Sponsored by: Limelight Networks
Differential Revision: https://reviews.freebsd.org/D17242
It is currently unused and reserved for future use to keep KBI/KPI.
Also add several spare pointers to be able extend structure if it
will be needed.
Approved by: re (gjb)
Various capabilities were not being handled correctly in the
SIOCSIFCAP handler. Specifically:
IFCAP_RXCSUM and IFCAP_RXCSUM_IPV6 could be set even if not supported
It was impossible to disable IFCAP_RXCSUM and/or IFCAP_RXCSUM_IPV6 via
ifconfig since it does ioctl() per command-line flag rather than combine
them into a single call.
IFCAP_VLAN_HWCSUM could not be modified via the ioctl()
Setting any combination of the three IFCAP_WOL flags would set only
IFCAP_WOL_MCAST | IFCAP_WOL_MAGIC. For example, setting only
IFCAP_WOL_UCAST would result in both IFCAP_WOL_MCAST and IFCAP_WOL_MAGIC
being enabled, but IFCAP_WOL_UCAST would not be enabled.
Because if_vlancap() was called before if_togglecapenable(), vlan flags
were sometimes not applied correctly.
Interfaces were being unnecessarily stopped and restarted for WoL
PR: 231151
Submitted by: Kaho Toshikazu <kaho@elam.kais.kyoto-u.ac.jp>
Reported by: Shirkdog <mshirk@daemon-security.com>
Reviewed by: galladin
Approved by: re (gjb)
Sponsored by: Limelight Networks
Differential Revision: https://reviews.freebsd.org/D17158
The old code appears to assume that vmem_alloc() would import
size-aligned KVA chunks from the parent kernel_arena, but vmem doesn't
provide this guarantee.
Also remove the unused global RWX arena and add comments explaining why
we have per-domain arenas.
Reported by: alc
Reviewed by: alc, kib (previous version)
Approved by: re (gjb)
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D17249
not freed. This can happen since the list traversal and locking
was converted to epoch(9). If the inp is marked "freed", skip it.
This prevents a NULL pointer deref panic in ip6_savecontrol_v4()
trying to access the socket hanging off the inp, which was gone
by the time we got there.
Reported by: andrew
Tested by: andrew
Approved by: re (gjb)
Ensure that pages backing the same virtual large page come from the
same physical domain, as kmem_malloc_domain() does.
PR: 231038
Reviewed by: alc, kib
Approved by: re (gjb)
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D17248
A lot of function have the following check:
cmpq %rax,%rdi /* verify address is valid */
ja fusufault
The label is present earlier in kernel .text, which means this is a jump
backwards. Absent any information in branch predictor, the cpu predicts it
as taken. Since it is almost never taken in practice, this results in a
completely avoidable misprediction.
Move it past all consumers, so that it is predicted as not taken.
Approved by: re (kib)
- Explicitly load an empty initial state into FP registers when taking
the fault on the first FP instruction in a thread. Setting
SSTATE.FS to INITIAL is just a marker to let context switch restore
code know that it can load FP registers with zeroes instead of
memory loads. It does not imply that the hardware will reset all
registers to zero on first access. In addition, set the state to
CLEAN instead of INITIAL after the first FP instruction.
cpu_switch() doesn't do anything for INITIAL and only restores from
the pcb if the state is CLEAN. We could perhaps change cpu_switch
to call fpe_state_clear if the state was INITIAL and leave SSTATE.FS
set to INITIAL instead of CLEAN after the first FP instruction.
However, adding this complexity to cpu_switch() doesn't seem worth
the supposed gain.
- Only save the current FPU registers in fill_fpregs() if the request
is made to save the current thread's registers. Previously if a
debugger requested FP registers via ptrace() it was getting a copy
of the debugger's FP registers rather than the debugee's.
- Zero the entire FP register set structure returned for ptrace() if a
thread hasn't used FP registers rather than leaking garbage in the
fp_fcsr field.
- If a debugger writes FP registers via ptrace(), always mark the pcb
as having valid FP registers and set SSTATUS.FS_MASK to CLEAN so
that the registers will be restored when the debugged thread
resumes.
- Be more explicit about clearing the SSTATUS.FS field before setting
it to CLEAN on the first FP instruction trap.
Submitted by: br, markj
Approved by: re (rgrimes)
Sponsored by: DARPA
Differential Revision: https://reviews.freebsd.org/D17141
Zero the entire FP register set structure returned for ptrace() if a
thread hasn't used FP registers rather than leaking garbage in the
fp_sr and fp_cr fields.
Reviewed by: emaste, andrew
Approved by: re (rgrimes)
MFC after: 2 weeks
Sponsored by: DARPA, AFRL
Differential Revision: https://reviews.freebsd.org/D17140