Even if a kif doesn't have an ifp or if_group pointer we still can't delete it
if it's referenced by a rule. In other words: we must check rulerefs as well.
While we're here also teach pfi_kif_unref() not to remove kifs with flags.
Reported-by: syzbot+b31d1d7e12c5d4d42f28@syzkaller.appspotmail.com
MFC after: 2 weeks
When trapping on a wrote access to a buffer the kernel has mapped as write
only we should only pass the VM_PROT_WRITE flag. Previously the call to
vm_fault_trap as the VM_PROT_READ flag was unexpected.
Reported by: manu
Sponsored by: Innovate UK
Add support to the _STANDALONE environment enough bits of the kernel
that we can compile it. We still have a small zstd_shim.c since there
were 3 items that were a bit hard to nail down and may be cleaned up
in the future. These go hand in hand with a number of commits to
sys/sys in the past weeks, should this need be MFCd.
Discussed with: mmacy (in review and on IRC/Slack)
Reviewed by: freqlabs (on openzfs repo)
Differential Revision: https://reviews.freebsd.org/D26218
Both s_len and s_size are ssize_t, so their differece is also more
properly a ssize_t not a size_t. Also, assert that len is <= size when
we enter. This should always be the case. Ensure that we have that one
byte that we write to the end of the buffer before we do so, though
the error should already be set on the buffer if not, and the only
times we supply 'partial' buffers they should be plenty large.
Reviewed by: cem, jhb (prior version, I did cem's suggestion)
Differential Revsion: https://reviews.freebsd.org/D26752
hints work on per-channel basis as documented, rather than chip-wide. Also,
when configured via hints, return BUS_PROBE_NOWILDCARD on successful hints
match, so that the hints don't bogusly match other types of i2c chips.
If userspace tries to set flags (e.g. 'set skip on <ifspec>') and <ifspec>
doesn't exist we should create a kif so that we apply the flags when the
<ifspec> does turn up.
Otherwise we'd end up in surprising situations where the rules say the
interface should be skipped, but it's not until the rules get re-applied.
Reviewed by: Lutz Donnerhacke <lutz_donnerhacke.de>
MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D26742
There's a number of types we forward declare for the kernel. We need
struct ucred for the ZSTD ZFS integration, so go ahead and forward
declare it here too.
This patch has the driver for 10Gigabit Ethernet controller in AMD
SoC. This driver is written compatible to the Iflib framework. The
existing driver is for the old version of hardware. The submitted
driver here is for the recent versions of the hardware where the Ethernet
controller is PCI-E based.
Submitted by: Rajesh Kumar <rajesh1.kumar@amd.com>
MFC after: 1 month
Relnotes: yes
Differential Revision: https://reviews.freebsd.org/D25793
It seems that in r354857 I got more than one thing wrong.
Convert the SYSCTL_OPAQUE to a SYSCTL_PROC to properly export the these
days allocated and not longer static per-vnet viftable array.
This fixes a problem with netstat -g which would show bogus information
for the IPv4 Virtual Interface Table.
PR: 246626
Reported by: Ozkan KIRIK (ozkan.kirik gmail.com)
MFC after: 3 days
Push the root seed version to userspace through the VDSO page, if
the RANDOM_FENESTRASX algorithm is enabled. Otherwise, there is no
functional change. The mechanism can be disabled with
debug.fxrng_vdso_enable=0.
arc4random(3) obtains a pointer to the root seed version published by
the kernel in the shared page at allocation time. Like arc4random(9),
it maintains its own per-process copy of the seed version corresponding
to the root seed version at the time it last rekeyed. On read requests,
the process seed version is compared with the version published in the
shared page; if they do not match, arc4random(3) reseeds from the
kernel before providing generated output.
This change does not implement the FenestrasX concept of PCPU userspace
generators seeded from a per-process base generator. That change is
left for future discussion/work.
Reviewed by: kib (previous version)
Approved by: csprng (me -- only touching FXRNG here)
Differential Revision: https://reviews.freebsd.org/D22839
There is no functional change for the existing Fortuna random(4)
implementation, which remains the default in GENERIC.
In the FenestrasX model, when the root CSPRNG is reseeded from pools due to
an (infrequent) timer, child CSPRNGs can cheaply detect this condition and
reseed. To do so, they just need to track an additional 64-bit value in the
associated state, and compare it against the root seed version (generation)
on random reads.
This revision integrates arc4random(9) into that model without substantially
changing the design or implementation of arc4random(9). The motivation is
that arc4random(9) is immediately reseeded when the backing random(4)
implementation has additional entropy. This is arguably most important
during boot, when fenestrasX is reseeding at 1, 3, 9, 27, etc., second
intervals. Today, arc4random(9) has a hardcoded 300 second reseed window.
Without this mechanism, if arc4random(9) gets weak entropy during initial
seed (and arc4random(9) is used early in boot, so this is quite possible),
it may continue to emit poorly seeded output for 5 minutes. The FenestrasX
push-reseed scheme corrects consumers, like arc4random(9), as soon as
possible.
Reviewed by: markm
Approved by: csprng (markm)
Differential Revision: https://reviews.freebsd.org/D22838
Fortuna remains the default; no functional change to GENERIC.
Big picture:
- Scalable entropy generation with per-CPU, buffered local generators.
- "Push" system for reseeding child generators when root PRNG is
reseeded. (Design can be extended to arc4random(9) and userspace
generators.)
- Similar entropy pooling system to Fortuna, but starts with a single
pool to quickly bootstrap as much entropy as possible early on.
- Reseeding from pooled entropy based on time schedule. The time
interval starts small and grows exponentially until reaching a cap.
Again, the goal is to have the RNG state depend on as much entropy as
possible quickly, but still periodically incorporate new entropy for
the same reasons as Fortuna.
Notable design choices in this implementation that differ from those
specified in the whitepaper:
- Blake2B instead of SHA-2 512 for entropy pooling
- Chacha20 instead of AES-CTR DRBG
- Initial seeding. We support more platforms and not all of them use
loader(8). So we have to grab the initial entropy sources in kernel
mode instead, as much as possible. Fortuna didn't have any mechanism
for this aside from the special case of loader-provided previous-boot
entropy, so most of these sources remain TODO after this commit.
Reviewed by: markm
Approved by: csprng (markm)
Differential Revision: https://reviews.freebsd.org/D22837
DTS must be synced with the kernel, add a freebsd,dts-version string in
the root node of each DTS that we compile so we can later in the kernel
check that it contain a correct value.
Reviewed by: imp, mmel
Differential Revision: https://reviews.freebsd.org/D26724
r366344 corrected the optimized amd64 skein assembly implementation, so
we can now enable it again.
Also add a dependency on this Makefile for the skein_block object, so
that it will be rebuit (similar to r366362).
PR: 248221
Sponsored by: The FreeBSD Foundation
r365732 was the first attempt to get an accurate count but it was
writing to some read-only registers to clear them and that obviously
didn't work. Instead, note the counter's value when it is supposed to
be cleared and subtract it from future readings.
dev.<port>.stats.rx_fcs_error should not be serviced from the MPS
register for T6.
The stats.* sysctls should all use T5_PORT_REG for T5 and above. This
must have been missed in the initial T5 support years ago. Fix it while
here.
MFC after: 3 days
Sponsored by: Chelsio Communications
Truncating requires an exclusive lock, but it was not taken if the
filesystem indicates support for shared writes. This only concerns
ZFS.
In particular fixes cp of files which have trailing holes.
Reported by: bdrewery
The module handler code invokes a MOD_UNLOAD event immediately if
MOD_LOAD fails. The result was that if seminit() failed, semunload()
was invoked twice. semunload() is not idempotent however and would
try to remove it's process_exit eventhandler twice resulting in a
panic.
Reviewed by: kib, markj
Obtained from: CheriBSD
MFC after: 1 month
Sponsored by: DARPA
Differential Revision: https://reviews.freebsd.org/D26696
Use of dead_vnodeops would result in a panic instead of returning the intended
EOPNOTSUPP error.
While here make sure to abort, not just try to return a partial result.
The former allows the regular lookup to restart from scratch, while the latter
makes it stuck with an unusable vnode.
Reported by: kevans
Create the RISC-V NOTES and LINT files. As of r366559, LINT configs are
no longer generated but checked in to the tree.
Reviewed by: imp
Differential Revision: https://reviews.freebsd.org/D26502
Allow the DSCP codepoint also to be configurable
for the traffic in the direction from the initiator
to the target, such that writes and any requests
are also treated in the appropriate QoS class.
Reviewed by: mav
MFC after: 2 weeks
Sponsored by: NetApp, Inc.
Differential Revision: https://reviews.freebsd.org/D26714
Consider the currently in-use TCP options when
calculating the amount of new data to be injected during
SACK loss recovery. That addresses the effect that very small
(new) segments could be injected on partial ACKs while
still performing a SACK loss recovery.
Reported by: Liang Tian
Reviewed by: tuexen, chengc_netapp.com
MFC after: 2 weeks
Sponsored by: NetApp, Inc.
Differential Revision: https://reviews.freebsd.org/D26446
This adds a new IP_PROTO / IPV6_PROTO setsockopt (getsockopt)
option IP(V6)_VLAN_PCP, which can be set to -1 (interface
default), or explicitly to any priority between 0 and 7.
Note that for untagged traffic, explicitly adding a
priority will insert a special 801.1Q vlan header with
vlan ID = 0 to carry the priority setting
Reviewed by: gallatin, rrs
MFC after: 2 weeks
Sponsored by: NetApp, Inc.
Differential Revision: https://reviews.freebsd.org/D26409
Extend netstat to display TCP stack and detailed congestion state
Adding the "-c" option used to show detailed per-connection
congestion control state for TCP sessions.
This is one summary patch, which adds the relevant variables into
xtcpcb. As previous "spare" space is used, these changes are ABI
compatible.
Reviewed by: tuexen
MFC after: 2 weeks
Sponsored by: NetApp, Inc.
Differential Revision: https://reviews.freebsd.org/D26518
makeLINT.mk isn't needed or used anymore, remove it and all the files
it uses.
Reviewed by: kevans
Differential Revision: https://reviews.freebsd.org/D26540
Now that config(8) has supported include for 19 years, transition to
including the NOTES files. include support didn't exist at the time,
nor did the envvar stuff recently added. Now that it does, eliminate
the building of LINT files by just including everything you need.
Note: This may cause conflicts with updating in some cases.
find sys -name LINT\* -rm
is suggested across this commit to remove the generated LINT
files.
Reviewed by: kevans
Differential Revision: https://reviews.freebsd.org/D26540
Without this patch, when vn_generic_copy_file_range() is
doing a large copy, it will remain in the function for a
considerable amount of time, delaying handling of any
outstanding signals until the copy completes.
This patch adds checks for signals that need to be
processed after each successful data copy cycle.
When sig_intr() returns non-zero, vn_generic_copy_file_range()
will return.
The check "if (len < savlen)" ensures that some data
has been copied, so that progress will be made.
Note that, since copy_file_range(2) is allowed to
return fewer bytes copied than requested, it
will never return EINTR/ERESTART when sig_intr()
returns non-zero.
Reviewed by: kib, asomers
Differential Revision: https://reviews.freebsd.org/D26620
The precedence of the '&' operator is less than of '+'. Added braces
do change the order of evaluation into the natural one, in my opinion.
On the other hand, the value of the expression should not change since
all elements should have page-aligned values.
This fixes a gcc warning reported.
Reported by: adrian
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Normally when a buffer with B_BARRIER is written, the flag is cleared
by g_vfs_strategy() when creating bio. But in some cases FFS buffer
might not reach g_vfs_strategy(), for instance when copy-on-write
reports an error like ENOSPC. In this case buffer is returned to
dirty queue and might be written later by other means. Among then
bdwrite() reasonably asserts that B_BARRIER is not set.
In fact, the only current use of B_BARRIER is for lazy inode block
initialization, where write of the new inode block is fenced against
cylinder group write to mark inode as used. The situation could be
seen that we break dependency by updating cg without written out
inode. Practically since CoW was not able to find space for a copy of
inode block, for the same reason cg group block write should fail.
Reported by: pho
Discussed with: chs, imp, mckusick
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D26511
Check td_flags for relevant AST requests lock-less. This opens the
race slightly wider where sig_intr() returns false negative, but might
be it is worth it.
Requested by: mjg
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Specifically, if lookup() returned any error and the topping directory
was not latched, which means that (non-existent) path did not returned
to the topping location, give ENOTCAPABLE a priority over the lookup()
error.
PR: 249960
Reviewed by: emaste, ngie
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D26695
Add support for sysctl 'machdep.uprintf_signal' that prints debugging
information on trap signal.
Reviewed by: jhibbits, luporl, bdragon
Sponsored by: Eldorado Research Institute (eldorado.org.br)
Differential Revision: https://reviews.freebsd.org/D26004
APM BIOS was relevant only to early laptops (approximately P166 or
P200 and slower). These have not been relevant for a long time, and
this code has been untested for a long time (as far as I can
tell). The APM compat code in ACPI and the apm(8) command is not being
retired. Both of these items are still in use (apm(8) is more
scriptable than the replacement acpiconf, for the most part). This has
been commented out of i386 GENERIC since 2002. This code is not
relevant to any other port.
Discussed on: arch@
The correct way to identify the end of the metadata is two adjacent
entries set to zero/MODINFO_END. I made a typo and this was checking the
first entry twice.
Reported by: rpokala
Sponsored by: NetApp, Inc.
Sponsored by: Klara, Inc.
The boot metadata (also referred to as modinfo, or preload metadata)
provides information about the size and location of the kernel,
pre-loaded modules, and other metadata (e.g. the EFI framebuffer) to be
consumed during by the kernel during early boot. It is encoded as a
series of type-length-value entries and is usually constructed by
loader(8) and passed to the kernel. It is also faked on some
architectures when booted by other means.
Although much of the module information is available via kldstat(8),
there is no easy way to debug the metadata in its entirety. Add some
routines to parse this data and allow it to be printed to the console
during early boot or output via a sysctl.
Since the output can be lengthly, printing to the console is gated
behind the debug.dump_modinfo_at_boot kenv variable as well as the
BOOTVERBOSE flag. The sysctl to print the metadata is named
debug.dump_modinfo.
Reviewed by: tsoome
Sponsored by: NetApp, Inc.
Sponsored by: Klara, Inc.
Differential Revision: https://reviews.freebsd.org/D26687
These kind of drops come for free in the sense that they do not use the
filter TCAM or any other resource that wouldn't normally be used during
rx. Frames dropped by the hardware get counted in the MAC's rx stats
but are not delivered to the driver.
hw.cxgbe.attack_filter
Set to 1 to enable the "attack filter". Default is 0. The attack
filter will drop an incoming frame if any of these conditions is true:
src ip/ip6 == dst ip/ip6; tcp and src/dst ip is not unicast; src/dst ip
is loopback (127.x.y.z); src ip6 is not unicast; src/dst ip6 is loopback
(::1/128) or unspecified (::/128); tcp and src/dst ip6 is mcast
(ff00::/8).
hw.cxgbe.drop_ip_fragments
Set to 1 to drop all incoming IP fragments. Default is 0. Note that
this drops valid frames.
hw.cxgbe.drop_pkts_with_l2_errors
Set to 1 to drop incoming frames with Layer 2 length or checksum errors.
Default is 1.
hw.cxgbe.drop_pkts_with_l3_errors
Set to 1 to drop incoming frames with IP version, length, or checksum
errors. Default is 0.
hw.cxgbe.drop_pkts_with_l4_errors
Set to 1 to drop incoming frames with Layer 4 length, checksum, or other
errors. Default is 0.
MFC after: 2 weeks
Sponsored by: Chelsio Communications
It is possible for elf_reloc_local() to fail in the unlikely case of
an unsupported relocation type. If this occurs, do not continue to
process the file.
Reviewed by: kib, markj (earlier version)
MFC after: 1 week
Sponsored by: NetApp, Inc.
Sponsored by: Klara, Inc.
Differential Revision: https://reviews.freebsd.org/D26701
This code was iteratively implemented during the work on various WiFi
drivers -- from individual functions to a macro-created implementations
for the various bit sized needed (and then extended to more for
comepleteness). Some of the bit combinations do not seem to make sense
so are left out.
The __bf_shf(x) was obtained from D26681 [1].
Requested by: manu [1]
Reviewed by: hselasky, manu
MFC after: 1 week
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D26708
Sort a few VHT160 and 80+80 lines, update some comments, and remove
a superfluous ','.
No functional changes intended.
MFC after: 1 week
Sponsored by: The FreeBSD Foundation
It is unlikely, but possible, that an unrecognized or unsupported
relocation type is encountered while trying to load a kernel module. If
this occurs we should offer the symbol index as a hint to the user.
While here, fix some small style issues.
Reviewed by: markj, kib (amd64 part, in D26701)
Sponsored by: NetApp, Inc.
Sponsored by: Klara, Inc.
Cleanup all host resources, SYSCTLs, MSIX vectors and memory used
by the host and only leave the device allocated memory behind, if any,
because it may still be in use, when the PCI remove function is called.
Else future probe calls may fail due to SYSCTLs already existing.
MFC after: 1 week
Sponsored by: Mellanox Technologies // NVIDIA Networking
The kernel globals for kenv are confined to 2 files that need them and
a few that likely shouldn't (but as written the code does). Move them
from sys/systm.h to sys/kenv.h. This removed a XXX from systm.h and
cleans it up a little bit...
Sometimes, this drive will be present in the system such that the the
firmware identification string doesn't start with ATA, such as when
it's behind a SATA-to-SAS interposer. Add another quirk for that.
Submitted by: github user mr44er
Github PR: 423
The prototype was added with the creation of kern_shutdown.c in r17658,
but it appears to have never been implemented. Remove it now.
Reviewed by: cem, kib
Differential Revision: https://reviews.freebsd.org/D26702
We're not allowed to hold NET_EPOCH while sleeping, so when we call ioctl()
handlers for member interfaces we cannot be in NET_EPOCH. We still need some
protection of our CK_LISTs, so hold BRIDGE_LOCK instead.
That requires changing BRIDGE_LOCK into a sleepable lock, and separating the
BRIDGE_RT_LOCK, to protect bridge_rtnode lists. That lock is taken in the data
path (while in NET_EPOCH), so it cannot be a sleepable lock.
While here document the locking strategy.
MFC after: 1 month
Differential Revision: https://reviews.freebsd.org/D26418
- Just use sw->octx != NULL to handle the HMAC case when finalizing
the MAC.
- Explicitly zero the on-stack auth context.
Reviewed by: markj
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D26688
if_capabilities is a read-only mask of supported capabilities.
if_capenable is a mask under administrative control via ifconfig(8).
Reviewed by: gallatin
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D26690
Both cxgbe(4) and mlx5(4) wrapped the existing send tag header with
their own identical headers that stored the type that the
type-specific tag structures inherited from, so in practice it seems
drivers need this in the tag anyway. This permits removing these
extra header indirections (struct cxgbe_snd_tag and struct
mlx5e_snd_tag).
In addition, this permits driver-independent code to query the type of
a tag, e.g. to know what type of tag is being queried via
if_snd_query.
Reviewed by: gallatin, hselasky, np, kib
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D26689
Since r366355 and r366284 we panic on access faults rather than treating
them like page faults so this condition is never true.
Reviewed by: jhb (mentor), markj, mhorne
Approved by: jhb (mentor), markj, mhorne
Differential Revision: https://reviews.freebsd.org/D26686
We should never take instruction page faults when in the kernel, but by
using the standard page fault code we should get a more-informative
message about faulting on a NOFAULT page rather than branching to the
default case here and printing an "Unknown kernel exception ..."
message.
Reviewed by: jhb (mentor), markj
Approved by: jhb (mentor), markj
Differential Revision: https://reviews.freebsd.org/D26685
These names were inherited from the arm64 port and should be changed to
the RISC-V terminology.
Reviewed by: jhb (mentor), kp, markj
Approved by: jhb (mentor), kp, markj
Differential Revision: https://reviews.freebsd.org/D26671
Add release_pages needed by drm which simply calls put_page for
all the pages provided
Reviewed by: bz
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D26680
Add power_supply_is_system_supplied which is needed by drm.
Reviewed by: bz
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D26679
Only add prefetchw as it is the only function used by drm.
Simply use the __builtin_prefetch which is available in all
compiler for a long time.
Reviewed by: bz
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D26677
This compute the common greater divider
Taken from OpenBSD
Reviewed by: bz, imp
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D26674
Add an "nextnoskip" sysctl that allows for listing of sysctls intended to be
normally skipped for cost reasons.
This makes it so the names/descriptions of those sysctls can be discovered with
sysctl -aN/sysctl -ad/sysctl -at.
It also makes it so children are visited when a node flagged with CTLFLAG_SKIP
is explicitly requested.
The intended use case is to mark the root "kstat" node with CTLFLAG_SKIP so that
the extensive and expensive stats are skipped by default but may still be easily
obtained without having to know them all (which may not even be possible) and
request each one-by-one.
Reviewed by: jhb
MFC after: 2 weeks
Relnotes: yes
Sponsored by: iXsystems, Inc.
Differential Revision: https://reviews.freebsd.org/D26560
This is described in RealTek's driver as a "RTL8168 Series add-on card."
PR: 250037
Submitted by: Hiroshi HASEGAWA <hhase1973@gmail.com>
MFC after: 1 week
Since the code exits smr section prior to calling pwd_hold, the used
pwd can be freed and a new one allocated with the same address, making
the comparison erroneously true.
Note it is very unlikely anyone ran into it.
switch ufs_stat() to use the same value for st_dev as was used by
the previous ufs_getattr() stat path.
Submitted by: gallatin
Reviewed by: mjg, imp, kib, mckusick
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D26596
Dynamically created OIDs automatically get this flag set.
Reviewed by: jhb
MFC after: 1 week
Sponsored by: iXsystems, Inc.
Differential Revision: https://reviews.freebsd.org/D26561
It gives the answer would the thread sleep according to current state
of signals and suspensions. Of course the answer is racy and allows
for false-negatives (no sleep when signal is delivered after process
lock is dropped). Also the answer might change due to signal
rescheduling among threads in multi-threaded process.
Still it is the best approximation I can provide, to answering the
question was the thread interrupted.
Reviewed by: markj
Tested by: pho, rmacklem
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
Differential revision: https://reviews.freebsd.org/D26628
- Extract suspension check into sig_ast_checksusp() helper.
- Extract signal check and calculation of the interruption errno into
sig_ast_needsigchk() helper.
The helpers are moved to kern_sig.c which is the proper place for
signal-related code.
Improve control flow in sleepq_catch_signals(), to handle ret == 0
(can sleep) and ret != 0 (interrupted) only once, by separating
checking code into sleepq_check_ast_sq_locked(), which return value is
interpreted at single location.
Reviewed by: markj
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
Differential revision: https://reviews.freebsd.org/D26628
Nexthop lookup was not consireding rt_flags when doing
structure comparison, which lead to an original nexthop
selection when changing flags. Fix the case by adding
rt_flags field into comparison and rearranging nhop_priv
fields to allow for efficient matching.
Fix `route change X/Y flags` case - recent changes
disallowed specifying RTF_GATEWAY flag without actual gateway.
It turns out, route(8) fills in RTF_GATEWAY by default, unless
-interface flag is specified. Fix regression by clearing
RTF_GATEWAY flag instead of failing.
Fix route flag reporting in RTM_CHANGE messages by explicitly
updating rtm_flags after operation competion.
Add IPv4/IPv6 tests for flag-only route changes.
If current process is 64bit, use rex-prefixed version of XSAVE
(XSAVE64). If current process is 32bit and CPU supports saving
segment registers cs/ds in the FPU save area, use non-prefixed variant
of XSAVE.
Reported and tested by: Michał Górny <mgorny@mgorny@moritz.systems>
PR: 250043
Reviewed by: emaste, markj
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D26643
After r354889 stack got struct nmi_pcpu at top, which makes IST top
not page-aligned. Since pmap_pti_add_kva() truncates/rounds up
addresses, it erronously entered a page mapped before double fault
stack into the pti page table.
Sponsored by: The FreeBSD Foundation
MFC after: 3 days
detour - it writes ktrace entries to the filesystem - so the overhead
of ast() won't make any difference.
Reviewed by: kib
Sponsored by: DARPA
Differential Revision: https://reviews.freebsd.org/D26404
This change is based on the nexthop objects landed in D24232.
The change introduces the concept of nexthop groups.
Each group contains the collection of nexthops with their
relative weights and a dataplane-optimized structure to enable
efficient nexthop selection.
Simular to the nexthops, nexthop groups are immutable. Dataplane part
gets compiled during group creation and is basically an array of
nexthop pointers, compiled w.r.t their weights.
With this change, `rt_nhop` field of `struct rtentry` contains either
nexthop or nexthop group. They are distinguished by the presense of
NHF_MULTIPATH flag.
All dataplane lookup functions returns pointer to the nexthop object,
leaving nexhop groups details inside routing subsystem.
User-visible changes:
The change is intended to be backward-compatible: all non-mpath operations
should work as before with ROUTE_MPATH and net.route.multipath=1.
All routes now comes with weight, default weight is 1, maximum is 2^24-1.
Current maximum multipath group width is statically set to 64.
This will become sysctl-tunable in the followup changes.
Using functionality:
* Recompile kernel with ROUTE_MPATH
* set net.route.multipath to 1
route add -6 2001:db8::/32 2001:db8::2 -weight 10
route add -6 2001:db8::/32 2001:db8::3 -weight 20
netstat -6On
Nexthop groups data
Internet6:
GrpIdx NhIdx Weight Slots Gateway Netif Refcnt
1 ------- ------- ------- --------------------------------------- --------- 1
13 10 1 2001:db8::2 vlan2
14 20 2 2001:db8::3 vlan2
Next steps:
* Land outbound hashing for locally-originated routes ( D26523 ).
* Fix net/bird multipath (net/frr seems to work fine)
* Add ROUTE_MPATH to GENERIC
* Set net.route.multipath=1 by default
Tested by: olivier
Reviewed by: glebius
Relnotes: yes
Differential Revision: https://reviews.freebsd.org/D26449
With helper page daemon threads, enabled by default in r364786, we
divide the inactive target by the number of threads, rounding down, and
sum the total number of pages freed by the threads. This sum is
compared with the original target, but by rounding down we might lose
pages, causing the page daemon control loop to conclude that inactive
queue scanning isn't keeping up with demand for free pages. Typically
this results in excessive swapping.
Fix the problem by accounting for the error in the main pagedaemon
thread's target. Note that by default the problem will manifest only in
systems with >16 CPUs in a NUMA domain.
Reviewed by: cem
Discussed with: dougm
Reported and tested by: dhw, glebius
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D26610
uma_zalloc_domain() allocates from the requested domain instead of
following a first-touch policy (the default for most zones). Currently
it is only used by malloc_domainset(), and consumers free returned items
with free(9) since r363834.
Previously uma_zalloc_domain() worked by always going to the keg for an
item. As a result, the use of UMA zone caches was unbalanced: we free
items to the caches, but always allocate from the keg, skipping the
caches.
Make some effort to allocate from the UMA caches when performing a
cross-domain allocation. This avoids blowing up the caches when
something is performing many transient allocations with
malloc_domainset().
Reported and tested by: dhw, glebius
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D26427
When SMR was introduced, zone_put_bucket() was changed to always place
full buckets at the end of the queue. However, it is generally
preferable to use recently used buckets since their items are more
likely to be resident in cache. So, for buckets that have no constraint
on item reuse, use a last-in-first-out ordering as we did before.
Reviewed by: rlibby
Tested by: dhw, glebius
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D26426
dmi function are used to get smbios values.
The DRM subsystem and drivers use it to enabled (or not) quirks.
Reviewed by: hselasky
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D26046
Add backlight function to linuxkpi.
Graphics drivers expose the backlight of the panel directly so allow them to use the backlight subsystem so
user can use backlight(8) to configure them.
Reviewed by: hselasky
Relnotes: yes
Differential Revision: The FreeBSD Foundation
Driver for pwm-backlight compatible device.
Relnotes: yes
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D26252
This is a simple subsystem that allow drivers to register as a backlight.
Each backlight creates a device node under /dev/backlight/backlightX and
an alias based on the name provided.
Relnotes: yes
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D26250
Currently we allocate and map zero-filled anonymous pages when dumping
core. This can result in lots of needless disk I/O and page
allocations. This change tries to make the core dumper more clever and
represent unbacked ranges of virtual memory by holes in the core dump
file.
Add a new page fault type, VM_FAULT_NOFILL, which causes vm_fault() to
clean up and return an error when it would otherwise map a zero-filled
page. Then, in the core dumper code, prefault all user pages and handle
errors by simply extending the size of the core file. This also fixes a
bug related to the fact that vn_io_fault1() does not attempt partial I/O
in the face of errors from vm_fault_quick_hold_pages(): if a truncated
file is mapped into a user process, an attempt to dump beyond the end of
the file results in an error, but this means that valid pages
immediately preceding the end of the file might not have been dumped
either.
The change reduces the core dump size of trivial programs by a factor of
ten simply by excluding unaccessed libc.so pages.
PR: 249067
Reviewed by: kib
Tested by: pho
MFC after: 1 month
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D26590
OBJT_DEFAULT, _SWAP, _VNODE and _PHYS is exactly the set of
non-fictitious object types, so just check for OBJ_FICTITIOUS. The
check no longer excludes dead objects, but such objects have to be
handled regardless.
No functional change intended.
Reviewed by: alc, dougm, kib
Tested by: pho
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D26589
Access faults in user mode are treated like TLB misses, which leads to an
endless loop of faults. It's less serious than the same fault in kernel mode,
because we can just terminate the process, but that's not ideal.
Treat user mode access faults as a bus error.
Suggested by: jrtc27
Reviewed by: br, jhb
Sponsored by: Axiado
Differential Revision: https://reviews.freebsd.org/D26621
These tunables can only be set to a valid cluster size (2K, 4K, 9K, or
16K) as documented in the man page. Anything else could lead to a
panic on interface up.
Reported by: mav@
MFC after: 1 week
Sponsored by: Chelsio Communications
- Annotate FreeBSD sysctls with CTLFLAG_MPSAFE
- Reduce stack usage of Lua
- Don't save user FPU context in kernel threads
- Add support for procfs_list
- Code cleanup in zio_crypt
- Add DB_RF_NOPREFETCH to dbuf_read()s in dnode.c
- Drop references when skipping dmu_send due to EXDEV
- Eliminate gratuitous bzeroing in dbuf_stats_hash_table_data
- Fix legacy compat for platform IOCs
The assembly implementation incorrectly used logical AND instead of
bitwise AND. Fix, and re-enable in libmd.
Submitted by: Yang Zhong <yzhong@freebsdfoundation.org>
Reviewed by: cem (earlier)
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D26614
Per the Intel manuals, CPUID is supposed to unconditionally zero the
upper 32 bits of the involved (rax/rbx/rcx/rdx) registers.
Previously, the emulation would cast pointers to the 64-bit register
values down to `uint32_t`, which while properly manipulating the lower
bits, would leave any garbage in the upper bits uncleared. While no
existing guest OSes seem to stumble over this in practice, the bhyve
emulation should match x86 expectations.
This was discovered through alignment warnings emitted by gcc9, while
testing it against SmartOS/bhyve.
SmartOS bug: https://smartos.org/bugview/OS-8168
Submitted by: Patrick Mooney
Reviewed by: rgrimes
Differential Revision: https://reviews.freebsd.org/D24727
Repeating the default WARNS here makes it slightly more difficult to
experiment with default WARNS changes, e.g. if we did something absolutely
bananas and introduced a WARNS=7 and wanted to try lifting the default to
that.
Drop most of them; there is one in the blake2 kernel module, but I suspect
it should be dropped -- the default WARNS in the rest of the build doesn't
currently apply to kernel modules, and I haven't put too much thought into
whether it makes sense to make it so.
successful RPC.
Without this patch, the NFSv4.2 VOP_COPY_FILE_RANGE() client call would
loop until the copy "len" was completed. The problem with doing this is
that it might take a considerable time to complete for a large "len".
By returning after a single successful Copy RPC that copied some of the
data, the application that did the copy_file_range(2) syscall will be
more responsive to signal delivery for large "len" copies.
hole size boundary.
By clipping the len argument of vn_generic_copy_file_range() to end at
an exact multiple of hole size, holes are more likely to be maintained
during the copy.
A hole can still straddle the boundary at the end of the
copy range, resulting in a block being allocated in the
output file as it is being grown in size, but this will reduce the
likelyhood of this happening.
While here, also modify setting of blksize to better handle the
case where _PC_MIN_HOLE_SIZE is returned as 1.
Reviewed by: asomers
Differential Revision: https://reviews.freebsd.org/D26570
A user pointer is not a suitable value for bio_data and the next block
of code always overwrites bio_data anyway. Just use cb->aio_buf
directly in the call to vm_fault_quick_hold_pages().
Reviewed by: kib
Obtained from: CheriBSD
MFC after: 1 month
Sponsored by: DARPA
Differential Revision: https://reviews.freebsd.org/D26595
Some DSDT entries have multiple interrupts for one device.
Add support for it.
This fixes ahci on NXP LS2160 and genet on RPi4
Submitted by: Greg V <greg@unrelenting.technology>
Reviewed by: jhb
MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D25145
In r351368, we introduced this XML- and GDB-encoded data. The protocol
'offset' should reflex the logical XML data offset, but unfortunately we
counted the GDB escapes as well.
In fact, we cannot safely do GDB character escaping at this layer at
all, because we don't know what will be flushed in a packet. It is
bogus to send only the first character of a two-character escape
sequence.
This patch "corrects" the problem by squashing these characters in the
transmitted XML document. It would be nice to transmit the characters
faithfully, but that is a more complicated change. Thread names are a
nice convenience feature for the GDB client, but one can always inspect
td_name or p_comm directly to find the true name.
Reported by: Ka Ho Ng <khng300 AT gmail.com>
Tested by: Ka Ho Ng
Reviewed by: emaste, markj, rlibby
Differential Revision: https://reviews.freebsd.org/D26599
Load/store/fetch access exceptions always indicate a violation of a PMP
rule. We can't treat those as page faults, because updating the page
table and trying again will only result in exactly the same access
exception recurring. This leaves us in an endless exception loop.
We cannot recover from these exceptions, so panic instead.
Reviewed by: jhb
Sponsored by: Axiado
Differential Revision: https://reviews.freebsd.org/D26544
Every other architecture defines this and this is required for
interrupts to work when using QEMU's PCI VirtIO devices (which all
report an interrupt line of 0) for two reasons.
Firstly, interrupt line 0 is wrong; they use one of 0x20-0x23 with the
lines being cycled across devices like normal. Moreover, RISC-V uses
INTRNG, whose IRQs are virtual as indices into its irq_map, so even if
we have the right interrupt line we still need to try and route the
interrupt in order to ultimately call into intr_map_irq and get back a
unique index into the map for the given line, otherwise we will use
whatever happens to be in irq_map[line] (which for QEMU where the line
is initialised to 0 results in using the first allocated interrupt,
namely the RTC on IRQ 11 at time of commit).
Note that pci_assign_interrupt will still do the wrong thing for INTRNG
when using a tunable, as it will bypass INTRNG entirely and use the
tunable's value as the index into irq_map, when it should instead
(indirectly) call intr_map_irq to allocate a new entry for the given
IRQ and treat the tunable as stating the physical line in use, which is
what one would expect. This, however, is a problem shared by all INTRNG
architectures, and not exclusive to RISC-V.
Reviewed by: kib
Approved by: kib
Differential Revision: https://reviews.freebsd.org/D26564
Without this patch, if a call to copy_file_range(2) specifies an input file
offset + len that would wrap around, EINVAL is returned.
I thought that was the Linux behaviour, but recent testing showed that
Linux accepts this case and does the copy_file_range() to EOF.
This patch changes the FreeBSD code to exhibit the same behaviour as
Linux for this case.
Reviewed by: asomers, kib
Differential Revision: https://reviews.freebsd.org/D26569
This appears to be a typo. The AdvSIMD field encodes support for
half-precision floating point SIMD instructions, which corresponds to
HWCAP_ASIMDHP, not HWCAP_ASIMDDP.
MFC after: 3 days
Sponsored by: The FreeBSD Foundation
ccr(4) uses software to handle GCM and CCM requests not supported by
the crypto engine (e.g. with only AAD and no payload). This change
adds a fallback for a few more requests such as those with more SGL
entries than can fit in a work request (this can happen for GCM when
decrypting a TLS record split across 15 or more packets).
Reported by: Chelsio QA
Reviewed by: np
Sponsored by: Chelsio Communications
Differential Revision: https://reviews.freebsd.org/D26582
Rather than placing the epoch around the entire receive loop which
might call into rtwn_rx_frame() and USB and sleep, split the loop
into two[1] and leave us with one unlock/lock cycle as well.
PR: 249925
Reported by: thj, (rkoberman gmail.com)
Tested by: thj
Suggested by: adrian [1]
Reviewed by: adrian
MFC after: 3 days
Sponsored by: The FreeBSD Foundation (initially, paniced my iwl lab host)
Differential Revision: https://reviews.freebsd.org/D26554
This is mostly needed for a common arm64/amd64 iommu code.
Reviewed by: kib
Sponsored by: DARPA, AFRL
Differential Revision: https://reviews.freebsd.org/D26587
This function isn't ACPI dependent and we may use it on FDT systems
as well.
o Don't repeat the function declaration, include iommu.h instead.
Reviewed by: andrew, kib
Sponsored by: DARPA, AFRL
Differential Revision: https://reviews.freebsd.org/D26584
This was introduced when I merged r361287 to OpenZFS and has been fixed
there already, commit 3f6bb6e43fd68e.
Reported by: swills
Reviewed by: allanjude, freqlabs, mmacy
LLVM 11 changed the meaning of '-O' from '-O2' to '-O1', which resulted
in debug kernels (with 'makeoptions DEBUG=-g') being built with inlining
disabled, causing severe performance hit.
The -O2 was already being used for building amd64, powerpc, and powerpcspe.
Discussed with: jrtc27, arichardson, bdragon, jhibbits
Sponsored by: DARPA
Differential Revision: https://reviews.freebsd.org/D26471
This avoids setting the association in an inconsistent
state, which could result in a use-after-free situation.
This can be triggered by a malicious peer, if the peer
can modify the cookie without the local endpoint recognizing
it.
Thanks to Ned Williamson for reporting the issue.
MFC after: 3 days
Bind the netmap tx queues to a special '0xff' scheduling class which
makes the firmware skip some processing related to rate limiting on the
outgoing traffic. Future firmwares will do this automatically.
MFC after: 1 week
Sponsored by: Chelsio Communications
- Only active netmap receive queues should be in the RSS lookup table.
- The RSS table should be restored for NIC operation when the last
active netmap queue is switched off, not the first one.
- Support repeated netmap ON/OFF on a subset of the queues. This works
whether the the queues being enabled and disabled are the only ones
active or not. Some kring indexes have to be reset in the driver for
the second case.
MFC after: 1 week
Sponsored by: Chelsio Communications
For interfaces that do not support SIOCGIFMEDIA (for which there are
quite a few) the only fallback is to query the interface for
if_data->ifi_link_state. While it's possible to get at if_data for an
interface via getifaddrs(3) or sysctl, both are heavy weight mechanisms.
SIOCGIFDATA is a simple ioctl to retrieve this fast with very little
resource use in comparison. This implementation mirrors that of other
similar ioctls in FreeBSD.
Submitted by: Roy Marples <roy@marples.name>
Reviewed by: markj
MFC after: 1 month
Differential Revision: https://reviews.freebsd.org/D26538
Until we can do proper /etc/rc output on both consoles in multicons
boot (or all of them if we ever generalize), report when we are
booting multicons. Also report the primary console. This will be a big
hint why output stops after this line (though some slow USB discovery
still happens after mountroot / init starts).
Reviewed by: scottl@, tsoome@
Differential Revision: https://reviews.freebsd.org/D26574
Use address of the pointer passed to kernel to determine whether the pointer
is a FDT block (physical address) or a module pointer (virtual kernel address).
This fragment was supposed to be committed before r366196, but I accidentally
skipped it in a patch series.
Reported by: bz
from 32 to 28) by shrinking some entries and reordering them.
Reviewed by: kib
Sponsored by: DARPA
Differential Revision: https://reviews.freebsd.org/D26508
messages using DATA chunks. Don't use fsn_included when not being
sure that it is set to an appropriate value. If the default is
used, which is -1, this can result in SCTP associaitons not
making any user visible progress.
Thanks to Yutaka Takeda for reporting this issue for the the
userland stack in https://github.com/pion/sctp/issues/138.
MFC after: 3 days
We have a few shortcuts in the arm trap code to speed up obvious "must fail"
cases. In these situations, make sure that we fill in the "sig" and "code"
fields of the generated signal.
MFC after: 3 weeks
It was removed in r355289 but forgot to return it back when new u-boot booti
support was committed. Although booti is not the preferred method of
booting the kernel, it is very useful for the initial phase of porting
FreeBSD to a new platform or booting the kernel on various embedded boards
in an industrial environment.
Don't map same physical memory multiple times with different cache attributes.
This is explicitly stated as architectural undefined behavior, leading to
coherency issues sooner or later.
using an open_to_lock_owner4 when that lock_owner4 has already
been created by a previous open_to_lock_owner4. This caused the NFS server
to reply NFSERR_INVAL.
For NFSv4.0, this is an error, although the updated NFSv4.0 RFC7530 notes
that the correct error reply is NFSERR_BADSEQID (RFC3530 did not specify
what error to return).
For NFSv4.1, it is not obvious whether or not this is allowed by RFC5661,
but the NFSv4.1 server can handle this case without error.
This patch changes the NFSv4.1 (and NFSv4.2) server to handle multiple
uses of the same lock_owner in open_to_lock_owner so that it now correctly
interoperates with the Linux NFS client.
It also changes the error returned for NFSv4.0 to be NFSERR_BADSEQID.
Thanks go to Bjorn for diagnosing this and testing the patch.
He also provided a program that I could use to reproduce the problem.
Tested by: bj@cebitec.uni-bielefeld.de (Bjorn Fischer)
PR: 249567
Reported by: bj@cebitec.uni-bielefeld.de (Bjorn Fischer)
MFC after: 3 days
There may be additional 64-bit ABIs supported, so use a positive check rather
than a negative check.
Suggested by: imp
MFC after: 1 week
Sponsored by: Juniper Networks, Inc
I had failed to notice that sgsendccb() was using cam_periph_mapmem()
and thus was not passing down user pointers directly to drivers. In
practice this broke requests submitted from userland.
PR: 249395
Reported by: Trenton Schulz <trueos@norwegianrockcat.com>
Reviewed by: scottl
MFC after: 3 days
Differential Revision: https://reviews.freebsd.org/D26550
Allow the necessary parts of systm.h to be visible in the _STANDALONE
environnment. Limit the reset to only being visible for _KERNEL
builds. Map KASSERT, etc to printf on failure in the bootloader until
we have more confidence things won't break and leave systems
unbootable. Eventually, this should map to a full panic in the
bootloader, but that also needs some enhancement to be more useful.
Reviewed by: tsoome, jhb
Differential Revision: https://reviews.freebsd.org/D26543
Original patch was against FreeBSD 12, and a test compile wasn't run against
head. md_tls_tcb_offset field was moved from mdthread to mdproc in the
meantime.
MFC after: 1 week
Sponsored by: Juniper Networks, Inc.
_KERNEL and _STANDALONE are different things. They cannot both be true
at the same time. If things that are normally visible only to _KERNEL
are needed for the _STANDALONE environment, you need to also make them
visible to _STANDALONE. Often times, this will be just a subset of the
required things for _KERNEL (eg global variables are but one example).
sys/cdefs.h is included by pretty much everything in both the loader
and the kernel, so is the ideal choke point.
A received control packet may cause the transmit queue to be flushed, in
which case ng_l2tp_seq_recv_nr() cancels the transmit timeout handler.
The handler checks to see if it was cancelled before doing anything, but
did so before acquiring the node lock, so a small race window could
cause ng_l2tp_seq_rack_timeout() to attempt to flush an empty queue,
ultimately causing a null pointer dereference.
PR: 241133
Reviewed by: bz, glebius, Lutz Donnerhacke
MFC after: 3 days
Sponsored by: Rubicon Communications, LLC (Netgate)
Differential Revision: https://reviews.freebsd.org/D26548
Summary:
Two bugs:
* Elf32_Auxinfo is broken, using pointers in the union, which are 64-bits not
32.
* freebsd32_sysarch() doesn't update the 'user local' register when handling
MIPS_SET_TLS, leading to a NULL pointer dereference in the 32-bit
application.
Reviewed by: #mips, brooks
MFC after: 1 week
Sponsored by: Juniper Networks, Inc
Differential Revision: https://reviews.freebsd.org/D26556
In some cases, the syscon driver may be used by consumer requiring better
control about locking (ie. it may be used as registe file provider for clock
driver which needs locked access to multiple registers).
Add fine locking protocol methods together with bunch of helper functions
in syscon driver and implement this functionality in syscon_generic driver.
MFC after: 4 weeks
Syscon can also have child nodes that share a registration file with it.
To do this correctly, follow these steps:
- subclass syscon from simplebus and expose it if the node is also
"simple-bus" compatible.
- block simplebus probe for this compatible string, so it's priority
(bus pass) doesn't colide with syscon driver.
While I'm in, also block "syscon", "simple-mfd" for the same reason.
MFC after: 4 weeks
The fastpath in tcp_output tries to send out
full segments, and avoid sending partial segments by
comparing against the static t_maxseg variable.
That value does not consider tcp options like timestamps,
while the initial window calculation is using
the correct dynamic tcp_maxseg() function.
Due to this interaction, the last, full size segment
is considered too short and not sent out immediately.
Reviewed by: tuexen
MFC after: 2 weeks
Sponsored by: NetApp, Inc.
Differential Revision: https://reviews.freebsd.org/D26478
Adjust ssthresh in after_idle to the maximum of
the prior ssthresh, or 3/4 of the prior cwnd. See
RFC2861 section 2 for an in depth explanation for
the rationale around this.
As newreno is the default "fall-through" reaction,
most tcp variants will benefit from this.
Reviewed by: tuexen
MFC after: 2 weeks
Sponsored by: NetApp, Inc.
Differential Revision: https://reviews.freebsd.org/D22438
designated initializers. This makes it easier to modify 'struct sysent'
layout.
Reviewed by: kevans
Sponsored by: DARPA
Differential Revision: https://reviews.freebsd.org/D26530
The hardware supports periods as long as 196 seconds[*] when using the
maximal prescaling of 72000 and maximum cycle count of 2^16.
But the code becomes incorrect when the period length approaches 1 second.
That's because of things like NS_PER_SEC / period.
[*] At the same time I must note that the KPI provides for maximum
period of about 4 seconds (2^32 nanoseconds).
MFC after: 2 weeks
If a FUSE server returns FOPEN_DIRECT_IO in response to FUSE_OPEN, that
instructs the kernel to bypass the page cache for that file. This feature
is also known by libfuse's name: "direct_io".
However, when accessing a file via mmap, there is no possible way to bypass
the cache completely. This change fixes a deadlock that would happen when
an mmap'd write tried to invalidate a portion of the cache, wrongly assuming
that a write couldn't possibly come from cache if direct_io were set.
Arguably, we could instead disable mmap for files with FOPEN_DIRECT_IO set.
But allowing it is less likely to cause user complaints, and is more in
keeping with the spirit of open(2), where O_DIRECT instructs the kernel to
"reduce", not "eliminate" cache effects.
PR: 247276
Reported by: trapexit@spawn.link
Reviewed by: cem
MFC after: 3 days
Differential Revision: https://reviews.freebsd.org/D26485
We have (two versions) of MS() and SM() macros which we use throughout
the wireless code. Change all but three places (ath_hal, rtwn, and rsu)
to the newly provided _IEEE80211_MASKSHIFT() and _IEEE80211_SHIFTMASK()
macros. Also change one internal case using both _S and _M instead of
just _S away from _M (one of the reasons rtwn and rsu were not changed).
This was done semi-mechanically. No functional changes intended.
Requested by: gnn (D26091)
Reviewed by: adrian (pre line wrap)
MFC after: 2 weeks
Sponsored by: Rubicon Communications, LLC (d/b/a "Netgate")
Differential Revision: https://reviews.freebsd.org/D26539
- We can exit the loop as soon as the filter check passes.
- The alignment check has already passed so there is no need to also run
it here.
Sponsored by: Innovate UK
We need to use a bounce buffer when the memory we are operating on is not
aligned to a cacheline, and not aligned to the maps alignment.
The former is to stop other threads from dirtying the cacheline while we
are performing DMA operations with it. The latter is to check memory
passed in by a driver is correctly aligned for the device.
Reviewed by: mmel
Sponsored by: Innovate UK
Differential Revision: https://reviews.freebsd.org/D26496
This will ensure nothing modifies the cacheline while DMA is in progress
so we won't need to bounce the data.
Reviewed by: mmel
Sponsored by: Innovate UK
Differential Revision: https://reviews.freebsd.org/D26495
_STANDALONE is only for the bootloader, not kernel modules. Remove it
from the build. This was harmless before, but sys/malloc.h now does
different things for the standalone environment, triggering the issue.
Use it to decide if we can skip cache management.
While here remove the DMAMAP_COULD_BOUNCE flag as it's unneeded.
Reviewed by: mmel
Sponsored by: Innovate UK
Differential Revision: https://reviews.freebsd.org/D26494
Add helper functions to the arm64 busdma for common cases of checking if
we may need to bounce, and if we must bounce for a given address.
These will be expanded later as we handle cache-misaligned memory.
Reported by: mmel
Sponsored by: Innovate UK
Differential Revision: https://reviews.freebsd.org/D26493
The ZSTD support for the boot loader will need to include files that
use the kernel's malloc interface. Create a standalone stub version
that's functional enough to allow this to work. There's some
limitations in this interface, and it's not quite a perfect
match. Specifically, M_WAITOK allocations can fail because there's
nothing that can be done we no memory is available.
It is only ever called for negative entries and for those it is
just a wrapper around cache_zap_negative_locked_vnode_kl which
always succeeds.
This also fixes a bug where cache_lookup_fallback should have been
calling cache_zap_locked_bucket instead. Note that in order to trigger
the bug NOCACHE must not be set, which currently only happens when
creating a new coredump (and then the coredump-to-be has to have a
negative entry).
The NOTES files have a bunch of hint lines that are removed when
generating LINT. However, we can achieve the same effect by prepending
each of the lines with 'envvar' so the NOTES files become standard
config(8) files. No functional changes as the sed script to generate
the LINT files filters these either way.
Suggested by: kevans
and current address space is already destroyed, so kern_execve()
terminates the process.
While there, clean up some internals of post_execve() inlined in init_main.
Reported by: Peter <pmc@citylink.dinoex.sub.org>
Reviewed by: markj
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D26525
We left these in the clean rule to avoid having stale files remain in
working trees, but enough time has now passed that it's no longer
relevant.
Discussed with: imp
The entire cache scan was a leftover from the old implementation.
It is incredibly wasteful in presence of several mount points and does not
win much even for single ones.
Similar to OPAL calls, switch to big endian to do calls to RTAS.
(Missed this one when I was doing the bulk commit of PowerPC64LE support.)
Sponsored by: Tag1 Consulting, Inc.
Due to enter_idle_powerx fabricating a MSR from scratch, it is necessary
for it to care about the endianness, so we don't accidentally switch
endian the first time we idle a thread.
Took about five seconds to spot after seeing an unmangled backtrace.
The hard bit was needing to temporarily set up a mutex to sort out the
logjam that happens when every thread simultaneously wakes up in the wrong
endian due to the panic IPI and panics, leaving what I can best describe as
"alphabet soup" on the console.
Luckily, I already had a patch sitting around to do that.
This brings POWER8 up to equivilence with POWER9 on PPC64LE.
Sponsored by: Tag1 Consulting, Inc.
OPAL unconditionally enters secondary CPUs with only HV and SF set.
I tried writing a secondary entry point instead, but OPAL rejected it
and I am unsure why, so I resorted to making the system reset interrupt
endian-flexible.
This means we take a slight performance hit on wakeup on LE, but it is
a good stopgap until we can figure out a reliable way to make OPAL enter
where we want it to.
It probably makes sense to have it around anyway, because I can imagine
scenarios where the cpu resets itself to BE and does a software reset.
Sponsored by: Tag1 Consulting, Inc.
More endian conversion.
* Install TCEs correctly (i.e. in big endian)
* Convert to big endian and back when setting up queue pages and IRQs.
Sponsored by: Tag1 Consulting, Inc.
Since OPAL runs in big endian, any data being passed back and forth
via memory instead of registers needs to be byteswapped.
From my notes during development:
"A good way to find candidates is to look for vtophys() in opal_call()
parameters. The memory being passed will be written into in BE."
Sponsored by: Tag1 Consulting, Inc.
It's much easier to implement this in an endian-independent way when we
don't also have to worry about masking half of the dword off.
Given that this code ran on a machine that ran a poudriere bulk with no
kernel oddities, I am relatively certain it is correctly implemented. ;)
This should be a minor performance boost on BE as well.
Sponsored by: Tag1 Consulting, Inc.
For a body of code that had its endian conversion bits written blind without
the ability to test, moea64 was VERY close to being correct.
There were only four instances where the existing code was getting it wrong.
Sponsored by: Tag1 Consulting, Inc.
This is slightly stripped down from GENERIC64, as PowerMac G5 machines
are incapable of running in LE mode (so we can skip the Mac drivers.)
While technically POWER6 and POWER7 have the hardware capability of running
in LE mode, they have a tendency to trap excessively when a load/store is
misaligned. (an extremely common occurrence in LE code, and one of the main
reasons I consider BE to be superior, as it turns potential security issues
into immediately obvious mangled numbers.)
Additionally, there was no mechanism to control what endian interrupts
are delivered in, so supporting LE operation on POWER6 and POWER7 involves
some really dirty tricks in the interrupt vectors that I would rather
avoid.
IBM drew the line in the sand at POWER8 some time around 2013, embracing
full support for LE in the platform, and making a push across the board
for LE code to target POWER8 as a minimum requirement. As such, usage of
LE kernels on POWER6 and POWER7 is practically nil, despite it being
technically possible to do.
The so-called "TRUELE" feature bit which is the baseline requirement for
needed for PowerPC64LE was introduced in POWER8.
Sponsored by: Tag1 Consulting, Inc.
When running without a hypervisor, we need to set the ILE bit in the LPCR
ourselves.
For the boot processor, handle it in powernv_attach() like we do for other
LPCR bits.
No change for the APs, as they will use the lpcr global to set up their own
LPCR when they do their own cpudep_ap_early_bootstrap() and pick up this
automatically.
Sponsored by: Tag1 Consulting, Inc.
OPAL runs in big endian, so we need to rfid into it to switch endian
atomically when branching to it, and we need to do the
RETURN_TO_NATIVE_ENDIAN dance when it returns to us.
Sponsored by: Tag1 Consulting, Inc.
Unlike virtio, which in legacy mode is guest endian, the hypervisor vscsi
interface operates in big endian, so we must convert back and forth in several
places.
These changes are enough to attach a rootdisk.
Sponsored by: Tag1 Consulting, Inc.