Summary:
kexec-lite cannot currently handle multiple PT_LOAD segments. In some
cases the compiler generates multiple PT_LOAD segments for an unknown
reason, causing boot to fail from kexec-lite.
Submitted by: Brandon Bergren (older version)
Differential Revision: https://reviews.freebsd.org/D19574
Summary:
With a sufficiently large TOC, it's possible to index out of range, as
the immediate load instructions only permit 16-bit indices, allowing up
to 64kB range (signed) from the base pointer. Allow +/- 2GB range, with
the medium code model TOC accesses in asm.
Patch originally by Brandon Bergren. The issue appears to impact ELFv2
more than ELFv1.
Reviewed by: luporl
Differential Revision: https://reviews.freebsd.org/D19708
fuse(4) was heavily instrumented with debug printf statements that could
only be enabled with compile-time flags. They fell into three basic groups:
1. Totally redundant with dtrace FBT probes. These I deleted.
2. Print textual information, usually error messages. These I converted to
SDT probes of the form fuse:fuse:FILE:trace. They work just like the old
printf statements except they can be enabled at runtime with dtrace. They
can be filtered by FILE and/or by priority.
3. More complicated probes that print detailed information. These I
converted into ad-hoc SDT probes.
Also, de-inline fuse_internal_cache_attrs. It's big enough to be a regular
function, and this way it gets a dtrace FBT probe.
This commit is a merge of r345304, r344914, r344703, and r344664 from
projects/fuse2.
Reviewed by: cem
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D19667
From Jake:
iflib_if_transmit returns ENOBUFS when the device is down, or when the
link isn't active.
This was changed in r308792 from return (0), so that the function
correctly reports an error that it was unable to transmit.
However, using ENOBUFS can cause some network applications to produce
the following or similar errors:
"ping: sendto: No buffer space available"
This is a bit confusing as the real cause of the issue is that the
network device is down.
Replace the ENOBUFS return with ENETDOWN to indicate more clearly that
the reason for the failure to send is due to the network device is
offline.
This will cause the error message to be reported as
"ping: sendto: Network is down"
Submitted by: Jacob Keller <jacob.e.keller@intel.com>
Reviewed by: shurd@, sbruno@, bz@
MFC after: 1 week
Sponsored by: Intel Corporation
Differential Revision: https://reviews.freebsd.org/D19652
From Jake:
The iflib_device_register function takes the CTX lock before calling
IFDI_ATTACH_PRE, and releases it upon finishing the registration.
Mirror this process in iflib_pseudo_register, so that we always hold the
CTX lock during the attach process when registering a pseudo interface
or a regular interface.
This was caught by code inspection while attempting to analyze where the
CTX lock was held.
Submitted by: Jacob Keller <jacob.e.keller@intel.com>
Reviewed by: shurd@, erj@
MFC after: 1 week
Sponsored by: Intel Corporation
Differential Revision: https://reviews.freebsd.org/D19604
CAM IOCTL interfaces traditionally mapped user-space data buffers to KVA.
It was nice originally, but now it takes too much to handle respective
TLB shootdowns, while small kernel memory allocations up to 64KB backed
by UMA and accompanied by copyin()/copyout() can be much cheaper.
For large buffers mapping still may have sense, and unmapped I/O would
be even better, but the last unfortunately is more tricky, since unmapped
I/O API is too specific to struct bio now.
MFC after: 2 weeks
Sponsored by: iXsystems, Inc.
lagg_bcast_start appeared to have a bug in that was using the last
lagg port structure after exiting the epoch that was keeping that
structure alive. However, upon further inspection, the epoch was
already entered by the caller (lagg_transmit), so the epoch enter/exit
in lagg_bcast_start was actually unnecessary.
This commit generally removes uses of the net epoch via LAGG_RLOCK to
protect the list of ports when the list of ports was already protected
by an existing LAGG_RLOCK in a caller, or the LAGG_XLOCK.
It also adds a missing epoch enter/exit in lagg_snd_tag_alloc while
accessing the lagg port structures. An ifp is still accessed via an
unsafe reference after the epoch is exited, but that is true in the
current code and will be fixed in a future change.
Reviewed by: gallatin
MFC after: 1 month
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D19718
Consider a bridge0 with em0 and em1 members. Traffic rx'd by em0 and
transmitted by bridge0 through em1 gets accounted for in IPACKETS/IBYTES
and bridge0 bpf -- assuming it's not unicast traffic destined for em1.
Unicast traffic destined for em1 traffic is not accounted for by any
mechanism, and isn't pushed through bridge0's bpf machinery as any other
packets that pass over the bridge do.
Fix this and simplify GRAB_OUR_PACKETS by bailing out early if it was rx'd
by the interface that it was addressed for. Everything else there is
relevant for any traffic that came in from one member that's being directed
at another member of the bridge.
Reviewed by: kp
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D19614
r345620 by kib@ fixed the rtld issue that caused a crash at startup
during resolution of libc's ifuncs with BIND_NOW.
PR: 233333
Sponsored by: The FreeBSD Foundation
It looks like some DIMMs claim to have a TSOD, but actually don't. Some
claim they weren't able to change the SPD page, but they did. Neither of
those should be fatal errors.
PR: 235944
Submitted by: Greg V <greg@unrelenting.technology>
Reported by: Greg V <greg@unrelenting.technology>
Reviewed by: cem
MFC after: 1 weeks
Sponsored by: Panasas
Differential Revision: https://reviews.freebsd.org/D19681
We were doing so as a workaround for the problem addressed by r345593, so
it's no longer necessary.
Reviewed by: jhb
Discussed with: emaste
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D19705
RISC-V timer has no dedicated DTS node and we have to get timer
frequency from cpus node.
Tested on Government Furnished Equipment (GFE) cores synthesized
on Xilinx VCU118.
Reviewed by: markj
Sponsored by: DARPA, AFRL
Differential Revision: https://reviews.freebsd.org/D19727
Remove redundant npxsave_core definition while here.
Suggested by: Anton Rang
Reviewed by: kib, Anton Rang <rang AT acm.org>
Sponsored by: Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D19665
In most cases kernel.bootfile is populated from the information
provided by loader(8). There are certain scenarios when loader
is not available, for instance when kernel is loaded by u-boot
or some other BootROM directly. In this case the default value
"/kernel" points to invalid location and breaks some functinality,
like using installkernel on self-hosted system or dtrace's CTF
lookup. This can be fixed by setting the value manually but the
default that reflects correct location is better than default that
points to invalid one.
Current default was set around FreeBSD 1, when "/kernel" was the
actual path. Transition to /boot/kernel/kernel happened circa FreeBSD 3.
PR: 221550
Reviewed by: ian, imp
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D18902
CAM_RESRC_UNAVAIL instead of CAM_REQUEUE_REQ. This makes CAM delay a bit
before retrying, so that the retries actually get a chance to succeed.
Reviewed by: sbruno
MFC after: 2 weeks
Sponsored by: Klara Inc.
Differential Revision: https://reviews.freebsd.org/D19696
It makes the code slightly easier to follow, and might make
it easier to fix the resouce accounting to also account for
the interpreter.
The PROC_UNLOCK() is moved earlier - I don't see anything
it should protect; the lim_max() is a wrapper around lim_rlimit(),
and that, differently from lim_rlimit_proc(), doesn't require
the proc lock to be held.
Reviewed by: kib
MFC after: 2 weeks
Sponsored by: DARPA, AFRL
Differential Revision: https://reviews.freebsd.org/D19689
was unlocked and yet the bucket-unlock flag was not
changed to false. This can cause a panic if INVARIANTS
is on and we go through the right path (though rare).
Reported by: syzbot+179a1ad49f3c4c215fa2@syzkaller.appspotmail.com
Reviewed by: tuexen@
MFC after: 1 week
* Cache moea64_need_lock in a local variable; gcc generates slightly better
code this way, it doesn't need to reload the value from memory each read.
* VPN cropping is only needed on PowerPC ISA 2.02 and older cores, a subset
of those that need serialization, so move this under the need_lock check,
so those that don't need the lock don't even need to check this.
This allows for directives such as
makeoptions DTS+=/out/of/tree/myboard.dts
# in tree! Same rules applied as if this were in a dtb/ module
makeoptions DTS+=otherboard.dts
to be specified in config(5) and have these built/installed alongside th
kernel. The assumption that overlays live in an overlays/ directory is only
made for in-tree DTSO, but we still make the assumption that out-of-tree
arm64 DTS will be in vendored directories (for now).
This lowers the cost to hacking on an overlay or dts by being able to
quickly throw it in a custom config, especially if it doesn't fit one of the
current dtb/modules quite appropriately or it's not intended for commit
there.
The build/install targets were split out of dtb.mk to centralize the build
logic and leave out the all/realinstall/CLEANFILES additions... it was
believed that we didn't want to pollute the kernel build with these.
The build rules were converted to suffix rules at the suggestion of Ian to
clean things up a little bit in a world where we can have mixed
in-tree/out-of-tree DTS/DTSO specified.
Reviewed by: ian
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D19351
A sysid of 0 denotes the local system, and some handlers for remote
locking commands do not attempt to deal with local locks. Note that
F_SETLK_REMOTE is only available to privileged users as it is intended
to be used as a testing interface.
Reviewed by: kib
Reported by: syzbot+9c457a6ae014a3281eb8@syzkaller.appspotmail.com
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D19702
The current kernel C-type macros might obscurely hide the fact that
the input argument might be used multiple times.
This breaks code like:
isalpha(*ptr++)
Use static inline functions instead of macros to fix this.
Reviewed by: kib @
Differential Revision: https://reviews.freebsd.org/D19694
MFC after: 1 week
Sponsored by: Mellanox Technologies
AR724X_GPIO_PINS used for this family is defined as 18
The datasheet for the AR7241 describes 20 pins, allow all to be used.
Submitted by: Hiroki Mori <yamori813@yahoo.co.jp>
Reviewed by: mizhka
Differential Revision: https://reviews.freebsd.org/D17580
TMPFS_PAGES_MINRESERVED controls how much memory is reserved for the system
and not used by tmpfs.
On very small memory systems, the default value may be too high and this
prevents these small memory systems from using reroot, which is required
for them to install firmware updates.
Submitted by: Hiroki Mori <yamori813@yahoo.co.jp>
Reviewed by: mizhka
Differential Revision: https://reviews.freebsd.org/D13583
While geom_flashmap has always supported label names for its slices, it does
so by appending "s.labelname" to the provider device name, meaning you still
have to know the name and unit of the hardware device to use the labels.
These changes add support for device-independent geom_flashmap labels, using
the standard geom_label infrastructure. geom_flashmap now creates a softc
struct attached to its geom, and as it creates slices it stores the label
into an array in the softc. The new geom_label_flashmap uses those labels
when tasting a geom_flashmap provider.
Differential Revision: https://reviews.freebsd.org/D19535
are successfully returned by the card (usually due to an abort being issued
as part of timeout recovery). Remove what amounts to an insufficient
KASSERT, and don't overwrite the state value. State should probably be
re-designed, and that will be done with a future commit.
Reported by: phk, bei.io
Reviewed by: imp, mav
Differential Revision: D19677
There are only 19 bytes available for the name of an interrupt plus the
name(s) of handlers/drivers using it. There is a mechanism from the days of
shared interrupts that replaces some of the handler names with '+' when they
don't all fit into 19 bytes.
In modern times there is typically only one device on an interrupt, but long
device names are the norm, especially with embedded systems. Also, in systems
with multiple interrupt controllers, the names of the interrupts themselves
can be long. For example, 'gic0,s54: imx6_anatop0' doesn't fit, and
replacing the device driver name with a '+' provides no useful info at all.
When there is only one handler but its name was too long to fit, this
change truncates enough leading chars of the handler name (replacing them
with a '-' char to indicate that some chars are missing) to use all 19
bytes, preserving the unit number typically on the end of the name. Using
the prior example, this results in: 'gic0,s54:-6_anatop0' which provides
plenty of info to figure out which device is involved.
PR: 211946
Reviewed by: gonzo@ (prior version without the '-' char)
Differential Revision: https://reviews.freebsd.org/D19675
For 32-bit Linuxulator, ipc() syscall was historically
the entry point for the IPC API. Starting in Linux 4.18, direct
syscalls are provided for the IPC. Enable it.
MFC after: 1 month
AMD64_SET_**BASE expects a pointer to a pointer, we just passing in the pointer value itself.
Set PCB_FULL_IRET for doreti to restore %fs, %gs and its correspondig base.
PR: 225105
Reported by: trasz@
MFC after: 1 month
SCTP_SENDALL flag. Allow also only one operation per SCTP endpoint.
This fixes an issue found by running syzkaller and is joint work with rrs@.
MFC after: 1 week
r343532 noted the difference between "hw.realmem" and "hw.physmem", which I
was previously unaware of. I discovered that neither sysctl had a
description visible via `sysctl -d', so I found where they were defined and
added suitable descriptions. While in the file, I went ahead and added
descriptions for all the others which lacked them. I also updated sysctl.3
accordingly
Reviewed by: kib, bcr
MFC after: 1 weeks
Sponsored by: Panasas
Differential Revision: https://reviews.freebsd.org/D19007
Otherwise resulting address from vm_map_find() migh not satisfy the
upper limit. For instance, it could affect MAP_32BIT flag from 64bit
processes.
Found by: Doug Moore <dougm@rice.edu>
Reviewed by: alc, Doug Moore <dougm@rice.edu>
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D19688
TPM has a built-in RNG, with its own entropy source.
The driver was extended to harvest 16 random bytes from TPM every 10 seconds.
A new build option "TPM_HARVEST" was introduced - for now, however, it
is not enabled by default in the GENERIC config.
Submitted by: Kornel Duleba <mindal@semihalf.com>
Reviewed by: markm, delphij
Approved by: secteam
Obtained from: Semihalf
Sponsored by: Stormshield
Differential Revision: https://reviews.freebsd.org/D19620
Attempting to build www/firefox on POWER9 resulted in a HMI exception being
thrown, a fatal trap currently. This is typically caused by timer facility
errors, but examination of the Hypervisor Maintenance Exception Register
(HMER) yielded only that an exception had recovered, with no information of
the actual exception cause.
When an HMI occurs, OPAL_HANDLE_HMI or OPAL_HANDLE_HMI2 must be called to
handle the exception at the firmware level. If the exception is handled, we
can continue.
This adds only the preliminary handler, enough to prevent package building
from panicking. An enhancement in the future is to use the flags returned
by OPAL_HANDLE_HMI2 to print more useful error messages, and log maintenance
events.
Reviewed by: luporl
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D19634
After latest binding update, this patch enables usage of
the switch on Armada 3720 EspressoBin, so compile it
by default with arm64 GENERIC.
A patch was extracted from https://reviews.freebsd.org/D19036
Submitted by: Bert JW Regeer <xistence@0x58.com>
Reviewed by: manu
In the latest Linux kernel revisions the DSA (Distributed
Switch Architecture) device tree binding was changed.
Instead of the top level dsa@ node, the switch and its
ports is represented as a child node of the mdio bus.
With that other modifications were added, such as
relation with the ethernet port of the SoC. Adjust
e6000sw etherswitch and mvneta drivers to that.
Tested on Armada 3720 EspressoBin and Armada 388 Clearfog Pro boards.
Submitted by: Bert JW Regeer <xistence@0x58.com>
Reviewed by: manu
Differential Revision: https://reviews.freebsd.org/D19036
r345402 fixed the bug that led to the split of the ISA 3.0 HPT handling from
the existing manager. The cause of the bug was gcc moving the register
holding VPN to a different register (not r0), which triggered bizarre
behaviors. With the fix, things work, so they can be re-merged. No
performance lost with the merge.
I overlooked the fact that that VOP_FSYNC() call is not a FreeBSD VFS
call, but a macro that provides an illumos-compatible wrapper for the
FreeBSD operation.
PR: 236475
Reported by: lwhsu
Pointyhat to: avg
PIIX4_SMBHSTSTAT_ERR can be set for several reasons that, unfortunately,
cannot be distinguished, but the most typical case is a missing or hung
slave (SMB_ENOACK).
PIIX4_SMBHSTSTAT_FAIL means failed or killed / aborted transaction, so
it's previous mapping to SMB_ENOACK was not ideal.
After this change an smb(4) access to a missing slave results in ENXIO
rather than EIO. To me, that seems to be more appropriate.
MFC after: 3 weeks
This module provides support for the Amazon Elastic Network Adapter; it
was previously only built on x86 architectures, but Amazon EC2 now also
has ARM64 instances with this hardware.
Submitted by: Greg V
This value was being used uninitialized, resulting in predictable issues
on systems with memory-mapped UART registers.
A case could be made that memmap_bus should be declared in a header
rather than being declared in each .c file which needs to refer to it,
but that's a broader style question.
This commit unbreaks hw.uart.console="mm:..." on ARM64.
Submitted by: Greg V
The "access width" value was hard-coded as 2, indicating 32-bit accesses;
instead, use the value specified in the SPCR table.
This unbreaks the console on EC2 "A1" family instances.
Submitted by: Greg V
By happenstance gcc4 puts 'vpn' into r0 in all uses of TLBIE(), but modern
gcc does not. Also, the single-argument form of tlbie zeros all unused
arguments, making the modern tlbie instruction use r0 as the RS field
(LPID).
The vpn argument has the bottom 12 bits cleared (the input having been
left-shifted by 12 bits), which just so happens, on the POWER9 and previous
incarnations, to be the number of LPID bits supported. With those bits
being zero, the instruction:
tlbie r0, r0
will invalidate the VPN in r0, in LPAR 0 (ignoring the upper bits of r0 for
the RS field). One build with gcc8 yields:
tlbie r9, r0
with r0 having arbitrary contents, not equal to r9. This leads to strange
crashes, behaviors, and panics, due to the requested TLB entry not actually
being invalidated.
As the moea64_native must work on both old and new, we explicitly zero out
r0 so that it can work with only the single argument, built with base gcc
and modern gcc. isa3_hashtb takes a different approach, encoding the
two-argument form, soas not to explicitly clobber r0, and instead let the
compiler decide.
Reported by: Brandon Bergren
Tested by: Brandon Bergren
MFC after: 1 week
There are some unusual cases where a process may cause an mlock()ed
range of memory to be unmapped. If the application subsequently
faults on that region, the handler may attempt to create a superpage
mapping backed by the resident, wired pages. However, the pmap code
responsible for creating such a mapping (pmap_enter_pde() on i386
and amd64) does not ensure that a leaf page table page is available
if the superpage is later demoted; the demotion operation must therefore
perform a non-blocking page allocation and must unmap the entire
superpage if the allocation fails. The pmap layer ensures that this
can never happen for wired mappings, and so the case described above
breaks that invariant.
For now, simply ensure that the MI fault handler never attempts to
create a wired superpage except via promotion.
Reviewed by: kib
Reported by: syzbot+292d3b0416c27c131505@syzkaller.appspotmail.com
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D19670
Now enabling ipfw(4) with sysctls controls only linkage of hooks to default
heads. When module is loaded fetch sysctls as tunables, to make it possible
to boot with ipfw(4) in kernel, but not linked to any pfil(9) hooks.
If vflush() did not completely flushed the mount vnodes queue, either
retry for forced unmounts, or give up for non-forced. This situation
can occur when new vnodes are instantiated while vflush() worked.
Reported and tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
The nexus module was missing method for releasing bus resources. As a
result, it couldn't be released and the bus_release_resource() call would
return ENXIO.
Next call to bus_alloc_resource() for the same resource was returning
error, because it wasn't released previously and it was still busy.
The implementation of the nexus_release_resource() is the same as for
arm architecture.
Submitted by: Michal Krawczyk <mk@semihalf.com>
Reported-by: Greg V <greg@unrelenting.technology>
Tested-by: cperciva, Greg V <greg@unrelenting.technology>
Obtained from: Semihalf
MFC after: 2 weeks
Sponsored by: Amazon, Inc.
Differential revision: https://reviews.freebsd.org/D19641
No functional changes. Replace whitespace by tabs, indent with 4 spaces,
coalesce multi-line shorter than 80 characters,
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
The resource is already being activated in the bus_alloc_resource(),
because the flag RF_ACTIVE is being passed.
Double activation on arm64 is causing kernel panic.
Version of the driver was upgraded to 0.8.4.
Submitted by: Michal Krawczyk <mk@semihalf.com>
Reported-by: Greg V <greg@unrelenting.technology>
Tested-by: cperciva, Greg V <greg@unrelenting.technology>
Obtained from: Semihalf
MFC after: 2 weeks
Sponsored by: Amazon, Inc.
Differential revision: https://reviews.freebsd.org/D19655
as an NS8250 UART.
This is the same as the UART found in EC2 "bare metal" instances,
except that the card vendor shows up as 0x0000 rather than 0x1d0f.
This seems like a bug in the EC2 firmware; but we might as well support
it anyway.
Reported by: Greg V
States in pf(4) let ICMP and ICMP6 packets pass if they have a
packet in their payload that matches an exiting connection. It was
not checked whether the outer ICMP packet has the same destination
IP as the source IP of the inner protocol packet. Enforce that
these addresses match, to prevent ICMP packets that do not make
sense.
Reported by: Nicolas Collignon, Corentin Bayet, Eloi Vanderbeken, Luca Moro at Synacktiv
Obtained from: OpenBSD
Security: CVE-2019-5598
It simply doesn't work in general since VCPUs may migrate between
physical cores. The approach used to measure skew also doesn't
make much sense in a VM.
PR: 218452
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
This bug was introduced with the change to use softdep_bp_to_mp()
in January 2018 changes -r327723 and -r327821. The softdep_bp_to_mp()
function failed to include VSOCK as one of the valid cases.
Although local-domain sockets do not allocate blocks in the filesystem,
they will allocate blocks if they use extended attributes (such as
ACLs). Thus, softdep_bp_to_mp() needs to return a non-NULL mount
pointer when presented with a socket vnode so that the soft updates
write complete will properly process the soft updates structures
associated with the extended attribute blocks. It was the failure
to process these soft updates structures, thus leaving them hanging
off the buffer, which lead to the "panic: softdep_deallocate_dependencies:
dangling deps" when trying to clean up the buffer after it was written.
PR: 230962
Reported by: 2t8mr7kx9f@protonmail.com
Reviewed by: kib
Tested by: Peter Holm
MFC after: 1 week
Sponsored by: Netflix
This ensures files like genassym.o and awk/mfiles are generated before
descending into the modules build. It may also allow some module builds
to not recreate files that are already present in the KERNBUILDDIR.
This fixes a rare build race where genassym.o is missing and assym.inc
is empty.
More work is planned around this to reduce some redundant dependency
generation in modules.
PR: 233339
MFC after: 2 weeks
Reported by: markj
This makes it more consistent with other filesystems, which all end in "fs",
and more consistent with its mount helper, which is already named
"mount_fusefs".
Reviewed by: cem, rgrimes
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D19649
The kernel build uses symlinks to make MD #includes like <machine/pcpu.h>
work. Debug info ends up referencing these symlinks in a relative path,
so debuggers generally don't know how to find the corresponding headers.
Address this by using -fdebug-prefix-map to map relative paths through
the symlinks to their absolute paths in the source tree. This is
consistent with how regular source file paths are defined in the
kernel's debug info.
Also map the current directory to an absolute path to the object
directory. This gives debuggers a chance to find auto-generated files
like vnode_if.c if the object directory is available.
Reviewed by: emaste, jhb (previous version)
MFC after: 1 month
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D19633
Recent firmwares prefer to use a different format for viid internally
and this change allows them to do so.
MFC after: 1 week
Sponsored by: Chelsio Communications
Either msync(MS_INVALIDATE) or the object unlock during vnode
truncation can expose invalid pages backing wired entries. Accept
them, but do not install them into destrination pmap. We must create
copied pages in the copy case, because e.g. vm_object_unwire() expects
that the entry is fully backed.
Reported by: syzkaller, via emaste
Reported by: syzbot+514d40ce757a3f8b15bc@syzkaller.appspotmail.com
Reviewed by: markj
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D19615
From Jake:
The iflib core never modifies the isc_driver_version string. Allow
drivers to safely assign pointers to constant buffers by marking this
parameter const.
Submitted by: Jacob Keller <jacob.e.keller@intel.com>
Reviewed by: erj@, gallatin@, jhb@
MFC after: 1 week
Sponsored by: Intel Corporation
Differential Revision: https://reviews.freebsd.org/D19577
From Krzysztof:
The driver built as KLD cannot be unloaded, if this flag is not set.
Submitted by: Krzysztof Galazka <krzysztof.galazka@intel.com>
Reviewed by: shurd@, erj@
MFC after: 1 week
Sponsored by: Intel Corporation
Differential Revision: https://reviews.freebsd.org/D19402
From Jake:
iflib_fl_setup calculates a suitable buffer size for the Rx mbufs based
on the isc_max_frame_size value that drivers setup. This calculation is
repeated by drivers when programming their hardware with the size of
each Rx buffer.
This can lead to a mismatch where the iflib mbuf size is different from
the expected size of the buffer as programmed by the hardware. This can
lead to unexpected results.
If iflib ever wants to support mbuf sizes larger than one page, every
driver must be updated to account for the new possible buffer sizes.
Fix this by calculating the mbuf size prior to calling IFDI_INIT, and
adding the iflib_get_rx_mbuf_sz function which will expose this value to
drivers, so that they do not repeat the same calculation.
Submitted by: Jacob Keller <jacob.e.keller@intel.com>
Reviewed by: shurd@, erj@
MFC after: 1 week
Sponsored by: Intel Corporation
Differential Revision: https://reviews.freebsd.org/D19489
I missed these in r344664. They're basically useless because they can only
be controlled at compile-time. Also, de-inline fuse_internal_cache_attrs.
It's big enough to be a regular function, and this way it gets a dtrace FBT
probe.
Sponsored by: The FreeBSD Foundation
From Jake:
iflib_encap calls bus_dmamap_load_mbuf_sg. Upon it returning EFBIG, an
m_collapse and an m_defrag are attempted to shrink the mbuf cluster to
fit within the DMA segment limitations.
However, if we call m_defrag, and then bus_dmamap_load_mbuf_sg returns
EFBIG on the now defragmented mbuf, we will continuously re-call
bus_dmamap_load_mbuf_sg over and over.
This happens because m_head isn't NULL, and remap is >1, so we don't try
to m_collapse or m_defrag again. The only way we exit the loop is if
m_head is NULL. However, m_head can't be modified by the call to
bus_dmamap_load_mbuf_sg, because we don't pass it as a double pointer.
I believe this will be an incredibly rare occurrence, because it is
unlikely that bus_dmamap_load_mbuf_sg will actually fail on the second
defragment with an EFBIG error. However, it still seems like
a possibility that we should account for.
Fix the exit check to ensure that if remap is >1, we will also exit,
even if m_head is not NULL.
Submitted by: Jacob Keller <jacob.e.keller@intel.com>
Reviewed by: shurd@, gallatin@
MFC after: 1 week
Sponsored by: Intel Corporation
Differential Revision: https://reviews.freebsd.org/D19468
Minimalistic PSCI implementation in U-Boot doesn't implement get_version()
method for some SoC. In this case, use PSCI version declared by 'psci' node
in DT as fallback.
MFC after: 2 weeks
- older DT can use 'cpu0-supply' property for power supply binding.
- don't expect that actual CPU frequency is contained in CPU
operational point table, but read current CPU voltage directly from
reguator. Typically, u-boot can set starting CPU frequency to any
value.
MFC after: 2 weeks
In current code, the delay argument in FDT_PLATFORM_DEF(2) improperly
initialize refs field from kobj_class structure instead of delay_count
field.
This causes not working DELAY() function (due to never initialized
delay_count) in earlier boot stages, until the first timer was attached.
MFC after: 2 weeks
Update NAT64LSN implementation:
o most of data structures and relations were modified to be able support
large number of translation states. Now each supported protocol can
use full ports range. Ports groups now are belongs to IPv4 alias
addresses, not hosts. Each ports group can keep several states chunks.
This is controlled with new `states_chunks` config option. States
chunks allow to have several translation states for single alias address
and port, but for different destination addresses.
o by default all hash tables now use jenkins hash.
o ConcurrencyKit and epoch(9) is used to make NAT64LSN lockless on fast path.
o one NAT64LSN instance now can be used to handle several IPv6 prefixes,
special prefix "::" value should be used for this purpose when instance
is created.
o due to modified internal data structures relations, the socket opcode
that does states listing was changed.
Obtained from: Yandex LLC
MFC after: 1 month
Sponsored by: Yandex LLC
and remove possible panic condition.
It is already allowed to sleep in bpfattach[2], since BPF_LOCK was
converted to SX lock in r332388. Also move KASSERT() to the top of
function and make full initialization before bpf_if will be linked
to BPF's list of interfaces.
MFC after: 2 weeks
It may happen on some machines, that even if SGX is disabled
in firmware, the driver would still attach despite EPC base and
size equal zero. Such behaviour causes a kernel panic when the
module is unloaded. Add a simple check to make sure we
only attach when these values are correctly set.
Submitted by: Kornel Duleba <mindal@semihalf.com>
Reviewed by: br
Obtained from: Semihalf
Sponsored by: Stormshield
Differential Revision: https://reviews.freebsd.org/D19595
OpenBSD and NetBSD provide macros to directly reference the underlying
struct timespec's tv_nsec member. While FreeBSD has such macros for
tv_sec, the others are missing. Add the following macros:
st->st_atimensec
st->st_mtimensec
st->st_ctimensec
st->st_birthtimensec
Adding these fields will provide programs which reference them better
portability to FreeBSD. An example of such a program is makefs(8),
which has unused support for subseconds that it has inherited from
NetBSD.
Submitted by: Mitchell Horne <mhorne063@gmail.com>
Reviewed by: kib
Differential Revision: https://reviews.freebsd.org/D19626
o most of data structures and relations were modified to be able support
large number of translation states. Now each supported protocol can
use full ports range. Ports groups now are belongs to IPv4 alias
addresses, not hosts. Each ports group can keep several states chunks.
This is controlled with new `states_chunks` config option. States
chunks allow to have several translation states for single alias address
and port, but for different destination addresses.
o by default all hash tables now use jenkins hash.
o ConcurrencyKit and epoch(9) is used to make NAT64LSN lockless on fast path.
o one NAT64LSN instance now can be used to handle several IPv6 prefixes,
special prefix "::" value should be used for this purpose when instance
is created.
o due to modified internal data structures relations, the socket opcode
that does states listing was changed.
Obtained from: Yandex LLC
MFC after: 1 month
Sponsored by: Yandex LLC
The line was misedited to change tt to st instead of
changing ut to st.
The use of st as the denominator in mul64_by_fraction() will lead
to an integer divide fault in the intr proc (the process holding
ithreads) where st will be 0. This divide by 0 happens after
the total runtime for all ithreads exceeds 76 hours.
Submitted by: bde
Some applications forward from/to host rings most or all the
traffic received or sent on a physical interface. In this
cases it is desirable to have more than a pair of RX/TX host
rings, and use multiple threads to speed up forwarding.
This change adds support for multiple host rings. On registering
a netmap port, the user can specify the number of desired receive
and transmit host rings in the nr_host_tx_rings and nr_host_rx_rings
fields of the nmreq_register structure.
MFC after: 2 weeks
CLAT is customer-side translator that algorithmically translates 1:1
private IPv4 addresses to global IPv6 addresses, and vice versa.
It is implemented as part of ipfw_nat64 kernel module. When module
is loaded or compiled into the kernel, it registers "nat64clat" external
action. External action named instance can be created using `create`
command and then used in ipfw rules. The create command accepts two
IPv6 prefixes `plat_prefix` and `clat_prefix`. If plat_prefix is ommitted,
IPv6 NAT64 Well-Known prefix 64:ff9b::/96 will be used.
# ipfw nat64clat CLAT create clat_prefix SRC_PFX plat_prefix DST_PFX
# ipfw add nat64clat CLAT ip4 from IPv4_PFX to any out
# ipfw add nat64clat CLAT ip6 from DST_PFX to SRC_PFX in
Obtained from: Yandex LLC
Submitted by: Boris N. Lytochkin
MFC after: 1 month
Relnotes: yes
Sponsored by: Yandex LLC
Add second IPv6 prefix to generic config structure and rename another
fields to conform to RFC6877. Now it contains two prefixes and length:
PLAT is provider-side translator that translates N:1 global IPv6 addresses
to global IPv4 addresses. CLAT is customer-side translator (XLAT) that
algorithmically translates 1:1 IPv4 addresses to global IPv6 addresses.
Use PLAT prefix in stateless (nat64stl) and stateful (nat64lsn)
translators.
Modify nat64_extract_ip4() and nat64_embed_ip4() functions to accept
prefix length and use plat_plen to specify prefix length.
Retire net.inet.ip.fw.nat64_allow_private sysctl variable.
Add NAT64_ALLOW_PRIVATE flag and use "allow_private" config option to
configure this ability separately for each NAT64 instance.
Obtained from: Yandex LLC
MFC after: 1 month
Sponsored by: Yandex LLC
Update node flags when driver supports SMPS, not when it is disabled or
in dynamic mode ((iv_htcaps & HTCAP_SMPS) != 0).
Checked with RTL8188EE (1T1R), STA mode - 'smps' word should disappear
from 'ifconfig wlan0' output.
MFC after: 2 weeks
The old implementation, at the VFS layer, would map the entire range of
logical blocks between the starting offset and the first data block
following that offset. With large sparse files this is very
inefficient. The VFS currently doesn't provide an interface to improve
upon the current implementation in a generic way.
Add ufs_bmap_seekdata(), which uses the obvious algorithm of scanning
indirect blocks to look for data blocks. Use it instead of
vn_bmap_seekhole() to implement SEEK_DATA.
Reviewed by: kib, mckusick
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D19598
Without this dependency relationship, the linker doesn't find the
flash_register_slicer() function, so kldload fails to load fdt_slicer.ko.
Discussed with: ian@
* Set MK_OPENMP to yes by default only on amd64, for now.
* Bump __FreeBSD_version to signal this addition.
* Ensure gcc's conflicting omp.h is not installed if MK_OPENMP is yes.
* Update OptionalObsoleteFiles.inc to cope with the conflicting omp.h.
* Regenerate src.conf(5) with new WITH/WITHOUT fragments.
Relnotes: yes
PR: 236062
MFC after: 1 month
X-MFC-With: r344779
Add the infrastructure to allow MD procctl(2) commands, and use it to
introduce amd64 PTI control and reporting. PTI mode cannot be
modified for existing pmap, the knob controls PTI of the new vmspace
created on exec.
Requested by: jhb
Reviewed by: jhb, markj (previous version)
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D19514
PTI mode for the process pmap on exec is activated iff P_MD_PTI is set.
On exec, the existing vmspace can be reused only if pti mode of the
pmap matches the P_MD_PTI flag of the process. Add MD
cpu_exec_vmspace_reuse() callback for exec_new_vmspace() which can
vetoed reuse of the existing vmspace.
MFC note: md_flags change struct proc KBI.
Reviewed by: jhb, markj
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D19514
When the pmap with pti disabled (i.e. pm_ucr3 == PMAP_NO_CR3) is
activated, tss.rsp0 was not updated. Any interrupt that happen before
next context switch would use pti trampoline stack for hardware frame
but fault and interrupt handlers are not prepared to this. Correctly
update tss.rsp0 for both PMAP_NO_CR3 and pti pmaps.
Note that this case, pti = 1 but pmap->pm_ucr3 == PMAP_NO_CR3 is not
used at the moment.
Reviewed by: markj
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D19514
Previously the main pfsync lock and the bucket locks shared the same name.
This lead to spurious warnings from WITNESS like this:
acquiring duplicate lock of same type: "pfsync"
1st pfsync @ /usr/src/sys/netpfil/pf/if_pfsync.c:1402
2nd pfsync @ /usr/src/sys/netpfil/pf/if_pfsync.c:1429
It's perfectly okay to grab both the main pfsync lock and a bucket lock at the
same time.
We don't need different names for each bucket lock, because we should always
only acquire a single one of those at a time.
MFC after: 1 week
`arc_reclaim_thread()` calls `arc_adjust()` after calling
`arc_kmem_reap_now()`; `arc_adjust()` signals `arc_get_data_buf()` to
indicate that we may no longer be `arc_is_overflowing()`.
The problem is, `arc_kmem_reap_now()` can take several seconds to
complete, has no impact on `arc_is_overflowing()`, but due to how the
code is structured, can impact how long the ARC will remain in the
`arc_is_overflowing()` state.
The fix is to use seperate threads to:
1. keep `arc_size` under `arc_c`, by calling `arc_adjust()`, which
improves `arc_is_overflowing()`
2. keep enough free memory in the system, by calling
`arc_kmem_reap_now()` plus `arc_shrink()`, which improves
`arc_available_memory()`.
illumos/illumos-gate@de753e34f9
Reviewed by: Matt Ahrens <mahrens@delphix.com>
Reviewed by: Serapheim Dimitropoulos <serapheim@delphix.com>
Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed by: Dan Kimmel <dan.kimmel@delphix.com>
Reviewed by: Paul Dagnelie <pcd@delphix.com>
Reviewed by: Dan McDonald <danmcd@joyent.com>
Reviewed by: Tim Kordas <tim.kordas@joyent.com>
Approved by: Garrett D'Amore <garrett@damore.org>
Author: Brad Lewis <brad.lewis@delphix.com>
use them without redefining the value names. New clang no longer
allows to redefine a enum value name to the same value.
Bump __FreeBSD_version, since ports depend on that.
Discussed with: jhb
At this point, all routes should've already been dropped by removing all
members from the bridge. This condition is in-fact KASSERT'd in the line
immediately above where this nop flush was added.
At this point, all routes should've already been dropped by removing all
members from the bridge. This condition is in-fact KASSERT'd in the line
immediately above where this nop flush was added.
and for %ecx after RDTSCP.
Initialize TSC_AUX MSR with CPUID. It allows for userspace to cheaply
identify CPU it was executed on some time ago, which is sometimes useful.
Note: The values returned might be changed in future.
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
After r345180 we need to have the appropriate vnet context set to delete an
rtnode in bridge_rtnode_destroy().
That's usually the case, but not when it's called by the STP code (through
bstp_notify_rtage()).
We have to set the vnet context in bridge_rtable_expire() just as we do in the
other STP callback bridge_state_change().
Reviewed by: kevans
bridge_rtnode_zone still has outstanding allocations at the time of
destruction in the current model because all of the interface teardown
happens in a VNET_SYSUNINIT, -after- the MOD_UNLOAD has already been
processed. The SYSUNINIT triggers destruction of the interfaces, which then
attempts to free the memory from the zone that's already been destroyed, and
we hit a panic.
Solve this by virtualizing the uma_zone we allocate the rtnodes from to fix
the ordering. bridge_rtable_fini should also take care to flush any
remaining routes that weren't taken care of when dynamic routes were flushed
in bridge_stop.
Reviewed by: kp
Differential Revision: https://reviews.freebsd.org/D19578
The ext2_nodealloccg() function unlocks the mount point
in case of successful node allocation.
The additional unlocks are not required and should be removed.
PR: 236452
Reported by: pho
MFC after: 3 days
If the spanning tree root interface is removed from the bridge we panic
on the next 'ifconfig'.
While the STP code is notified whenever a bridge member interface is
removed from the bridge it does not clear the bs_root_port. This means
bs_root_port can still point at an bridge_iflist which has been free()d.
The next access to it will panic.
Explicitly check if the interface we're removing in bstp_destroy() is
the root, and if so re-assign the roles, which clears bs_root_port.
Reviewed by: philip
MFC after: 2 weeks
The counters of pf tables are updated outside the rule lock. That means state
updates might overwrite each other. Furthermore allocation and
freeing of counters happens outside the lock as well.
Use counter(9) for the counters, and always allocate the counter table
element, so that the race condition cannot happen any more.
PR: 230619
Submitted by: Kajetan Staszkiewicz <vegeta@tuxpowered.net>
Reviewed by: glebius
MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D19558
With new pfil(9) KPI it is possible to pass a void pointer with length
instead of mbuf pointer to a packet filter. Until this commit no filters
supported that, so pfil run through a shim function pfil_fake_mbuf().
Now the ipfw(4) hook named "default-link", that is instantiated when
net.link.ether.ipfw sysctl is on, supports processing pointer/length
packets natively.
- ip_fw_args now has union for either mbuf or void *, and if flags have
non-zero length, then we use the void *.
- through ipfw_chk() we handle mem/mbuf cases differently.
- ether_header goes away from args. It is ipfw_chk() responsibility
to do parsing of Ethernet header.
- ipfw_log() now uses different bpf APIs to log packets.
Although ipfw_chk() is now capable to process pointer/length packets,
this commit adds support for the link level hook only, see
ipfw_check_frame(). Potentially the IP processing hook ipfw_check_packet()
can be improved too, but that requires more changes since the hook
supports more complex actions: NAT, divert, etc.
Reviewed by: ae
Differential Revision: https://reviews.freebsd.org/D19357
IPFW_ARGS_OUT are utilized. They are intented to substitute the "dir"
parameter that is often passes together with args.
- Rename ip_fw_args.oif to ifp and now it is set to either input or
output interface, depending on IPFW_ARGS_IN/OUT bit set.
This has the advantage of being obvious to sniff out the designated prefix
by eye and it has all the right bits set. Comment stolen from ffec.
I've removed bryanv@'s pending question of using the FreeBSD OUI range --
no one has followed up on this with a definitive action, and there's no
particular reason to shoot for it and the administrative overhead that comes
with deciding exactly how to use it.
We currently have two places with identical fake hwaddr generation --
if_vxlan and if_bridge. Lift it into if_ethersubr for reuse in other
interfaces that may also need a fake addr.
Reviewed by: bryanv, kp, philip
Differential Revision: https://reviews.freebsd.org/D19573
The drivers were removed in r344299 so there is no need to keep the
firmware files in the src tree.
Reviewed by: imp, jhibbits, johalun
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D19583
The only reference to p1 after a dead store was in a comment so update
the comment to refer to td1.
Submitted by: sbruno
Differential Revision: https://reviews.freebsd.org/D16226
Fix some style while at it.
Submitted by: Johannes Lundberg <johalun0@gmail.com>
MFC after: 1 week
Sponsored by: Limelight Networks
Sponsored by: Mellanox Technologies
The forthcoming microcode update will fix a TSX bug by clobbering PMC3
when TSX instructions are executed (even speculatively). There is an
alternate mode where CPU executes all TSX instructions by aborting
them, in which case PMC3 is still available to OS. Any code that
correctly uses TSX must be ready to handle abort anyway.
Since it is believed that FreeBSD population of hwpmc(4) users is
significantly larger than the population of TSX users, switch the
microcode into TSX abort mode whenever a pmc is allocated, and back to
bug avoidance mode when the last pmc is deallocated.
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
extended attributes, the kernel can panic with either "ffs_truncate3"
or with "softdep_deallocate_dependencies: dangling deps".
The problem arises because the flushbuflist() function which is
called to clear out buffers is passed either the V_NORMAL flag to
indicate that it should flush buffer associated with the contents
of the file or the V_ALT flag to indicate that it should flush the
buffers associated with the extended attribute data. The buffers
containing the extended attribute data are identified by having
their BX_ALTDATA flag set in the buffer's b_xflags field. The
BX_ALTDATA flag is set on the buffer when the extended attribute
block is first allocated or when its contents are read in from the
disk.
On a busy system, a buffer may be reused for another purpose, but
the contents of the block that it contained continues to be held
in the main page cache. Each physical page is identified as holding
the contents of a logical block within a specified file (identified
by a vnode). When a request is made to read a file, the kernel first
looks for the block in the existing buffers. If it is not found
there, it checks the page cache to see if it is still there. If
it is found in the page cache, then it is remapped into a new
buffer thus avoiding the need to read it in from the disk.
The bug is that when a buffer request made for an extended attribute
is fulfilled by reconstituting a buffer from the page cache rather
than reading it in from disk, the BX_ALTDATA flag was not being
set. Thus the flushbuflist() function would never clear it out and
the "ffs_truncate3" panic would occur because the vnode being cleared
still had buffers on its clean-buffer list. If the extended attribute
was being updated, it is first read, then updated, and finally
written. If the read is fulfilled by reconstituting the buffer
from the page cache the BX_ALTDATA flag was not set and thus the
dirty buffer would never be flushed by flushbuflist(). Eventually
the buffer would be recycled. Since it was never written it would
have an unfinished dependency which would trigger the
"softdep_deallocate_dependencies: dangling deps" panic.
The fix is to ensure that the BX_ALTDATA flag is set when a buffer
has been reconstituted from the page cache.
PR: 230962
Reported by: 2t8mr7kx9f@protonmail.com
Reviewed by: kib
Tested by: Peter Holm
MFC after: 1 week
Sponsored by: Netflix
isci(4) uses deferred loading. Typically on amd64 and i386 non-PAE
the tag does not create any restrictions, but on i386 PAE-tables but
non-PAE configs callbacks might be used.
Reported and tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks