PFIL_MEMPTR flag are intentionally providing a memory address that
isn't aligned to pointer alignment. This is done to align an IPv4
or IPv6 header that is expected to follow Ethernet header.
When we return PFIL_REALLOCED we store a pointer to allocated mbuf
at this address. With this change the KPI changes to store the pointer
at aligned address, which usually yields in +2 bytes.
Provide two inlines:
pfil_packet_align() to get aligned pfil_packet_t for a misaligned one
pfil_mem2mbuf() to read out mbuf pointer from misaligned pfil_packet_t
Provide function pfil_realloc(), not used yet, that would convert a
memory pfil_packet_t to an mbuf one.
Reported by: hps
Reviewed by: hps, gallatin
Registers visible from 'show reg' don't always match the registers from the
offending trap frame. Knowing the frame address lets one examine the
registers manually.
MFC after: 1 week
In r324227 the comment moved into socketvar.h originally from
sockstate.h r180948. Try to improve English and as a consequence rewrap
the comment.
No functional changes.
Reviewed by: jhb (a wording suggestion)
Differential Revision: https://reviews.freebsd.org/D13865
retries.
When resetting the controller, we abort I/O. Prior to this fix, we
printed a ton of abort messages for I/O that we're going to
retry. This imparts no useful information. Stop printing them unless
our retry count is exhausted. Clarify code for when we don't retry,
and remove useless arg to a routine that's always called with it
as 'true'. All the other debug is still printed (including multiple
reset messages if we have multiple timeouts before the taskqueue
runs the actual reset) so that we know when we reset.
Reviewed by: jimharris@, chuck@
Differential Revision: https://reviews.freebsd.org/D19431
r344504 added an extra ARP_LOG() call in case of an if_output() failure.
It turns out IPv4 can be noisy. In order to not spam the console by default:
(a) add a counter for these events so people can keep better track of how
often it happens, and
(b) add a sysctl to select the default ARP_LOG log level and set it to
INFO avoiding the one (the new) DEBUG level by default.
Claim a spare (1st one after 10 years since the stats were added) in order
to not break netstat from FreeBSD 12->13 updates in the future.
Reviewed by: karels
Differential Revision: https://reviews.freebsd.org/D19490
The LBA weighting makes sense on rotational media where the outer tracks
have twice the bandwidth of the inner tracks. However, it is detrimental
on nonrotational media such as solid state disks, where the only effect
is to ensure that metaslabs enter the best-fit allocation behavior
sooner, which is detrimental to performance. It also makes no sense on
files where the underlying filesystem can arrange things however it
wants.
Author: Richard Yao <ryao@gentoo.org>
Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#3712zfsonlinux/zfs@fb40095f5f
To reduce code divergence this merge replaces equivalent but different
FreeBSD code detecting non-rotating medium vdevs.
MFC after: 1 month
Before sequential scrub patches ZFS never aggregated I/Os above 128KB.
Sequential scrub bumped that to 1MB, which motivation I understand for
spinning disks, since it should reduce number of head seeks. But for
SSDs it makes much less sense to me, especially on FreeBSD, where due
to MAXPHYS limitation device will likely still see bunch of 128KB I/Os
instead of one large. Having more strict aggregation limit allows to
avoid allocation of large memory buffer and memcpy to/from it, that is
a serious problem when bandwidth reaches few GB/s.
MFC after: 1 month
Sponsored by: iXsystems, Inc.
Update the bounds checking for zfs_vdev_aggregation_limit so that
it has a floor of zero and a maximum value of the supported block
size for the pool.
Additionally add an early return when zfs_vdev_aggregation_limit
equals zero to disable aggregation. For very fast solid state or
memory devices it may be more expensive to perform the aggregation
than to issue the IO immediately.
Author: Brian Behlendorf <behlendorf1@llnl.gov>
zfsonlinux/zfs@a58df6f536
MFV/ZoL: Cap maximum aggregate IO size
Commit 8542ef8 allowed optional IOs to be aggregated beyond
the specified aggregation limit. Since the aggregation limit
was also used to enforce the maximum block size, setting
`zfs_vdev_aggregation_limit=16777216` could result in an
attempt to allocate an ABD larger than 16M.
Author: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#6259Closes#6270zfsonlinux/zfs@2d678f779a
r343295 broke DIOCGETSRCNODES by failing to reset 'nr' after counting the
number of source tracking nodes.
This meant that we never copied the information to userspace, leading to '? ->
?' output from pfctl.
PR: 236368
MFC after: 1 week
mlx4_en_stop_port() calls mlx4_en_put_qp() which can refer the link level
address of the network interface, which in turn will be freed by the
network interface detach function. Make sure the port is stopped
before detaching the network interface.
MFC after: 1 week
Sponsored by: Mellanox Technologies
It can happen during shutdown that the lock will recurse when the mlx4en(4)
instance is part of a lagg interface. Call ether_ifdetach() unlocked.
Backtrace:
panic(): _sx_xlock_hard: recursed on non-recursive sx &mdev->state_lock
_sx_xlock_hard()
_sx_xlock()
mlx4_en_ioctl()
if_setlladdr()
lagg_port_destroy()
lagg_port_ifdetach()
if_detach()
mlx4_en_destroy_netdev()
mlx4_en_remove()
mlx4_remove_device()
mlx4_unregister_device()
mlx4_unload_one()
mlx4_shutdown()
linux_pci_shutdown()
bus_generic_shutdown()
MFC after: 1 week
Sponsored by: Mellanox Technologies
The check for early exit should be checking the SLB entry itself. As
currently written it was checking the address of the SLB, which is always
non-zero, so would go through the kernel SR restore loop regardless.
Submitted by: mmacy
MFC after: 2 weeks
The second statements on the lines are not guarded by the `if' condition.
This triggers a warning with newer gcc. It's relatively harmless given the
usage, but incorrect. Instead, wrap the statements so they're properly
guarded.
Reported by: powerpc64-gcc xtoolchain
MFC after: 1 week
Chacha20 with a 256 bit key and 128 bit counter size is a good match for an
AES256-ICM replacement.
In userspace, Chacha20 is typically marginally slower than AES-ICM on
machines with AESNI intrinsics, but typically much faster than AES on
machines without special intrinsics. ChaCha20 does well on typical modern
architectures with SIMD instructions, which includes most types of machines
FreeBSD runs on.
In the kernel, we can't (or don't) make use of AESNI intrinsics for
random(4) anyway. So even on amd64, using Chacha provides a modest
performance improvement in random device throughput today.
This change makes the stream cipher used by random(4) configurable at boot
time with the 'kern.random.use_chacha20_cipher' tunable.
Very rough, non-scientific measurements at the /dev/random device, on a
GENERIC-NODEBUG amd64 VM with 'pv', show a factor of 2.2x higher throughput
for Chacha20 over the existing AES-ICM mode.
Reviewed by: delphij, markm
Approved by: secteam (delphij)
Differential Revision: https://reviews.freebsd.org/D19475
When we roam between networks and our link-state goes down, automatically remove
the IPv6-Only flag from the interface. Otherwise we might switch from an
IPv6-only to and IPv4-only network and the flag would stay and we would prevent
IPv4 from working.
While the actual function call to clear the flag is under EXPERIMENTAL,
the eventhandler is not as we might want to re-use it for other
functionality on link-down event (such was re-calculate default routers
for example if there is more than one).
Reviewed by: hrs
Differential Revision: https://reviews.freebsd.org/D19487
I just found that at least on Skylake CPUs cpu_ticks() never returns odd
values, only even, and possibly has even bigger step (176/2?), that makes
its lower bits very bad entropy source, leaving half of taskqueues unused.
Switch to sbinuptime(), closer to upstreams, mitigates the problem by the
rate conversion working as kind of hash function. In case that is somehow
not enough (timer rate is too low or too divisible) mix in curcpu.
MFC after: 1 week
dyn_install_state() uses `rule` pointer when it creates state.
For O_LIMIT states this pointer actually is not struct ip_fw,
it is pointer to O_LIMIT_PARENT state, that keeps actual pointer
to ip_fw parent rule. Thus we need to cache rule id and number
before calling dyn_get_parent_state(), so we can use them later
when the `rule` pointer is overrided.
PR: 236292
MFC after: 3 days
On GENERIC kernels with empty loader.conf, there is no functional change.
DFLTPHYS and MAXBSIZE are both 64kB at the moment. This change allows
larger bufcache block sizes to be used when either MAXBSIZE (custom kernel)
or the loader.conf tunable vfs.maxbcachebuf (GENERIC) is adjusted higher
than the default.
Suggested by: ken@
All changes are hidden behind the EXPERIMENTAL option and are not compiled
in by default.
Add ND6_IFF_IPV6_ONLY_MANUAL to be able to set the interface into no-IPv4-mode
manually without router advertisement options. This will allow developers to
test software for the appropriate behaviour even on dual-stack networks or
IPv6-Only networks without the option being set in RA messages.
Update ifconfig to allow setting and displaying the flag.
Update the checks for the filters to check for either the automatic or the manual
flag to be set. Add REVARP to the list of filtered IPv4-related protocols and add
an input filter similar to the output filter.
Add a check, when receiving the IPv6-Only RA flag to see if the receiving
interface has any IPv4 configured. If it does, ignore the IPv6-Only flag.
Add a per-VNET global sysctl, which is on by default, to not process the automatic
RA IPv6-Only flag. This way an administrator (if this is compiled in) has control
over the behaviour in case the node still relies on IPv4.
When open(2) was invoked against a FUSE filesystem with an unexpected flags
value (no O_RDONLY / O_RDWR / O_WRONLY), an assertion fired, causing panic.
For now, prevent the panic by rejecting such VOP_OPENs with EINVAL.
This is not considered the correct long term fix, but does prevent an
unprivileged denial-of-service.
PR: 236329
Reported by: asomers
Reviewed by: asomers
Sponsored by: Dell EMC Isilon
* The ani function bitmap was being badly used when determining if a command
could be used. In hostap modes only a couple of the ANI control parameters
are enabled.
* The ani function bitmap was not being reset to HAL_ANI_ALL if transitioning
from AP -> STA.
* Change mrcCckOff to mrcCck - 1 == on, rather than 1 == off. This matches
the API used to set the value from userland via the diagnostic API.
* Handle OFDM/CCK noise immunity level commands in ar9300_ani_control().
These will only come from userland and it will go and program the rest of
the ANI control parameters with the values in the ANI table.
* Ensure all of the ANI parameters can be tweaked at runtime, even if they're
disabled.
Tested:
* carambola2 (AR9331), STA/AP modes
This mirrors the arm64 implementation and is for use in the minidump
code.
Submitted by: Mitchell Horne <mhorne063@gmail.com>
Differential Revision: https://reviews.freebsd.org/D18321
Note that only entries wired by userspace are shown as such. In
particular, entries transiently wired by sysctl_wire_old_buffer() are
not flagged as wired in procstat -v output.
Reviewed by: kib (previous version)
MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D19461
From Jake:
"The iflib_fl_setup() function tries to pick various buffer sizes based
on the max_frame_size value defined by the parent driver. However, this
code was wrapped under CONTIGMALLOC_WORKS, which was never actually
defined anywhere.
This same code pattern was used in if_em.c, likely trying to match
what iflib uses.
Since CONTIGMALLOC_WORKS is not defined, remove this dead code from
iflib_fl_setup and if_em.c
Given that various iflib drivers appear to be using a similar
calculation, it might be worth making this buffer size a value that the
driver can peek at in the future."
Submitted by: Jacob Keller <jacob.e.keller@intel.com>
Reviewed by: shurd@
MFC after: 1 week
Sponsored by: Intel Corporation
Differential Revision: https://reviews.freebsd.org/D19199
The if_tun cloner is not virtualised, but if_clone_attach() does use a
virtualised list of cloners.
The result is that we can't find the if_tun cloner when we try to remove
a renamed tun interface. Virtualise the cloner, and move the final
cleanup into a sysuninit so that we're sure this happens after all of
the vnet_sysuninits
Note that we need unit numbers to be system-unique (rather than unique
per vnet, as is done by if_clone_simple()). The unit number is used to
create the corresponding /dev/tunX device node, and this node must match
with the interface.
Switch to if_clone_advanced() so that we have control over the unit
numbers.
Reproduction scenario:
jail -c -n foo persist vnet
jexec test ifconfig tun create
jexec test ifconfig tun0 name wg0
jexec test ifconfig wg0 destroy
PR: 235704
Reviewed by: bz, hrs, hselasky
MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D19248
In revision 254095, gpt_entries is not set to match the on-disk
hdr_entries, but rather is computed based on available space.
There are 2 problems with this:
1. The GPT backend respects hdr_entries and only reads and writes
that number of partition entries. On top of that, CRC32 is
computed over the table that has hdr_entries elements. When
the common code works on what is possibly a larger number, the
behaviour becomes inconsistent and problematic. In particular,
it would be possible to add a new partition that on a reboot
isn't there anymore.
2. The calculation of gpt_entries is based on flawed assumptions.
The GPT specification does not dictate that sectors are layed
out in a particular way that the available space can be
determined by looking at LBAs. In practice, implementations
do the same thing, because there's no reason to do it any
other way. Still, GPT allows certain freedoms that can be
exploited in some form or shape if the need arises.
PR: 229977
MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D19438
Mask off the bits we don't care about when checking that capabilities
of the member interfaces have been disabled as intended.
Submitted by: Ryan Moeller <ryan@ixsystems.com>
Reviewed by: kristof, mav
MFC after: 1 week
Sponsored by: iXsystems, Inc.
Differential Revision: https://reviews.freebsd.org/D18924
of it being explicitly passed as an argument. No functional changes.
The big picture here is that I want to get rid of the 'td' argument
being passed everywhere, and this is the first piece that affects
the NFS server.
Reviewed by: rmacklem
MFC after: 2 weeks
Sponsored by: DARPA, AFRL
Differential Revision: https://reviews.freebsd.org/D19417
Add more on-disk superblock consistency checks to ext2_compute_sb_data() function.
It should decrease the probability of mounting filesystems with corrupted superblock data.
Reviewed by: pfg
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D19322
I'm trying to debug why reception upstairs here is so terrible and it
turns out ANI is buggy. (Which is no surprise, ANI is always buggy.)
Tested:
* Carambola2 (AR9331), STA/AP modes
- Fix data frames transmission via POWER_STATUS register setup -
it seems to be set by MACID_CONFIG firmware command, which was broken*
in r290439 and later disabled in r307529.
We can re-enable it later if / when firmware rate adaptation will be
ready; however, this step will be required anyway - for firmware-less
builds.
- Force RTS / CTS protection frame rate to CCK1 (this rate works fine
without any additional setup; no better workaround is known yet).
The problem was not observed on the channel 1 or with CCK1 rate enforced
('ifconfig wlan0 ucastrate 1' for 11 b/g; not possible for 11n networks
due to ifconfig(8) bug).
* I'm not sure if it works before r290439 because - AFAIR - I never seen
firmware rate adaptation working for 10-STABLE urtwn(4)
(It needs EN_BCN bit set and RSSI updates at least).
Tested with RTL8188CUS in STA mode
(in regular mode and with disabled MRR - DARFRC*8 is set to 0)
PR: 233949
MFC after: 2 weeks
Since in most configurations CTL serves as network service, we found
that this change improves local system interactivity under heavy load.
Priority of main threads is set slightly higher then worker taskqueues
to make them quickly sort incoming requests not creating bottlenecks,
while plenty of worker taskqueues should be less sensitive to latency.
MFC after: 1 week
Sponsored by: iXsystems, Inc.
to call soabort() on a newborn socket created by sonewconn() in case
if further setup of PCB failed. Code in sofree() handles such socket
correctly.
Submitted by: jtl, rrs
MFC after: 3 weeks
FDT data. The sector size must be a multiple of the device's page size.
If not configured, use the historical default of the device page size.
Setting the disk sector size to 512 or 4096 allows a variety of standard
filesystems to be used on the device. Of course you wouldn't want to be
writing frequently to a SPI flash chip like it was a disk drive, but for
data that gets written once (or rarely) and read often, using a standard
filesystem is a nice convenient thing.
some #define'd names to be more descriptive. When reporting a post-write
compare failure, report the page number, not the byte address of the page.
The latter is the only functional change, it makes the number match the
words of the error message.
This is especially important for writes. SPI is inherently a bidirectional
bus; you receive data (even if it's garbage) while writing. We should not
receive that data into the same buffer we're writing to the device.
When reading it doesn't matter what we send to the device, but using the
dummy buffer for that as well is pleasingly symmetrical.
On very large powerpc64 systems (2x22x4 power9) it's very easy to run out of
available IRQs and crash the system at boot. Scale the count by mp_ncpus,
similar to x86, so this doesn't happen. Further work can be done in the future
to scale the I/O IRQs as well, but that's left for the future.
Submitted by: mmacy
MFC after: 3 weeks
When a vCPU is HLTed, interrupts with a priority below the processor
priority (PPR) should not resume the vCPU while interrupts at or above
the PPR should. With posted interrupts, bhyve maintains a bitmap of
pending interrupts in PIR descriptor along with a single 'pending'
bit. This bit is checked by a CPU running in guest mode at various
places to determine if it should be checked. In addition, another CPU
can force a CPU in guest mode to check for pending interrupts by
sending an IPI to a special IDT vector reserved for this purpose.
bhyve had a bug in that it would only notify a guest vCPU of an
interrupt (e.g. by sending the special IPI or by resuming it if it was
idle due to HLT) if an interrupt arrived that was higher priority than
PPR and no interrupts were currently pending. This assumed that if
the 'pending' bit was set, any needed notification was already in
progress. However, if the first interrupt sent to a HLTed vCPU was
lower priority than PPR and the second was higher than PPR, the first
interrupt would set 'pending' but not notify the vCPU, and the second
interrupt would not notify the vCPU because 'pending' was already set.
To fix this, track the priority of pending interrupts in a separate
per-vCPU bitmask and notify a vCPU anytime an interrupt arrives that
is above PPR and higher than any previously-received interrupt.
This was found and debugged in the bhyve port to SmartOS maintained by
Joyent. Relevant SmartOS bugs with more background:
https://smartos.org/bugview/OS-6829https://smartos.org/bugview/OS-6930https://smartos.org/bugview/OS-7354
Submitted by: Patrick Mooney <pmooney@pfmooney.com>
Reviewed by: tychon, rgrimes
Obtained from: SmartOS / Joyent
MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D19299
As a step towards adding other potential streaming ciphers. As well as just
pushing the loop down into the rijndael APIs (basically 128-bit wide AES-ICM
mode) to eliminate some excess explicit_bzero().
No functional change intended.
Reviewed by: delphij, markm
Approved by: secteam (delphij)
Sponsored by: Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D19411
level socket option SCTP_GET_LOCAL_ADDRESSES in a getsockopt() call.
Thanks to Thomas Barabosch for reporting the issue which was found by
running syzkaller.
MFC after: 3 days
In all of the architectures we have today, we always use PAGE_SIZE.
While in theory one could define different things, none of the
current architectures do, even the ones that have transitioned from
32-bit to 64-bit like i386 and arm. Some ancient mips binaries on
other systems used 8k instead of 4k, but we don't support running
those and likely never will due to their age and obscurity.
Reviewed by: imp (who also contributed the commit message)
Sponsored by: DARPA, AFRL
Differential Revision: https://reviews.freebsd.org/D19280
When porting code once written for Linux we find not only uints but also ushort and ulong.
Provide central typedefs as part of the linuxkpi for those as well.
Reviewed by: hselasky, emaste
MFC after: 3 days
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D19405
The plls frequency are now correctly calculated in fractional mode
and integer mode.
While here add some debug printfs (disabled by default)
Tested with powerd on the little cluster on a RockPro64.
MFC after: 1 week
We mistakenly used the extoff value from the last packet to patch the
next_header field. If a malicious host sends a chain of fragmented packets
where the first packet and the final packet have different lengths or number of
extension headers we'd patch the next_header at the wrong offset.
This can potentially lead to panics or rule bypasses.
Security: CVE-2019-5597
Obtained from: OpenBSD
Reported by: Corentin Bayet, Nicolas Collignon, Luca Moro at Synacktiv
Firmware needed by petitboot, for example, GPU firmware, can be installed to
a partition in the flash filesystem. This driver exposes the full flash
given by the device tree, letting the user manage firmware, etc, from
FreeBSD.
To use the partitions provided by the flash module, the fdt_slicer module is
needed, but the module isn't needed for raw access, so there's no direct
dependency link in here.
MFC after: 2 weeks
For some reason this seems to be required on aarch64, but I can build armv7
from clean without needing this in the list. (The file does get included,
so the mystery is why armv7 works.)
The OPAL firmware only supports a finite number of in-flight asynchronous
operations. Rather than have each subsystem try to manage its own, use a
central management service to hand out tokens.
More work can be done to improve asynchronous behavior, such as funneling
things through a future OPAL heartbeat handler, but capabilities will be
added as needed.
Augment the existing consumers (i2c and sensors) to use this new API.
MFC after: 4 weeks