a vdev that has the same name as the one stored in metadata and that has
all VDEV labels in place. If it cannot find a GEOM provider with the given
name and all VDEV labels it will scan all GEOM providers for the best match
(the most VDEV labels available), but here the name is ignored.
In case the ZFS pool is created, eg. using GPT partition label:
# zpool create tank /dev/gpt/tank
everything works, and on every import ZFS will pick /dev/gpt/tank and
not /dev/da0p4.
The problem occurs when da0p4 is extended and ZFS is unable to find all
VDEV labels in /dev/gpt/tank anymore (the VDEV labels stored at the end
of the partition are now somewhere else). In this case it will scan all
GEOM providers and will pick the first one with the best match, ie. da0p4.
Fix this problem by checking the VDEV/provider name even if we get the same
match. If the name is the same as the one we have in pool's metadata, prefer
this GEOM provider.
Reported by: oshogbo, Michal Mroz <m.mroz@fudosecurity.com>
Tested by: Michal Mroz <m.mroz@fudosecurity.com>
Obtained from: Fudo Security
Right now it's possible to invoke the REP escape sequence with a maximum
of tens of millions of iterations. In practice, there is never any need
to do this. Calling it more frequently than the number of cells in the
terminal hardly makes any sense. By placing a limit on it, we can
prevent users from exhausting resources in inside the terminal emulator.
As support for this escape sequence is not present in any of the stable
branches, there is no need to MFC.
Reported by: https://bugs.chromium.org/p/oss-fuzz/issues/detail?id=11255
The DIOCGETZONE ioctl can be used to fetch the zone list of an SMR
drive, and the caller specifies the number of entries it wants to fetch.
Clamp the caller's request to a sane limit so that a user cannot attempt
large allocations. Callers already need to invoke the ioctl multiple
times to fetch the full list in general, so there's no harm in limiting
the number of entries returned.
Fix style while here.
admbug: 807
Reported by: Ilja Van Sprundel <ivansprundel@ioactive.com>
Reviewed by: asomers, ken
Tested by: ken
MFC after: 1 week
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D19249
Otherwise a privileged user can trigger a memory allocation of
unbounded size, or an integer overflow in the subsequent
geom_alloc_copyin() call, leading to out-of-bounds accesses.
Hard-code a large limit to circumvent this problem.
admbug: 854
Reported by: Anonymous of the Shellphish Grill Team
Reviewed by: ae
MFC after: 1 week
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D19251
When dropping a fragment queue, account for the number of fragments in the
queue. This improves accounting between the number of fragments received and
the number of fragments dropped.
Reviewed by: jtl, bz, transport
Approved by: jtl (mentor), bz (mentor)
Differential Revision: https://review.freebsd.org/D17521
Per discussions on arch@ and elsewhere, the maintenance of this code
has moved to the drm-kmod and drm-legacy-kmod ports. Remove the i915
and radeon drivers from the tree.
Approved by: graphics team
Reviewed by: manu@, mmel@
Differential Revision: https://reviews.freebsd.org/D19196
Remove support for compiling drm2 as a module. This has transitioned
to the drm-kmod or drm-legacy-kmodw ports.
Approved by: graphics team
Reviewed by: manu@, mmel@
Differential Revision: https://reviews.freebsd.org/D19196
Retire the drm modules / drivers. These are now handled by the
drm-legacy-kmod port and/or the drm-kmod port. All future
development and maintanace will be handled there.
Approved by: graphics team
Reviewed by: manu@, mmel@
Differential Revision: https://reviews.freebsd.org/D19196
EVFILT_WRITE knotes for pipes live on the knlist for the other end of the
pipe. Since they do not hold a reference on the corresponding file
structure, they may be removed from the knlist by pipeclose() while still
remaining active. In this case, there is no knlist lock acquired before
filt_pipewrite() is called, so the assertion fails.
Fix the problem by first checking whether that end of the pipe has been
closed. These checks are memory safe since the knote holds a reference
on one end of the pipe, and the pipe structure is not freed until both
ends are closed. The checks are not racy since PIPE_EOF is never cleared
after being set, and pipe_present is never set back to PIPE_ACTIVE after
pipeclose() has been called.
PR: 235640
Reported and tested by: pho
Reviewed by: kib
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D19224
that can happen when rerooting into NFSv4 rootfs with kernel
built with INVARIANTS.
I've talked to rmacklem@ (back in 2017), and while the root cause
is still unknown, the case guarded by assertion (nfscl_doclose()
being called from VOP_INACTIVE) is believed to be safe, and the
whole thing seems to run just fine.
Obtained from: CheriBSD
MFC after: 2 weeks
Sponsored by: DARPA, AFRL
The old value resulted in bad performance, with high kernel
and gssd(8) load, with more than ~64 clients; it also triggered
crashes, which are to be fixed by a different patch.
PR: 235582
Discussed with: rmacklem@
MFC after: 2 weeks
it easily. This can lower the load on gssd(8) on large NFS servers.
Submitted by: Per Andersson <pa at chalmers dot se>
Reviewed by: rmacklem@
MFC after: 2 weeks
Sponsored by: Chalmers University of Technology
The pmap_works variable is always true for amd64. Remove it, the
branch in the initialization taken when false, and corresponding
sysctl.
Remove pat_table[] local array, work on pat_index[] directly.
Collapse whole initialization to not override already assigned values.
Add comment explaining the choice for PAT4 and PAT7.
Reviewed by: alc, markj
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
MFC note: Leave the sysctl around
Differential revision: https://reviews.freebsd.org/D19225
This change adds a counter (kqueue_users) to keep track of how many
kqueue users are referencing a given struct nm_selinfo.
In this way, nm_os_selwakeup() can schedule the kevent notification
task only when kqueue is actually being used.
This is important to avoid wasting CPU in the common case where
kqueue is not used.
Reviewed by: Aleksandr Fedorov <aleksandr.fedorov@itglobal.com>
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D19177
Ensure __bss_start is associated with the next section
in case orphan sections are placed directly after .sdata,
as has been seen to happen with LLD.
Submitted by: "J.R.T. Clarke" <jrtc4@cam.ac.uk>
Differential Revision: https://reviews.freebsd.org/D18429
In r343986 we introduced a double free. The structure was already
freed fixed in the r302966. This problem was introduced
because the GitHub version was out of sync with the FreeBSD one.
Submitted by: Mindaugas Rasiukevicius <rmind@netbsd.org>
MFC with: r343986
Some argument validation error paths would return without releasing the
file reference obtained at the beginning of the function.
While here, fix some style bugs and remove trivial debug prints.
Reviewed by: kib
MFC after: 1 week
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D19214
Use the object pointer itself to determine whether the object is locked.
No functional change intended.
Reviewed by: kib
MFC after: 1 week
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D19215
The charging current can be set using steps
from 0: 200mA to 13: 2800mA (200mA/step).
While there, fix battery charging current related
sensor descriptions.
Reviewed by: manu
Differential Revision: https://reviews.freebsd.org/D19212
back to the lever before r343030. For 64-bit machines reduce it slightly,
too. Together with r343030 I bumped the limit up to the value we use at
Netflix to serve 100 Gbit/s of sendfile traffic, and it probably isn't a
good default.
Provide a loader tunable to change vnode pager pbufs count. Document it.
The cached fvdat->filesize is indepedent of the (mostly unused)
cached_attrs, and we failed to update it when a cached (but perhaps
inactive) vnode was found during VOP_LOOKUP to have a different size than
cached.
As noted in the code comment, this can occur in distributed filesystems or
with other kinds of irregular file behavior (anything is possible in FUSE).
We do something similar in fuse_vnop_getattr already.
PR: 230258 (as reported in description; other issues explored in
comments are not all resolved)
Reported by: MooseFS FreeBSD Team <freebsd AT moosefs.com>
Submitted by: Jakub Kruszona-Zawadzki <acid AT moosefs.com> (earlier version)
At least prior to 7.23 (which adds FUSE_WRITEBACK_CACHE), the FUSE protocol
specifies only clean data to be cached.
Prior to this change, we implement and default to writeback caching. This
is ok enough for local only filesystems without hardlinks, but violates the
general design contract with FUSE and breaks distributed filesystems or
concurrent access to hardlinks of the same inode.
In this change, add cache mode as an extension of cache enable/disable. The
new modes are UC (was: cache disabled), WT (default), and WB (was: cache
enabled).
For now, WT caching is implemented as write-around, which meets the goal of
only caching clean data. WT can be better than WA for workloads that
frequently read data that was recently written, but WA is trivial to
implement. Note that this has no effect on O_WRONLY-opened files, which
were already coerced to write-around.
Refs:
* https://sourceforge.net/p/fuse/mailman/message/8902254/
* https://github.com/vgough/encfs/issues/315
PR: 230258 (inspired by)
Most users of fuse_vnode_setsize() set the cached fvdat->filesize and update
the buf cache bounds as a result of either a read from the underlying FUSE
filesystem, or as part of a write-through type operation (like truncate =>
VOP_SETATTR). In these cases, do not set the FN_SIZECHANGE flag, which
indicates that an inode's data is dirty (in particular, that the local buf
cache and fvdat->filesize have dirty extended data).
PR: 230258 (related)
The FUSE protocol demands that kernel implementations cache user filesystem
path components (lookup/cnp data) for a maximum period of time in the range
of [0, ULONG_MAX] seconds. In practice, typical requests are for 0, 1, or
10 seconds; or "a long time" to represent indefinite caching.
Historically, FreeBSD FUSE has ignored this client directive entirely. This
works fine for local-only filesystems, but causes consistency issues with
multi-writer network filesystems.
For now, respect 0 second cache TTLs and do not cache such metadata.
Non-zero metadata caching TTLs in the range [0.000000001, ULONG_MAX] seconds
are still cached indefinitely, because it is unclear how a userspace
filesystem could do anything sensible with those semantics even if
implemented.
Pass fuse_entry_out to fuse_vnode_get when available and only cache lookup
if the user filesystem did not set a zero second TTL.
PR: 230258 (inspired by; does not fix)
The FUSE protocol demands that kernel implementations cache user filesystem
file attributes (vattr data) for a maximum period of time in the range of
[0, ULONG_MAX] seconds. In practice, typical requests are for 0, 1, or 10
seconds; or "a long time" to represent indefinite caching.
Historically, FreeBSD FUSE has ignored this client directive entirely. This
works fine for local-only filesystems, but causes consistency issues with
multi-writer network filesystems.
For now, respect 0 second cache TTLs and do not cache such metadata.
Non-zero metadata caching TTLs in the range [0.000000001, ULONG_MAX] seconds
are still cached indefinitely, because it is unclear how a userspace
filesystem could do anything sensible with those semantics even if
implemented.
In the future, as an optimization, we should implement notify_inval_entry,
etc, which provide userspace filesystems a way of evicting the kernel cache.
One potentially bogus access to invalid cached attribute data was left in
fuse_io_strategy. It is restricted behind the undocumented and non-default
"vfs.fuse.fix_broken_io" sysctl or "brokenio" mount option; maybe these are
deadcode and can be eliminated?
Some minor APIs changed to facilitate this:
1. Attribute cache validity is tracked in FUSE inodes ("fuse_vnode_data").
2. cache_attrs() respects the provided TTL and only caches in the FUSE
inode if TTL > 0. It also grows an "out" argument, which, if non-NULL,
stores the translated fuse_attr (even if not suitable for caching).
3. FUSE VTOVA(vp) returns NULL if the vnode's cache is invalid, to help
avoid programming mistakes.
4. A VOP_LINK check for potential nlink overflow prior to invoking the FUSE
link op was weakened (only performed when we have a valid attr cache). The
check is racy in a multi-writer network filesystem anyway -- classic TOCTOU.
We have to trust any userspace filesystem that rejects local caching to
account for it correctly.
PR: 230258 (inspired by; does not fix)
In out of order mode Rx buffer are accesses by req_id.
Accessing and validating mbuf using ntc is causing false error.
Increase driver revision after latest RX OOO completion fixes.
Submitted by: Rafal Kozik <rk@semihalf.com>
Obtained from: Semihalf
Sponsored by: Amazon, Inc.
MFC after: 1 week
Requested ID should be validated when the packet is received and not
when the driver is repopulating the mbufs.
Submitted by: Michal Krawczyk <mk@semihalf.com>
Obtained from: Semihalf
Sponsored by: Amazon, Inc.
MFC after: 1 week
This commit essentially has three parts:
* Add the AES-CCM encryption hooks. This is in and of itself fairly small,
as there is only a small difference between CCM and the other ICM-based
algorithms.
* Hook the code into the OpenCrypto framework. This is the bulk of the
changes, as the algorithm type has to be checked for, and the differences
between it and GCM dealt with.
* Update the cryptocheck tool to be aware of it. This is invaluable for
confirming that the code works.
This is a software-only implementation, meaning that the performance is very
low.
Sponsored by: iXsystems Inc.
Differential Revision: https://reviews.freebsd.org/D19090
This adds the CBC-MAC code to the kernel, but does not hook it up to
anything (that comes in the next commit).
https://tools.ietf.org/html/rfc3610 describes the algorithm.
Note that this is a software-only implementation, which means it is
fairly slow.
Sponsored by: iXsystems Inc
Differential Revision: https://reviews.freebsd.org/D18592
The previous fix was unnecessarily very slow up to 105 hours where the
simple formula used previously worked, and unnecessarily slow by a factor
of about 5/3 up to 388 days, and didn't work above 388 days. 388 days is
not a long time, since it is a reasonable uptime, and for processes the
times being calculated are aggregated over all threads, so with N CPUs
running the same thread a runtime of 388 days is reachable after only
388 / N physical days.
The PRs document overflow at 388 days, but don't try to fix it.
Use the simple formula up to 76 hours. Then use a complicated general
method that reduces to the simple formula up to a bit less than 105
hours, then reduces to the previous method without its extra work up
to almost 388 days, then does more complicated reductions, usually
many bits at a time so that this is not slow. This works up to half
of maximum representable time (292271 years), with accumulated rounding
errors of at most 32 usec.
amd64 can do all this with no avoidable rounding errors in an inline
asm with 2 instructions, but this is too special to use. __uint128_t
can do the same with 100's of instructions on 64-bit arches. Long
doubles with at least 64 bits of precision are the easiest method to
use on i386 userland, but are hard to use in the kernel.
PR: 76972 and duplicates
Reviewed by: kib
Don't use a struct if_irq for IFLIB_INTR_IOV type interrupts since that results
in get_core_offset() being called on them, and get_core_offset() doesn't
handle IFLIB_INTR_IOV type interrupts, which results in an assert() being triggered
in iflib_irq_set_affinity().
PR: 235730
Reported by: Jeffrey Pieper <jeffrey.e.pieper@intel.com>
MFC after: 1 day
Sponsored by: Intel Corporation
Make the clustering enabling knob more fine-grained by providing a
setting where the allocation with hint is not clustered. This is aimed
to be somewhat more compatible with e.g. go 1.4 which expects that
hinted mmap without MAP_FIXED does not change the allocation address.
Now the vm.cluster_anon can be set to 1 to only cluster when no hints,
and to 2 to always cluster. Default value is 1.
Requested by: peter
Reviewed by: emaste, markj
Sponsored by: The FreeBSD Foundation
MFC after: 1 month
Differential revision: https://reviews.freebsd.org/D19194
When sigreturn() restored a thread's context, SRR1 was being restored
to its previous value, but pcb_flags was not being touched.
This could cause a mismatch between the thread's MSR and its pcb_flags.
For instance, when the thread used the FPU for the first time inside
the signal handler, sigreturn() would clear SRR1, but not pcb_flags.
Then, the thread would return with the FPU bit cleared in MSR and,
the next time it tried to use the FPU, it would fail on a KASSERT
that checked if the FPU was disabled.
This change clears the FPU bit in both pcb_flags and frame->srr1,
as the code that restores the context expects to use the FPU trap
to re-enable it.
PR: 234539
Reported by: sbruno
Reviewed by: jhibbits, sbruno
Differential Revision: https://reviews.freebsd.org/D19166
Some older compilers, when generating PIC code, cannot handle inline
asm that clobbers %ebx (because %ebx is used as the GOT offset
register). Userspace versions avoid clobbering %ebx by saving it to
stack before executing the CPUID instruction.
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
This reduces the overhead of TLB invalidations by ensuring that we
only interrupt CPUs which are using the given pmap. Tracking is
performed in pmap_activate(), which gets called during context switches:
from cpu_throw(), if a thread is exiting or an AP is starting, or
cpu_switch() for a regular context switch.
For now, pmap_sync_icache() still must interrupt all CPUs.
Reviewed by: kib (earlier version), jhb
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D18874
This includes support for pmap_enter(..., psind=1) as described in the
commit log message for r321378.
The changes are largely modelled after amd64. arm64 has more stringent
requirements around superpage creation to avoid the possibility of TLB
conflict aborts, and these requirements do not apply to RISC-V, which
like amd64 permits simultaneous caching of 4KB and 2MB translations for
a given page. RISC-V's PTE format includes only two software bits, and
as these are already consumed we do not have an analogue for amd64's
PG_PROMOTED. Instead, pmap_remove_l2() always invalidates the entire
2MB address range.
pmap_ts_referenced() is modified to clear PTE_A, now that we support
both hardware- and software-managed reference and dirty bits. Also
fix pmap_fault_fixup() so that it does not set PTE_A or PTE_D on kernel
mappings.
Reviewed by: kib (earlier version)
Discussed with: jhb
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D18863
Differential Revision: https://reviews.freebsd.org/D18864
Differential Revision: https://reviews.freebsd.org/D18865
Differential Revision: https://reviews.freebsd.org/D18866
Differential Revision: https://reviews.freebsd.org/D18867
Differential Revision: https://reviews.freebsd.org/D18868
But ipsec_delete_pcbpolicy() uses some VNET-virtualized variables,
and thus it needs VNET context, that is missing during gtaskqueue
executing. Use inp_vnet context to set curvnet in in_pcbfree_deferred().
PR: 235684
MFC after: 1 week
ratelimiting code. The two modules (lagg and vlan) did have
allocation routines, and even though they are indirect (and
vector down to the underlying interfaces) they both need to
have a free routine (that also vectors down to the actual interface).
Sponsored by: Netflix Inc.
Differential Revision: https://reviews.freebsd.org/D19032
Newer cores have the 'tlbilx' instruction, which doesn't broadcast over
CoreNet. This is significantly faster than walking the TLB to invalidate
the PID mappings. tlbilx with the arguments given takes 131 clock cycles to
complete, as opposed to 512 iterations through the loop plus tlbre/tlbwe at
each iteration.
MFC after: 3 weeks
The panic message lead people to believe some userland CAM request had
caused a problem when in reallity it was for a kernel request (eg the
USER bit was cleared). Reword message. Also, improve a couple of
comments to reflect that the periph shouldn't be completely torn down
before we get here (so the path and sim pointers should be valid, but
aren't and the code is designed to be robust enough in the face of
that to give a specific panic message).
So far, intr_{g,s}etaffinity(9) take a single int for identifying
a device interrupt. This approach doesn't work on all architectures
supported, as a single int isn't sufficient to globally specify a
device interrupt. In particular, with multiple interrupt controllers
in one system as found on e. g. arm and arm64 machines, an interrupt
number as returned by rman_get_start(9) may be only unique relative
to the bus and, thus, interrupt controller, a certain device hangs
off from.
In turn, this makes taskqgroup_attach{,_cpu}(9) and - internal to
the gtaskqueue implementation - taskqgroup_attach_deferred{,_cpu}()
not work across architectures. Yet in turn, iflib(4) as gtaskqueue
consumer so far doesn't fit architectures where interrupt numbers
aren't globally unique.
However, at least for intr_setaffinity(..., CPU_WHICH_IRQ, ...) as
employed by the gtaskqueue implementation to bind an interrupt to a
particular CPU, using bus_bind_intr(9) instead is equivalent from
a functional point of view, with bus_bind_intr(9) taking the device
and interrupt resource arguments required for uniquely specifying a
device interrupt.
Thus, change the gtaskqueue implementation to employ bus_bind_intr(9)
instead and intr_{g,s}etaffinity(9) to take the device and interrupt
resource arguments required respectively. This change also moves
struct grouptask from <sys/_task.h> to <sys/gtaskqueue.h> and wraps
struct gtask along with the gtask_fn_t typedef into #ifdef _KERNEL
as userland likes to include <sys/_task.h> or indirectly drags it
in - for better or worse also with _KERNEL defined -, which with
device_t and struct resource dependencies otherwise is no longer
as easily possible now.
The userland inclusion problem probably can be improved a bit by
introducing a _WANT_TASK (as well as a _WANT_MOUNT) akin to the
existing _WANT_PRISON etc., which is orthogonal to this change,
though, and likely needs an exp-run.
While at it:
- Change the gt_cpu member in the grouptask structure to be of type
int as used elswhere for specifying CPUs (an int16_t may be too
narrow sooner or later),
- move the gtaskqueue_enqueue_fn typedef from <sys/gtaskqueue.h> to
the gtaskqueue implementation as it's only used and needed there,
- change the GTASK_INIT macro to use "gtask" rather than "task" as
argument given that it actually operates on a struct gtask rather
than a struct task, and
- let subr_gtaskqueue.c consistently use __func__ to print functions
names.
Reported by: mmel
Reviewed by: mmel
Differential Revision: https://reviews.freebsd.org/D19139
Gratuitous ARP packets are sent from a timer, which means we don't have a vnet
context set. As a result we panic trying to send the packet.
Set the vnet context based on the interface associated with the interface
address.
To reproduce:
sysctl net.link.ether.inet.garp_rexmit_count=2
ifconfig vtnet1 10.0.0.1/24 up
PR: 235699
Reviewed by: vangyzen@
MFC after: 1 week
o Correct the obvious bugs in the netmap(4) parts:
- No longer check for the existence of DMA maps as bus_dma(9)
is used unconditionally in iflib(4) since r341095.
- Supply the correct DMA tag and map pairs to bus_dma(9)
functions (see also the commit message of r343753).
- In iflib_netmap_timer_adjust(), add synchronization of the
TX descriptors before calling the ift_txd_credits_update
method as the latter evaluates the TX descriptors possibly
updated by the MAC.
- In _task_fn_tx(), wrap the netmap(4)-specific bits in
#ifdef DEV_NETMAP just as done in _task_fn_admin() and
_task_fn_rx() respectively.
o In iflib_fast_intr_rxtx(), synchronize the TX rather than
the RX descriptors before calling the ift_txd_credits_update
method (see also above).
o There's no need to synchronize an RX buffer that is going to
be recycled in iflib_rxd_pkt_get(), yet; it's sufficient to
do that as late as passing RX buffers to the MAC via the
ift_rxd_refill method. Hence, combine that synchronization
with the synchronization of new buffers into a common spot
in _iflib_fl_refill().
o There's no need to synchronize the RX descriptors of a free
list in preparation of the MAC updating their statuses with
every invocation of rxd_frag_to_sd(); it's enough to do this
once before handing control over to the MAC, i. e. before
calling ift_rxd_flush method in _iflib_fl_refill(), which
already performs the necessary synchronization.
o Given that the ift_rxd_available method evaluates the RX
descriptors which possibly have been altered by the MAC,
synchronize as appropriate beforehand. Most notably this
is now done in iflib_rxd_avail(), which in turn means that
we don't need to issue the same synchronization yet again
before calling the ift_rxd_pkt_get method in iflib_rxeof().
o In iflib_txd_db_check(), synchronize the TX descriptors
before handing them over to the MAC for transmission via
the ift_txd_flush method.
o In iflib_encap(), move the TX buffer synchronization after
the invocation of the ift_txd_encap() method. If the MAC
driver fails to encapsulate the packet and we retry with
a defragmented mbuf chain or finally fail, the cycles for
TX buffer synchronization have been wasted. Synchronizing
afterwards matches what non-iflib(4) drivers typically do
and is sufficient as the MAC will not actually start with
the transmission before - in this case - the ift_txd_flush
method is called.
Moreover, for the latter reason the synchronization of the
TX descriptors in iflib_encap() can go as it's enough to
synchronize them before passing control over to the MAC by
issuing the ift_txd_flush() method (see above).
o In iflib_txq_can_drain(), only synchronize TX descriptors
if the ift_txd_credits_update method accessing these is
actually called.
Differential Revision: https://reviews.freebsd.org/D19081
At moea64_sync_icache(), when the 'va' argument has page size
alignment, round_page() will return the same value as 'va'.
This would cause 'len' to be 0 and thus an infinite loop.
With this change, 'lim' will always point to the next page boundary.
This issue occurred especially during debugging sessions, when a breakpoint
was placed on an exact page-aligned offset, for instance.
Reviewed by: jhibbits
Differential Revision: https://reviews.freebsd.org/D19149
option.
This issue was found by running syzkaller on OpenBSD.
Greg Steuck made me aware that the problem might also exist on FreeBSD.
Reported by: Greg Steuck
MFC after: 1 month
Differential Revision: https://reviews.freebsd.org/D18834
As a followup to r343673, unsign some variables related to allocation
since the hashsize cannot be negative. This gives a bit more space to
handle bigger allocations and avoid some implicit casting.
While here also unsign uh_hashmask, it makes little sense to keep that
signed.
MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D19148
This will allow upstream consumers, e.g., capsicum-test and third-party
packages (via ports(7)), to test for a specific `__FreeBSD_version__` and
expect `renameat(2)` to be functional.
PR: 222258
Approved by: emaste (mentor)
Reviewed by: emaste
MFC with: r343891
Differential Revision: https://reviews.freebsd.org/D19154
In `probedone()`, for the `PROBE_REPORT_LUNS` case, all paths that
fall to the bottom of the case set `lp` to `NULL`, so the test for a
non-NULL value of `lp` and call to `free()` if true is dead code as
the test can never be true. Fix by eliminating the whole if
statement. To guard against a possible future change that accidentally
violates this assumption, use a `KASSERT()` to catch if `lp` is
non-NULL.
Reviewed by: cem
MFC after: 1 week
Sponsored by: Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D19109
When pci_realloc_bars was first added, the intention was to eventually
enable it by default, but it was left disabled to preserve existing
behavior. The setting is pretty conservative in that it does not
attempt to allocate resources for BARs that the BIOS/firmware leaves
disabled. It only attempts to reallocate resources for a BAR that the
firmware programmed during boot but that conflicts with another
resource during the kernel's device scan.
PR 221350 is an example of a machine that this knob fixes.
Reviewed by: imp
Differential Revision: https://reviews.freebsd.org/D18965
Initially it was introduced because parent rule pointer could be freed,
and rule's information could become inaccessible. In r341471 this was
changed. And now we don't need this information, and also it can become
stale. E.g. rule can be moved from one set to another. This can lead
to parent's set and state's set will not match. In this case it is
possible that static rule will be freed, but dynamic state will not.
This can happen when `ipfw delete set N` command is used to delete
rules, that were moved to another set.
To fix the problem we will use the set number from parent rule.
Obtained from: Yandex LLC
MFC after: 1 week
Sponsored by: Yandex LLC
Without this fix, the usage of kernel coverage would lockup the system.
Thanks to Andrew for suggesting the final form of the fix.
PR: 235611
Reviewed by: andrew@, emaste@
Differential Revision: https://reviews.freebsd.org/D19135
battery charging, charge state, voltage, charging current, discharging current,
battery capacity etc. can be obtained via sysctl.
Reviewed by: manu
Differential Revision: https://reviews.freebsd.org/D19145
The hardcoded ident is exactly 20 bytes long but sprintf adds terminating zero,
so there is one byte written out of array bounds.As a fix use strncpy it
appends \0 only if space allows and its behavior matches virtio spec:
When VIRTIO_BLK_T_GET_ID is issued, the device identifier, up to 20 bytes, is
written to the buffer. The identifier should be interpreted as an ascii string.
It is terminated with \0, unless it is exactly 20 bytes long.
PR: 202298
Reviewed by: br
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D18852
In general, the time savings come from separating the active and
inactive queues lists into separate interface and non-interface queue
lists, and changing the rule and queue tag management from list-based
to hash-bashed.
In HFSC, a linear scan of the class table during each queue destroy
was also eliminated.
There are now two new tunables to control the hash size used for each
tag set (default for each is 128):
net.pf.queue_tag_hashsize
net.pf.rule_tag_hashsize
Reviewed by: kp
MFC after: 1 week
Sponsored by: RG Nets
Differential Revision: https://reviews.freebsd.org/D19131
nvpair_create_stringv: free the temporary string; this fix affects
nvlist_add_stringf() and nvlist_add_stringv().
nvpair_remove_nvlist_array (NV_TYPE_NVLIST_ARRAY case): free the chain
of nvpairs (as resetting it prevents nvlist_destroy() from freeing it).
Note: freeing the chain in nvlist_destroy() is not sufficient, because
it would still leak through nvlist_take_nvlist_array(). This affects
all nvlist_*_nvlist_array() use
Submitted by: Mindaugas Rasiukevicius <rmind@netbsd.org>
Reported by: clang/gcc ASAN
MFC after: 2 weeks
PR: 76972 and duplicates
Reported by: Dr. Christopher Landauer <cal AT aero.org>,
Steinar Haug <sthaug AT nethelp.no>
Submitted by: Andrey Zonov <andrey AT zonov.org> (earlier version)
MFC after: 2 weeks
SoCs with e500v2 chips only have at most 2 cores, and there are no plans to
release any more e500v2-based SoCs. Clamping MAXCPU down to 2 saves 5MB of
data, and 1.5MB bss.
- Distribute RX load across multiple cores, if present. This reverts
r217212, which is no longer relevant (I think because of the newer
SDK).
- Use newer APIs for pinning taskqueue entries to specific cores.
- Deepen RX buffers.
This more than doubles NAT forwarding throughput on my EdgeRouter Lite from,
with typical packet mixture, 90 Mbps to over 200 Mbps. The result matches
forwarding throughput in Linux without the UBNT hardware offload on the same
hardware, and thus likely reflects hardware limits.
Reviewed by: jhibbits
i386 is the only architecture where uint64_t does not specify 8-bytes
alignment, which makes struct xswdev layout not compatible between
64bit and i386.
Reported and tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
With this change, randomization can be enabled for all non-fixed
mappings. It means that the base address for the mapping is selected
with a guaranteed amount of entropy (bits). If the mapping was
requested to be superpage aligned, the randomization honours the
superpage attributes.
Although the value of ASLR is diminshing over time as exploit authors
work out simple ASLR bypass techniques, it elimintates the trivial
exploitation of certain vulnerabilities, at least in theory. This
implementation is relatively small and happens at the correct
architectural level. Also, it is not expected to introduce
regressions in existing cases when turned off (default for now), or
cause any significant maintaince burden.
The randomization is done on a best-effort basis - that is, the
allocator falls back to a first fit strategy if fragmentation prevents
entropy injection. It is trivial to implement a strong mode where
failure to guarantee the requested amount of entropy results in
mapping request failure, but I do not consider that to be usable.
I have not fine-tuned the amount of entropy injected right now. It is
only a quantitive change that will not change the implementation. The
current amount is controlled by aslr_pages_rnd.
To not spoil coalescing optimizations, to reduce the page table
fragmentation inherent to ASLR, and to keep the transient superpage
promotion for the malloced memory, locality clustering is implemented
for anonymous private mappings, which are automatically grouped until
fragmentation kicks in. The initial location for the anon group range
is, of course, randomized. This is controlled by vm.cluster_anon,
enabled by default.
The default mode keeps the sbrk area unpopulated by other mappings,
but this can be turned off, which gives much more breathing bits on
architectures with small address space, such as i386. This is tied
with the question of following an application's hint about the mmap(2)
base address. Testing shows that ignoring the hint does not affect the
function of common applications, but I would expect more demanding
code could break. By default sbrk is preserved and mmap hints are
satisfied, which can be changed by using the
kern.elf{32,64}.aslr.honor_sbrk sysctl.
ASLR is enabled on per-ABI basis, and currently it is only allowed on
FreeBSD native i386 and amd64 (including compat 32bit) ABIs. Support
for additional architectures will be added after further testing.
Both per-process and per-image controls are implemented:
- procctl(2) adds PROC_ASLR_CTL/PROC_ASLR_STATUS;
- NT_FREEBSD_FCTL_ASLR_DISABLE feature control note bit makes it possible
to force ASLR off for the given binary. (A tool to edit the feature
control note is in development.)
Global controls are:
- kern.elf{32,64}.aslr.enable - for non-fixed mappings done by mmap(2);
- kern.elf{32,64}.aslr.pie_enable - for PIE image activation mappings;
- kern.elf{32,64}.aslr.honor_sbrk - allow to use sbrk area for mmap(2);
- vm.cluster_anon - enables anon mapping clustering.
PR: 208580 (exp runs)
Exp-runs done by: antoine
Reviewed by: markj (previous version)
Discussed with: emaste
Tested by: pho
MFC after: 1 month
Sponsored by: The FreeBSD Foundation
Differential revision: https://reviews.freebsd.org/D5603
- for now, alignments bigger that page size is allowed only for buffers
allocated by bus_dmamem_alloc(), cover this fact by KASSERT.
- never bounce buffers allocated by bus_dmamem_alloc(), these always comply
with the required rules (alignment, boundary, address range).
MFC after: 1 week
Reviewed by: jah
PR: 235542
reading some events from the interrupt status registers. These events
are reported to devd via system "PMU" and subsystem "Battery", "AC"
and "USB" such as plugged/unplugged, absent, charged and charging.
Reviewed by: manu
Differential Revision: https://reviews.freebsd.org/D19116
Make every rockchip file depend on the multiple soc_rockchip options
While here make rk_i2c and rk_gpio depend on their device options.
Reported by: sbruno
The COVERAGE option breaks xtoolchain-gcc GENERIC kernel early boot
extremely badly and hasn't been fixed for the ~week since it was committed.
Please enable for GENERIC only when it doesn't do that.
Related fallout reported by: lwhsu, tuexen (pr 235611)
This can aid with debugging when a thread is running and has no backtrace.
State can be estimated based on the pcb, and refined from there, for
example, to get a rough idea of the stack pointer.
r241119 that's performed globally by device_attach(9).
- As for the EM-class of devices, em(4) supports multiple queues
and MSI-X respectively only with 82574 devices. However, since
the conversion to iflib(4), em(4) relies on the interrupt type
fallback mechanism, i. e. MSI-X -> MSI -> INTx, of iflib(4) to
figure out the interrupt type to use for the EM-class (as well
as the IGB-class) of MACs. Moreover, despite the datasheet for
82583V not mentioning any support of MSI-X, there actually are
82583V devices out there that report a varying number of MSI-X
messages as supported. The interrupt type fallback of iflib(4)
is causing two failure modes depending on the actual number of
MSI-X messages supported for such instances of 82583V:
1) With only one MSI-X message supported, none is left for the
RX/TX queues as that one message gets assigned to the admin
interrupt. Worse, later on - which will be addressed with a
separate fix - iflib(4) interprets that one messages as MSI
or INTx to be set up, but fails to actually do so as it has
previously called pci_alloc_msix(9). [1, 2]
2) With more message supported, their distribution is okay but
then em_if_msix_intr_assign() doesn't work for 82583V, with
the interface being left in a non-working state, too. [3]
Thus, let em_if_attach_pre() indicate to iflib(4) to try MSI-X
with 82574 only, and at most MSI for the remainder of EM-class
devices.
While at it, remove "try_second_bar" as it's polarity inverted
and not actually needed.
- Remove code from em_if_timer() that effectively is a NOP since
the conversion to iflib(4) ("trigger" is no longer read).
While at it, let the comment for em_if_timer() reflect reality
after said conversion.
- Implement an ifdi_watchdog_reset method which only updates the
em(4) "watchdog_events" counter but doesn't perform any reset,
so that the em(4) "watchdog_timeouts" SYSCTL (iflib(4) doesn't
provide a counterpart) reflects reality and these timeouts add
to IFCOUNTER_OERRORS again after the iflib(4) conversion.
- Remove the "mbuf_defrag_fail" and "tx_dma_fail" SYSCTLS; since
the iflib(4) conversion, associated counters are disconnected,
but iflib(4) provides "mbuf_defrag_failed" and "tx_map_failed"
respectively as equivalents.
- Move the description preceding lem_smartspeed() to the correct
spot before em_reset() and bring back appropriate comments for
{igb,em}_initialize_rss_mapping() and lem_smartspeed() lost in
the iflib(4) conversion.
- Adapt some other function descriptions and INIT_DEBUGOUT() use
to match reality after the iflib(4) conversion.
- Put the debugging message of em_enable_vectors_82574() (missed
in r343578) under bootverbose, too.
PR: 219428 [1], 235246 [2], 235147 [3]
Reviewed by: erj (previous version)
Differential Revision: https://reviews.freebsd.org/D19108
It is currently re-declared in sys/sysent.h which is a wrong place for
MD variable. Which causes redeclaration error with gcc when
sys/sysent.h and machine/md_var.h are included both.
Remove it from sys/sysent.h and instead include machine/md_var.h when
needed, under #ifdef for both i386 and amd64.
Reported and tested by: bde
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
sysctl variable net.inet.tcp.cc.cdg.smoothing_factor to 0, the smoothing
is disabled. Without this patch, a division by zero orrurs.
PR: 193762
Reviewed by: lstewart@, rrs@
MFC after: 3 days
Sponsored by: Netflix, Inc.
Differential Revision: https://reviews.freebsd.org/D19071
When configured with more tx queues than rx queues,
em_if_msix_intr_assign() was incorrectly routing the tx event
interrupts.
Reviewed by: erj, marius
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D19070
The vp vnode is unlocked during the execution of the VOP method and
can be reclaimed, zeroing vp->v_data. Caching allows to use the
correct mount point.
Reported and tested by: pho
PR: 235549
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
When renameat(2) is used with:
- absolute path for to;
- tofd not set to AT_FDCWD;
- the target exists
kern_renameat() requires CAP_UNLINK capability on tofd, but
corresponding namei ni_filecap is not initialized at all because the
lookup is absolute. As result, the check was done against empty filecap
and syscall fails erronously.
Fix it by creating a return flags namei member and reporting if the
lookup was absolute, then do not touch to.ni_filecaps at all.
PR: 222258
Reviewed by: jilles, ngie
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
X-MFC-note: KBI breakage
Differential revision: https://reviews.freebsd.org/D19096
Code after exec_fail_dealloc label expects that the image vnode is
locked if present. When copyout() of the strings or auxv vectors fails,
goto to the error handling did not relocked the vnode as required.
The copyout() can be made failing e.g. by creating an ELF image with
PT_GNU_STACK segment disabling the write.
Reported by: Jonathan Stuart <n0t.jcs@gmail.com> (found by fuzzing)
Sponsored by: The FreeBSD Foundation
MFC after: 3 days
When moving from an invalid to a valid entry we don't need to invalidate
the tlb, however we do need to ensure the store is ordered before later
memory accesses. This is because this later access may be to a virtual
address within the newly mapped region.
Add the needed barriers to places where we don't later invalidate the
tlb. When we do invalidate the tlb there will be a barrier to correctly
order this.
This fixes a panic on boot on ThunderX2 when INVARIANTS is turned off:
panic: vm_fault_hold: fault on nofault entry, addr: 0xffff000040c11000
Reported by: jchandra
Tested by: jchandra
Sponsored by: DARPA, AFRL
Differential Revision: https://reviews.freebsd.org/D19097
We need to ensure the page table store has happened before the tlbi.
Reported by: jchandra
Tested by: jchandra
Sponsored by: DARPA, AFRL
Differential Revision: https://reviews.freebsd.org/D19097
For direct mapped kernel addresses, ppc64 function was not
performing the dmap to physical conversion, before jumping
to the code that fetched the value from physical memory.
Reviewed by: jhibbits
Differential Revision: https://reviews.freebsd.org/D19086
Use the information from IORT parsing to translate the PCI RID to
GIC ITS device ID. And similarly, use the information to find the
PIC XREF identifier to be used for PCI devices.
Reviewed by: andrew
Differential Revision: https://reviews.freebsd.org/D18004
acpi_iort.c has added support to query GIC proximity and MSI XREF
ID for GIC ITS blocks. Use this when GIC ITS blocks are initialized
from ACPI.
Reviewed by: andrew
Differential Revision: https://reviews.freebsd.org/D18003
Add new file arm64/acpica/acpi_iort.c to support the "IO Remapping
Table" (IORT). The table is specified in ARM document "ARM DEN 0049D"
titled "IO Remapping Table Platform Design Document". The IORT table
has information on the associations between PCI root complexes, SMMU
blocks and GIC ITS blocks in the system.
The changes are to parse and save the information in the IORT table.
The API to use this information is added to sys/dev/acpica/acpivar.h.
The acpi_iort.c also has code to check the GIC ITS nodes seen in the
IORT table with corresponding entries in MADT table (for validity)
and with entries in SRAT table (for proximity information).
Reviewed by: andrew
Differential Revision: https://reviews.freebsd.org/D18002
Make it more comprehensive on i386, by not setting nx bit for any
mapping, not just adding PF_X to all kernel-loaded ELF segments. This
is needed for the compatibility with older i386 programs that assume
that read access implies exec, e.g. old X servers with hand-rolled
module loader.
Reported and tested by: bde
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
It was broken before PAE/no-PAE merge, but since now PAE is the
default, resume is apparently becomes for all machines.
The corrected issues:
- the trampoline page is not mapped executable, so machine faults when
paging is on;
- MSR.EFER and %cr4 both should be loaded before paging is enabled,
otherwise paging structures are invalid (cr4.PAE and EFER.NX).
- MSR.EFER and %cr4 should be only loaded if present. I attempt to handle
this by not touching the registers if the value is zero.
There are some more bits still not quite correct, e.g. unconditional
access to %cr4 in resumectx.
Reported and debugging help by: bde
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
There's no need to worry about potential backwards compatibility issues
in a brand-new architecture, so avoid stack PROT_EXEC as with arm64.
Discussed with: br
update of devicetree to 4.19 in r340337.
Our build system doesn't provide dependencies for included DTS files, so
nobody noticed this issue for long time.
PR: 235362
MFC after: 1 week
The QorIQ SoCs don't actually support multicast interrupts, and the
references state explicitly that multicast is undefined behavior. Avoid the
undefined behavior by binding to only a single CPU, using a quirk to
determine if this is necessary.
MFC after: 3 weeks
There are few places in interrupt handler where the driver
lock is dropped; ensure that device is still running before
processing remaining ring entries.
PR: 192641
MFC after: 5 days
Certain versions of Sandisk x400 firmware can hang under extremely
heavly load of large I/Os for prolonged periods of time. Newer /
current versions work fine, and should be used where possible. Where
not possible, this quirk ensures that I/O requests are limited to 128k
to avoids the bug, even under extreme load. Since MAXPHYS is 128k,
only users with custom kernels are at risk on the older firmware.
Once all known users of the older firmware have upgraded, this quirk
will be removed.
Sponsored by: Netflix, Inc.
Initialize the static kenv in pmap_cold() and fetch user opinion on
vm.pmap.pae_mode tunable if hardware is capable. Note that the static
environment is reinitilized in init386() later when paging is enabled.
Reviewed by: bde
Discussed with: kevans
Sponsored by: The FreeBSD Foundation
MFC after: 2 months
When running several builders in parallel, on QEMU, with 8GB of
memory, a fatal kernel trap (0x300 (data storage interrupt))
caused by llan driver is sometimes observed, when the system
starts to run out of swap space.
This happens because, at llan_intr(), a phyp call to add a
logical LAN buffer is always made when llan_add_rxbuf() fails,
even if it fails to allocate a new buffer.
PR: 235489
Reviewed by: jhibbits
Differential Revision: https://reviews.freebsd.org/D19084
%r8, %r10, and on non-KPTI configuration %r9 were not restored on fast
return from a syscall.
Reviewed by: markj
Approved by: so
Security: CVE-2019-5595
Sponsored by: The FreeBSD Foundation
MFC after: 0 minutes
when TEKEN_CONS25 is configured. Fix this by adding a function to
set the flag that enables the fix and always calling this function
for syscons.
Expand the man page for teken_set_cons25(). This function is not
very useful since it can only set but not clear 1 flag. In practice,
it is only used when TEKEN_CONS25 is configured and all that does is
choose the the default emulation for syscons at compile time.
are terminated by 2 NULs, but only 1 NUL was zapped. Zapping only 1
NUL just splits the first string into an empty string and a corrupted
string. All other strings in static hints and env remained live early
in the boot when they were supposed to be disabled.
Support calling init_static_kenv() very early in the boot, so as to
use the env very early in the boot. Then the pointer to the loader
env may change after the first call due to enabling paging or otherwise
remapping the pointer. Another call is needed to register the change.
Don't use the previous pointer in this (or any) later call.
Reviewed by: kib
Changelist:
- Replace ND, D and RD macros with nm_prdis, nm_prinf, nm_prerr
and nm_prlim, to avoid possible naming conflicts.
- Add netmap_krings_mode_commit() helper function and use that
to reduce code duplication.
- Refactor pipes control code to export some functions that
can be reused by the veth driver (on Linux) and epair(4).
- Add check to reject API requests with version less than 11.
- Small code refactoring for the null adapter.
MFC after: 1 week
Bump up MAX_HWCNT and MAX_EXCNT to 32 when ACPI is enabled. These are
the sizes of the hwregions and exregions arrays respectively. ACPI
firmware typically has more memory regions and the current value of
16 is not sufficient for some platforms.
This commit fixes a failure seen with AMI firmware on Cavium's Sabre
ThunderX2 reference platform. This platform needs 21 physical memory
regions and 18 excluded regions to boot correctly with the current
firmware release.
Reviewed by: andrew
Differential Revision: https://reviews.freebsd.org/D19073
It appears idling via 'wait' on e5500 causes strange behaviors, such as
top(1) simply hanging sporadically, until input. Until this can possibly be
sorted out (interrupt issue?), just don't idle on this hardware. The SoCs
are low power already, and the wait state doesn't save much anyway.
List is a 'read'-type operation that does not modify shared state; it's safe
for multiple thread to proceed concurrently. This is reflected in the vnode
operation LISTEXTATTR locking protocol specification, which only requires a
shared lock.
(Similar to previous r248933.)
Reported by: Case van Rij <case.vanrij AT isilon.com>
Reviewed by: kib, mjg
Sponsored by: Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D19082
Use recent best practices for Copyright form at the top of
the license:
1. Remove all the All Rights Reserved clauses on our stuff. Where we
piggybacked others, use a separate line to make things clear.
2. Use "Netflix, Inc." everywhere.
3. Use a single line for the copyright for grep friendliness.
4. Use date ranges in all places for our stuff.
Approved by: Netflix Legal (who gave me the form), adrian@ (pmc files)
controller datasheet revision 3.3, in the context of Ethernet
MACs the control data describing the packet buffers typically
are named "descriptors". Each of these descriptors references
one buffer, multiple of which a packet can be composed of.
By contrast, in comments, messages and the names of structure
members, iflib(4) refers to DMA resources employed for RX and
TX buffers (rather than control data) as "desc(riptors)".
This odd naming convention of iflib(4) made reviewing r343085
and identifying wrong and missing bus_dmamap_sync(9) calls in
particular way harder than it already is. This convention may
also explain why the netmap(4) part of iflib(4) pairs the DMA
tags for control data with DMA maps of buffers and vice versa
in calls to bus_dma(9) functions.
Therefore, change iflib(4) to refer to buf(fers) when buffers
and not the usual understanding of descriptors is meant. This
change does not include corrections to the DMA resources used
in the netmap(4) parts. However, it revises error messages to
state which kind of allocation/creation failed. Specifically,
the "Unable to allocate tx_buffer (map) memory" copy & pasted
inappropriately on several occasions was replaced with proper
messages.
o Enhance some other error messages to indicate which half - RX
or TX - they apply to instead of using identical text in both
cases and generally canonicalize them.
o Correct the descriptions of iflib_{r,t}xsd_alloc() to reflect
reality; current code doesn't use {r,t}x_buffer structures.
o In iflib_queues_alloc():
- Remove redundant BUS_DMA_NOWAIT of iflib_dma_alloc() calls,
- change the M_WAITOK from malloc(9) calls into M_NOWAIT. The
return values are already checked, deferred DMA allocations
not being an option at this point, BUS_DMA_NOWAIT has to be
used anyway and prior malloc(9) calls in this function also
specify M_NOWAIT.
Reviewed by: shurd
Differential Revision: https://reviews.freebsd.org/D19067
Compiling a GENERIC kernel for i386 with clang 8.0 results in the
following warning:
/usr/src/sys/i386/i386/sys_machdep.c:542:40: error: 'sizeof ((ldt))' will return the size of the pointer, not the array itself [-Werror,-Wsizeof-pointer-div]
nldt = pldt != NULL ? pldt->ldt_len : nitems(ldt);
^~~~~~~~~~~
/usr/src/sys/sys/param.h:299:32: note: expanded from macro 'nitems'
#define nitems(x) (sizeof((x)) / sizeof((x)[0]))
~~~~~~~~~~~ ^
Indeed, 'ldt' is declared as 'union descriptor *', so nitems() is not
the right way to determine the number of LDTs. Instead, the NLDT define
from sys/x86/include/segments.h should be used.
Reviewed by: kib
MFC after: 3 days
Differential Revision: https://reviews.freebsd.org/D19074
Currently, the trap code switches to the the temporary stack in the dbtrap
section. It works in most cases, but in the beginning of the execution, the
temp stack is being used, as starting in the powerpc_init() code.
In this current scenario, the stack is being overwritten, which causes the
return of breakpoint() to take abnormal execution.
This current patchset create a small stack to use by the dbtrap: codepath
avoiding the corruption of the temporary stack.
PR: 224872
Submitted by: breno.leitao_gmail.com
Reviewed by: jhibbits
Differential Revision: https://reviews.freebsd.org/D14484
Otherwise the lock might recurse in faultin() if the process is
swapped out.
Reported by: zeising
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
On CPUs supporting cmpxchg8b, fetch is performed by cmpxchg8b on
corresponding CPU slot, which unconditionally write to the slot. If
for that slot, the owner CPU increments it, then both CPUs might run
the cmpxchg8b instruction concurrently and this might race and
override the incremental write. So the counter update would be lost.
Fix it by implementing fetch as IPI and accumulation of result. It is
acceptable for rare counter64 fetch operation to be more expensive.
Diagnosed and tested by: Andreas Longwitz <longwitz@incore.de>
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
This is a step towards being able to free pages without the page
lock held. The approach is simply to add an implementation of
vm_page_dequeue_deferred() which does not assert that the page
lock is held. Formally, the page lock is required to set
PGA_DEQUEUE, but in the case of vm_page_free_prep() we get the
same mutual exclusion for free by virtue of the fact that no
other references to the page may exist.
No functional change intended.
Reviewed by: kib (previous version)
MFC after: 2 weeks
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D19065
To detect the case where the page is already marked for a deferred
dequeue, we must read the "queue" and "aflags" fields in a
precise order. Otherwise, a race with a concurrent
vm_page_dequeue_complete() could leave the page with PGA_DEQUEUE
set despite it already having been dequeued. Fix the problem by
using vm_page_queue() to check the queue state, which correctly
handles the race.
Reviewed by: kib
Tested by: pho
MFC after: 3 days
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D19039
This allows userspace to trace the kernel using the coverage sanitizer
found in clang. It will also allow other coverage tools to be built as
modules and attach into the same framework.
Sponsored by: DARPA, AFRL
never get here, however a test for SOLARIS, as redundant as this test is,
serves to document that this is the illumos definition. This should help
those who come after me to follow the code more easily.
MFC after: 1 month
Remove #ifdefs for ancient and irrelevant operating systems from
ipfilter.
When ipfilter was written the UNIX and UNIX-like systems in use
were diverse and plentiful. IRIX, Tru64 (OSF/1) don't exist any
more. OpenBSD removed ipfilter shortly after the first time the
ipfilter license terms changed in the early 2000's. ipfilter on AIX,
HP/UX, and Linux never really caught on. Removal of code for operating
systems that ipfilter will never run on again will simplify the code
making it easier to fix bugs, complete partially implemented features,
and extend ipfilter.
Unsupported previous version FreeBSD code and some older NetBSD code
has also been removed.
What remains is supported FreeBSD, NetBSD, and illumos. FreeBSD and
NetBSD have collaborated exchanging patches, while illumos has expressed
willingness to have their ipfilter updated to 5.1.2, provided their
zone-specific updates to their ipfilter are merged (which are of interest
to FreeBSD to allow control of ipfilters in jails from the global zone).
Reviewed by: glebius@
MFC after: 1 month
Differential Revision: https://reviews.freebsd.org/D19006
That should shorten 'ifconfig <wlan> list txparam' output since
unsupported modes will not be shown.
Checked with RTL8188EE, STA mode.
MFC after: 2 weeks
Do not try to clear 'basic rate' bit from roamRate; it cannot be here and,
actually, this operation clears 'MCS rate' bit instead, breaking comparison
for 11n / 11ac modes.
Tested with RTL8188CUS, HOSTAP mode + RTL8821AU, STA mode.
MFC after: 3 days
ifconfig(8) prints per-mode parameters if they are non-zero; since
we have 13 possible modes with 3...5 typically supported this change
should greatly reduce amount of information for 'ifconfig <wlan> list roam'
command.
While here ensure that sta_roam_check() will not use roaming parameters
for unsupported modes (it should not).
This change effectively reverts r188776.
MFC after: 2 weeks
Add SYNC_KLOOP_MODE option, and add support for direct mode, where application
executes the TXSYNC and RXSYNC in the context of the ioeventfd wake up callback.
MFC after: 5 days
When in MSI mode, the device was only being configured with one
interrupt index, but it needs two - one for the actual interrupt and
one to park the tx queue at.
Also clarified comments relating to interrupt index assignment.
Reported by: Yuri Pankov <yuripv@yuripv.net>
MFC after: 1 day
The XIVE (External Interrupt Virtualization Engine) is a new interrupt
controller present in IBM's POWER9 processor. It's a very powerful,
very complex device using queues and shared memory to improve interrupt
dispatch performance in a virtualized environment.
This yields a ~10% performance improvment over the XICS emulation mode,
measured in both buildworld, and 'dd' from nvme to /dev/null.
Currently, this only supports native access.
MFC after: 1 month
512GB of ZFS ABD ARC means abd_chunk zone of 128M 4KB items. To manage
them UMA tries to allocate 2GB hash table, which size does not fit into
the int variable, causing later allocation failure, which makes ARC shrink
back below the 512GB, not letting it to use more RAM. With this change I
easily reached >700GB ARC size on 768GB RAM machine.
MFC after: 1 week
Sponsored by: iXsystems, Inc.
With the current 24G memory limit for GENERIC, the boot time test
causes quite visible delay, amplified by the default
debug.late_console = 0.
The comment text is copied from the same setting explanation for
amd64.
Suggested by: bde
Discussed with: emaste
Sponsored by: The FreeBSD Foundation
MFC after: 2 months
CPU and buses can manage up to the limit reported by cpu_maxphyaddr,
so set mem_rman to the value returned by cpu_getmaxphyaddr(). For the
PAE mode, it was missed both when rman_res_t was increased to
uintmax_t, and from the PAE merge commit.
When importing smaps or dump_avail chunks into memory rman, do not
blindly ignore resources which ends above the limit, chomp them
instead if start is below the limit. The same change was already done
to i386 add_physmap_entry().
Based on the submission by: bde
MFC after: 2 months
"slow" interrupt handler:
- Expand the list of INT_CAUSE registers known to the driver.
- Add decode information for many more bits but decouple it from the
rest of intr_info so that it is entirely optional.
- Call t4_fatal_err exactly once, and from the top level PL intr handler.
t4_fatal_err:
- Use t4_shutdown_adapter from the common code to stop the adapter.
- Stop servicing slow interrupts after the first fatal one.
Driver/firmware interaction:
- CH_DUMP_MBOX: note whether the mailbox being dumped is a command or a
reply or something else.
- Log the raw value of pcie_fw for some errors.
- Use correct log levels (debug vs. error).
Sponsored by: Chelsio Communications
kbd(4) (but only documented in atkbd(4)) maintains a table of strings
for 96 function keys. Using teken broke this 9+ years ago for the
most usable first 12 function keys and for 10 cursor keys, by supplying
its own non-programmable strings so that the keyboard driver's strings
are not used.
Fix this by supplying NULL in the teken layer for syscons in cons25 mode
so that the the strings are found in the kbd(4) layer.
vt needs more changes to use kbd(4)'s tables. Teken's cons25 table is
still needed to supply nonempty strings for vt in cons25 mode.
Keep using teken's xterm tables for both syscons and vt in xterm mode.
Function keys should at least default to xterm values in xterm mode,
and kbd(4) doesn't support this.
teken_set_cons25() sets a sticky flag to ask for the fix, and space is
reserved for another new flag. vt should set this flag when it uses
kbd(4)'s tables.
PR: 226553 (for vt)
consistently.
This inconsistency was observed when working on the bug reported in
PR 235256, although it does not fix the reported issue. The fix for
the PR will be a separate commit.
PR: 235256
Reviewed by: rrs@, Richard Scheffenegger
MFC after: 3 days
Sponsored by: Netflix, Inc.
Differential Revision: https://reviews.freebsd.org/D19033
The KPI have been reviewed and cleansed of features that were planned
back 20 years ago and never implemented. The pfil(9) internals have
been made opaque to protocols with only returned types and function
declarations exposed. The KPI is made more strict, but at the same time
more extensible, as kernel uses same command structures that userland
ioctl uses.
In nutshell [KA]PI is about declaring filtering points, declaring
filters and linking and unlinking them together.
New [KA]PI makes it possible to reconfigure pfil(9) configuration:
change order of hooks, rehook filter from one filtering point to a
different one, disconnect a hook on output leaving it on input only,
prepend/append a filter to existing list of filters.
Now it possible for a single packet filter to provide multiple rulesets
that may be linked to different points. Think of per-interface ACLs in
Cisco or Juniper. None of existing packet filters yet support that,
however limited usage is already possible, e.g. default ruleset can
be moved to single interface, as soon as interface would pride their
filtering points.
Another future feature is possiblity to create pfil heads, that provide
not an mbuf pointer but just a memory pointer with length. That would
allow filtering at very early stages of a packet lifecycle, e.g. when
packet has just been received by a NIC and no mbuf was yet allocated.
Differential Revision: https://reviews.freebsd.org/D18951
Not all child devices of the NVDIMM root device represent DIMM devices
which are present in the system. The spec says (ACPI 6.2, sec 9.20.2):
For each NVDIMM present or intended to be supported by platform,
platform firmware also exposes an NVDIMM device ... under the
NVDIMM root device.
Present NVDIMM devices are found by walking all of the NFIT table's
SPA ranges, then walking the NVDIMM regions mentioned by those SPA
ranges.
A set of NFIT walking helper functions are introduced to avoid the
need to splat the enumeration logic across several disparate
callbacks.
Submitted by: D Scott Phillips <d.scott.phillips@intel.com>
Sponsored by: Intel Corporation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D18439
Move the enumeration of NVDIMM SPA ranges from the spa GEOM class
initializer into the NVDIMM root device. This will be necessary for a
later change where NVDIMM namespaces require NVDIMM device enumeration
to be reliably ordered before SPA enumeration.
Submitted by: D Scott Phillips <d.scott.phillips@intel.com>
Sponsored by: Intel Corporation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D18734
anything except several assertions. This type is going to be used for
temporary on stack mbufs, that point into data in receive ring of a NIC,
that shall not be freed. Such mbuf can not be stored or reallocated, its
life time is current context.
Parts of the kobj(9) KPI assume a non-sleepable context for the purpose
of internal memory allocations, but currently have no way to signal an
allocation failure to the caller, so they just panic in this case. This
can occur even when kobj_create() is called with M_WAITOK. Fix some
instances of the problem by plumbing wait flags from kobj_create() through
internal subroutines. Change kobj_class_compile() to assume a sleepable
context when called externally, since all existing callers use it in a
sleepable context.
To fix the problem fully the kobj_init() KPI must be changed.
Reported and tested by: pho
Reviewed by: kib (previous version)
MFC after: 1 week
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D19023
This patch and commit message are based on r340256 created by Jacob Keller:
The iflib stack does not disable TSO automatically when TXCSUM is
disabled, instead assuming that the driver will correctly handle TSOs
even when CSUM_IP is not set.
This results in iflib calling ixgbe_isc_txd_encap with packets which have
CSUM_IP_TSO, but do not have CSUM_IP or CSUM_IP_TCP set. Because of
this, ixgbe_tx_ctx_setup will not setup the IPv4 checksum offloading.
This results in bad TSO packets being sent if a user disables TXCSUM
without disabling TSO.
Fix this by updating the ixgbe_tx_ctx_setup function to check both
CSUM_IP and CSUM_IP_TSO when deciding whether to enable checksums.
Once this is corrected, another issue for TSO packets is revealed. The
driver sets IFLIB_NEED_ZERO_CSUM in order to enable a work around that
causes the ip->sum field to be zero'd. This is necessary for ix
hardware to correctly perform TSOs.
However, if TXCSUM is disabled, then the work around is not enabled, as
CSUM_IP will not be set when the iflib stack checks to see if it should
clear the sum field.
Fix this by adding IFLIB_TSO_INIT_IP to the iflib flags for the ix and
ixv interface files.
Once both of these changes are made, the ix and ixv drivers should
correctly offload TSO packets when TSO offload is enabled, regardless
of whether TXCSUM is enabled or disabled.
Submitted by: Piotr Pietruszewski <piotr.pietruszewski@intel.com>
Reviewed by: IntelNetworking
Sponsored by: Intel Corporation
Differential Revision: https://reviews.freebsd.org/D18470
From Piotr:
This patch introduces adapter->task_requests register responsible for
recording requests for mod_task, msf_task, mbx_task, fdir_task and
phy_task calls. Instead of enqueueing these tasks with
GROUPTASK_ENQUEUE, handlers will be called directly from
ixgbe_if_update_admin_status() while holding ctx lock.
SIOCGIFXMEDIA ioctl() call reads adapter->media list. The list is
deleted and rewritten in ixgbe_handle_msf() task without holding ctx
lock. This change is needed to maintain data coherency when sharing
adapter info via ioctl() calls.
Patch co-authored by Krzysztof Galazka <krzysztof.galazka@intel.com>.
PR: 221317
Submitted by: Piotr Pietruszewski <piotr.pietruszewski@intel.com>
Reviewed by: sbruno@, IntelNetworking
Sponsored by: Intel Corporation
Differential Revision: https://reviews.freebsd.org/D18468
lagg_capabilities() will set the capability once interfaces supporting
the feature are added to the lagg. Setting it on a lagg without any
interfaces is pointless as the if_snd_tag_alloc call will always fail
in that case.
Reviewed by: hselasky, gallatin
MFC after: 2 weeks
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D19040
The pfil(9) system is about to be converted to epoch(9) synchronization, so
we need [temporarily] go back with ipfw internal locking.
Discussed with: ae
iflib is already a module, but it is unconditionally compiled into the
kernel. There are drivers which do not need iflib(4), and there are
situations where somebody might not want iflib in kernel because of
using the corresponding driver as module.
Reviewed by: marius
Discussed with: erj
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
Differential revision: https://reviews.freebsd.org/D19041
Then bucket_alloc() also selects bucket size based on uz_count. However,
since zone lock is dropped, uz_count may reduce. In this case max may
be greater than ub_entries and that would yield into writing beyond end
of the allocation.
Reported by: pho
image as not compatible with ASLR.
Requested by: emaste
Sponsored by: The FreeBSD Foundation
MFC after: 3 days
Differential revision: https://reviews.freebsd.org/D5603
length of the struct in memmove() rather than an unintialized variable.
This fixes the first of two kernel page faults when ipfs is invoked.
PR: 235110
Reported by: David.Boyd49@twc.com
MFC after: 2 weeks
SIFTR does not allow any kind of filtering, but captures every packet
processed by the TCP stack.
Often, only a specific session or service is of interest, and doing the
filtering in post-processing of the log adds to the overhead of SIFTR.
This adds a new sysctl net.inet.siftr.port_filter. When set to zero, all
packets get captured as previously. If set to any other value, only
packets where either the source or the destination ports match, are
captured in the log file.
Submitted by: Richard Scheffenegger
Reviewed by: Cheng Cui
Differential Revision: https://reviews.freebsd.org/D18897
In all cases where ZFS sends BIO_FLUSH, it first waits for all related
writes to complete, so its BIO_FLUSH does not care about strict ordering.
Removal of one makes life much easier at least for NVMe driver, which
hardware has no concept of request ordering, relying completely on software.
MFC after: 2 weeks
Sponsored by: iXsystems, Inc.
Other types, such as BIO_FLUSH or BIO_ZONE, or especially new/unknown ones,
may imply some degree of ordering even if strict ordering is not requested
explicitly.
MFC after: 2 weeks
Sponsored by: iXsystems, Inc.
copyright.
When all member nations of the Buenos Aires Convention adopted the Berne
Convention, the phrase "All rights reserved" became unnecessary to assert
copyright. Remove it from files under my or Panasas's copyright. The files
related to jedec_dimm(4) also bear avg@'s copyright; he has approved this
change.
Approved by: avg
Sponsored by: Panasas
r212160 tightened this from always using MSG_SIMPLE_Q_TAG to always
MSG_ORDERED_Q_TAG. Since it also marked all BIO_FLUSH requests with
BIO_ORDERED, this commit changes nothing immediately, but it returns
BIO_FLUSH callers ability to actually specify ordering they really
need, alike to other request types.
MFC after: 2 weeks
Sponsored by: iXsystems, Inc.
When using poll(), select() or kevent() on netmap file descriptors,
netmap executes the equivalent of NIOCTXSYNC and NIOCRXSYNC commands,
before collecting the events that are ready. In other words, the
poll/kevent callback has side effects. This is done to avoid the
overhead of two system call per iteration (e.g., poll() + ioctl(NIOC*XSYNC)).
When the kqueue subsystem invokes the kqueue(9) f_event callback
(netmap_knrw), it holds the lock of the struct knlist object associated
to the netmap port (the lock is provided at initialization, by calling
knlist_init_mtx).
However, netmap_knrw() may need to wake up another netmap port (or even
the same one), which means that it may need to call knote().
Since knote() needs the lock of the struct knlist object associated to
the to-be-wake-up netmap port, it is possible to have a lock order reversal
problem (AB/BA deadlock).
This change prevents the deadlock by executing the knote() call in a
per-selinfo taskqueue, where it is possible to hold a mutex.
Reviewed by: aleksandr.fedorov_itglobal.com
MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D18956
bus_teardown_intr(9) before pci_release_msi(9).
- Ensure that iflib(4) and associated drivers pass correct RIDs to
bus_release_resource(9) by obtaining the RIDs via rman_get_rid(9)
on the corresponding resources instead of using the RIDs initially
passed to bus_alloc_resource_any(9) as the latter function may
change those RIDs. Solely em(4) for the ioport resource (but not
others) and bnxt(4) were using the correct RIDs by caching the ones
returned by bus_alloc_resource_any(9).
- Change the logic of iflib_msix_init() around to only map the MSI-X
BAR if MSI-X is actually supported, i. e. pci_msix_count(9) returns
> 0. Otherwise the "Unable to map MSIX table " message triggers for
devices that simply don't support MSI-X and the user may think that
something is wrong while in fact everything works as expected.
- Put some (mostly redundant) debug messages emitted by iflib(4)
and em(4) during attachment under bootverbose. The non-verbose
output of em(4) seen during attachment now is close to the one
prior to the conversion to iflib(4).
- Replace various variants of spelling "MSI-X" (several in messages)
with "MSI-X" as used in the PCI specifications.
- Remove some trailing whitespace from messages emitted by iflib(4)
and change them to consistently start with uppercase.
- Remove some obsolete comments about releasing interrupts from
drivers and correct a few others.
Reviewed by: erj, Jacob Keller, shurd
Differential Revision: https://reviews.freebsd.org/D18980
The main differences with the currently implemented method are:
- Requires a local APIC EOI, since it doesn't bypass the local APIC
as the previous method used to do.
- Can be set to use different IDT vectors on each vCPU. Note that
FreeBSD doesn't make use of this feature since the event channel
IDT vector is reserved system wide.
Note that the old method of setting the event channel upcall is
not removed, and will be used as a fallback if this newly introduced
method is not available.
MFC after: 1 month
Sponsored by: Citrix Systems R&D
Effectively all i386 kernels now have two pmaps compiled in: one
managing PAE pagetables, and another non-PAE. The implementation is
selected at cold time depending on the CPU features. The vm_paddr_t is
always 64bit now. As result, nx bit can be used on all capable CPUs.
Option PAE only affects the bus_addr_t: it is still 32bit for non-PAE
configs, for drivers compatibility. Kernel layout, esp. max kernel
address, low memory PDEs and max user address (same as trampoline
start) are now same for PAE and for non-PAE regardless of the type of
page tables used.
Non-PAE kernel (when using PAE pagetables) can handle physical memory
up to 24G now, larger memory requires re-tuning the KVA consumers and
instead the code caps the maximum at 24G. Unfortunately, a lot of
drivers do not use busdma(9) properly so by default even 4G barrier is
not easy. There are two tunables added: hw.above4g_allow and
hw.above24g_allow, the first one is kept enabled for now to evaluate
the status on HEAD, second is only for dev use.
i386 now creates three freelists if there is any memory above 4G, to
allow proper bounce pages allocation. Also, VM_KMEM_SIZE_SCALE changed
from 3 to 1.
The PAE_TABLES kernel config option is retired.
In collaboarion with: pho
Discussed with: emaste
Reviewed by: markj
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
Differential revision: https://reviews.freebsd.org/D18894
This fixes BIO_ORDERED semantics while also improving performance by:
- sleeping also before BIO_ORDERED bio, as defined, not only after;
- not queueing BIO_ORDERED bio to taskqueue if no other bios running;
- waking up sleeping taskqueue explicitly rather then rely on polling.
On Samsung SSD 970 PRO this shows sync write latency, measured with
`diskinfo -wS`, reduction from ~2ms to ~1.1ms by not sleeping without
reason till next HZ tick.
On the same device ZFS pool with 8 ZVOLs synchronously writing 4KB blocks
shows ~950 IOPS instead of ~750 IOPS before. I suspect ZFS does not need
BIO_ORDERED on BIO_FLUSH at all, but that will be next question.
MFC after: 2 weeks
Sponsored by: iXsystems, Inc.
Because of a typo, the code was mistakenly resetting the
vtnrx_vq pointer rather than vtntx_tq.
Reviewed by: bryanv
MFC after: 3 days
Differential Revision: https://reviews.freebsd.org/D19015
handling for protocols without ports numbers.
Since port numbers were uninitialized for protocols like ICMP/ICMPv6,
ipfw_chk() used some non-zero values to create dynamic states, and due
this it failed to match replies with created states.
Reported by: Oliver Hartmann, Boris Lytochkin
Obtained from: Yandex LLC
X-MFC after: r342908
This will allow multiple consumers of the coverage data to be compiled
into the kernel together. The only requirement is only one can be
registered at a given point in time, however it is expected they will
only register when the coverage data is needed.
A new kernel conflig option COVERAGE is added. This will allow kcov to
become a module that can be loaded as needed, or compiled into the
kernel.
While here clean up the #include style a little.
Reviewed by: kib
Sponsored by: DARPA, AFRL
Differential Revision: https://reviews.freebsd.org/D18955
On sync-kloop stop, send a wake-up signal to the kloop, so that
waiting for the timeout is not needed.
Also, improve logging in netmap_freebsd.c.
MFC after: 3 days
January 2018 changes -r327723 and -r327821. The softdep_bp_to_mp()
function failed to include VFIFO as one of the valid cases.
Although fifo's do not allocate blocks in the filesystem, they will
allocate blocks if they use extended attributes (such as ACLs). Thus,
softdep_bp_to_mp() needs to return a non-NULL mount pointer when
presented with a fifo vnode so that the soft updates write complete
will properly process the soft updates structures associated with the
extended attribute blocks. It was the failure to process these soft
updates structures, thus leaving them hanging off the buffer, which
lead to the "panic: softdep_deallocate_dependencies: dangling deps"
when trying to clean up the buffer after it was written.
PR: 230962
Reported by: 2t8mr7kx9f@protonmail.com
Reviewed by: kib
Tested by: Peter Holm
MFC after: 1 week
Sponsored by: Netflix
Re-evaluating the ALTQ kernel configuration can be expensive,
particularly when there are a large number (hundreds or thousands) of
queues, and is wholly unnecessary in response to events on interfaces
that do not support ALTQ as such interfaces cannot be part of an ALTQ
configuration.
Reviewed by: kp
MFC after: 1 week
Sponsored by: RG Nets
Differential Revision: https://reviews.freebsd.org/D18918
The existence of a PV entry for a mapping guarantees that the mapping
exists, so we should not need to test for that.
Reviewed by: kib
MFC after: 3 days
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D18866
RFC 3168 defines an ECN-setup SYN-ACK packet as on with the ECE flags
set and the CWR flags not set. The code was only checking if ECE flag
is set. This patch adds the check to verify that the CWR flags is not
set.
Submitted by: Richard Scheffenegger
Reviewed by: tuexen@
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D18996
Enforce net80211 rates for control / management / multicast / EAPOL frames
and allow to override rate for unicast frames via ifconfig(8) 'ucastrate'
option; by default it still uses f/w rate adaptation for unicast frames.
MFC after: 1 week
The comment states that function always return a top of allocated mbuf;
however, the function actually return the overall mbuf chain top pointer.
Since there are already existing users of it (via m_getm(4) macro),
rephrase the comment and leave behavior unchanged.
PR: 134335
MFC after: 12 days
This includes the bump for cdevsw d_version. Otherwise, the impact on
the ABI (not KBI) is surprisingly low. The most important affected
interface is devname(3) and ttyname(3) which already correctly handle
long names (and ttyname(3) should not be affected at all).
Still, due to the d_version bump, I argue that the change is not MFC-able.
Requested by: mmacy
Reviewed by: jhb
Sponsored by: The FreeBSD Foundation
Differential revision: https://reviews.freebsd.org/D18932
corresponding bitmap before adding an mbuf has actually succeeded.
Previously, m_gethdr(M_NOWAIT, ...) failing caused a "hole" in the
RX ring but not in its bitmap. One implication of such a hole was
that in a subsequent call to _iflib_fl_refill() with the RX buffer
accounting still indicating another reclaimable buffer, bit_ffc(3)
nevertheless returned -1 in frag_idx which in turn caused havoc
when used as an index. Thus, additionally assert that frag_idx is
0 or greater.
Another possible consequence of a hole in the RX ring was a NULL-
dereference when trying to use the unallocated mbuf, for example
in iflib_rxd_pkt_get().
While at it, make the variable declarations in _iflib_fl_refill()
conform to style(9) and remove redundant checks already performed
by bit_ffc{,_at}(3).
- In iflib_queues_alloc(), don't pass redundant M_ZERO to bit_alloc(3).
Reported and tested by: pho
* There's no reason to have a while() loop here, because:
- if msleep returns 0, that means we were woken up by the interrupt handler,
and we are going to exit immediately as sc_fw_chunk_done will now be 1
(there is nothing else that sleeps on sc_fw.)
- if msleep doesn't return 0 (i.e. it returned ETIMEDOUT) then we will
exit immediately because of the if-test.
So, just use a single msleep() and then check sc_fw_chunk_done as before.
* The comment said we were sleeping for 5 seconds, but the msleep was only
for 1. Before r314065, this was 1 second and so was the comment,
and in that commit the comment was changed and the function call wasn't.
Possibly fixes failures to initialize uCode on certain devices.
Submitted by: Augustin Cavalier (waddlesplash gmail.com)
Obtained from: Haiku 132990ecdcb072f2ce597b5d497ff3e5b1f09c20
MFC after: 10 days
Wrap ieee80211_add_channel_list_2ghz into another function
which supplies default (1-14) channel list to it and drop
its copies from drivers.
Checked with RTL8188EE, country US / JP / KR / UA.
MFC after: 2 weeks
There is no reason for this variable to be tunable.
This variable is used as a barrier in few places.
Discussed with: pjd
MFC after: 2 weeks
Sponsored by: Fudo Security
set of known soft dependency data structures now includes: sd_worklist,
sd_inodedep, sd_allocdirect, sd_allocindir, and sd_mkdir. DDB can
also print lists of sd_allinodedeps, sd_mkdir_list, and sd_workhead.
The sd_workhead script is useful for listing all the dependencies
associated with a buffer, e.g. bp->b_dep.
Prefix the soft dependency show names with sd_ so that they sort
together when listed by DDB's "show help" and to distinguish them
from other data structures printable by DDB.
Sponsored by: Netflix
Maliciously formed, or badly corrupted, filesystems can cause kernel
panics. In general, such acts of foot-shooting can only be accomplished
by root, but in a world with VM images that is moving towards automated
mounts it is important to have some form of prevention.
Reported by: Christopher Krah, Thomas Barabosch, and Jan-Niclas Hilgert
of Fraunhofer FKIE.
Incidentaly this should also fix a memory corruption issue reported by
Dr Silvio Cesare of InfoSect.
Huge thanks to all reseachers for making us aware of the issue.
admbug: 872, 891
Reviewed by: fsu
Obtained from: NetBSD (with minor changes)
MFC after: 3 days
- Drain offload transmit queues when RATELIMIT is enabled but
TCP_OFFLOAD is not.
- Expose the per-VI nofldtxq and first_ofld_txq sysctls when
RATELIMIT is enabled but TCP_OFFLOAD is not.
- Clear offload transmit queue stats as part of a 'cxgbetool clearstats'
request when RATELIMIT is enabled but TCP_OFFLOAD is not.
Reviewed by: np
MFC after: 2 weeks
Sponsored by: Chelsio Communications
Differential Revision: https://reviews.freebsd.org/D18966
Fix dublicate value in what is apparent copypaste mistake. The last value
in mask is supposed to be for counter 7, not counter 3.
PR: 229790
Submitted by: David Binderman <dcb314@hotmail.com>
MFC after: 1 week
cpuid and local cpu variable are unsigned so checking if value is less than zero
always yields false.
PR: 211088
Submitted by: David Binderman <dcb314@hotmail.com>
MFC after: 1 week
This allows the part of the rewrite of TCP reassembly in this
files to be MFCed to stable/11 with manual change.
MFC after: 3 days
Sponsored by: Netflix, Inc.
The new loop to sync and unload descriptors was indexed
by "i", rather than "j". The panic was caused by "i"
being advanced rather than "j", and eventually becoming
out of bounds.
Reviewed by: kib
MFC after: 3 days
Sponsored by: Netflix
When implementing support for IW10, an update in the computation
of the restart window used after an idle phase was missed. To
minimize code duplication, implement the logic in tcp_compute_initwnd()
and call it. This fixes a bug in NewReno, which was not aware of
IW10.
Submitted by: Richard Scheffenegger
Reviewed by: tuexen@
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D18940
When cleaning up a vnet we free the counters in V_pf_default_rule and
V_pf_status from shutdown_pf(), but we can still use them later, for example
through pf_purge_expired_src_nodes().
Free them as the very last operation, as they rely on nothing else themselves.
PR: 235097
MFC after: 1 week
Replace in-place implementation with system-wide one; since it
guarantees non-zero result drop all less-than-one checks from
drivers and net80211.
MFC after: 2 weeks
connectivity.
Looking at past changes in this area like r337866, some refcounting
bugs have been introduced, one by one. For example like calling
in6m_disconnect() and in6m_rele_locked() in mld_v1_process_group_timer()
where previously no disconnect nor refcount decrement was done.
Calling in6m_disconnect() when it shouldn't causes IPv6 solitation to no
longer work, because all the multicast addresses receiving the solitation
messages are now deleted from the network interface.
This patch reverts some recent changes while improving the MLD
refcounting and concurrency model after the MLD code was converted
to using EPOCH(9).
List changes:
- All CK_STAILQ_FOREACH() macros are now properly enclosed into
EPOCH(9) sections. This simplifies assertion of locking inside
in6m_ifmultiaddr_get_inm().
- Corrected bad use of in6m_disconnect() leading to loss of IPv6
connectivity for MLD v1.
- Factored out checks for valid inm structure into
in6m_ifmultiaddr_get_inm().
PR: 233535
Differential Revision: https://reviews.freebsd.org/D18887
Reviewed by: bz (net)
Tested by: ae
MFC after: 1 week
Sponsored by: Mellanox Technologies
because the destructor will access the if_ioctl() callback in the ifnet
pointer which is about to be freed. This prevents use-after-free.
PR: 233535
Differential Revision: https://reviews.freebsd.org/D18887
Reviewed by: bz (net)
Tested by: ae
MFC after: 1 week
Sponsored by: Mellanox Technologies