Make the clustering enabling knob more fine-grained by providing a
setting where the allocation with hint is not clustered. This is aimed
to be somewhat more compatible with e.g. go 1.4 which expects that
hinted mmap without MAP_FIXED does not change the allocation address.
Now the vm.cluster_anon can be set to 1 to only cluster when no hints,
and to 2 to always cluster. Default value is 1.
Requested by: peter
Reviewed by: emaste, markj
Sponsored by: The FreeBSD Foundation
MFC after: 1 month
Differential revision: https://reviews.freebsd.org/D19194
When sigreturn() restored a thread's context, SRR1 was being restored
to its previous value, but pcb_flags was not being touched.
This could cause a mismatch between the thread's MSR and its pcb_flags.
For instance, when the thread used the FPU for the first time inside
the signal handler, sigreturn() would clear SRR1, but not pcb_flags.
Then, the thread would return with the FPU bit cleared in MSR and,
the next time it tried to use the FPU, it would fail on a KASSERT
that checked if the FPU was disabled.
This change clears the FPU bit in both pcb_flags and frame->srr1,
as the code that restores the context expects to use the FPU trap
to re-enable it.
PR: 234539
Reported by: sbruno
Reviewed by: jhibbits, sbruno
Differential Revision: https://reviews.freebsd.org/D19166
Some older compilers, when generating PIC code, cannot handle inline
asm that clobbers %ebx (because %ebx is used as the GOT offset
register). Userspace versions avoid clobbering %ebx by saving it to
stack before executing the CPUID instruction.
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
This reduces the overhead of TLB invalidations by ensuring that we
only interrupt CPUs which are using the given pmap. Tracking is
performed in pmap_activate(), which gets called during context switches:
from cpu_throw(), if a thread is exiting or an AP is starting, or
cpu_switch() for a regular context switch.
For now, pmap_sync_icache() still must interrupt all CPUs.
Reviewed by: kib (earlier version), jhb
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D18874
This includes support for pmap_enter(..., psind=1) as described in the
commit log message for r321378.
The changes are largely modelled after amd64. arm64 has more stringent
requirements around superpage creation to avoid the possibility of TLB
conflict aborts, and these requirements do not apply to RISC-V, which
like amd64 permits simultaneous caching of 4KB and 2MB translations for
a given page. RISC-V's PTE format includes only two software bits, and
as these are already consumed we do not have an analogue for amd64's
PG_PROMOTED. Instead, pmap_remove_l2() always invalidates the entire
2MB address range.
pmap_ts_referenced() is modified to clear PTE_A, now that we support
both hardware- and software-managed reference and dirty bits. Also
fix pmap_fault_fixup() so that it does not set PTE_A or PTE_D on kernel
mappings.
Reviewed by: kib (earlier version)
Discussed with: jhb
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D18863
Differential Revision: https://reviews.freebsd.org/D18864
Differential Revision: https://reviews.freebsd.org/D18865
Differential Revision: https://reviews.freebsd.org/D18866
Differential Revision: https://reviews.freebsd.org/D18867
Differential Revision: https://reviews.freebsd.org/D18868
But ipsec_delete_pcbpolicy() uses some VNET-virtualized variables,
and thus it needs VNET context, that is missing during gtaskqueue
executing. Use inp_vnet context to set curvnet in in_pcbfree_deferred().
PR: 235684
MFC after: 1 week
ratelimiting code. The two modules (lagg and vlan) did have
allocation routines, and even though they are indirect (and
vector down to the underlying interfaces) they both need to
have a free routine (that also vectors down to the actual interface).
Sponsored by: Netflix Inc.
Differential Revision: https://reviews.freebsd.org/D19032
Newer cores have the 'tlbilx' instruction, which doesn't broadcast over
CoreNet. This is significantly faster than walking the TLB to invalidate
the PID mappings. tlbilx with the arguments given takes 131 clock cycles to
complete, as opposed to 512 iterations through the loop plus tlbre/tlbwe at
each iteration.
MFC after: 3 weeks
The panic message lead people to believe some userland CAM request had
caused a problem when in reallity it was for a kernel request (eg the
USER bit was cleared). Reword message. Also, improve a couple of
comments to reflect that the periph shouldn't be completely torn down
before we get here (so the path and sim pointers should be valid, but
aren't and the code is designed to be robust enough in the face of
that to give a specific panic message).
So far, intr_{g,s}etaffinity(9) take a single int for identifying
a device interrupt. This approach doesn't work on all architectures
supported, as a single int isn't sufficient to globally specify a
device interrupt. In particular, with multiple interrupt controllers
in one system as found on e. g. arm and arm64 machines, an interrupt
number as returned by rman_get_start(9) may be only unique relative
to the bus and, thus, interrupt controller, a certain device hangs
off from.
In turn, this makes taskqgroup_attach{,_cpu}(9) and - internal to
the gtaskqueue implementation - taskqgroup_attach_deferred{,_cpu}()
not work across architectures. Yet in turn, iflib(4) as gtaskqueue
consumer so far doesn't fit architectures where interrupt numbers
aren't globally unique.
However, at least for intr_setaffinity(..., CPU_WHICH_IRQ, ...) as
employed by the gtaskqueue implementation to bind an interrupt to a
particular CPU, using bus_bind_intr(9) instead is equivalent from
a functional point of view, with bus_bind_intr(9) taking the device
and interrupt resource arguments required for uniquely specifying a
device interrupt.
Thus, change the gtaskqueue implementation to employ bus_bind_intr(9)
instead and intr_{g,s}etaffinity(9) to take the device and interrupt
resource arguments required respectively. This change also moves
struct grouptask from <sys/_task.h> to <sys/gtaskqueue.h> and wraps
struct gtask along with the gtask_fn_t typedef into #ifdef _KERNEL
as userland likes to include <sys/_task.h> or indirectly drags it
in - for better or worse also with _KERNEL defined -, which with
device_t and struct resource dependencies otherwise is no longer
as easily possible now.
The userland inclusion problem probably can be improved a bit by
introducing a _WANT_TASK (as well as a _WANT_MOUNT) akin to the
existing _WANT_PRISON etc., which is orthogonal to this change,
though, and likely needs an exp-run.
While at it:
- Change the gt_cpu member in the grouptask structure to be of type
int as used elswhere for specifying CPUs (an int16_t may be too
narrow sooner or later),
- move the gtaskqueue_enqueue_fn typedef from <sys/gtaskqueue.h> to
the gtaskqueue implementation as it's only used and needed there,
- change the GTASK_INIT macro to use "gtask" rather than "task" as
argument given that it actually operates on a struct gtask rather
than a struct task, and
- let subr_gtaskqueue.c consistently use __func__ to print functions
names.
Reported by: mmel
Reviewed by: mmel
Differential Revision: https://reviews.freebsd.org/D19139
Gratuitous ARP packets are sent from a timer, which means we don't have a vnet
context set. As a result we panic trying to send the packet.
Set the vnet context based on the interface associated with the interface
address.
To reproduce:
sysctl net.link.ether.inet.garp_rexmit_count=2
ifconfig vtnet1 10.0.0.1/24 up
PR: 235699
Reviewed by: vangyzen@
MFC after: 1 week
o Correct the obvious bugs in the netmap(4) parts:
- No longer check for the existence of DMA maps as bus_dma(9)
is used unconditionally in iflib(4) since r341095.
- Supply the correct DMA tag and map pairs to bus_dma(9)
functions (see also the commit message of r343753).
- In iflib_netmap_timer_adjust(), add synchronization of the
TX descriptors before calling the ift_txd_credits_update
method as the latter evaluates the TX descriptors possibly
updated by the MAC.
- In _task_fn_tx(), wrap the netmap(4)-specific bits in
#ifdef DEV_NETMAP just as done in _task_fn_admin() and
_task_fn_rx() respectively.
o In iflib_fast_intr_rxtx(), synchronize the TX rather than
the RX descriptors before calling the ift_txd_credits_update
method (see also above).
o There's no need to synchronize an RX buffer that is going to
be recycled in iflib_rxd_pkt_get(), yet; it's sufficient to
do that as late as passing RX buffers to the MAC via the
ift_rxd_refill method. Hence, combine that synchronization
with the synchronization of new buffers into a common spot
in _iflib_fl_refill().
o There's no need to synchronize the RX descriptors of a free
list in preparation of the MAC updating their statuses with
every invocation of rxd_frag_to_sd(); it's enough to do this
once before handing control over to the MAC, i. e. before
calling ift_rxd_flush method in _iflib_fl_refill(), which
already performs the necessary synchronization.
o Given that the ift_rxd_available method evaluates the RX
descriptors which possibly have been altered by the MAC,
synchronize as appropriate beforehand. Most notably this
is now done in iflib_rxd_avail(), which in turn means that
we don't need to issue the same synchronization yet again
before calling the ift_rxd_pkt_get method in iflib_rxeof().
o In iflib_txd_db_check(), synchronize the TX descriptors
before handing them over to the MAC for transmission via
the ift_txd_flush method.
o In iflib_encap(), move the TX buffer synchronization after
the invocation of the ift_txd_encap() method. If the MAC
driver fails to encapsulate the packet and we retry with
a defragmented mbuf chain or finally fail, the cycles for
TX buffer synchronization have been wasted. Synchronizing
afterwards matches what non-iflib(4) drivers typically do
and is sufficient as the MAC will not actually start with
the transmission before - in this case - the ift_txd_flush
method is called.
Moreover, for the latter reason the synchronization of the
TX descriptors in iflib_encap() can go as it's enough to
synchronize them before passing control over to the MAC by
issuing the ift_txd_flush() method (see above).
o In iflib_txq_can_drain(), only synchronize TX descriptors
if the ift_txd_credits_update method accessing these is
actually called.
Differential Revision: https://reviews.freebsd.org/D19081
At moea64_sync_icache(), when the 'va' argument has page size
alignment, round_page() will return the same value as 'va'.
This would cause 'len' to be 0 and thus an infinite loop.
With this change, 'lim' will always point to the next page boundary.
This issue occurred especially during debugging sessions, when a breakpoint
was placed on an exact page-aligned offset, for instance.
Reviewed by: jhibbits
Differential Revision: https://reviews.freebsd.org/D19149
option.
This issue was found by running syzkaller on OpenBSD.
Greg Steuck made me aware that the problem might also exist on FreeBSD.
Reported by: Greg Steuck
MFC after: 1 month
Differential Revision: https://reviews.freebsd.org/D18834
As a followup to r343673, unsign some variables related to allocation
since the hashsize cannot be negative. This gives a bit more space to
handle bigger allocations and avoid some implicit casting.
While here also unsign uh_hashmask, it makes little sense to keep that
signed.
MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D19148
This will allow upstream consumers, e.g., capsicum-test and third-party
packages (via ports(7)), to test for a specific `__FreeBSD_version__` and
expect `renameat(2)` to be functional.
PR: 222258
Approved by: emaste (mentor)
Reviewed by: emaste
MFC with: r343891
Differential Revision: https://reviews.freebsd.org/D19154
In `probedone()`, for the `PROBE_REPORT_LUNS` case, all paths that
fall to the bottom of the case set `lp` to `NULL`, so the test for a
non-NULL value of `lp` and call to `free()` if true is dead code as
the test can never be true. Fix by eliminating the whole if
statement. To guard against a possible future change that accidentally
violates this assumption, use a `KASSERT()` to catch if `lp` is
non-NULL.
Reviewed by: cem
MFC after: 1 week
Sponsored by: Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D19109
When pci_realloc_bars was first added, the intention was to eventually
enable it by default, but it was left disabled to preserve existing
behavior. The setting is pretty conservative in that it does not
attempt to allocate resources for BARs that the BIOS/firmware leaves
disabled. It only attempts to reallocate resources for a BAR that the
firmware programmed during boot but that conflicts with another
resource during the kernel's device scan.
PR 221350 is an example of a machine that this knob fixes.
Reviewed by: imp
Differential Revision: https://reviews.freebsd.org/D18965
Initially it was introduced because parent rule pointer could be freed,
and rule's information could become inaccessible. In r341471 this was
changed. And now we don't need this information, and also it can become
stale. E.g. rule can be moved from one set to another. This can lead
to parent's set and state's set will not match. In this case it is
possible that static rule will be freed, but dynamic state will not.
This can happen when `ipfw delete set N` command is used to delete
rules, that were moved to another set.
To fix the problem we will use the set number from parent rule.
Obtained from: Yandex LLC
MFC after: 1 week
Sponsored by: Yandex LLC
Without this fix, the usage of kernel coverage would lockup the system.
Thanks to Andrew for suggesting the final form of the fix.
PR: 235611
Reviewed by: andrew@, emaste@
Differential Revision: https://reviews.freebsd.org/D19135
battery charging, charge state, voltage, charging current, discharging current,
battery capacity etc. can be obtained via sysctl.
Reviewed by: manu
Differential Revision: https://reviews.freebsd.org/D19145
The hardcoded ident is exactly 20 bytes long but sprintf adds terminating zero,
so there is one byte written out of array bounds.As a fix use strncpy it
appends \0 only if space allows and its behavior matches virtio spec:
When VIRTIO_BLK_T_GET_ID is issued, the device identifier, up to 20 bytes, is
written to the buffer. The identifier should be interpreted as an ascii string.
It is terminated with \0, unless it is exactly 20 bytes long.
PR: 202298
Reviewed by: br
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D18852
In general, the time savings come from separating the active and
inactive queues lists into separate interface and non-interface queue
lists, and changing the rule and queue tag management from list-based
to hash-bashed.
In HFSC, a linear scan of the class table during each queue destroy
was also eliminated.
There are now two new tunables to control the hash size used for each
tag set (default for each is 128):
net.pf.queue_tag_hashsize
net.pf.rule_tag_hashsize
Reviewed by: kp
MFC after: 1 week
Sponsored by: RG Nets
Differential Revision: https://reviews.freebsd.org/D19131
nvpair_create_stringv: free the temporary string; this fix affects
nvlist_add_stringf() and nvlist_add_stringv().
nvpair_remove_nvlist_array (NV_TYPE_NVLIST_ARRAY case): free the chain
of nvpairs (as resetting it prevents nvlist_destroy() from freeing it).
Note: freeing the chain in nvlist_destroy() is not sufficient, because
it would still leak through nvlist_take_nvlist_array(). This affects
all nvlist_*_nvlist_array() use
Submitted by: Mindaugas Rasiukevicius <rmind@netbsd.org>
Reported by: clang/gcc ASAN
MFC after: 2 weeks
PR: 76972 and duplicates
Reported by: Dr. Christopher Landauer <cal AT aero.org>,
Steinar Haug <sthaug AT nethelp.no>
Submitted by: Andrey Zonov <andrey AT zonov.org> (earlier version)
MFC after: 2 weeks
SoCs with e500v2 chips only have at most 2 cores, and there are no plans to
release any more e500v2-based SoCs. Clamping MAXCPU down to 2 saves 5MB of
data, and 1.5MB bss.
- Distribute RX load across multiple cores, if present. This reverts
r217212, which is no longer relevant (I think because of the newer
SDK).
- Use newer APIs for pinning taskqueue entries to specific cores.
- Deepen RX buffers.
This more than doubles NAT forwarding throughput on my EdgeRouter Lite from,
with typical packet mixture, 90 Mbps to over 200 Mbps. The result matches
forwarding throughput in Linux without the UBNT hardware offload on the same
hardware, and thus likely reflects hardware limits.
Reviewed by: jhibbits
i386 is the only architecture where uint64_t does not specify 8-bytes
alignment, which makes struct xswdev layout not compatible between
64bit and i386.
Reported and tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
With this change, randomization can be enabled for all non-fixed
mappings. It means that the base address for the mapping is selected
with a guaranteed amount of entropy (bits). If the mapping was
requested to be superpage aligned, the randomization honours the
superpage attributes.
Although the value of ASLR is diminshing over time as exploit authors
work out simple ASLR bypass techniques, it elimintates the trivial
exploitation of certain vulnerabilities, at least in theory. This
implementation is relatively small and happens at the correct
architectural level. Also, it is not expected to introduce
regressions in existing cases when turned off (default for now), or
cause any significant maintaince burden.
The randomization is done on a best-effort basis - that is, the
allocator falls back to a first fit strategy if fragmentation prevents
entropy injection. It is trivial to implement a strong mode where
failure to guarantee the requested amount of entropy results in
mapping request failure, but I do not consider that to be usable.
I have not fine-tuned the amount of entropy injected right now. It is
only a quantitive change that will not change the implementation. The
current amount is controlled by aslr_pages_rnd.
To not spoil coalescing optimizations, to reduce the page table
fragmentation inherent to ASLR, and to keep the transient superpage
promotion for the malloced memory, locality clustering is implemented
for anonymous private mappings, which are automatically grouped until
fragmentation kicks in. The initial location for the anon group range
is, of course, randomized. This is controlled by vm.cluster_anon,
enabled by default.
The default mode keeps the sbrk area unpopulated by other mappings,
but this can be turned off, which gives much more breathing bits on
architectures with small address space, such as i386. This is tied
with the question of following an application's hint about the mmap(2)
base address. Testing shows that ignoring the hint does not affect the
function of common applications, but I would expect more demanding
code could break. By default sbrk is preserved and mmap hints are
satisfied, which can be changed by using the
kern.elf{32,64}.aslr.honor_sbrk sysctl.
ASLR is enabled on per-ABI basis, and currently it is only allowed on
FreeBSD native i386 and amd64 (including compat 32bit) ABIs. Support
for additional architectures will be added after further testing.
Both per-process and per-image controls are implemented:
- procctl(2) adds PROC_ASLR_CTL/PROC_ASLR_STATUS;
- NT_FREEBSD_FCTL_ASLR_DISABLE feature control note bit makes it possible
to force ASLR off for the given binary. (A tool to edit the feature
control note is in development.)
Global controls are:
- kern.elf{32,64}.aslr.enable - for non-fixed mappings done by mmap(2);
- kern.elf{32,64}.aslr.pie_enable - for PIE image activation mappings;
- kern.elf{32,64}.aslr.honor_sbrk - allow to use sbrk area for mmap(2);
- vm.cluster_anon - enables anon mapping clustering.
PR: 208580 (exp runs)
Exp-runs done by: antoine
Reviewed by: markj (previous version)
Discussed with: emaste
Tested by: pho
MFC after: 1 month
Sponsored by: The FreeBSD Foundation
Differential revision: https://reviews.freebsd.org/D5603
- for now, alignments bigger that page size is allowed only for buffers
allocated by bus_dmamem_alloc(), cover this fact by KASSERT.
- never bounce buffers allocated by bus_dmamem_alloc(), these always comply
with the required rules (alignment, boundary, address range).
MFC after: 1 week
Reviewed by: jah
PR: 235542
reading some events from the interrupt status registers. These events
are reported to devd via system "PMU" and subsystem "Battery", "AC"
and "USB" such as plugged/unplugged, absent, charged and charging.
Reviewed by: manu
Differential Revision: https://reviews.freebsd.org/D19116
Make every rockchip file depend on the multiple soc_rockchip options
While here make rk_i2c and rk_gpio depend on their device options.
Reported by: sbruno