In general, the time savings come from separating the active and
inactive queues lists into separate interface and non-interface queue
lists, and changing the rule and queue tag management from list-based
to hash-bashed.
In HFSC, a linear scan of the class table during each queue destroy
was also eliminated.
There are now two new tunables to control the hash size used for each
tag set (default for each is 128):
net.pf.queue_tag_hashsize
net.pf.rule_tag_hashsize
Reviewed by: kp
MFC after: 1 week
Sponsored by: RG Nets
Differential Revision: https://reviews.freebsd.org/D19131
nvpair_create_stringv: free the temporary string; this fix affects
nvlist_add_stringf() and nvlist_add_stringv().
nvpair_remove_nvlist_array (NV_TYPE_NVLIST_ARRAY case): free the chain
of nvpairs (as resetting it prevents nvlist_destroy() from freeing it).
Note: freeing the chain in nvlist_destroy() is not sufficient, because
it would still leak through nvlist_take_nvlist_array(). This affects
all nvlist_*_nvlist_array() use
Submitted by: Mindaugas Rasiukevicius <rmind@netbsd.org>
Reported by: clang/gcc ASAN
MFC after: 2 weeks
PR: 76972 and duplicates
Reported by: Dr. Christopher Landauer <cal AT aero.org>,
Steinar Haug <sthaug AT nethelp.no>
Submitted by: Andrey Zonov <andrey AT zonov.org> (earlier version)
MFC after: 2 weeks
SoCs with e500v2 chips only have at most 2 cores, and there are no plans to
release any more e500v2-based SoCs. Clamping MAXCPU down to 2 saves 5MB of
data, and 1.5MB bss.
- Distribute RX load across multiple cores, if present. This reverts
r217212, which is no longer relevant (I think because of the newer
SDK).
- Use newer APIs for pinning taskqueue entries to specific cores.
- Deepen RX buffers.
This more than doubles NAT forwarding throughput on my EdgeRouter Lite from,
with typical packet mixture, 90 Mbps to over 200 Mbps. The result matches
forwarding throughput in Linux without the UBNT hardware offload on the same
hardware, and thus likely reflects hardware limits.
Reviewed by: jhibbits
i386 is the only architecture where uint64_t does not specify 8-bytes
alignment, which makes struct xswdev layout not compatible between
64bit and i386.
Reported and tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
With this change, randomization can be enabled for all non-fixed
mappings. It means that the base address for the mapping is selected
with a guaranteed amount of entropy (bits). If the mapping was
requested to be superpage aligned, the randomization honours the
superpage attributes.
Although the value of ASLR is diminshing over time as exploit authors
work out simple ASLR bypass techniques, it elimintates the trivial
exploitation of certain vulnerabilities, at least in theory. This
implementation is relatively small and happens at the correct
architectural level. Also, it is not expected to introduce
regressions in existing cases when turned off (default for now), or
cause any significant maintaince burden.
The randomization is done on a best-effort basis - that is, the
allocator falls back to a first fit strategy if fragmentation prevents
entropy injection. It is trivial to implement a strong mode where
failure to guarantee the requested amount of entropy results in
mapping request failure, but I do not consider that to be usable.
I have not fine-tuned the amount of entropy injected right now. It is
only a quantitive change that will not change the implementation. The
current amount is controlled by aslr_pages_rnd.
To not spoil coalescing optimizations, to reduce the page table
fragmentation inherent to ASLR, and to keep the transient superpage
promotion for the malloced memory, locality clustering is implemented
for anonymous private mappings, which are automatically grouped until
fragmentation kicks in. The initial location for the anon group range
is, of course, randomized. This is controlled by vm.cluster_anon,
enabled by default.
The default mode keeps the sbrk area unpopulated by other mappings,
but this can be turned off, which gives much more breathing bits on
architectures with small address space, such as i386. This is tied
with the question of following an application's hint about the mmap(2)
base address. Testing shows that ignoring the hint does not affect the
function of common applications, but I would expect more demanding
code could break. By default sbrk is preserved and mmap hints are
satisfied, which can be changed by using the
kern.elf{32,64}.aslr.honor_sbrk sysctl.
ASLR is enabled on per-ABI basis, and currently it is only allowed on
FreeBSD native i386 and amd64 (including compat 32bit) ABIs. Support
for additional architectures will be added after further testing.
Both per-process and per-image controls are implemented:
- procctl(2) adds PROC_ASLR_CTL/PROC_ASLR_STATUS;
- NT_FREEBSD_FCTL_ASLR_DISABLE feature control note bit makes it possible
to force ASLR off for the given binary. (A tool to edit the feature
control note is in development.)
Global controls are:
- kern.elf{32,64}.aslr.enable - for non-fixed mappings done by mmap(2);
- kern.elf{32,64}.aslr.pie_enable - for PIE image activation mappings;
- kern.elf{32,64}.aslr.honor_sbrk - allow to use sbrk area for mmap(2);
- vm.cluster_anon - enables anon mapping clustering.
PR: 208580 (exp runs)
Exp-runs done by: antoine
Reviewed by: markj (previous version)
Discussed with: emaste
Tested by: pho
MFC after: 1 month
Sponsored by: The FreeBSD Foundation
Differential revision: https://reviews.freebsd.org/D5603
- for now, alignments bigger that page size is allowed only for buffers
allocated by bus_dmamem_alloc(), cover this fact by KASSERT.
- never bounce buffers allocated by bus_dmamem_alloc(), these always comply
with the required rules (alignment, boundary, address range).
MFC after: 1 week
Reviewed by: jah
PR: 235542
reading some events from the interrupt status registers. These events
are reported to devd via system "PMU" and subsystem "Battery", "AC"
and "USB" such as plugged/unplugged, absent, charged and charging.
Reviewed by: manu
Differential Revision: https://reviews.freebsd.org/D19116
Make every rockchip file depend on the multiple soc_rockchip options
While here make rk_i2c and rk_gpio depend on their device options.
Reported by: sbruno
The COVERAGE option breaks xtoolchain-gcc GENERIC kernel early boot
extremely badly and hasn't been fixed for the ~week since it was committed.
Please enable for GENERIC only when it doesn't do that.
Related fallout reported by: lwhsu, tuexen (pr 235611)
This can aid with debugging when a thread is running and has no backtrace.
State can be estimated based on the pcb, and refined from there, for
example, to get a rough idea of the stack pointer.
r241119 that's performed globally by device_attach(9).
- As for the EM-class of devices, em(4) supports multiple queues
and MSI-X respectively only with 82574 devices. However, since
the conversion to iflib(4), em(4) relies on the interrupt type
fallback mechanism, i. e. MSI-X -> MSI -> INTx, of iflib(4) to
figure out the interrupt type to use for the EM-class (as well
as the IGB-class) of MACs. Moreover, despite the datasheet for
82583V not mentioning any support of MSI-X, there actually are
82583V devices out there that report a varying number of MSI-X
messages as supported. The interrupt type fallback of iflib(4)
is causing two failure modes depending on the actual number of
MSI-X messages supported for such instances of 82583V:
1) With only one MSI-X message supported, none is left for the
RX/TX queues as that one message gets assigned to the admin
interrupt. Worse, later on - which will be addressed with a
separate fix - iflib(4) interprets that one messages as MSI
or INTx to be set up, but fails to actually do so as it has
previously called pci_alloc_msix(9). [1, 2]
2) With more message supported, their distribution is okay but
then em_if_msix_intr_assign() doesn't work for 82583V, with
the interface being left in a non-working state, too. [3]
Thus, let em_if_attach_pre() indicate to iflib(4) to try MSI-X
with 82574 only, and at most MSI for the remainder of EM-class
devices.
While at it, remove "try_second_bar" as it's polarity inverted
and not actually needed.
- Remove code from em_if_timer() that effectively is a NOP since
the conversion to iflib(4) ("trigger" is no longer read).
While at it, let the comment for em_if_timer() reflect reality
after said conversion.
- Implement an ifdi_watchdog_reset method which only updates the
em(4) "watchdog_events" counter but doesn't perform any reset,
so that the em(4) "watchdog_timeouts" SYSCTL (iflib(4) doesn't
provide a counterpart) reflects reality and these timeouts add
to IFCOUNTER_OERRORS again after the iflib(4) conversion.
- Remove the "mbuf_defrag_fail" and "tx_dma_fail" SYSCTLS; since
the iflib(4) conversion, associated counters are disconnected,
but iflib(4) provides "mbuf_defrag_failed" and "tx_map_failed"
respectively as equivalents.
- Move the description preceding lem_smartspeed() to the correct
spot before em_reset() and bring back appropriate comments for
{igb,em}_initialize_rss_mapping() and lem_smartspeed() lost in
the iflib(4) conversion.
- Adapt some other function descriptions and INIT_DEBUGOUT() use
to match reality after the iflib(4) conversion.
- Put the debugging message of em_enable_vectors_82574() (missed
in r343578) under bootverbose, too.
PR: 219428 [1], 235246 [2], 235147 [3]
Reviewed by: erj (previous version)
Differential Revision: https://reviews.freebsd.org/D19108
It is currently re-declared in sys/sysent.h which is a wrong place for
MD variable. Which causes redeclaration error with gcc when
sys/sysent.h and machine/md_var.h are included both.
Remove it from sys/sysent.h and instead include machine/md_var.h when
needed, under #ifdef for both i386 and amd64.
Reported and tested by: bde
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
sysctl variable net.inet.tcp.cc.cdg.smoothing_factor to 0, the smoothing
is disabled. Without this patch, a division by zero orrurs.
PR: 193762
Reviewed by: lstewart@, rrs@
MFC after: 3 days
Sponsored by: Netflix, Inc.
Differential Revision: https://reviews.freebsd.org/D19071
When configured with more tx queues than rx queues,
em_if_msix_intr_assign() was incorrectly routing the tx event
interrupts.
Reviewed by: erj, marius
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D19070
The vp vnode is unlocked during the execution of the VOP method and
can be reclaimed, zeroing vp->v_data. Caching allows to use the
correct mount point.
Reported and tested by: pho
PR: 235549
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
When renameat(2) is used with:
- absolute path for to;
- tofd not set to AT_FDCWD;
- the target exists
kern_renameat() requires CAP_UNLINK capability on tofd, but
corresponding namei ni_filecap is not initialized at all because the
lookup is absolute. As result, the check was done against empty filecap
and syscall fails erronously.
Fix it by creating a return flags namei member and reporting if the
lookup was absolute, then do not touch to.ni_filecaps at all.
PR: 222258
Reviewed by: jilles, ngie
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
X-MFC-note: KBI breakage
Differential revision: https://reviews.freebsd.org/D19096
Code after exec_fail_dealloc label expects that the image vnode is
locked if present. When copyout() of the strings or auxv vectors fails,
goto to the error handling did not relocked the vnode as required.
The copyout() can be made failing e.g. by creating an ELF image with
PT_GNU_STACK segment disabling the write.
Reported by: Jonathan Stuart <n0t.jcs@gmail.com> (found by fuzzing)
Sponsored by: The FreeBSD Foundation
MFC after: 3 days
When moving from an invalid to a valid entry we don't need to invalidate
the tlb, however we do need to ensure the store is ordered before later
memory accesses. This is because this later access may be to a virtual
address within the newly mapped region.
Add the needed barriers to places where we don't later invalidate the
tlb. When we do invalidate the tlb there will be a barrier to correctly
order this.
This fixes a panic on boot on ThunderX2 when INVARIANTS is turned off:
panic: vm_fault_hold: fault on nofault entry, addr: 0xffff000040c11000
Reported by: jchandra
Tested by: jchandra
Sponsored by: DARPA, AFRL
Differential Revision: https://reviews.freebsd.org/D19097
We need to ensure the page table store has happened before the tlbi.
Reported by: jchandra
Tested by: jchandra
Sponsored by: DARPA, AFRL
Differential Revision: https://reviews.freebsd.org/D19097
For direct mapped kernel addresses, ppc64 function was not
performing the dmap to physical conversion, before jumping
to the code that fetched the value from physical memory.
Reviewed by: jhibbits
Differential Revision: https://reviews.freebsd.org/D19086
Use the information from IORT parsing to translate the PCI RID to
GIC ITS device ID. And similarly, use the information to find the
PIC XREF identifier to be used for PCI devices.
Reviewed by: andrew
Differential Revision: https://reviews.freebsd.org/D18004