When calling file_findfile with only a type it returns
the first file matching the type. But in fdt_apply_overlays we
then iterate on the next files and try loading them as dtb overlays.
Fix this by checking the type one more time.
Sponsored by: Diablotin Systems
Reported by: Mark Millard <marklmi@yahoo.com>
Instead serialize against these operations with a dedicated lock.
Prior to the change, When pushing 17 mln pps of traffic, calling
DIOCRGETTSTATS in a loop would restrict throughput to about 7 mln. With
the change there is no slowdown.
Reviewed by: kp (previous version)
Sponsored by: Rubicon Communications, LLC ("Netgate")
Creating tables and zeroing their counters induces excessive IPIs (14
per table), which in turns kills single- and multi-threaded performance.
Work around the problem by extending per-CPU counters with a general
counter populated on "zeroing" requests -- it stores the currently found
sum. Then requests to report the current value are the sum of per-CPU
counters subtracted by the saved value.
Sample timings when loading a config with 100k tables on a 104-way box:
stock:
pfctl -f tables100000.conf 0.39s user 69.37s system 99% cpu 1:09.76 total
pfctl -f tables100000.conf 0.40s user 68.14s system 99% cpu 1:08.54 total
patched:
pfctl -f tables100000.conf 0.35s user 6.41s system 99% cpu 6.771 total
pfctl -f tables100000.conf 0.48s user 6.47s system 99% cpu 6.949 total
Reviewed by: kp (previous version)
Sponsored by: Rubicon Communications, LLC ("Netgate")
as a thin wrapper around native version found in sys/seqc.h.
This replaces out-of-base GPLv2-licensed code used by drm-kmod.
Reviewed by: hselasky
Differential revision: https://reviews.freebsd.org/D31006
strscpy copies the src string, or as much of it as fits, into the dst
buffer. The dst buffer is always NUL terminated, unless it's zero-sized.
strscpy returns the number of characters copied (not including the
trailing NUL) or -E2BIG if len is 0 or src was truncated.
Currently drm-kmod replaces strscpy with strncpy that is not quite
correct as strncpy does not NUL-terminate truncated strings and returns
different values on exit.
Reviewed by: hselasky, imp, manu
MFC after: 2 weeks
Differential revision: https://reviews.freebsd.org/D31005
This allows to remove unimplemented attrs parameter which type differs
between Linux kernel versions and to compile both drm-kmod and ofed
callers unmodified.
Also convert it to 'unsigned long' type to match modern Linuxes.
Reviewed by: hselasky
Differential revision: https://reviews.freebsd.org/D30932
Linux docs explicitly state that this is not required [1]:
"Important note: The rcu_barrier() function is not, repeat, not,
obligated to wait for a grace period. It is instead only required to
wait for RCU callbacks that have already been posted. Therefore, if
there are no RCU callbacks posted anywhere in the system, rcu_barrier()
is within its rights to return immediately. Even if there are
callbacks posted, rcu_barrier() does not necessarily need to wait for
a grace period."
[1] https://www.kernel.org/doc/Documentation/RCU/Design/Requirements/Requirements.html
Reviewed by: emaste, hselasky, manu
MFC after: 2 weeks
Differential revision: https://reviews.freebsd.org/D30809
so this list-traversal primitive may safely run concurrently with the
_rcu list-mutation primitives such as list_add_rcu() as long as the
traversal is guarded by rcu_read_lock().
Do it by reusing the "list_for_each_entry_rcu" macro which does the same.
On Linux it implements some additional lockdep stuff which we skip.
Also move the macro to linux/rculist.h where it resides on Linux.
Reviewed by: hselasky
Differential revision: https://reviews.freebsd.org/D30795
as it is required by i915kms driver from Linux kernel v 5.5.
This is done with asynchronous freeing of requested memory areas from
taskqueue thread. As memory to be freed is reused to store linked list
entry, backing UMA zone item size is rounded up to pointer size.
While here, make struct linux_kmem_cache private to LKPI to reduce amount
of BSD headers included by linux/slab.h and switch RCU code to usage of
LKPI's linux_irq_work_tq taskqueue to avoid injection of current into
system-wide taskqueue_fast thread context.
Submitted by: nc (initial version for drm-kmod)
Reviewed by: manu, nc
Differential revision: https://reviews.freebsd.org/D30760
The kernel part of ipfw(8) does initialize LibAlias uncondistionally
with an zeroized port range (allowed ports from 0 to 0). During
restucturing of libalias, port ranges are used everytime and are
therefor initialized with different values than zero. The secondary
initialization from ipfw (and probably others) overrides the new
default values and leave the instance in an unfunctional state. The
obvious solution is to detect such reinitializations and use the new
default value instead.
MFC after: 3 days
Only ESRT and PROP tables are handled at the moment.
Submitted by: Pavel Balaev <pavel.balaev@3mdeb.com>
MFC after: 2 weeks
Differential revision: https://reviews.freebsd.org/D30104
Make it work, but change the interface to be safe for non-root users. In
particular, right now interface only works for the tables which can be
minimally parsed by kernel to determine the table size. Then, userspace can
query the table size, after that it provides a buffer of needed size
and kernel copies out just table to userspace.
Main advantage is that user no longer need to be able to read /dev/mem,
the disadvantage is the need to have minimal parsers aware of the table
types. Right now the parsers are implemented for ESRT and PROP tables.
Future extension of the present interface might be a return of only
the table physical address, in case kernel does not have suitable
parser yet. Then, a privileged user could read the table from /dev/mem.
This extension, which logically equivalent to the old (non-worked)
EFIIOC_GET_TABLE variant, is not implemented until needed.
Submitted by: Pavel Balaev <pavel.balaev@3mdeb.com>
MFC after: 2 weeks
Differential revision: https://reviews.freebsd.org/D30104
This makes prctl(2) support PR_SET_NO_NEW_PRIVS, by mapping it
to the native PROC_NO_NEW_PRIVS_CTL procctl(2).
Sponsored By: EPSRC
Differential Revision: https://reviews.freebsd.org/D30973
Refine mistakes from adaptaton of NetBSD's hardclock man page to
FreeBSD:
o clarify what usermode means
o clarify how often hardclock is called
o remove Xr callout(9) since that's done elsewhere
Reviewed by: mav@
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D30982
The expiration time of direct address mappings is explicitly
uninitialized. Expire times are always compared during housekeeping.
Despite the uninitialized value does not harm, it's simpler to just
set it to a reasonable default. This was detected during valgrinding
the test suite.
MFC after: 3 days
Revert commit 22b615a96593 from llvm git (by Daniel Kiss):
[libunwind] Support for leaf function unwinding.
Unwinding leaf function is useful in cases when the backtrace finds a
leaf function for example when it caused a signal.
This patch also add the support for the DW_CFA_undefined because it marks
the end of the frames.
Ryan Prichard provided code for the tests.
Reviewed By: #libunwind, mstorsjo
Differential Revision: https://reviews.llvm.org/D83573
Reland with limit the test to the x86_64-linux target.
Bisection has shown that this particular upstream commit causes programs
using backtrace(3) on aarch64 to segfault. This affects the lang/rust
port, for instance. Until we can upstream to fix this problem, revert
the commit for now.
Reported by: mikael
PR: 256864
Comparing elements in a tree requires transitiviy. If a < b and b < c
then a must be smaller than c. This way the tree elements are always
pairwise comparable.
Tristate comparsion functions returning values lower, equal, or
greater than zero, are usually implemented by a simple subtraction of
the operands. If the size of the operands are equal to the size of
the result, integer modular arithmetics kick in and violates the
transitivity.
Example:
Working on byte with 0, 120, and 240. Now computing the differences:
120 - 0 = 120
240 - 120 = 120
240 - 0 = -16
MFC after: 3 days
Add an option to dumpfs, `-s`, that only prints the super block information.
Reviewed by: chs, imp
Differential Revision: https://reviews.freebsd.org/D30881
Coherently read the phase bit of the status completion record. We loop
over the completion record array, looking for all the transactions in
the same phase that have been completed. In doing that, we have to be
careful to read the status field first, and if it indicates a complete
record, we need to read and process that record. Otherwise, the host
might be overtaken by device when reading this completion record,
leading to a mistaken belief that the record is in phase. This leads to
the code using old values and looking at an already completed entry, which
has no current tracker.
To work around this problem, we read the status and make sure it is in
phase, we then re-read the entire completion record guaranteeing it's
complete, valid, and consistent . In addition we resync the dmatag to
reflect changes since the prior loop for the bouncing dma case.
Reviewed by: jrtc27@, chuck@
Found by: jrtc27 (this fix is based in part on her D30995 fix)
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D31002
Remove __packed from nvme_command, nvme_completion and
nvme_dsm_trim. Add super-alignment to nvme_completion since it's always
at least that aligned in hardware (and in our existing uses of it
embedded in structures). It generates better code in
nvme_qpair_process_completions on riscv64 because otherwise the ABI
assumes a 4-byte alignment, and the same on all other platforms.
Reviewed by: jrtc27@, mav@, chuck@
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D31001
Put the { on the same line as the struct nvme_foo when we define these
structures. It's FreeBSD standard and these were inconsistent.
Sponsored by: Netflix
Currently, there are several places in zvol_id where the program logic
returns particular errno values, or even particular ioctl return values,
as the program exit status, rather than a straightforward system of
explicit zero on success and explicit nonzero value(s) on failure.
This is problematic for multiple reasons. One particularly interesting
problem that can arise, is that if any of these values happens to have
all 8 least significant bits unset (i.e., it is a positive or negative
multiple of 256), then although the C program sees a nonzero int value
(presumed to be a failure exit status), the actual exit status as seen
by the system is only the bottom 8 bits of that integer: zero.
This can happen in practice, and I have encountered it myself. In a
particularly weird situation, the zvol_open code in the zfs kernel
module was behaving in such a manner that it caused the open() syscall
to fail and for errno to be set to a kernel-private value (ERESTARTSYS,
which happens to be defined as 512). It turns out that 512 is evenly
divisible by 256; or, in other words, its least significant 8 bits are
all-zero. So even though zvol_id believed it was returning a nonzero
(failure) exit status of 512, the system modulo'd that value by 256,
resulting in the actual exit status visible by other programs being 0!
This actually-zero (non-failure) exit status caused problems: udev
believed that the program was operating successfully, when in fact it
was attempting to indicate failure via a nonzero exit status integer.
Combined with another problem, this led to the creation of nonsense
symlinks for zvol dev nodes by udev.
Let's get rid of all this problematic logic, and simply return
EXIT_SUCCESS (0) is everything went fine, and EXIT_FAILURE (1) if
anything went wrong.
Additionally, let's clarify some of the variable names (error is similar
to errno, etc) and clean up the overall program flow a bit.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Pavel Zakharov <pavel.zakharov@delphix.com>
Signed-off-by: Justin Gottula <justin@jgottula.com>
Closes#12302
The zvol_id program is invoked by udev, via a PROGRAM key in the
60-zvol.rules.in rule file, to determine the "pretty" /dev/zvol/*
symlink paths paths that should be generated for each opaquely named
/dev/zd* dev node.
The udev rule uses the PROGRAM key, followed by a SYMLINK+= assignment
containing the %c substitution, to collect the program's stdout and then
"paste" it directly into the name of the symlink(s) to be created.
Unfortunately, as currently written, zvol_id outputs both its intended
output (a single string representing the symlink path that should be
created to refer to the name of the dataset whose /dev/zd* path is
given) AND its error messages (if any) to stdout.
When processing PROGRAM keys (and others, such as IMPORT{program}), udev
uses only the data written to stdout for functional purposes. Any data
written to stderr is used solely for the purposes of logging (if udev's
log_level is set to debug).
The unintended consequence of this is as follows: if zvol_id encounters
an error condition; and then udev fails to halt processing of the
current rule (either because zvol_id didn't return a nonzero exit
status, or because the PROGRAM key in the rule wasn't written properly
to result in a "non-match" condition that would stop the current rule on
a nonzero exit); then udev will create a space-delimited list of symlink
names derived directly from the words of the error message string!
I've observed this exact behavior on my own system, in a situation where
the open() syscall on /dev/zd* dev nodes was failing sporadically (for
reasons that aren't especially relevant here). Because the open() call
failed, zvol_id printed "Unable to open device file: /dev/zd736\n" to
stdout and then exited.
The udev rule finished with SYMLINK+="zvol/%c %c". Assuming a volume
name like pool/foo/bar, this would ordinarily expand to
SYMLINK+="zvol/pool/foo/bar pool/foo/bar"
and would cause symlinks to be created like this:
/dev/zvol/pool/foo/bar -> /dev/zd736
/dev/pool/foo/bar -> /dev/zd736
But because of the combination of error messages being printed to
stdout, and the udev syntax freely accepting a space-delimited sequence
of names in this context, the error message string
"Unable to open device file: /dev/zd736\n"
in reality expanded to
SYMLINK+="zvol/Unable to open device file: /dev/zd736"
which caused the following symlinks to actually be created:
/dev/zvol/Unable -> /dev/zd736
/dev/to -> /dev/zd736
/dev/open -> /dev/zd736
/dev/device -> /dev/zd736
/dev/file: -> /dev/zd736
/dev//dev/zd736 -> /dev/zd736
(And, because multiple zvols had open() syscall errors, multiple zvols
attempted to claim several of those symlink names, resulting in numerous
udev errors and timeouts and general chaos.)
This commit rectifies all this silliness by simply printing error
messages to stderr, as Dennis Ritchie originally intended.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Pavel Zakharov <pavel.zakharov@delphix.com>
Signed-off-by: Justin Gottula <justin@jgottula.com>
Closes#12302
Assignment syntax (=) can be used for the PROGRAM key. But the PROGRAM
key is really a match key, not an assign key. The internal logic used by
udev to decide whether a PROGRAM key "matched" or not (which determines
whether the remainder of the rule is evaluated) depends on whether the
operator was OP_MATCH (==) or OP_NOMATCH (!=). [1]
The man page claims that '"=", ":=", and "+=" have the same effect as
"=="' for PROGRAM keys. And, after a brief perusal, the udev source code
does seem to confirm that operators other than OP_MATCH (==) or
OP_NOMATCH (!=) are implicitly converted to OP_MATCH (==). [2] But it's
not entirely clear that this is definitely the case: anecdotal testing
seems to indicate that when OP_ASSIGN (=) is used, the program's exit
status is disregarded and the remainder of the rule is processed
regardless of whether it was, in fact, a successful exit.
The bottom line here is that, if zvol_id hits some snag and returns a
nonzero exit status, then we almost certainly do NOT want to continue on
with the rule and use whatever the stdout contents may have been to
mindlessly create /dev/zvol/* symlinks. Therefore, let's be extra-sure
and use the match (==) operator explicitly, to eliminate any possibility
that udev might do the wrong thing, and ensure that a nonzero exit
status will definitely short-circuit the rest of the rule, bypassing the
SYMLINK+= assignments.
[1]
udev,
file src/udev/udev-rules.c,
func udev_rule_apply_token_to_event,
switch case TK_M_PROGRAM if r != 0 (nonzero exit status):
return token->op == OP_NOMATCH;
switch case TK_M_PROGRAM if r == 0 (zero exit status):
return token->op == OP_MATCH;
func retval 0 => key is considered to have matched
func retval 1 => key is considered to have NOT matched
[2]
udev,
file src/udev/udev-rules.c,
func parse_token,
at func start:
bool is_match = IN_SET(op, OP_MATCH, OP_NOMATCH);
in else-if case streq(key, "PROGRAM"):
if (!is_match) op = OP_MATCH;
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Pavel Zakharov <pavel.zakharov@delphix.com>
Signed-off-by: Justin Gottula <justin@jgottula.com>
Closes#12302
The $tempnode substitution is so old that it's not even mentioned in the
man page anymore. It is still technically supported by udev, but with
plenty of "deprecated" comments surrounding it.
The preferred modern equivalent of $tempnode is $devnode (or
alternatively, %N).
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Pavel Zakharov <pavel.zakharov@delphix.com>
Signed-off-by: Justin Gottula <justin@jgottula.com>
Closes#12302
This file is old as dirt. It's entirely possible that commas were
optional in udev back at that time. But they're definitely supposed to
be there nowadays.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Pavel Zakharov <pavel.zakharov@delphix.com>
Signed-off-by: Justin Gottula <justin@jgottula.com>
Closes#12302
We must remember to free the nvlist we create from the kernel's response
to DIOCGETSTATESNV, on every iteration.
Reviewed by: donner
MFC after: 1 week
Sponsored by: Rubicon Communications, LLC ("Netgate")
Differential Revision: https://reviews.freebsd.org/D30957
This call is particularly slow due to the large amount of data it
returns. Remove all fields pfctl does not use. There is no functional
impact to pfctl, but it somewhat speeds up the call.
It might affect other (i.e. non-FreeBSD) code that uses the new
interface, but this call is very new, so there's unlikely to be any. No
releases contained the previous version, so we choose to live with the
ABI modification.
Reviewed by: donner
MFC after: 1 week
Sponsored by: Rubicon Communications, LLC ("Netgate")
Differential Revision: https://reviews.freebsd.org/D30944
Create and retrieve 20.000 states. There have been issues with nvlists
causing very slow state retrieval. We don't impose a specific limit on
the time required to retrieve the states, but do log it. In excessive
cases the Kyua timeout will fail this test.
Reviewed by: donner
MFC after: 1 week
Sponsored by: Rubicon Communications, LLC ("Netgate")
Differential Revision: https://reviews.freebsd.org/D30943