In this function we always work with mbufs that we previously
created ourselves in unp_internalize(). They must be valid.
Reviewed by: markj
Differential revision: https://reviews.freebsd.org/D35319
Now that we call sbcreatecontrol() with M_WAITOK, we are expected to
pass a valid size. Return same error code, we are returning for an
oversized control from sockargs().
Reviewed by: markj
Differential revision: https://reviews.freebsd.org/D35317
Specification doesn't list an explicit error code for the control
size specified by msg_control being too large. But it does list
EMSGSIZE as error code for "message is too large to be sent all at
once (as the socket requires)". It also lists EINVAL as code for
the "The sum of the iov_len values overflows an ssize_t." Given
how generic and uninformative EINVAL is, the EMSGSIZE is more
appropriate.
https://pubs.opengroup.org/onlinepubs/9699919799/functions/sendmsg.html
Reviewed by: markj
Differential revision: https://reviews.freebsd.org/D35316
Suppose a periodic kevent timer fires close to its deadline, so that
now - kc->next is small. Then delta ends up being 1, and the next timer
deadline is set to (delta + 1) * kc->to, where kc->to is the timer
period. This means that the timer fires at half of the requested rate,
and the value returned in kn_data is similarly inaccurate.
PR: 264131
Fixes: 7cb40543e9 ("filt_timerexpire: do not iterate over the interval")
Reviewed by: kib
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D35313
Rename the "copyin" and "copyout" fields of struct cpuset_copy_cb to
something less generic, since sanitizers define interceptors for
copyin() and copyout() using #define.
Reported by: syzbot+2db5d644097fc698fb6f@syzkaller.appspotmail.com
Fixes: 47a57144af ("cpuset: Byte swap cpuset for compat32 on big endian architectures")
Sponsored by: The FreeBSD Foundation
Summary:
BITSET uses long as its basic underlying type, which is dependent on the
compile type, meaning on 32-bit builds the basic type is 32 bits, but on
64-bit builds it's 64 bits. On little endian architectures this doesn't
matter, because the LSB is always at the low bit, so the words get
effectively concatenated moving between 32-bit and 64-bit, but on
big-endian architectures it throws a wrench in, as setting bit 0 in
32-bit mode is equivalent to setting bit 32 in 64-bit mode. To
demonstrate:
32-bit mode:
BIT_SET(foo, 0): 0x00000001
64-bit sees: 0x0000000100000000
cpuset is the only system interface that uses bitsets, so solve this
by swapping the integer sub-components at the copyin/copyout points.
Reviewed by: kib
MFC after: 3 days
Sponsored by: Juniper Networks, Inc.
Differential Revision: https://reviews.freebsd.org/D35225
When locking a priority inherit mutex we perform a compare and swap
operation to try and acquire the mutex. This may fail even when the
compare succeeds.
Check and handle this case.
PR: 263825
Reviewed by: kib, markj
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D35150
Kthread worker is a single thread workqueue which can be used in cases
where specific kthread association is necessary, for example, when it
should have RT priority or be assigned to certain cgroup.
This change implements Linux v4.9 interface which mostly hides kthread
internals from users thus allowing to use ordinary taskqueue(9) KPI.
As kthread worker prohibits enqueueing of already pending or canceling
tasks some minimal changes to taskqueue(9) were done.
taskqueue_enqueue_flags() was added to taskqueue KPI which accepts extra
flags parameter. It contains one or more of the following flags:
TASKQUEUE_FAIL_IF_PENDING - taskqueue_enqueue_flags() fails if the task
is already scheduled to execution. EEXIST is returned and the
ta_pending counter value remains unchanged.
TASKQUEUE_FAIL_IF_CANCELING - taskqueue_enqueue_flags() fails if the
task is in the canceling state and ECANCELED is returned.
Required by: drm-kmod 5.10
MFC after: 1 week
Reviewed by: hselasky, Pau Amma (docs)
Differential Revision: https://reviews.freebsd.org/D35051
Without this patch, the MSG_TLSAPPDATA flag would cause
soreceive_generic() to return ENXIO for any non-application
data record in a TLS receive stream.
This works ok for TLS1.2, since Alert records appear to be
the only non-application data records received.
However, for TLS1.3, there can be post-handshake handshake
records, such as NewSessionKey sent to the client from the
server. These handshake records cannot be handled by the
upcall which does an SSL_read() with length == 0.
It appears that the client can simply throw away these
NewSessionKey records, but to do so, it needs to receive
them within the kernel.
This patch modifies the semantics of MSG_TLSAPPDATA slightly,
so that it only applies to Alert records and not Handshake
records. It is needed to allow the krpc to work with KTLS1.3.
Reviewed by: hselasky
MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D35170
Per jamie@ rpr can be NULL if the jail is created with sysvsem=disable.
But at least it doesn't appear to be fatal, since rpr is never dereferenced
but is only compared to other prison pointers.
Reviewed by: jamie
Differential revision: https://reviews.freebsd.org/D35198
MFC after: 2 weeks
It is unused, especially now that the underlying d_dumper methods do not
accept the argument.
Reviewed by: markj
MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D35174
The physical address argument is essentially ignored by every dumper
method. In addition, the dump routines don't actually pass a real
address; every call to dump_append() passes a value of zero for
physical.
Reviewed by: markj
MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D35173
It appears to be unused. These days struct disk has a d_dump member,
which is what gets passed to the kernel dump framework.
Reviewed by: markj
MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D35172
This change is also a preparation for further optimization to
allow locked return on success.
Reviewed by: markj
Differential revision: https://reviews.freebsd.org/D35182
Assert that sockets are of the same type. unp_connectat() already did
this check. Add the check to uipc_connect2().
Reviewed by: markj
Differential revision: https://reviews.freebsd.org/D35181
Since c67f3b8b78 the sockbuf mutexes belong to the containing socket,
and socket buffers just point to it. In 74a68313b5 macros that access
this mutex directly were added. Go over the core socket code and
eliminate code that reaches the mutex by dereferencing the sockbuf
compatibility pointer.
This change requires a KPI change, as some functions were given the
sockbuf pointer only without any hint if it is a receive or send buffer.
This change doesn't cover the whole kernel, many protocols still use
compatibility pointers internally. However, it allows operation of a
protocol that doesn't use them.
Reviewed by: markj
Differential revision: https://reviews.freebsd.org/D35152
First paragraph refers to old past "we used to" and is no longer
important today. Second paragraph has just a wrong statement that
socket buffer is destroyed before pru_detach.
Linux has more tolerant checks of the user supplied cpuset_t's.
Minimum cpuset_t size that the Linux kernel permits in case of
getaffinity() is the maximum CPU id, present in the system / NBBY,
the maximum size is not limited.
For setaffinity(), Linux does not limit the size of the user-provided
cpuset_t, internally using only the meaningful part of the set, where
the upper bound is the maximum CPU id, present in the system, no larger
than the size of the kernel cpuset_t.
Unlike FreeBSD, Linux ignores high bits if set in the setaffinity(),
so clear it in the sched_setaffinity() and Linuxulator itself.
Reviewed by: Pau Amma (man pages)
In collaboration with: jhb
Differential revision: https://reviews.freebsd.org/D34849
MFC after: 2 weeks
The hack can be tracked down to 4.4BSD, where copy was performed
under splimp() and then after splx() dom_dispose was called.
Stevens has a chapter on this function, but he doesn't answer why
this trick is necessary. Why can't we call into dom_dispose under
splimp()? Anyway, with multithreaded kernel the hack seems to be
necessary to avoid LORs between socket buffer lock and different
filesystem locks, especially network file systems.
The new socket buffers KPI sbcut() from 1d2df300e9 allow us to get
rid of the hack.
Reviewed by: markj
Differential revision: https://reviews.freebsd.org/D35125
sorflush() already did the right thing, so only sofree() needed
a fix. Turn check into assertion in our only dom_dispose method.
Reviewed by: markj
Differential revision: https://reviews.freebsd.org/D35124
Today, kern.msgbuf_show_timestamp=1 will give 1 second granularity
timestamps on dmesg lines. When kern.msgbuf_show_timestamp=2, we'll
produce microsecond level graunlarity.
For example:
old (== 1):
[13] Dual Console: Video Primary, Serial Secondary
[14] lo0: link state changed to UP
[15] bxe0: NIC Link is Up, 10000 Mbps full duplex, Flow control: ON - receive & transmit
[15] bxe0: link state changed to UP
new (== 2):
[13.807015] Dual Console: Video Primary, Serial Secondary
[14.544150] lo0: link state changed to UP
[15.272044] bxe0: NIC Link is Up, 10000 Mbps full duplex, Flow control: ON - receive & transmit
[15.272052] bxe0: link state changed to UP
Sponsored by: Netflix
The old fixed-point arithmetic used for calculating load averages had an
overflow at 1024. So on systems with extremely high load, the observed
load average would actually fall back to 0 and shoot up again, creating
a kind of sawtooth graph.
Fix this by using 64-bit math internally, while still reporting the load
average to userspace as a 32-bit number.
Sponsored by: Axcient
Reviewed by: imp
Differential Revision: https://reviews.freebsd.org/D35134
For future use in the Linux emulation layer for the semtimedop syscall
split the sys_semop syscall into two counterparts and add
struct timespec *timeout argument to the last one.
Reviewed by: jhb, kib
Differential revision: https://reviews.freebsd.org/D35121
MFC after: 2 weeks
When we remove an interface it is first removed from the interface list
V_ifnet (by if_unlink_ifnet()) and marked as IFF_DYING. We then wait for
any possible references to stop being used (i.e.
epoch_wait/epoch_drain_callbacks) before we tear it fully down.
However, the index in ifindex_table is not removed, so m_rcvif_restore()
can still find the (now dying) interface.
This results in panics, for example when dummynet restores the rcvif
pointer and passes a packet to ip6_input() we can panic because the
AF_INET6 domain has already been removed (so we end up dereferencing a
NULL pointer there).
Check that the interface is not dying before we restore it, which is
equivalent to checking its presence in V_ifnet, and thus ensures that
future accesses (while in NET_EPOCH) are safe.
Reviewed by: glebius
Sponsored by: Rubicon Communications, LLC ("Netgate")
Differential Revision: https://reviews.freebsd.org/D34076
(cherry picked from commit 703e533da5)
Supplement ifindex table with generation count and use it to
serialize & restore an ifnet pointer.
Reviewed by: kp
Differential revision: https://reviews.freebsd.org/D33266
Fun note: git show e6abef0918
(cherry picked from commit e1882428dc)
This reverts commit 703e533da5.
Revert "ifnet/mbuf: provide KPI to serialize/restore m->m_pkthdr.rcvif"
This reverts commit e1882428dc.
Obtained from: github.com/glebius/FreeBSD/commits/backout-ifindex
where we might not yet see a new child when signalling a process.
Ensure that this cannot happen by stopping all reapping subtree,
which ensures that the child is not inside a syscall, in particular
fork(2).
Reported and tested by: pho
Reviewed by: markj
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D35014
by repeating iteration over the subtree until there are no new processes
to signal.
Reported and tested by: pho
Reviewed by: markj
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D35014
stop_allo_proc_block() must be taken before proctree_lock.
Reviewed by: markj
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D35014
It allows to have more than one consumer of thread_signle(SIGNLE_ALLPROC) by
serializing them.
Reviewed by: markj
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D35014
p_reapsibling linkage is protected by proctree_lock, and it is modified
there.
Suggested and reviewed by: markj
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D35014
Our kern_sigtimedwait() calculates absolute sleep timo value as 'uptime+timeout'.
So, when the user specifies a big timeout value (LONG_MAX), the calculated
timo can be less the the current uptime value.
In that case kern_sigtimedwait() returns EAGAIN instead of EINTR, if
unblocked signal was caught.
While here switch to a high-precision sleep method.
Reviewed by: mav, kib
In collaboration with: mav
Differential revision: https://reviews.freebsd.org/D34981
MFC after: 2 weeks
Instead, create a switch structure private to ktls_ocf.c and store a
pointer to the switch in the ocf_session. This will permit adding an
additional function pointer needed for NIC TLS RX without further
bloating ktls_session.
Reviewed by: hselasky
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D35011
Deduplicate code to iterate over the bpages list in a bus_dmamap_t
freeing bounce pages during bus_dmamap_unload.
Reviewed by: imp
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D34967
When pages are freed to a bounce zone, only maps waiting for pages for
that zone can make forward progress. If a map for a different bounce
zone is at the head of the global list, then requests that could
otherwise make forward progress will be stalled waiting on the other
bounce zone. If bounce zones shared bounce pages then a global list
would still make sense to prevent "later" requests from starving an
earlier request but that is not a concern with per-zone bounce page
pools.
Reviewed by: imp
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D34966
Rather than using a software interrupt with a single handler, just
create a dedicated kernel process woken up with a simple wakeup().
Reviewed by: imp
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D34965
Add a new PI_SOFTCLOCK for use by softclock threads. Currently this
maps to PI_AV which is the second-highest ithread priority.
Reviewed by: mav, kib
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D33693
15b1eb142c changed the callout code to store the CALLOUT_SHAREDLOCK flag
in c_iflags (where it used to be c_flags), but failed to update the
check in softclock_call_cc(). This resulted in the callout code always
taking the write lock, even if a read lock had been requested (with
the CALLOUT_SHAREDLOCK flag in callout_init_rm()).
Reviewed by: markj
MFC after: 1 week
Sponsored by: Rubicon Communications, LLC ("Netgate")
Differential Revision: https://reviews.freebsd.org/D34959
This permits a driver module structure that doesn't want to store a
pointer to the new driver's devclass.
Reviewed by: imp
MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D34962
Rather than using taskqueue_swi_giant which holds Giant for all
deferred destroy_dev calls, create a separate queue for destroyed
devices with D_NEEDGIANT set in the corresponding cdevsw. The task
for this queue holds Giant whild destroying deferred devices while the
task for the default queue does not hold Giant.
In addition, switch to taskqueue_thread for destroy_dev_sched.
Deferred destroy_dev requests don't need to run at an SWI priority.
Reviewed by: imp, markj
MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D34915
Now, since O_PATH-opened file descriptors use use references instead
of the hold references, vrefact() chahges from that revision can be
reverted.
Reviewed by: markj
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D34906
posix_fadvise operates only on a provided fd. Noted by
Mathieu <sigsys@gmail.com> in review D34761.
No new CAP_ rights are added for posix_fadvise(), as 'advice' in
general only influences when I/O happens; the fd must have existing
CAP_ rights for actual data access.
Reviewed by: markj
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D34903
Problem is that open(O_PATH) on nullfs -o nocache is broken then,
because there is no reference on the vnode after the open syscall exits.
Reported and tested by: ambrisko
Reviewed by: markj
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
We may legitimately have tib == NULL if we're at the very end of the
queue.
PR: 215373
Reported by: pho
MFC after: 1 week
Sponsored by: The FreeBSD Foundation
To be more compatible to IEEE Std 1003.1-2008 (“POSIX.1”).
Reviewed by: mjg, Pau Amma (doc)
Differential revision: https://reviews.freebsd.org/D34680
MFC after: 2 weeks
reporting the shapshot of the active advisory locks.
A new VFS ops method vfs_report_lockf if provided in the mount point
op table. If it is NULL, as it is currently for all existing
filesystems, vfs_report_lockf() function is used, which gathers
information from the standard implementation inside kern/kern_lockf.c.
Filesystems implementing its own locking (NFSv4 as example) can provide
a custom implementation.
Reviewed by: markj, rmacklem
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D34756
The UFS-specific struct inode cannot be used in generic advisory lock
code. It was probably used as a shortcut for the debugging, as the
remnants of the code around it indicates.
Use somewhat more verbose and less concentrated, but universal,
VOP_PRINT(), where needed.
Reviewed by: markj, rmacklem
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D34756
In some configurations the firmware may pass memory regions that are
not page sized or aligned, e.g. when using 16k pages on arm64. If this
is the case we will calculate many small regions because the alignment
is applied before being inserted. As we round the start up and end down
this will leave a 1 page hole between what should have been a single
region.
Fix by keeping the original alignment until we are just about to insert
the region into the avail array.
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D34694
These give us some confidience we haven't broken anything in early
boot code that may be running before the console.
Reviewed by: emaste
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D34691
Add three hooks to the livedump process: before, after, and for each
block of dumped data. This allows, for example, quiescing the system
before the dump begins or protecting data of interest to ensure its
consistency in the final output.
Reviewed by: markj, kib (previous version)
Reviewed by: debdrup (manpages)
Reviewed by: Pau Amma <pauamma@gundo.com> (manpages)
MFC after: 3 weeks
Sponsored by: Juniper Networks, Inc.
Sponsored by: Klara, Inc.
Differential Revision: https://reviews.freebsd.org/D34067
This dumper can instantiate and write the dump's contents to a
file-backed vnode.
Unlike existing disk or network dumpers, the vnode dumper should not be
invoked during a system panic, and therefore is not added to the global
dumper_configs list. Instead, the vnode dumper is constructed ad-hoc
when a live dump is requested using the new ioctl on /dev/mem. This is
similar in spirit to a kgdb session against the live system via
/dev/mem.
As described briefly in the mem(4) man page, live dumps are not
guaranteed to result in a usuable output file, but offer some debugging
value where forcefully panicing a system to dump its memory is not
desirable/feasible.
A future change to savecore(8) will add an option to save a live dump.
Reviewed by: markj, Pau Amma <pauamma@gundo.com> (manpages)
Discussed with: kib
MFC after: 3 weeks
Sponsored by: Juniper Networks, Inc.
Sponsored by: Klara, Inc.
Differential Revision: https://reviews.freebsd.org/D33813
Add a new function, dumper_create(), to allocate a dumper.
dumper_insert() will call this function and retains the existing
behaviour.
This is desirable for performing live dumps of the system. Here, there
is a need to allocate and configure a dumper structure that is invoked
outside of the typical debugger context. Therefore, it should be
excluded from the list of panic-time dumpers.
free_single_dumper() is made public and renamed to dumper_destroy().
Reviewed by: kib, markj
MFC after: 1 week
Sponsored by: Juniper Networks, Inc.
Sponsored by: Klara, Inc.
Differential Revision: https://reviews.freebsd.org/D34068
In some cases vn_open_cred overwrites cn_flags, effectively nullifying
initialisation done in NDINIT. This will have to be fixed.
In the meantime make sure the flag is passed.
Reported by: jenkins
Noted by: Mathieu <sigsys@gmail.com>
This more clearly differentiates system call arguments from integer
registers and return values. On current architectures it has no effect,
but on architectures where pointers are not integers (CHERI) and may
not even share registers (CHERI-MIPS) it is necessiary to differentiate
between system call arguments (syscallarg_t) and integer register values
(register_t).
Obtained from: CheriBSD
Reviewed by: imp, kib
Differential Revision: https://reviews.freebsd.org/D33780
Make sure both sides of a comparison are unsigned. As the values being
compared are size_t make the the value in the for loop size_t too.
Sponsored by: The FreeBSD Foundation
This completes the patch which was originally meant to go in.
Spotted by: mhorne
Fixes: c35ec1efdc ("vfs: [1/2] fix stalls in vnode reclaim by not
requeieing from vnlru")
It expects exactly one of those flags. A future commit will assert this.
Reviewed by: rstone
MFC after: 1 month
Sponsored by: Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D34451
Turns out execve looks at it to store binary name, but in order to
trigger the problem one has to be trying to exec '/'. As is the value
would be left uninitialized (or rather set to -1 on debug kernels).
Fixes: 56244d3574 ("vfs: hoist degenerate path lookups out of the
loop")
We use the p_itcallout callout, interlocked by the proc lock, to
schedule timeouts for the setitimer(2) system call. When a process
exits, the callout must be stopped before the process struct is
recycled.
Currently we attempt to stop the callout in exit1() with the call
_callout_stop_safe(&p->p_itcallout, CS_EXECUTING). If this call returns
0, then we sleep in order to drain the callout. However, this happens
only if the callout is not scheduled at all. If the callout thread is
blocked on the proc lock, then exit1() will not block and the callout
may execute after the process has fully exited, typically resulting in a
panic.
I cannot see a reason to use the CS_EXECUTING flag here. Instead, use
the regular callout_stop()/callout_drain() dance to halt the callout.
Reported by: ler
Tested by: ler, pho
MFC after: 1 month
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D34625
umtxq_requeue() moves the queue to a different hash chain and different
lock, so we can't rely on msleep_sbt() reacquiring the same old lock.
We have to use PDROP and update the queue chain and so lock pointer.
PR: 262587
MFC after: 2 weeks
cpu ticks) has some imprecision and, worse, huge timestep (about
20 minutes on 4GHz CPU) near 53.4 days of elapsed time.
kern_time.c/cputick2timespec() (it is used for clock_gettime() for
querying process or thread consumed cpu time) Uses cputick2usec()
and then needlessly converting usec to nsec, obviously losing
precision even with fixed cputick2usec().
kern_time.c/kern_clock_getres() uses some weird (anyway wrong)
formula for getting cputick resolution.
PR: 262215
Reviewed by: gnn
Differential Revision: https://reviews.freebsd.org/D34558
This is used to calculate sizes that are then stored in unsigned long
fields. Make this unsigned long so the calculations use this type and
not an int that can lead to an integer overflow with a large PAGE_SIZE.
This allows building this on arm64 with PAGE_SIZE of 16k. Further work
will be needed if a 32-bit architecture tries to use a similar sized
page.
Sponsored by: The FreeBSD Foundation
The timestamp logs are quite large (often much larger than all the
other sysctls combined) so it's unlikely anyone will want to have
them displayed by `sysctl -a`.
MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D34616
Can be used when the fs at hand can synchronize insmntque with other
means than the vnode lock.
Reviewed by: markj
Tested by: pho (previous version)
Differential Revision: https://reviews.freebsd.org/D34466
Use a spinlock section instead of a critical section to synchronize with
statclock(). Otherwise the CLOCK_THREAD_CPUTIME_ID clock can appear to
go backwards.
PR: 262273
Reviewed by: markj
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D34568
Some loops access the fd table of a different process, and drop the
filedesc lock while iterating, so they check the table's refcount.
However, we access the table before the first iteration, in order to get
the number of table entries, and this access can be a use-after-free.
Fix the problem by checking the refcount before we start iterating.
Reported by: pho
Reviewed by: mjg
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D34575
In particular, use a generic wrapper around struct regset rather than
requiring per-regset helpers. This helper replaces the MI
__elfN(note_prstatus) and __elfN(note_fpregset) helpers. It also
removes the need to explicitly dump NT_ARM_ADDR_MASK in the arm64
__elfN(dump_thread).
Reviewed by: markj, emaste
Sponsored by: University of Cambridge, Google, Inc.
Differential Revision: https://reviews.freebsd.org/D34446
In order to support various types of data stored in device
tree properties or ACPI _DSD packages, create a new enum so
the caller can specify the expected type of a property they
want to read, according to the binding. The bus logic will use
that information to process the underlying data.
For example in DT all integer properties are stored in BE format.
In order to get constant results across different platforms we
need to convert its endianness to match the host.
Another example are ACPI_TYPE_INTEGER properties stored
as uint64_t. Before this patch the ACPI logic would refuse
to read them if the provided buffer was smaller than 8 bytes.
Now this can be handled by using DEVICE_PROP_UINT32 type.
Modify the existing consumers of this API to reflect the changes
and update the man pages accordingly.
Reviewed by: mw
Obtained from: Semihalf
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D33457
There are multiple buses that pretend to be ofw compatible,
e.g ofw_pci, mii_fdt. We now need to provide an implementation
of BUS_GET_PROPERTY for every one of them. Instead of modifying
them one by one it's better to just provide a default
implementation that simply traverses up the device tree.
Remove the now unneeded BUS_GET_PROPERTY implementation in mii_fdt.
Reviewed by: andrew, bz
Obtained from: Semihalf
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D34031
It appears to have introduced a regression on arm64, possibly due to the
fact that the pcpu pointer is reloaded outside of the critical section
in _rm_rlock(). Until this is resolved one way or another, let's
revert.
Reported by: Ronald Klop <ronald-lists@klop.ws>
Sponsored by: The FreeBSD Foundation
Use sys/ctf.h to provide various definitions required to parse the CTF
header. No functional change intended.
Reviewed by: Domagoj Stolfa, emaste
MFC after: 1 month
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D34359
Despite the buffer taken from cache or free list, it still can be
locked, due to 'lockless lookup' in getblkx() potentially operating on
the freed buffers. The lock is transient, but prevents the use of
LK_NOWAIT there for the goal of neutralizing WITNESS.
Just use LK_NOWITNESS.
Reported and tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 3 days
This code was not touched when all other user-space sleep functions were
switched to sbintime_t and decoupled from hardclock. When it is possible,
convert supplied times into sbinuptime to supply directly to msleep_sbt()
with C_ABSOLUTE. This provides the timeout resolution of few microseconds
instead of 2 milliseconds, plus avoids few clock reads and conversions.
Reviewed by: vangyzen
MFC after: 1 month
Differential Revision: https://reviews.freebsd.org/D34163
This reduces the size of diffs required to support different values of
vnsz2log. In CheriBSD, kernels for CHERI architectures have vnodes
larger than 512 bytes and require a value of 9.
Reviewed by: mjg
Obtained from: CheriBSD
Sponsored by: University of Cambridge, Google, Inc.
Differential Revision: https://reviews.freebsd.org/D34418
exit1() sets P_WEXIT before waiting for holding threads to finish,
rather than after, so this assertion is racy.
Fixes: 12fb39ec3e ("proc: Relax proc_rwmem()'s assertion on the process hold count")
Reported by: Jenkins
This reference ensures that the process and its associated vmspace will
not be destroyed while proc_rwmem() is executing. If, however, the
calling thread belongs to the target process, then it is unnecessary to
hold the process. In particular, fasttrap - a module which enables
userspace dtrace - may frequently call proc_rwmem(), and we'd prefer to
avoid the overhead of locking and bumping the hold count when possible.
Thus, make the assertion conditional on "p != curproc". Also assert
that the process is not already exiting. No functional change intended.
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
A simple cache to cache differnet locators to the same device.
Sponsored by: Netflix
Changes Suggested by: jhb
Differential Revision: https://reviews.freebsd.org/D32783
Add support for printing ACPI paths. This is a bit of a degenerate case
for this interface since it's always just the device handle if the
device has one. But it is illustrtive of how to do this for a few nodes
in the tree.
Sponsored by: Netflix
Reviewed by: jhb
Differential Revision: https://reviews.freebsd.org/D32748
DEV_GET_PATH will get the path to a device based on different locators.
Sponsored by: Netflix
Reviewed by: jhb
Differential Revision: https://reviews.freebsd.org/D32745
This returns the full path of a the child device requested. Since
there's different ways to recon the entire path, include a 'locator'
method. The default 'FreeBSD' method uses a filesystem-like path name
with each device to the root node separated by /. Other locators will be
UEFI, ACPI and fdt, though others are possible in the future. Make the
locator a string to allow maximum flexibility.
Sponsored by: Netflix
Reviewed by: jhb
Differential Revision: https://reviews.freebsd.org/D32744
We make sure that we check for device privs (usually meaning root or
better) for everything. To allow other functions that don't require
this, default to 644 protection.
Sponsored by: Netflix
Reviewed by: jhb
Differential Revision: https://reviews.freebsd.org/D32863
Use get_pcpu() instead of an open-coded pcpu_find(td->td_oncpu). This
eliminates some memory accesses and results in a shorter instruction
sequence. Note that get_pcpu() didn't exist when rmlocks were added.
Reviewed by: jah, mjg
MFC after: 1 week
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D34377
Due to misplaced braces, an error from vfs_uninit() in the VFCF_SBDRY
case was ignored.
Reported by: Anton Rang <rang@acm.org>
Reviewed by: Anton Rang <rang@acm.org>, markj
MFC after: 1 week
Sponsored by: Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D34375
The changeset 65572cade3 uncovered the fact that top layer of sendto(2)
would clear a transient error code if some data was copied out of uio.
The clearing of the error makes sense for non-atomic protocols, since
they have sent some data. The atomic protocols send all or nothing.
The current implementation of unix/dgram uses sosend_generic(), which
would always copyout and only then it may fail to deliver a message.
The sosend_dgram(), currently used by UDP only, also has same behavior.
Reported by: pho
Reviewed by: pho, markj
Differential revision: https://reviews.freebsd.org/D34309
This gets rid of the error prone naming where fget_unlocked returns with
a ref held, while fget_locked requires a lock but provides nothing in
terms of making sure the file lives past unlock.
No functional changes.
watch'ing a tty triggers a refcount wraparound panic, take a reference
on fp after fget_cap_locked() to fix.
Reported by: Michael Jung <mikej_at_paymentallianceintl.com>
Reviewed by: hselasky, mjg
Fixes: f40dd6c803 ("tty: switch ttyhook_register to use fget_cap_locked")
Differential Revision: https://reviews.freebsd.org/D34335
Add trace events for execution of SYSINITs (both static and dynamically
loaded), and to the various steps in the shutdown/panic/reboot paths.
Sponsored by: NetApp, Inc.
Sponsored by: Klara, Inc.
X-NetApp-PR: #23
Differential Revision: https://reviews.freebsd.org/D30187
This is preferred by style(9). Do this ahead of adding another include.
Reviewed by: imp, kevans, allanjude
MFC after: 3 days
Sponsored by: NetApp, Inc.
Sponsored by: Klara, Inc.
Differential Revision: https://reviews.freebsd.org/D30186
This is preferred by style(9). Do this ahead of adding another include.
Reviewed by: imp, kevans
MFC after: 3 days
Sponsored by: NetApp, Inc.
Sponsored by: Klara, Inc.
Differential Revision: https://reviews.freebsd.org/D30185
Boottrace is a facility for capturing trace events during boot and
shutdown. This includes kernel initialization, as well as rc. It has
been used by NetApp internally for several years, for catching and
diagnosing slow devices or subsystems. It is driven from userspace by
sysctl interface, and the output is a human-readable log of events
(kern.boottrace.log).
This commit adds the core boottrace functionality implementing these
interfaces. Adding the trace annotations themselves to kernel and
userland will happen in follow-up commits. A future commit will also add
a boottrace(4) man page.
For now, boottrace is unconditionally compiled into the kernel but
disabled by default. It can be enabled by setting the
kern.boottrace.enabled tunable to 1 in loader.conf(5).
There is an existing boot-time event tracing facility, which can be
compiled into the kernel with 'options TSLOG'. While there is some
functional overlap between this and boottrace, they are distinct. TSLOG
is suitable for generating detailed timing information and flamegraphs,
and has been used to great success by cperciva@ to diagnose and reduce
the overall system boot time. Boottrace aims to more quickly provide an
overview of timing and resource usage of the boot (and shutdown) process
to a sysadmin who requires this knowledge.
Sponsored by: NetApp, Inc.
Sponsored by: Klara, Inc.
X-NetApp-PR: #23
Differential Revision: https://reviews.freebsd.org/D30184
This is behavior what some programs expect and what Linux does. For
example nginx expects EAGAIN when sending messages to /var/run/log,
which it connects to with O_NONBLOCK. Particularly with nginx the
problem is magnified by the fact that a ENOBUFS on send(2) is also
logged, so situation creates a log-bomb - a failed log message
triggers another log message.
Reviewed by: markj
Differential revision: https://reviews.freebsd.org/D34187
After commit 74cf7cae4d ("softclock: Use dedicated ithreads for
running callouts."), there is a lock order reversal between the per-CPU
callout lock and the scheduler lock. softclock_thread() locks callout
lock then the scheduler lock, when preparing to switch off-CPU, and
sleepq_remove_thread() stops the timed sleep callout while potentially
holding a scheduler lock. In the latter case, it's the thread itself
that's locked, and if the thread is sleeping then its lock will be a
sleepqueue lock, but if it's still in the process of going to sleep
it'll be a scheduler lock.
We could perhaps change softclock_thread() to try to acquire locks in
the opposite order, but that'd require dropping and re-acquiring the
callout lock, which seems expensive for an operation that will happen
quite frequently. We can instead perhaps avoid stopping the
td_slpcallout callout if the thread is still going to sleep, which is
what this patch does. This will result in a spurious call to
sleepq_timeout(), but some counters suggest that this is very rare.
PR: 261198
Fixes: 74cf7cae4d ("softclock: Use dedicated ithreads for running callouts.")
Reported and tested by: thj
Reviewed by: kib
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D34204
The TCP rate pacing code relies on being able to read this pointer
safely while holding an INP lock. The initial TLS session pointer is
set while holding the write lock already.
Reviewed by: gallatin, hselasky
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D34086
It is still wrong-ish as fget* funcs don't expect to operate on abitrary
file descriptor tables, but this at least moves it out of the way of an
upcoming change while being bug-compatible.
Since TD_IS_RUNNING() and TS_ON_RUNQ() are defined as logical
expressions involving '==', clang 14 warns about them being checked with
a bitwise operator instead of a logical one:
```
sys/kern/tty_info.c:124:9: error: use of bitwise '|' with boolean operands [-Werror,-Wbitwise-instead-of-logical]
runa = TD_IS_RUNNING(td) | TD_ON_RUNQ(td);
^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
||
sys/sys/proc.h:562:27: note: expanded from macro 'TD_IS_RUNNING'
^
sys/kern/tty_info.c:124:9: note: cast one or both operands to int to silence this warning
sys/sys/proc.h:562:27: note: expanded from macro 'TD_IS_RUNNING'
^
sys/kern/tty_info.c:129:9: error: use of bitwise '|' with boolean operands [-Werror,-Wbitwise-instead-of-logical]
runb = TD_IS_RUNNING(td2) | TD_ON_RUNQ(td2);
^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
||
sys/sys/proc.h:562:27: note: expanded from macro 'TD_IS_RUNNING'
^
sys/kern/tty_info.c:129:9: note: cast one or both operands to int to silence this warning
sys/sys/proc.h:562:27: note: expanded from macro 'TD_IS_RUNNING'
^
```
Fix this by using logical operators instead. No functional change
intended.
Reviewed by: cem, emaste, kevans, markj
MFC after: 3 days
Differential Revision: https://reviews.freebsd.org/D34186
There was nothing preventing one from sending an empty fragment on an
arbitrary KTLS TX-enabled socket, but ktls_frame() asserts that this
could not happen. Though the transmit path handles this case for TLS
1.0 with AES-CBC, we should be strict and allow empty fragments only in
modes where it is explicitly allowed.
Modify sosend_generic() to reject writes to a KTLS-enabled socket if the
number of data bytes is zero, so that userspace cannot trigger the
aforementioned assertion.
Add regression tests to exercise this case.
Reported by: syzkaller
Reviewed by: gallatin, jhb
MFC after: 1 week
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D34195
Most fget*() functions initialize the output parameter to NULL. Make
the externally visible interface behave consistently, and make
fget_unlocked_seq() private to kern_descrip.c.
This fixes at least one bug in a consumer, _filemon_wrapper_openat(),
which assumes that getvnode() sets the output file pointer to NULL upon
an error.
Reported by: syzbot+01c0459408f896a5933a@syzkaller.appspotmail.com
Reviewed by: kib
MFC after: 1 week
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D34190
The ntp_init() function did set a couple of global objects to zero. These
objects are in the .bss section and already initialized to zero during kernel
or module loading.
Since 59f256ec35 dmesg(8) will always skip first line of the message
buffer, cause it might be incomplete. The problem is that in most cases
it is complete, valid and contains the "---<<BOOT>>---" marker. This
skip can be disabled with '-a', but that would also unhide all non-kernel
messages. Move this functionality from dmesg(8) to kernel, since kernel
actually knows if wrap has happened or not.
The main motivation for the change is not actually the value of the
"---<<BOOT>>---" marker. The problem breaks unit tests, that clear
message buffer, perform a test and then check the message buffer for
a result. Example of such test is sys/kern/sonewconn_overflow.
74cf7cae4d ("softclock: Use dedicated ithreads for running callouts.")
switched callouts away from the swi infrastructure. It turns out that
this was a major source of entropy in early boot, which we've now lost.
As a result, first boot on hardware without a 'fast' entropy source
would block waiting for fortuna to be seeded with little hope of
progressing without manual intervention.
Let's resolve it by explicitly harvesting entropy in callout_process()
if we've handled any callouts. cc/curthread/now seem to be reasonable
sources of entropy, so use those.
Discussed with: jhb (also proposed initial patch)
Reported by: many
Reviewed by: cem, markm (both csprng)
Differential Revision: https://reviews.freebsd.org/D34150
It prevents WITNESS from recording the lock order for the buffer lock
acquired by getblkx().
Reviewed by: mckusick
Discussed with: markj
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D34073
At the moment this is mostly a no-op but in the future there will be
in-flight encrypted data which requires software decryption. This
same setup is also needed for NIC TLS RX.
Note that this does break TOE TLS RX for AES-CBC ciphers since there
is no software fallback for AES-CBC receive. This will be resolved
one way or another before 14.0 is released.
Reviewed by: hselasky
Sponsored by: Chelsio Communications
Differential Revision: https://reviews.freebsd.org/D34082
Also remove once-used functions to clean up after failed insmntque1(),
which were destructor callbacks in previous life.
Reviewed by: markj
Tested by: pho
Sponsored by: The FreeBSD Foundation
Differential revision: https://reviews.freebsd.org/D34071
The lock is unneccessary since the mount point is busied, which prevents
unmount and syncer vnode deallocation. Having the vnode locked causes
innocent LoRs and complicates debugging.
Also stop starting write accounting around it. Any caller of
VOP_FSYNC() must do it already, and sync_vnode() does.
Reported and tested by: pho
Reviewed by: markj, mckusick
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D34072
The buffer must not be accessed by any other thread, it is freshly
allocated. As such, LK_NOWAIT should be nop but also it prevents
recording the order between the buffer lock and any other locks we might
own in the call to getnewbuf(). In particular, if we own FFS snap lock,
it should avoid triggering false positive warning.
Reviewed by: markj, mckusick
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D34072
Also remove a pointer to array variable, use array address directly.
Reviewed by: markj, mckusick
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D34072
When we remove an interface it is first removed from the interface list
V_ifnet (by if_unlink_ifnet()) and marked as IFF_DYING. We then wait for
any possible references to stop being used (i.e.
epoch_wait/epoch_drain_callbacks) before we tear it fully down.
However, the index in ifindex_table is not removed, so m_rcvif_restore()
can still find the (now dying) interface.
This results in panics, for example when dummynet restores the rcvif
pointer and passes a packet to ip6_input() we can panic because the
AF_INET6 domain has already been removed (so we end up dereferencing a
NULL pointer there).
Check that the interface is not dying before we restore it, which is
equivalent to checking its presence in V_ifnet, and thus ensures that
future accesses (while in NET_EPOCH) are safe.
Reviewed by: glebius
Sponsored by: Rubicon Communications, LLC ("Netgate")
Differential Revision: https://reviews.freebsd.org/D34076
I was somehow convinced that insmntque calls insmntque1 with a NULL
destructor. Unfortunately this worked well enough to not immediately
blow up in simple testing.
Keep not using the destructor in previously patched filesystems though
as it avoids unnecessary casts.
Noted by: kib
Reported by: pho
This adds the PT_GETREGSET and PT_SETREGSET ptrace types. These can be
used to access all the registers from a specified core dump note type.
The NT_PRSTATUS and NT_FPREGSET notes are initially supported. Other
machine-dependant types are expected to be added in the future.
The ptrace addr points to a struct iovec pointing at memory to hold the
registers along with its length. On success the length in the iovec is
updated to tell userspace the actual length the kernel wrote or, if the
base address is NULL, the length the kernel would have written.
Because the data field is an int the arguments are backwards when
compared to the Linux PTRACE_GETREGSET call.
Reviewed by: kib
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D19831
Supplement ifindex table with generation count and use it to
serialize & restore an ifnet pointer.
Reviewed by: kp
Differential revision: https://reviews.freebsd.org/D33266
Fun note: git show e6abef0918
The manpage has contained the following verbiage on the matter for just
under 31 years:
"At least one argument must be present in the array"
Previous to this version, it had been prefaced with the weakening phrase
"By convention."
Carry through and document it the rest of the way. Allowing argc == 0
has been a source of security issues in the past, and it's hard to
imagine a valid use-case for allowing it. Toss back EINVAL if we ended
up not copying in any args for *execve().
The manpage change can be considered "Obtained from: OpenBSD"
Reviewed by: emaste, kib, markj (all previous version)
Differential Revision: https://reviews.freebsd.org/D34045
This function will be used by coming TLS hardware receive offload support.
Differential Revision: https://reviews.freebsd.org/D32356
Discussed with: jhb@
MFC after: 1 week
Sponsored by: NVIDIA Networking
This matches user_getsockname.
Reviewed by: brooks, kib
Sponsored by: The University of Cambridge, Google Inc.
Differential Revision: https://reviews.freebsd.org/D33987
that disables any access to ptrace(2) for all processes.
Reviewed by: emaste
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D33986
Privilege checks in both functions should allow the current process to
infer information about itself, as well as use the interfaces that are
proclaimed 'debugging', for instance, procctl(2).
Note that in p_cansee() case, explicit comparision of curproc and p
avoids a race where the process might change credentials and cause
thread to compare its cached stale credentials against updated process
creds, effectively disallowing the process to observe itself.
Reviewed by: emaste
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D33986
Otherwise we end up copying one uninitialized byte into the socket
buffer.
Reported by: KMSAN
Reviewed by: jhb
MFC after: 1 week
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D33953
tc_counter_mask is an unsigned int and in the TSC timecounter is equal
to UINT_MAX, so the addition tc->tc_counter_mask + 1 can overflow to 0,
resulting in a hang during boot.
Fixes: c2705ceaeb ("x86: Speed up clock calibration")
Reviewed by: cperciva
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D33956
Before this change bufdaemon and bufspacedaemon threads used
kthread_shutdown() to stop activity on system shutdown. The problem is
that kthread_shutdown() has no idea about the wait channel and lock used
by specific thread to wake them up reliably. As result, up to 9 threads
could consume up to 9 seconds to shutdown for no good reason.
This change introduces specific shutdown functions, knowing how to
properly wake up specific threads, reducing wait for those threads on
shutdown/reboot from average 4 seconds to effectively zero.
MFC after: 2 weeks
Reviewed by: kib, markj
Differential Revision: https://reviews.freebsd.org/D33936
This provides information about fixed regions of the target process'
user memory map.
Reviewed by: kib
MFC after: 1 month
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D33708
The approach taken by the stack gap implementation was to insert a
random gap between the top of the fixed stack mapping and the true top
of the main process stack. This approach was chosen so as to avoid
randomizing the previously fixed address of certain process metadata
stored at the top of the stack, but had some shortcomings. In
particular, mlockall(2) calls would wire the gap, bloating the process'
memory usage, and RLIMIT_STACK included the size of the gap so small
(< several MB) limits could not be used.
There is little value in storing each process' ps_strings at a fixed
location, as only very old programs hard-code this address; consumers
were converted decades ago to use a sysctl-based interface for this
purpose. Thus, this change re-implements stack address randomization by
simply breaking the convention of storing ps_strings at a fixed
location, and randomizing the location of the entire stack mapping.
This implementation is simpler and avoids the problems mentioned above,
while being unlikely to break compatibility anywhere the default ASLR
settings are used.
The kern.elfN.aslr.stack_gap sysctl is renamed to kern.elfN.aslr.stack,
and is re-enabled by default.
PR: 260303
Reviewed by: kib
Discussed with: emaste, mw
MFC after: 1 month
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D33704
ASLR stack randomization will reappear in a forthcoming commit. Rather
than inserting a random gap into the stack mapping, the entire stack
mapping itself will be randomized in the same way that other mappings
are when ASLR is enabled.
No functional change intended, as the stack gap implementation is
currently disabled by default.
Reviewed by: kib
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D33704
Rather than fetching the ps_strings address directly from a process'
sysentvec, use this macro. With stack address randomization the
ps_strings address is no longer fixed.
Reviewed by: kib
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D33704
This will not be required with a forthcoming reimplementation of ASLR
stack randomization. Moreover, this change was not sufficient to enable
the use of a stack size limit smaller than the stack gap itself.
PR: 260303
Reviewed by: kib
MFC after: 1 week
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D33704
The size of the ps_strings structure varies between ABIs, so this is
useful for computing the address of the ps_strings structure relative to
the top of the stack when stack address randomization is enabled.
Reviewed by: kib
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D33704
The current ASLR stack gap feature will be removed, and with that the
need for the kern.stacktop sysctl is gone. All consumers have been
removed.
This reverts commit a97d697122.
Reviewed by: kib
MFC after: 1 week
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D33704
Before commit 3cec5c77d6 buf_daemon() went to longer 1s sleep if
numdirtybuffers <= lodirtybuffers. After that commit new condition
!BIT_EMPTY(BUF_DOMAINS, &bdlodirty) got opposite -- true when one
or more more domains is above lodirtybuffers. As result, on freshly
booted system with no dirty buffers buf_daemon() wakes up 10 times
per second and probably only 1 time per second when there is actual
work to do.
MFC after: 1 week
Reviewed by: kib, markj
Tested by: pho
Differential revision: https://reviews.freebsd.org/D33890
VM's have little control over CPU speed, don't make matters worse
by constantly spaming console.
Reviewed by: jhb
Differential Revision: https://reviews.freebsd.org/D33902
Missed issues in truss on at least armv7 and powerpcspe need to be
resolved before recommit.
This reverts commit 3889fb8af0.
This reverts commit 1544e0f5d1.
This more clearly differentiates system call arguments from integer
registers and return values. On current architectures it has no effect,
but on architectures where pointers are not integers (CHERI) and may
not even share registers (CHERI-MIPS) it is necessiary to differentiate
between system call arguments (syscallarg_t) and integer register values
(register_t).
Obtained from: CheriBSD
Reviewed by: imp, kib
Differential Revision: https://reviews.freebsd.org/D33780
Prior to this commit, the TSC and local APIC frequencies were calibrated
at boot time by measuring the clocks before and after a one-second sleep.
This was simple and effective, but had the disadvantage of *requiring a
one-second sleep*.
Rather than making two clock measurements (before and after sleeping) we
now perform many measurements; and rather than simply subtracting the
starting count from the ending count, we calculate a best-fit regression
between the target clock and the reference clock (for which the current
best available timecounter is used). While we do this, we keep track
of an estimate of the uncertainty in the regression slope (aka. the ratio
of clock speeds), and stop measuring when we believe the uncertainty is
less than 1 PPM.
In order to avoid the risk of aliasing resulting from the data-gathering
loop synchronizing with (a multiple of) the frequency of the reference
clock, we add some additional spinning depending upon the iteration number.
For numerical stability and simplicity of implementation, we make use of
floating-point arithmetic for the statistical calculations.
On the author's Dell laptop, this reduces the time spent in calibration
from 2000 ms to 29 ms; on an EC2 c5.xlarge instance, it is reduced from
2000 ms to 2.5 ms.
Reviewed by: bde (previous version), kib
MFC after: 1 month
Sponsored by: https://www.patreon.com/cperciva
Differential Revision: https://reviews.freebsd.org/D33802
a helper to remount filesystem from rw to ro.
Tested by: pho
Reviewed by: markj, mckusick
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D33721
- Move busdma_lock_mutex to subr_bus_dma.c.
- Move _busdma_lock_dflt to subr_bus_dma.c. This function was named a
couple of different things previously. It is not a public API but
an internal helper used in place of a NULL pointer. The prototype
is in <sys/bus_dma.h> as not all backends include
<sys/bus_dma_internal.h>.
Reviewed by: kib
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D33694
Move mostly duplicated code in various MD bus_dma backends to support
bounce pages into sys/kern/subr_busdma_bounce.c. This file is
currently #include'd into the backends rather than compiled standalone
since it requires access to internal members of opaque bus_dma
structures such as bus_dmamap_t and bus_dma_tag_t.
Reviewed by: kib
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D33684
Now that each module handles its global and VNET initialization
itself, there is no VNET related stuff left to do in domain_init().
Differential revision: https://reviews.freebsd.org/D33541
There left only three modules that used dom_init(). And netipsec
was the last one to use dom_destroy().
Differential revision: https://reviews.freebsd.org/D33540
The function now modifies pr_usrreqs only, which are always
global. Rename it to pr_usrreqs_init().
Differential revision: https://reviews.freebsd.org/D33538
The historical BSD network stack loop that rolls over domains and
over protocols has no advantages over more modern SYSINIT(9).
While doing the sweep, split global and per-VNET initializers.
Getting rid of pr_init allows to achieve several things:
o Get rid of ifdef's that protect against double foo_init() when
both INET and INET6 are compiled in.
o Isolate initializers statically to the module they init.
o Makes code easier to understand and maintain.
Reviewed by: melifaro
Differential revision: https://reviews.freebsd.org/D33537
Currently intr_pic_add_handler either returns the PIC you gave it (which
is useless and risks causing confusion about whether it's creating
another PIC) or, on error, NULL. Instead, convert it to return an int
error code as one would expect.
Note that the only consumer of this API, arm64's gicv3_its, does not use
the return value, so no uses need updating to work with the revised API.
Reviewed by: markj, mmel
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D33341