running thread.
It is currently implemented only on amd64 and i386; on these
architectures, it is implemented by raising an NMI on the CPU on which
the target thread is currently running. Unlike stack_save_td(), it may
fail, for example if the thread is running in user mode.
This change also modifies the kern.proc.kstack sysctl to use this function,
so that stacks of running threads are shown in the output of "procstat -kk".
This is handy for debugging threads that are stuck in a busy loop.
Reviewed by: bdrewery, jhb, kib
Sponsored by: EMC / Isilon Storage Division
Differential Revision: https://reviews.freebsd.org/D3256
Auto-tuning threshold discussions aside, it turns out that if you want
to lower this on say, rather memory-packed machines, you either set maxusers
or kern.maxfiles, or you set it in sysctl. The former is a non-exact
way to tune this; the latter doesn't actually affect anything in the
startup scripts.
This first occured because I wondered why the hell screen would take upwards
of 10 seconds to spawn a new screen. I then found python doing the same
thing during fork/exec of child processes - it calls close() on each FD
up to the current openfiles limit. On a 1TB machine this is like, 26 million
FDs per process. Ugh.
So:
* This allows it to be set early in /boot/loader.conf;
* It can be used to work around the ridiculous situation of
screen, python, etc doing a close() on potentially millions of FDs
even though you only have four open.
Tested:
* 4GB, 32GB, 64GB, 128GB, 384GB, 1TB systems with autotune, ensuring
screen and python forking doesn't result in some pretty hilariously
bad behaviour.
TODO:
* Note that the default login.conf sets openfiles-cur to unlimited,
effectively obeying kern.maxfilesperproc. Perhaps we should fix
this.
* .. and even if we do, we need to also ensure that daemons get
a soft limit of something reasonable and capped - they can request
more FDs themselves.
MFC after: 1 week
Sponsored by: Norse Corp, Inc.
named node, open(2) cannot create directories. But do allow the flag
combination to succeed if the directory already exists.
Declare the open("name", O_DIRECTORY | O_CREAT | O_EXCL) always
invalid for the same reason, since open(2) cannot create directory.
Note that there is an argument that O_DIRECTORY | O_CREAT should be
invalid always, regardless of the target directory existence or
O_EXCL. The current fix is conservative and allows the call to
succeed in the situation where it succeeded before the patch.
Reported by: Tom Ridge <freebsd@tom-ridge.com>
Reviewed by: rwatson
PR: 202892
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
The filedesc lock is only needed if ioctls caps are present, which is a
rare situation. This is a step towards reducing the scope of the filedesc
lock.
the size of the name cache hash table (mapping file names to vnodes)
and the vnode hash table (mapping mount point and inode number to vnode).
An appropriate locking strategy is the key to changing hash table sizes
while they are in active use.
Reviewed by: kib
Tested by: Peter Holm
Differential Revision: https://reviews.freebsd.org/D2265
MFC after: 2 weeks
Coredump notes depend on being able to invoke dump routines twice; once
in a dry-run mode to get the size of the note, and another to actually
emit the note to the corefile.
When a note helper emits a different length section the second time
around than the length it requested the first time, the kernel produces
a corrupt coredump.
NT_PROCSTAT_FILES output length, when packing kinfo structs, is tied to
the length of filenames corresponding to vnodes in the process' fd table
via vn_fullpath. As vnodes may move around during dump, this is racy.
So:
- Detect badly behaved notes in putnote() and pad underfilled notes.
- Add a fail point, debug.fail_point.fill_kinfo_vnode__random_path to
exercise the NT_PROCSTAT_FILES corruption. It simply picks random
lengths to expand or truncate paths to in fo_fill_kinfo_vnode().
- Add a sysctl, kern.coredump_pack_fileinfo, to allow users to
disable kinfo packing for PROCSTAT_FILES notes. This should avoid
both FILES note corruption and truncation, even if filenames change,
at the cost of about 1 kiB in padding bloat per open fd. Document
the new sysctl in core.5.
- Fix note_procstat_files to self-limit in the 2nd pass. Since
sometimes this will result in a short write, pad up to our advertised
size. This addresses note corruption, at the risk of sometimes
truncating the last several fd info entries.
- Fix NT_PROCSTAT_FILES consumers libutil and libprocstat to grok the
zero padding.
With suggestions from: bjk, jhb, kib, wblock
Approved by: markj (mentor)
Relnotes: yes
Sponsored by: EMC / Isilon Storage Division
Differential Revision: https://reviews.freebsd.org/D3548
Originally it was added in order to prevent trashing of objects with
INVARIANTS enabled. The same effect is now provided with mere UMA_ZONE_NOFREE.
This reverts r286921.
Discussed with: kib
and exit events. procfs stop events for system call tracing report these
values (argument count for system call entry and code for system call exit),
but ptrace() does not provide this information. (Note that while the system
call code can be determined in an ABI-specific manner during system call
entry, it is not generally available during system call exit.)
The values are exported via new fields at the end of struct ptrace_lwpinfo
available via PT_LWPINFO.
Reviewed by: kib
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D3536
particular, this invalidates the knote kn_link linkage, making the
SLIST_FOREACH() loop accessing undefined values (e.g. trashed by
QUEUE_MACRO_DEBUG). If the knote is freed by other thread when kq
lock is released or when influx is cleared, e.g. by knote_scan() for
kqueue owning the knote, the iteration step would access freed memory.
Use SLIST_FOREACH_SAFE() to fix iteration.
Diagnosed by: avg
Tested by: avg, lstewart, pawel
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
Explain why it is fine to not check for M_NOWAIT failures in
kqueue_register(). Remove unneeded check for NULL result from
waitable allocation in kqueue_scan(). uma_free(9) handles NULL
argument correctly, remove checks for NULL. Remove useless cast and
adjust style in knote_alloc().
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
The typo was introduced in r278469 / 344ecf88af.
As a result of the bug there was a timing window where callout_reset()
would fail to cancel a concurrent execution of a callout that is about
to start and would schedule the callout again.
The callout would fire more times than it is scheduled.
That would happen even if the callout is initialized with a lock.
For example, the bug triggered the "Stray timeout" assertion in
taskqueue_timeout_func().
MFC after: 5 days
scheduler types. It was intended to be used there, compare with the
min value, and with the test for correctness in ksched_setscheduler().
Note that P1B_PRIO_MAX and RTP_PRIO_MAX do have the same numerical
values, the change is cosmetical.
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
it helps only the TCP timers callout(9) usage. As the benefit for
others callout(9) usages did not reach a consensus the historical
usage should prevail.
Differential Revision: https://reviews.freebsd.org/D3078
To make it easier to understand how Capsicum interacts with linkat() and
renameat(), rename the rights to CAP_{LINK,RENAME}AT_{SOURCE,TARGET}.
This also addresses a shortcoming in Capsicum, where it isn't possible
to disable linking to files stored in a directory. Creating hardlinks
essentially makes it possible to access files with additional rights.
Reviewed by: rwatson, wblock
Differential Revision: https://reviews.freebsd.org/D3411
being serviced return 0 (fail) but it is applicable only
mpsafe callouts. Thanks to hselasky for finding this.
Differential Revision: https://reviews.freebsd.org/D3078 (Updated)
Submitted by: hselasky
Reviewed by: jch
was invalid. Don't trigger a mount failure (which by default means
a panic), but instead just move on to the next directive in the
configuration. This typically has us ask for the root mount.
PR: 163245
reason this didn't result in an unclean shutdown is that devfs ignores
MNT_FORCE flag.
Reviewed by: kib@
MFC after: 1 month
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D3467
Add a check to preload_search_info to make sure mod is set. Most of the
callers of preload_search_info don't check that the mod parameter is
set, which can cause page faults. While at it, remove some now unnecessary
checks before calling preload_search_info.
Sponsored by: Citrix Systems R&D
Reviewed by: kib
Differential Revision: https://reviews.freebsd.org/D3440
having some children, the children' reaper is not reset to the parent.
This allows for the situation where reaper has children but not
descendands and the too strict asserts in the reap_status() fire.
Remove the wrong asserts, add some clarification for the situation to
the procctl(2) REAP_STATUS.
Reported and tested by: feld
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
zero. The file_zone if no-free, but r284861 added trashing of the
freed memory. Most visible manifestation of the issue were 'memory
modified after free' panics for the file zone, triggered from
falloc_noinstall().
Add UMA_ZONE_ZINIT flag to turn off trashing. Mjg noted that it makes
sense to not trash freed memory for any non-free zone, which will be
done later.
Reported and tested by: pho
Discussed with: mjg
Sponsored by: The FreeBSD Foundation
being serviced and indeed unstoppable.
A scenario to reproduce this case is:
- the callout is being serviced and at same time,
- callout_reset() is called on this callout that sets
the CALLOUT_PENDING flag and at same time,
- callout_stop() is called on this callout and returns 1 (success)
even if the callout is indeed currently running and unstoppable.
This issue was caught up while making r284245 (D2763) workaround, and
was discussed at BSDCan 2015. Once applied the r284245 workaround
is not needed anymore and will be reverted.
Differential Revision: https://reviews.freebsd.org/D3078
Reviewed by: jhb
Sponsored by: Verisign, Inc.
with higher quality registers (presumably in a module that has just been
loaded), do not undo the user's choice by switching to the new timecounter.
Document that behavior, and also the fact that there is no way to unregister
a timecounter (and thus no way to unload a module containing one).
- Document the kern_kevent_anonymous() function.
- Add assertions to ensure that we don't silently leave the kqueue
linked from a file descriptor table.
Reviewed by: jmg
Differential Revision: https://reviews.freebsd.org/D3364
We already properly return ENOTDIR when calling *at() on a non-directory
vnode, but it turns out that if you call it on a socket, we see EINVAL.
Patch up namei to properly translate this to ENOTDIR.
As CloudABI processes cannot adjust their signal handlers, we need to
make sure that we start up CloudABI processes with consistent signal
masks. Though the POSIx standard signal behavior is all right, we do
need to make sure that we ignore SIGPIPE, as it would otherwise be
hard to interact with pipes and sockets.
Extend execsigs() to iterate over ps_sigignore and call sigdflt() for
each of the ignored signals.
Reviewed by: kib
Obtained from: https://github.com/NuxiNL/freebsd
Differential Revision: https://reviews.freebsd.org/D3365
CloudABI's polling system calls merge the concept of one-shot polling
(poll, select) and stateful polling (kqueue). They share the same data
structures.
Extend FreeBSD's kqueue to provide support for waiting for events on an
anonymous kqueue. Unlike stateful polling, there is no need to support
timeouts, as an additional timer event could be used instead.
Furthermore, it makes no sense to use a different number of input and
output kevents. Merge this into a single argument.
Obtained from: https://github.com/NuxiNL/freebsd
Differential Revision: https://reviews.freebsd.org/D3307
The existing sys_cap_rights_limit() expects that a cap_rights_t object
lives in userspace. It is therefore hard to call into it from
kernelspace.
Move the interesting bits of sys_cap_rights_limit() into
kern_cap_rights_limit(), so that we can call into it from the CloudABI
compatibility layer.
Obtained from: https://github.com/NuxiNL/freebsd
Differential Revision: https://reviews.freebsd.org/D3314
initial thread stack is not adjusted by the tunable, the stack is
allocated too early to get access to the kernel environment. See
TD0_KSTACK_PAGES for the thread0 stack sizing on i386.
The tunable was tested on x86 only. From the visual inspection, it
seems that it might work on arm and powerpc. The arm
USPACE_SVC_STACK_TOP and powerpc USPACE macros seems to be already
incorrect for the threads with non-default kstack size. I only
changed the macros to use variable instead of constant, since I cannot
test.
On arm64, mips and sparc64, some static data structures are sized by
KSTACK_PAGES, so the tunable is disabled.
Sponsored by: The FreeBSD Foundation
MFC after: 2 week
This makes the PPS API behave correctly, but isn't ideal -- we still end
up capturing PPS data for non-enabled edges, we just don't process the
data into an event that becomes visible outside of kern_tc. That's because
the event type isn't passed to pps_capture(), so it can't do the filtering.
Any solution for capture filtering is going to require touching every driver.
On CloudABI we want to create file descriptors with just the minimal set
of Capsicum rights in place. The reason for this is that it makes it
easier to obtain uniform behaviour across different operating systems.
By explicitly whitelisting the operations, we can return consistent
error codes, but also prevent applications from depending OS-specific
behaviour.
Extend kern_kqueue() to take an additional struct filecaps that is
passed on to falloc_caps(). Update the existing consumers to pass in
NULL.
Differential Revision: https://reviews.freebsd.org/D3259
It looks like EVFILT_READ and EVFILT_WRITE trigger under the same
conditions as poll()'s POLLRDNORM and POLLWRNORM as described by POSIX.
The only difference is that POLLRDNORM has to be triggered on regular
files unconditionally, whereas EVFILT_READ only triggers when not EOF.
Introduce a new flag, NOTE_FILE_POLL, that can be used to make
EVFILT_READ and EVFILT_WRITE behave identically to poll(). This flag
will be used by cloudlibc's poll() function.
Reviewed by: jmg
Differential Revision: https://reviews.freebsd.org/D3303
of the timehands, from the kern_tc.c implementation to vdso. Add
comments giving hints where to look for the algorithm explanation.
To compensate the removal of rmb() in userspace binuptime(), add
explicit lfence instruction before rdtsc. On i386, add usual
complications to detect SSE2 presence; assume that old CPUs which do
not implement SSE2 also execute rdtsc almost in order.
Reviewed by: alc, bde (previous version)
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks