This enables locking consumers to pass their own structures around as const and
be able to assert locks embedded into those structures.
Reviewed by: ed, kib, jhb
curthread-accessing part of mtx_{,un}lock(9) when using a r210623-style
curthread implementation on sparc64, crashing the kernel in its early
cycles as PCPU isn't set up, yet (and can't be set up as OFW is one of the
things we need for that, which leads to a chicken-and-egg problem). What
happens is that due to the fact that the idea of r210623 actually is to
allow the compiler to cache invocations of curthread, it factors out
obtaining curthread needed for both mtx_lock(9) and mtx_unlock(9) to
before the branch based on kobj_mutex_inited when compiling the kernel
without the debugging options. So change kobj_class_compile_static(9)
to just never acquire kobj_mtx, effectively restricting it to its
documented use, and add a kobj_init_static(9) for initializing objects
using a class compiled with the former and that also avoids using mutex(9)
(and malloc(9)). Also assert in both of these functions that they are
used in their intended way only.
While at it, inline kobj_register_method() and kobj_unregister_method()
as there wasn't much point for factoring them out in the first place
and so that a reader of the code has to figure out the locking for
fewer functions missing a KOBJ_ASSERT.
Tested on powerpc{,64} by andreast.
Reviewed by: nwhitehorn (earlier version), jhb
MFC after: 3 days
it can be used by in-kernel consumers.
- Make kern_posix_fallocate() public.
- Use kern_posix_fadvise() and kern_posix_fallocate() to implement the
freebsd32 wrappers for the two system calls.
iteration over the fdsets, kern_select() limits the length of the
fdsets copied in by the last valid file descriptor index. If any bit
is set in a mask above the limit, current implementation ignores the
filedescriptor, instead of returning EBADF.
Fix the issue by scanning the tails of fdset before entering the
select loop and returning EBADF if any bit above last valid
filedescriptor index is set. The performance impact of the additional
check is only imposed on the (somewhat) buggy applications that pass
bad file descriptors to select(2) or pselect(2).
PR: kern/155606, kern/162379
Discussed with: cognet, glebius
Tested by: andreast (powerpc, all 64/32bit ABI combinations, big-endian),
marius (sparc64, big-endian)
MFC after: 2 weeks
Just place the default kobj_method inside the kobjop_desc structure.
There's no need to give these kobj_methods their own symbol. This shaves
off 10 KB of a GENERIC kernel binary.
These structures hold no information that is modified during runtime. By
marking this constant, we see approximately 600 symbols become
read-only (amd64 GENERIC). While there, also mark the kobj_method
structures generated by makeobjops.awk static. They are only referenced
by the kobjop_desc structures within the same file.
Before:
$ ls -l kernel
-rwxr-xr-x 1 ed wheel 15937309 Nov 8 16:29 kernel*
$ size kernel
text data bss dec hex filename
12260854 1358468 2848832 16468154 fb48ba kernel
$ nm kernel | fgrep -c ' r '
8240
After:
$ ls -l kernel
-rwxr-xr-x 1 ed wheel 15922469 Nov 8 16:25 kernel*
$ size kernel
text data bss dec hex filename
12302869 1302660 2848704 16454233 fb1259 kernel
$ nm kernel | fgrep -c ' r '
8838
CTF data from a module. On subsequent attempts to retrieve CTF data for
a module, return an error if there no CTF data.
This fixes a panic if you try to enable fbt probes on a module with CTF
data twice.
Submitted by: Paul Ambrose (ambrosehua AT gmail DOT com)
MFC after: 3 days
all the architectures.
The option allows to mount non-MPSAFE filesystem. Without it, the
kernel will refuse to mount a non-MPSAFE filesytem.
This patch is part of the effort of killing non-MPSAFE filesystems
from the tree.
No MFC is expected for this patch.
Tested by: gianni
Reviewed by: kib
The SYSCTL_NODE macro defines a list that stores all child-elements of
that node. If there's no SYSCTL_DECL macro anywhere else, there's no
reason why it shouldn't be static.
madvise(2) except that it operates on a file descriptor instead of a
memory region. It is currently only supported on regular files.
Just as with madvise(2), the advice given to posix_fadvise(2) can be
divided into two types. The first type provide hints about data access
patterns and are used in the file read and write routines to modify the
I/O flags passed down to VOP_READ() and VOP_WRITE(). These modes are
thus filesystem independent. Note that to ease implementation (and
since this API is only advisory anyway), only a single non-normal
range is allowed per file descriptor.
The second type of hints are used to hint to the OS that data will or
will not be used. These hints are implemented via a new VOP_ADVISE().
A default implementation is provided which does nothing for the WILLNEED
request and attempts to move any clean pages to the cache page queue for
the DONTNEED request. This latter case required two other changes.
First, a new V_CLEANONLY flag was added to vinvalbuf(). This requests
vinvalbuf() to only flush clean buffers for the vnode from the buffer
cache and to not remove any backing pages from the vnode. This is
used to ensure clean pages are not wired into the buffer cache before
attempting to move them to the cache page queue. The second change adds
a new vm_object_page_cache() method. This method is somewhat similar to
vm_object_page_remove() except that instead of freeing each page in the
specified range, it attempts to move clean pages to the cache queue if
possible.
To preserve the ABI of struct file, the f_cdevpriv pointer is now reused
in a union to point to the currently active advice region if one is
present for regular files.
Reviewed by: jilles, kib, arch@
Approved by: re (kib)
MFC after: 1 month
file descriptor drops to zero out of _fdrop() and into devfs_close_f()
as it is only relevant for devfs file descriptors.
Reviewed by: kib
MFC after: 1 week
UP/!SMP case.
The callbacks may be relying on this feature and having 2 different
ways to deal with them is not correct.
Reported by: rstone
Reviewed by: jhb
MFC after: 2 weeks
more general VM system interfaces. So, their implementation can now
reside in kern_malloc.c alongside the other functions that are declared
in malloc.h.
This restores the previous behaviour. While here, match '?' and '.'
inputs exactly and improve the error message.
Requested by: avg@
Derived from a patch by: Arnaud Lacombe <lacombar@gmail.com>
only logged instances where an operation on a file descriptor required
capabilities which the file descriptor did not have. By adding a type enum
to struct ktr_cap_fail, we can catch other types of capability failures as
well, such as disallowed system calls or attempts to wrap a file descriptor
with more capabilities than it had to begin with.
supporting procstat -f: properly provide capability rights information to
userspace. The bug resulted from a merge-o during upstreaming (or rather,
a failure to properly merge FreeBSD-side changed downstream).
Spotted by: des, kibab
MFC after: 3 days
This has been irking me for a while. This causes significant
CPU use on bottlenecked CPUs (eg my older EEEPC w/ an earlier
Celeron CPU and my MIPS24k boards) when they're passing
a lot of traffic.
Since the file/line values are only used for printing, this
should only affect display. It should have no operational
change on the code, besides reducing CPU use.
so that if no vnodes in the filesystem are actively in use the unmount
will succeed rather than failing with EBUSY.
Reported by: Garrett Cooper
Reviewed by: Attilio Rao and Kostik Belousov
Tested by: Garrett Cooper
PR: kern/161016
MFC after: 3 weeks
the unlikely event that sysctl_kmem_map_free() was performed on an
empty kmem map, it would incorrectly report the free space as zero.
Discussed with: avg
MFC after: 1 week
As noted in kern/159780, printf() is not very jail-friendly, since it can't be easily monitored by jail management tools. This patch reports an error via log() instead, which, if nobody is watching the log file, still prints to the console.
Approved by: mentor (rwatson)
Submitted by: Eugene Grosbein <eugen@eg.sd.rdtc.ru>
MFC after: 5 days
itself, which sparc64 hardware doesn't support. One way to solve this
would be to directly call sched_preempt() instead of issuing a self-IPI.
However, quoting jhb@:
"On the other hand, you can probably just skip the IPI entirely if we are
going to send it to the current CPU. Presumably, once this routine
finishes, the current CPU will exit softlock (or will do so "soon") and
will then pick the next thread to run based on the adjustments made in
this routine, so there's no need to IPI the CPU running this routine
anyway. I think this is the better solution. Right now what is probably
happening on other platforms is as soon as this routine finishes the CPU
processes its self-IPI and causes mi_switch() which will just switch back
to the softclock thread it is already running."
- With r226054 and the the above change in place, sparc64 now no longer is
incompatible with ULE and vice versa. However, powerpc/E500 still is.
Submitted by: jhb
Reviewed by: jeff
valid - we don't allow for setting it on a file, for example - but it's
not something we should assert on.
For STABLE kernel, it changes nothing, because it's not compiled with
INVARIANTS. If it was, it would fix crashes. It also fixes an assert
in libc encountered with NFSv4 without nfsuserd(8) running.
Submitted by: Yuri Pankov (earlier version)
MFC after: 1 month
POSIX/SUSvN. The sigwait(2) syscall does return EINTR, and libc.so.7
contains the wrapper sigwait(3) which hides EINTR from callers. The
EINTR return is used by libthr to handle required cancellation point
in the sigwait(3).
To help the binaries linked against pre-libc.so.7, i.e. RELENG_6 and
earlier, to have right ABI for sigwait(3), transform EINTR return from
sigwait(2) into ERESTART.
Discussed with: davidxu
MFC after: 1 week
wdog_kern_pat() acquires eventhandler mutex, thus it cannot work in
kernel context (from where kdb_trap() runs).
The right way to fix this is both offering the
cpu-stop-on-panic-and-skip-locking logic and also a context for KDB
to officially run. We can re-enable this (or a similar) improvement
when these 2 patches hit the tree.
Sponsored by: Sandvine Incorporated
Discussed with: emaste, rstone
MFC after: immediately
syscall exit path. Otherwise, if SIGTRAP is ignored, that tdsendsignal()
do not want to deliver the signal, and debugger never get a notification
of exec.
Found and tested by: Anton Yuzhaninov <citrin citrin ru>
Discussed with: jhb
MFC after: 2 weeks
we were accounting the newly created process to its parent instead
of the child itself. This caused problems later, when the child
changed its credentials - the per-uid, per-jail etc counters were
not properly updated, because the maxproc counter in the child
process was 0.
Approved by: re (kib)
patch modifies makesyscalls.sh to prefix all of the non-compatibility
calls (e.g. not linux_, freebsd32_) with sys_ and updates the kernel
entry points and all places in the code that use them. It also
fixes an additional name space collision between the kernel function
psignal and the libc function of the same name by renaming the kernel
psignal kern_psignal(). By introducing this change now we will ease future
MFCs that change syscalls.
Reviewed by: rwatson
Approved by: re (bz)
If it overflows before the taskqueue can run, the task will be
re-added to the taskqueue and cause a loop in the task list.
Reported by: Arnaud Lacombe <lacombar@gmail.com>
Submitted by: Ryan Stone <rysto32@gmail.com>
Reviewed by: jhb
Approved by: re (kib)
MFC after: 1 day
on vfc_name to set vfc_typenum, so that vfc_typenum doesn't
change when file systems are loaded in different orders. This
keeps NFS file handles from changing, for file systems that
use vfc_typenum in their fsid. This change is controlled via
a loader.conf variable called vfs.typenumhash, since vfc_typenum
will change once when this is enabled. It defaults to 1 for
9.0, but will default to 0 when MFC'd to stable/8.
Tested by: hrs
Reviewed by: jhb, pjd (earlier version)
Approved by: re (kib)
MFC after: 1 month
of the device boundry.
While this is generally ok, the problem is that all the consumers
handle similar cases (and expect to catch) ENOSPC for this (for a
reference look at minidumpsys() and dumpsys() constructions). That
ends up in consumers not recognizing the issue and amd64 failing to
retry if the number of pages grows up during minidump.
Fix this by returning ENOSPC in dump_write() and while here add some
more diagnostic on involved values.
Sponsored by: Sandvine Incorporated
In collabouration with: emaste
Approved by: re (kib)
MFC after: 10 days
In revision 223722 we introduced support for driver ioctls on init/lock
state devices. Unfortunately the call to ttydevsw_cioctl() clobbers the
value of the error variable, meaning that in many cases ioctl() will now
return ENOTTY, even though the ioctl() was processed properly.
Reported by: Boris Samorodov <bsam ipt ru>
Patch by: jilles@
Approved by: re@ (kib@)
- Axe out the SHOW_BUSYBUFS option and uses a tunable for selectively
enable/disable it, which is defaulted for not printing anything (0
value) but can be changed for printing (1 value) and be verbose (2
value)
- Improves the informations outputed: right now, there is no track of
the actual struct buf object or vnode which are referenced by the
shutdown process, but it is printed the related struct bufobj object
which is not really helpful
- Add more verbosity about the state of the struct buf lock and the
vnode informations, with the latter to be activated separately by the
sysctl
Sponsored by: Sandvine Incorporated
Reviewed by: emaste, kib
Approved by: re (ksmith)
MFC after: 10 days
such as j:name:maxproc:sigkill=100. Proper fix - deferring psignal
to a taskqueue - is somewhat complicated and thus will happen
after 9.0.
Approved by: re (kib)
While this is generally good, it brings along a serie of problems,
like clocks going off sync and in presence of SW_WATCHDOG, watchdogs
firing without a good reason (missed hardclock wdog ticks update).
Fix the latter by kicking the watchdog just before to re-enable the interrupts.
Also, while here, not rely on users to stop the watchdog manually when
entering DDB but do that when entering KDB context.
Sponsored by: Sandvine Incorporated
Reviewed by: emaste, rstone
Approved by: re (kib)
MFC after: 1 week
and the new setmode and setowner fileops in FreeBSD 9.0:
- Add new MAC Framework entry point mac_posixshm_check_create() to allow
MAC policies to authorise shared memory use. Provide a stub policy and
test policy templates.
- Add missing Biba and MLS implementations of mac_posixshm_check_setmode()
and mac_posixshm_check_setowner().
- Add 'accmode' argument to mac_posixshm_check_open() -- unlike the
mac_posixsem_check_open() entry point it was modeled on, the access mode
is required as shared memory access can be read-only as well as writable;
this isn't true of POSIX semaphores.
- Implement full range of POSIX shared memory entry points for Biba and MLS.
Sponsored by: Google Inc.
Obtained from: TrustedBSD Project
Approved by: re (kib)
accessible:
(1) Always compile in support for breaking into the debugger if options
KDB is present in the kernel.
(2) Disable both by default, but allow them to be enabled via tunables
and sysctls debug.kdb.break_to_debugger and
debug.kdb.alt_break_to_debugger.
(3) options BREAK_TO_DEBUGGER and options ALT_BREAK_TO_DEBUGGER continue
to behave as before -- only now instead of compiling in
break-to-debugger support, they change the default values of the
above sysctls to enable those features by default. Current kernel
configurations should, therefore, continue to behave as expected.
(4) Migrate alternative break-to-debugger state machine logic out of
individual device drivers into centralised KDB code. This has a
number of upsides, but also one downside: it's now tricky to release
sio spin locks when entering the debugger, so we don't. However,
similar logic does not exist in other device drivers, including uart.
(5) dcons requires some special handling; unlike other console types, it
allows overriding KDB's own debugger selection, so we need a new
interface to KDB to allow that to work.
GENERIC kernels in -CURRENT will now support break-to-debugger as long as
appropriate boot/run-time options are set, which should improve the
debuggability of BETA kernels significantly.
MFC after: 3 weeks
Reviewed by: kib, nwhitehorn
Approved by: re (bz)
but not removed; decrement it instead when the child jail actually
goes away. This avoids letting the counter go below zero in the case
where dying (pr_uref==0) jails are "resurrected", and an associated
KASSERT panic.
Submitted by: Steven Hartland
Approved by: re (bz)
MFC after: 1 week
If a selinfo object is recorded (via selrecord()) and then it is
quickly destroyed, with the waiters missing the opportunity to awake,
at the next iteration they will find the selinfo object destroyed,
causing a PF#.
That happens because the selinfo interface has no way to drain the
waiters before to destroy the registered selinfo object. Also this
race is quite rare to get in practice, because it would require a
selrecord(), a poll request by another thread and a quick destruction
of the selrecord()'ed selinfo object.
Fix this by adding the seldrain() routine which should be called
before to destroy the selinfo objects (in order to avoid such case),
and fix the present cases where it might have already been called.
Sometimes, the context is safe enough to prevent this type of race,
like it happens in device drivers which installs selinfo objects on
poll callbacks. There, the destruction of the selinfo object happens
at driver detach time, when all the filedescriptors should be already
closed, thus there cannot be a race.
For this case, mfi(4) device driver can be set as an example, as it
implements a full correct logic for preventing this from happening.
Sponsored by: Sandvine Incorporated
Reported by: rstone
Tested by: pluknet
Reviewed by: jhb, kib
Approved by: re (bz)
MFC after: 3 weeks
and the maximum TCP send and receive buffer limits from 256kB
to 2MB.
For sb_max_adj we need to add the cast as already used in the sysctl
handler to not overflow the type doing the maths.
Note that this is just the defaults. They will allow more memory
to be consumed per socket/connection if needed but not change the
default "idle" memory consumption. All values are still tunable
by sysctls.
Suggested by: gnn
Discussed on: arch (Mar and Aug 2011)
MFC after: 3 weeks
Approved by: re (kib)
Remove mapped pages for all dataset vnodes in zfs_rezget() using
new vn_pages_remove() to fix mmapped files changed by
zfs rollback or zfs receive -F.
PR: kern/160035, kern/156933
Reviewed by: kib, pjd
Approved by: re (kib)
MFC after: 1 week
callout cpu lock (and after having dropped it).
If the newly scheduled thread wants to acquire the old queue it will
just spin forever.
Fix this by disabling preemption and interrupts entirely (because fast
interrupt handlers may incur in the same problem too) while switching
locks.
Reported by: hrs, Mike Tancsa <mike AT sentex DOT net>,
Chip Camden <sterling AT camdensoftware DOT com>
Tested by: hrs, Mike Tancsa <mike AT sentex DOT net>,
Chip Camden <sterling AT camdensoftware DOT com>,
Nicholas Esborn <nick AT desert DOT net>
Approved by: re (kib)
MFC after: 10 days
effectively negative. Often seen as upstream fastcgi connection timeouts
in nginx when using sendfile over unix domain sockets for communication.
Sendfile(2) may send more bytes then currently allowed by the
hiwatermark of the socket, e.g. because the so_snd sockbuf lock is
dropped after sbspace() call in the kern_sendfile() loop. In this case,
recalculated hiwatermark will overflow. Since lowatermark is renewed
as half of the hiwatermark by sendfile code, and both are unsigned,
the send buffer never reaches the free space requested by lowatermark,
causing indefinite wait in sendfile.
Reviewed by: rwatson
Approved by: re (bz)
MFC after: 2 weeks
buffer is greater than 1. This triggered panics in at least one spot in
the kernel (the MAC Framework) which passes non-negative, rather than >1
buffer sizes based on the size of a user buffer passed into a system
call. While 0-size buffers aren't particularly useful, they also aren't
strictly incorrect, so loosen the assertion.
Discussed with: phk (fears I might be EDOOFUS but willing to go along)
Spotted by: pho + stress2
Approved by: re (kib)
A "process descriptor" file descriptor is used to manage processes
without using the PID namespace. This is required for Capsicum's
Capability Mode, where the PID namespace is unavailable.
New system calls pdfork(2) and pdkill(2) offer the functional equivalents
of fork(2) and kill(2). pdgetpid(2) allows querying the PID of the remote
process for debugging purposes. The currently-unimplemented pdwait(2) will,
in the future, allow querying rusage/exit status. In the interim, poll(2)
may be used to check (and wait for) process termination.
When a process is referenced by a process descriptor, it does not issue
SIGCHLD to the parent, making it suitable for use in libraries---a common
scenario when using library compartmentalisation from within large
applications (such as web browsers). Some observers may note a similarity
to Mach task ports; process descriptors provide a subset of this behaviour,
but in a UNIX style.
This feature is enabled by "options PROCDESC", but as with several other
Capsicum kernel features, is not enabled by default in GENERIC 9.0.
Reviewed by: jhb, kib
Approved by: re (kib), mentor (rwatson)
Sponsored by: Google Inc
query the needed size for a sysctl result by passing in a NULL old
pointer and a valid oldsize. The kern.proc.args sysctl handler broke
this assumption by not calling SYSCTL_OUT() if the old pointer was
NULL.
Approved by: re (kib)
MFC after: 3 days
to implement fchown(2) and fchmod(2) support for several file types
that previously lacked it. Add MAC entries for chown/chmod done on
posix shared memory and (old) in-kernel posix semaphores.
Based on the submission by: glebius
Reviewed by: rwatson
Approved by: re (bz)
When calling poll(2) on a capability, unwrap first and then poll the
underlying object.
Approved by: re (kib), mentor (rwatson)
Sponsored by: Google Inc
a bug was introduced in kern_openat() such that the error from the vnode
open operation was overwritten before it was passed as an argument to
dupfdopen(). This broke operations on /dev/{stdin,stdout,stderr}. Fix
by preserving the original error number across finstall() so that it is
still available.
Approved by: re (kib)
Reported by: cognet
A no-op for non-Capsicum kernels; for Capsicum kernels, completes the
enabling of fooat(2) system calls using capabilities. With this change,
and subject to bug fixes, Capsicum capability support is now complete for
9.0.
Approved by: re (kib)
Submitted by: jonathan
Sponsored by: Google Inc
namei() and lookup() can now perform "strictly relative" lookups.
Such lookups, performed when in capability mode or when looking up
relative to a directory capability, enforce two policies:
- absolute paths are disallowed (including symlinks to absolute paths)
- paths containing '..' components are disallowed
These constraints make it safe to enable openat() and friends.
These system calls are instrumental in supporting Capsicum
components such as the capability-mode-aware runtime linker.
Finally, adjust comments in capabilities.conf to reflect the actual state
of the world (e.g. shm_open(2) already has the appropriate constraints,
getdents(2) already requires CAP_SEEK).
Approved by: re (bz), mentor (rwatson)
Sponsored by: Google Inc.