provide struct namecache_ts which is the old struct namecache. Only
allocate struct namecache_ts if non-null struct timespec *tsp was
passed to cache_enter_time, otherwise use struct namecache.
Change struct namecache allocation and deallocation macros into static
functions, since logic becomes somewhat twisty. Provide accessor for
the nc_name member of struct namecache to hide difference between
struct namecache and namecache_ts.
The aim of the change is to not waste 20 bytes per small namecache
entry.
Reviewed by: jhb
MFC after: 2 weeks
X-MFC-note: after r230394
entries on one client when a directory was renamed on another client. The
root cause for the stale entry being trusted is that each per-vnode nfsnode
structure has a single 'n_ctime' timestamp used to validate positive name
cache entries. However, if there are multiple entries for a single vnode,
they all share a single timestamp. To fix this, extend the name cache
to allow filesystems to optionally store a timestamp value in each name
cache entry. The NFS clients now fetch the timestamp associated with
each name cache entry and use that to validate cache hits instead of the
timestamps previously stored in the nfsnode. Another part of the fix is
that the NFS clients now use timestamps from the post-op attributes of
RPCs when adding name cache entries rather than pulling the timestamps out
of the file's attribute cache. The latter is subject to races with other
lookups updating the attribute cache concurrently. Some more details:
- Add a variant of nfsm_postop_attr() to the old NFS client that can return
a vattr structure with a copy of the post-op attributes.
- Handle lookups of "." as a special case in the NFS clients since the name
cache does not store name cache entries for ".", so we cannot get a
useful timestamp. It didn't really make much sense to recheck the
attributes on the the directory to validate the namecache hit for "."
anyway.
- ABI compat shims for the name cache routines are present in this commit
so that it is safe to MFC.
MFC after: 2 weeks
to an OBJT_VNODE-specific field of the vm object. The same
information can be just as easily obtained from the struct vattr that
is in struct image_params if the latter is passed to
elf*_load_section(). Moreover, by replacing the vmspace and vm
object parameters to elf*_load_section() with a struct image_params
parameter, we actually reduce the size of the object code.
In collaboration with: kib
to read strings completely to know the actual size.
As a side effect it fixes the issue with kern.proc.args and kern.proc.env
sysctls, which didn't return the size of available data when calling
sysctl(3) with the NULL argument for oldp.
Note, in get_ps_strings(), which does actual work for proc_getargv() and
proc_getenvv(), we still have a safety limit on the size of data read in
case of a corrupted procces stack.
Suggested by: kib
MFC after: 3 days
This function updates path string to vnode's full global path and checks
the size of the new path string against the pathlen argument.
In vfs_domount(), sys_unmount() and kern_jail_set() this new function
is used to update the supplied path argument to the respective global path.
Unbreaks jailed zfs(8) with enforce_statfs set to 1.
Reviewed by: kib
MFC after: 1 month
On amd64, link_elf_obj.c must specify KERNBASE rather than
VM_MIN_KERNEL_ADDRESS to vm_map_find() because kernel loadable
modules must be mapped for execution in the same upper region
of the kernel map as the kernel code and data segments.
For MIPS32 KERNBASE lies below KVA area (it's less than
VM_MIN_KERNEL_ADDRESS) so basically vm_map_find got whole
KVA to look through. On MIPS64 it's not the case because
KERNBASE is set to the very end of XKSEG, well out of KVA
bounds, so vm_map_find always fails. We should use
VM_MIN_KERNEL_ADDRESS as a base for vm_map_find.
Details obtained from: alc@
The vfs_busy() is after covered vnode lock in the global lock order, but
since quotaon() does recursive VFS call to open quota file, we usually
end up locking covered vnode after mp is busied in sys_quotactl().
Change the interface of VFS_QUOTACTL(), requiring that mp was unbusied
by fs code, and do not try to pick up vfs_busy() reference in ufs quotaon,
esp. if vfs_busy cannot succeed due to unmount being performed.
Reported and tested by: pho
MFC after: 1 week
operation on POSIX shared memory objects and tmpfs. Previously, neither of
these modules correctly handled the case in which the new size of the object
or file was not a multiple of the page size. Specifically, they did not
handle partial page truncation of data stored on swap. As a result, stale
data might later be returned to an application.
Interestingly, a data inconsistency was less likely to occur under tmpfs
than POSIX shared memory objects. The reason being that a different mistake
by the tmpfs truncation operation helped avoid a data inconsistency. If the
data was still resident in memory in a PG_CACHED page, then the tmpfs
truncation operation would reactivate that page, zero the truncated portion,
and leave the page pinned in memory. More precisely, the benevolent error
was that the truncation operation didn't add the reactivated page to any of
the paging queues, effectively pinning the page. This page would remain
pinned until the file was destroyed or the page was read or written. With
this change, the page is now added to the inactive queue.
Discussed with: jhb
Reviewed by: kib (an earlier version)
MFC after: 3 weeks
The wrong structure happened to work since the only argument used was
the vnode which is in the same place in both VOP_SETATTR() and the two
extattr VOPs.
MFC after: 3 days
- Only initialize the per-cpu switchticks and switchtime in sched_throw()
for the very first context switch on APs during boot. This avoids a
small gap between the middle of thread_exit() and sched_throw() where
time is not accounted to any thread.
- In thread_exit(), update the timestamp bookkeeping to track the changes
to mi_switch() introduced by td_rux so that the code once again matches
the comment claiming it is mimicing mi_switch(). Specifically, only
update the per-thread stats directly and depend on ruxagg() to update
p_rux rather than adjusting p_rux directly. While here, move the
timestamp bookkeeping as late in the function as possible.
Reviewed by: bde, kib
MFC after: 1 week
It seems strchr() and strrchr() are used more often than index() and
rindex(). Therefore, simply migrate all kernel code to use it.
For the XFS code, remove an empty line to make the code identical to
the code in the Linux kernel.
at SCHED_PRI_RANGE to prevent overflows in the priority value. This can
happen due to irregularities with clock interrupts under certain
virtualization environments.
Tested by: Larry Rosenman ler lerctr org
MFC after: 2 weeks
sysclock_getsnapshot() function allows the caller to obtain a snapshot of all
the system clock and timecounter state required to create time stamps at a later
point. The sysclock_snap2bintime() function converts a previously obtained
snapshot into a bintime time stamp according to the specified flags e.g. which
system clock, uptime vs absolute time, etc.
These KPIs enable useful functionality, including direct comparison of the
feedback and feed-forward system clocks and generation of multiple time stamps
with different formats from a single timecounter read.
Committed on behalf of Julien Ridoux and Darryl Veitch from the University of
Melbourne, Australia, as part of the FreeBSD Foundation funded "Feed-Forward
Clock Synchronization Algorithms" project.
For more information, see http://www.synclab.org/radclock/
In collaboration with: Julien Ridoux (jridoux at unimelb edu au)
these to trigger a NOTE_ATTRIB EVFILT_VNODE kevent when the extended
attributes of a vnode are changed.
Note that OS X already implements this behavior.
Reviewed by: rwatson
MFC after: 2 weeks
With the previous code, if the range of priorities for timeshare batch
threads was greater than RQ_NQS, then the threads with low priorities in
the part of the range above RQ_NQS would be scheduled to the run-queues
as if they had high priorities at the beginning of the range.
In other words, threads with a nice level of +N could be scheduled as
if they had a nice level of -M.
Reported by: George Mitchell <george@m5p.com>
Reviewed by: jhb
Tested by: George Mitchell <george@m5p.com> (earlier version)
MFC after: 1 week
locate a process calling pfind() and do some additional checks like
p_candebug(). To reduce this code duplication a new function pget() is
introduced and used.
As the function may be useful not only in kern_proc.c it is in the
kernel name space.
Suggested by: kib
Reviewed by: kib
MFC after: 2 weeks
This is intended as a replacement for libkern's gets and mostly borrows
its implementation. It uses cngrab/cnungrab to delimit kernel's access
to console input.
Note: libkern's gets obviously doesn't share any bits of implementation
iwth libc's gets. They also have different APIs and the former doesn't
have the overflow problems of the latter.
Inspired by: bde
MFC after: 2 months
At the moment grab and ungrab methods of all console drivers are no-ops.
Current intended meaning of the calls is that the kernel takes control of
console input. In the future the semantics may be extended to mean that
the calling thread takes full ownership of the console (e.g. console
output from other threads could be suspended).
Inspired by: bde
MFC after: 2 months
case where a kevent would not fire on a regular file if an application read
to EOF and then seeked backwards into the file.
Reviewed by: kib
MFC after: 2 weeks
The reverse direction of a pipe is lazily allocated on the first write in
that direction (because pipes are usually used in one direction only). A
special case is needed to ensure the pipe appears writable before the first
write because there are 0 bytes of pending data in 0 bytes of buffer space
at that point, leaving 0 bytes of data that can be written with the normal
code.
Note that the first write returns [ENOMEM] if kern.ipc.maxpipekva is
exceeded and does not block or return [EAGAIN], so selecting true for write
is correct even in that case.
PR: kern/93685
Submitted by: gianni
MFC after: 2 weeks
objects created by shm_open(2) into the kernel's address space. This
provides a convenient way for creating shared memory buffers between
userland and the kernel without requiring custom character devices.
Historical behavior of letting other CPUs merily go on is a default for
time being. The new behavior can be switched on via
kern.stop_scheduler_on_panic tunable and sysctl.
Stopping of the CPUs has (at least) the following benefits:
- more of the system state at panic time is preserved intact
- threads and interrupts do not interfere with dumping of the system
state
Only one thread runs uninterrupted after panic if stop_scheduler_on_panic
is set. That thread might call code that is also used in normal context
and that code might use locks to prevent concurrent execution of certain
parts. Those locks might be held by the stopped threads and would never
be released. To work around this issue, it was decided that instead of
explicit checks for panic context, we would rather put those checks
inside the locking primitives.
This change has substantial portions written and re-written by attilio
and kib at various times. Other changes are heavily based on the ideas
and patches submitted by jhb and mdf. bde has provided many insights
into the details and history of the current code.
The new behavior may cause problems for systems that use a USB keyboard
for interfacing with system console. This is because of some unusual
locking patterns in the ukbd code which have to be used because on one
hand ukbd is below syscons, but on the other hand it has to interface
with other usb code that uses regular mutexes/Giant for its concurrency
protection. Dumping to USB-connected disks may also be affected.
PR: amd64/139614 (at least)
In cooperation with: attilio, jhb, kib, mdf
Discussed with: arch@, bde
Tested by: Eugene Grosbein <eugen@grosbein.net>,
gnn,
Steven Hartland <killing@multiplay.co.uk>,
glebius,
Andrew Boyer <aboyer@averesystems.com>
(various versions of the patch)
MFC after: 3 months (or never)
thread_free(newtd). This to avoid a possible page fault in
cpu_thread_clean() as seen on amd64 with syscall fuzzing.
Reviewed by: kib
MFC after: 1 week
of vm_kmem_size that may occur if the system administrator has specified a
vm.vm_kmem_size tunable value that exceeds the hard cap.
PR: 162741
Submitted by: Adam McDougall
Reviewed by: bde@
MFC after: 3 weeks
Optimize for the case, by lazily allocating the pipe inode number at the
fstat(2) time. If alloc_unr(9) returns failure, do not fail fstat(2), since
uses of inode numbers are even rare then fstat(2), but provide zero inode
forever. Note that alloc_unr() failure is unlikely due to total number
of pipes in the system limited by the number of file descriptors.
Based on the submission by: gianni
MFC after: 2 weeks
Citing jilles:
If we are ever going to do ASLR, the AUXV information tells an attacker
where the stack, executable and RTLD are located, which defeats much of
the point of randomizing the addresses in the first place.
Given that the AUXV information seems to be used by debuggers only anyway,
I think it would be good to move it to p_candebug() now.
The full virtual memory maps (KERN_PROC_VMMAP, procstat -v) are already
under p_candebug().
Suggested by: jilles
Discussed with: rwatson
MFC after: 1 week
having dereferenced it. We either should generally check the device_t's
supplied to bus functions before using them (which we seem to virtually
never do) or just assume that they are not NULL.
While at it make this code fit 78 columns.
Found with: Coverity Prevent(tm)
CID: 4230
when actually setting a driver as especially ENOMEM is fatal in these
cases.
- Annotate other calls to device_set_devclass(9) and device_set_driver(9)
without the return value being checked and that are okay to fail.
Reviewed by: yongari (slightly earlier version)
in addition to the user priority for threads whose current real priority
is equal to the previous user priority or if the new priority is a
real-time priority. This allows priority changes of other threads to
have an immediate effect.
MFC after: 2 weeks
-1. But, because ino_t is unsigned, this case was not covered by the
test ino > 0 in pipeclose(), leading to the free_unr(-1). Fix it by
explicitely comparing with 0 and -1. [1]
Do no access freed memory, the inode number was cached to prevent access
to cpipe after it possibly was freed, but I failed to commit the right
patch.
Noted by: gianni [1]
Pointy hat to: kib
MFC after: 3 days
introduced when feed-forward clock support is enabled in the kernel:
- Rename the "choice" variable to "available".
- Streamline the implementation of the "active" variable's sysctl handler
function.
- Create a kern.sysclock sysctl node for general sysclock related configuration
options. Place the "available" and "active" variables under this node.
- Create a kern.sysclock.ffclock sysctl node for feed-forward clock specific
configuration options. Place the "version" and "ffcounter_bypass" variables
under this node.
- Tweak some of the description strings.
Discussed with: Julien Ridoux (jridoux at unimelb edu au)
Committed on behalf of Julien Ridoux and Darryl Veitch from the University of
Melbourne, Australia, as part of the FreeBSD Foundation funded "Feed-Forward
Clock Synchronization Algorithms" project.
For more information, see http://www.synclab.org/radclock/
Discussed with: Julien Ridoux (jridoux at unimelb edu au)
Submitted by: Julien Ridoux (jridoux at unimelb edu au)
reimplementing the [get]{bin,nano,micro}[up]time() wrapper functions in terms of
the new "fromclock" API instead.
Committed on behalf of Julien Ridoux and Darryl Veitch from the University of
Melbourne, Australia, as part of the FreeBSD Foundation funded "Feed-Forward
Clock Synchronization Algorithms" project.
For more information, see http://www.synclab.org/radclock/
Discussed with: Julien Ridoux (jridoux at unimelb edu au)
Submitted by: Julien Ridoux (jridoux at unimelb edu au)
that new APIs with some performance sensitivity can be built on top of them.
These functions should not be called directly except in special circumstances.
Committed on behalf of Julien Ridoux and Darryl Veitch from the University of
Melbourne, Australia, as part of the FreeBSD Foundation funded "Feed-Forward
Clock Synchronization Algorithms" project.
For more information, see http://www.synclab.org/radclock/
Discussed with: Julien Ridoux (jridoux at unimelb edu au)
Submitted by: Julien Ridoux (jridoux at unimelb edu au)
fbclock_{nanouptime|microuptime|bintime|nanotime|microtime}() functions to avoid
indirecting through a sysclock_ops wrapper function.
Committed on behalf of Julien Ridoux and Darryl Veitch from the University of
Melbourne, Australia, as part of the FreeBSD Foundation funded "Feed-Forward
Clock Synchronization Algorithms" project.
For more information, see http://www.synclab.org/radclock/
Submitted by: Julien Ridoux (jridoux at unimelb edu au)
environment strings and ELF auxiliary vectors from a process stack.
Make sysctl_kern_proc_args to read not cached arguments from the
process stack.
Export proc_getargv() and proc_getenvv() so they can be reused by
procfs and linprocfs.
Suggested by: kib
Reviewed by: kib
Discussed with: kib, rwatson, jilles
Tested by: pho
MFC after: 2 weeks
ffclock time in seconds.
- Add IOCTL to retrieve ffclock timestamps from userland.
Committed on behalf of Julien Ridoux and Darryl Veitch from the University of
Melbourne, Australia, as part of the FreeBSD Foundation funded "Feed-Forward
Clock Synchronization Algorithms" project.
For more information, see http://www.synclab.org/radclock/
Submitted by: Julien Ridoux (jridoux at unimelb edu au)
system calls to provide feed-forward clock management capabilities to
userspace processes. ffclock_getcounter() returns the current value of the
kernel's feed-forward clock counter. ffclock_getestimate() returns the current
feed-forward clock parameter estimates and ffclock_setestimate() updates the
feed-forward clock parameter estimates.
- Document the syscalls in the ffclock.2 man page.
- Regenerate the script-derived syscall related files.
Committed on behalf of Julien Ridoux and Darryl Veitch from the University of
Melbourne, Australia, as part of the FreeBSD Foundation funded "Feed-Forward
Clock Synchronization Algorithms" project.
For more information, see http://www.synclab.org/radclock/
Submitted by: Julien Ridoux (jridoux at unimelb edu au)
defined and will allow consumers, willing to provide options, file and
line to locking requests, to not worry about options redefining the
interfaces.
This is typically useful when there is the need to build another
locking interface on top of the mutex one.
The introduced functions that consumers can use are:
- mtx_lock_flags_
- mtx_unlock_flags_
- mtx_lock_spin_flags_
- mtx_unlock_spin_flags_
- mtx_assert_
- thread_lock_flags_
Spare notes:
- Likely we can get rid of all the 'INVARIANTS' specification in the
ppbus code by using the same macro as done in this patch (but this is
left to the ppbus maintainer)
- all the other locking interfaces may require a similar cleanup, where
the most notable case is sx which will allow a further cleanup of
vm_map locking facilities
- The patch should be fully compatible with older branches, thus a MFC
is previewed (infact it uses all the underlying mechanisms already
present).
Comments review by: eadler, Ben Kaduk
Discussed with: kib, jhb
MFC after: 1 month
Simplify the description of pause() and shorten the KASSERT message in pause.
Also add a clamp for the timo argument in the non-KASSERT case.
Suggested by: Bruce Evans
MFC after: 1 week
- Wrap [get]{bin,nano,micro}[up]time() functions of sys/time.h to allow
requesting time from either the feedback or the feed-forward clock. If a
feedback (e.g. ntpd) and feed-forward (e.g. radclock) daemon are both running
on the system, both kernel clocks are updated but only one serves time.
- Add similar wrappers for the feed-forward difference clock.
Committed on behalf of Julien Ridoux and Darryl Veitch from the University of
Melbourne, Australia, as part of the FreeBSD Foundation funded "Feed-Forward
Clock Synchronization Algorithms" project.
For more information, see http://www.synclab.org/radclock/
Submitted by: Julien Ridoux (jridoux at unimelb edu au)
clocks. Each routine can output an upper bound on the absolute time or time
interval requested. Different flavours of absolute time can be requested, for
example with or without leap seconds, monotonic or not, etc.
Committed on behalf of Julien Ridoux and Darryl Veitch from the University of
Melbourne, Australia, as part of the FreeBSD Foundation funded "Feed-Forward
Clock Synchronization Algorithms" project.
For more information, see http://www.synclab.org/radclock/
Submitted by: Julien Ridoux (jridoux at unimelb edu au)
Implement ffcounter, a monotonically increasing cumulative counter on top of the
active timecounter. Provide low-level functions to read the ffcounter and
convert it to absolute time or a time interval in seconds using the current
ffclock estimates, which track the drift of the oscillator. Add a ring of
fftimehands to track passing of time on each kernel tick and pick up updates of
ffclock estimates.
Committed on behalf of Julien Ridoux and Darryl Veitch from the University of
Melbourne, Australia, as part of the FreeBSD Foundation funded "Feed-Forward
Clock Synchronization Algorithms" project.
For more information, see http://www.synclab.org/radclock/
Submitted by: Julien Ridoux (jridoux at unimelb edu au)
to the kernel's pause() function. The pause() function can now be used
when cold != 0. Also assert that the timeout in system ticks must be
positive.
Suggested by: Bruce Evans
MFC after: 1 week
to kern/subr_bus.c. Simplify this function so that it no longer
depends on malloc() to execute. Identify a few other places where
it makes sense to use device_delete_all_children().
MFC after: 1 week
nullfs. The problem is that resulting vnode is only required to be
held on return from the successfull call to vop, instead of being
referenced.
Nullfs VOP_INACTIVE() method reclaims the vnode, which in combination
with the VOP_VPTOCNP() interface means that the directory vnode
returned from VOP_VPTOCNP() is reclaimed in advance, causing
vn_fullpath() to error with EBADF or like.
Change the interface for VOP_VPTOCNP(), now the dvp must be
referenced. Convert all in-tree implementations of VOP_VPTOCNP(),
which is trivial, because vhold(9) and vref(9) are similar in the
locking prerequisites. Out-of-tree fs implementation of VOP_VPTOCNP(),
if any, should have no trouble with the fix.
Tested by: pho
Reviewed by: mckusick
MFC after: 3 weeks (subject of re approval)
The current code mixes the use of `flags' and `mode'. This is a bit
confusing, since the faccessat() function as a `flag' parameter to store
the AT_ flag.
Make this less confusing by using the same name as used in the POSIX
specification -- `amode'.
firmware image in the module is registered. Instead, do it when the
other image is itself referenced.
This allows a module with multiple firmware images to be automatically
unloaded when none of the firmware images are in use.
Discussed with: jhb@ (on -hackers)
p->p_boundary_count. Race could cause the execve(2) from the threaded
process to hung since thread boundary counter was incorrect and
single-threading never finished.
Reported by: pluknet, pho
Tested by: pho
MFC after: 1 week
This enables locking consumers to pass their own structures around as const and
be able to assert locks embedded into those structures.
Reviewed by: ed, kib, jhb
curthread-accessing part of mtx_{,un}lock(9) when using a r210623-style
curthread implementation on sparc64, crashing the kernel in its early
cycles as PCPU isn't set up, yet (and can't be set up as OFW is one of the
things we need for that, which leads to a chicken-and-egg problem). What
happens is that due to the fact that the idea of r210623 actually is to
allow the compiler to cache invocations of curthread, it factors out
obtaining curthread needed for both mtx_lock(9) and mtx_unlock(9) to
before the branch based on kobj_mutex_inited when compiling the kernel
without the debugging options. So change kobj_class_compile_static(9)
to just never acquire kobj_mtx, effectively restricting it to its
documented use, and add a kobj_init_static(9) for initializing objects
using a class compiled with the former and that also avoids using mutex(9)
(and malloc(9)). Also assert in both of these functions that they are
used in their intended way only.
While at it, inline kobj_register_method() and kobj_unregister_method()
as there wasn't much point for factoring them out in the first place
and so that a reader of the code has to figure out the locking for
fewer functions missing a KOBJ_ASSERT.
Tested on powerpc{,64} by andreast.
Reviewed by: nwhitehorn (earlier version), jhb
MFC after: 3 days
it can be used by in-kernel consumers.
- Make kern_posix_fallocate() public.
- Use kern_posix_fadvise() and kern_posix_fallocate() to implement the
freebsd32 wrappers for the two system calls.
iteration over the fdsets, kern_select() limits the length of the
fdsets copied in by the last valid file descriptor index. If any bit
is set in a mask above the limit, current implementation ignores the
filedescriptor, instead of returning EBADF.
Fix the issue by scanning the tails of fdset before entering the
select loop and returning EBADF if any bit above last valid
filedescriptor index is set. The performance impact of the additional
check is only imposed on the (somewhat) buggy applications that pass
bad file descriptors to select(2) or pselect(2).
PR: kern/155606, kern/162379
Discussed with: cognet, glebius
Tested by: andreast (powerpc, all 64/32bit ABI combinations, big-endian),
marius (sparc64, big-endian)
MFC after: 2 weeks
Just place the default kobj_method inside the kobjop_desc structure.
There's no need to give these kobj_methods their own symbol. This shaves
off 10 KB of a GENERIC kernel binary.
These structures hold no information that is modified during runtime. By
marking this constant, we see approximately 600 symbols become
read-only (amd64 GENERIC). While there, also mark the kobj_method
structures generated by makeobjops.awk static. They are only referenced
by the kobjop_desc structures within the same file.
Before:
$ ls -l kernel
-rwxr-xr-x 1 ed wheel 15937309 Nov 8 16:29 kernel*
$ size kernel
text data bss dec hex filename
12260854 1358468 2848832 16468154 fb48ba kernel
$ nm kernel | fgrep -c ' r '
8240
After:
$ ls -l kernel
-rwxr-xr-x 1 ed wheel 15922469 Nov 8 16:25 kernel*
$ size kernel
text data bss dec hex filename
12302869 1302660 2848704 16454233 fb1259 kernel
$ nm kernel | fgrep -c ' r '
8838
CTF data from a module. On subsequent attempts to retrieve CTF data for
a module, return an error if there no CTF data.
This fixes a panic if you try to enable fbt probes on a module with CTF
data twice.
Submitted by: Paul Ambrose (ambrosehua AT gmail DOT com)
MFC after: 3 days
all the architectures.
The option allows to mount non-MPSAFE filesystem. Without it, the
kernel will refuse to mount a non-MPSAFE filesytem.
This patch is part of the effort of killing non-MPSAFE filesystems
from the tree.
No MFC is expected for this patch.
Tested by: gianni
Reviewed by: kib
The SYSCTL_NODE macro defines a list that stores all child-elements of
that node. If there's no SYSCTL_DECL macro anywhere else, there's no
reason why it shouldn't be static.
madvise(2) except that it operates on a file descriptor instead of a
memory region. It is currently only supported on regular files.
Just as with madvise(2), the advice given to posix_fadvise(2) can be
divided into two types. The first type provide hints about data access
patterns and are used in the file read and write routines to modify the
I/O flags passed down to VOP_READ() and VOP_WRITE(). These modes are
thus filesystem independent. Note that to ease implementation (and
since this API is only advisory anyway), only a single non-normal
range is allowed per file descriptor.
The second type of hints are used to hint to the OS that data will or
will not be used. These hints are implemented via a new VOP_ADVISE().
A default implementation is provided which does nothing for the WILLNEED
request and attempts to move any clean pages to the cache page queue for
the DONTNEED request. This latter case required two other changes.
First, a new V_CLEANONLY flag was added to vinvalbuf(). This requests
vinvalbuf() to only flush clean buffers for the vnode from the buffer
cache and to not remove any backing pages from the vnode. This is
used to ensure clean pages are not wired into the buffer cache before
attempting to move them to the cache page queue. The second change adds
a new vm_object_page_cache() method. This method is somewhat similar to
vm_object_page_remove() except that instead of freeing each page in the
specified range, it attempts to move clean pages to the cache queue if
possible.
To preserve the ABI of struct file, the f_cdevpriv pointer is now reused
in a union to point to the currently active advice region if one is
present for regular files.
Reviewed by: jilles, kib, arch@
Approved by: re (kib)
MFC after: 1 month
file descriptor drops to zero out of _fdrop() and into devfs_close_f()
as it is only relevant for devfs file descriptors.
Reviewed by: kib
MFC after: 1 week
UP/!SMP case.
The callbacks may be relying on this feature and having 2 different
ways to deal with them is not correct.
Reported by: rstone
Reviewed by: jhb
MFC after: 2 weeks
more general VM system interfaces. So, their implementation can now
reside in kern_malloc.c alongside the other functions that are declared
in malloc.h.
This restores the previous behaviour. While here, match '?' and '.'
inputs exactly and improve the error message.
Requested by: avg@
Derived from a patch by: Arnaud Lacombe <lacombar@gmail.com>
only logged instances where an operation on a file descriptor required
capabilities which the file descriptor did not have. By adding a type enum
to struct ktr_cap_fail, we can catch other types of capability failures as
well, such as disallowed system calls or attempts to wrap a file descriptor
with more capabilities than it had to begin with.
supporting procstat -f: properly provide capability rights information to
userspace. The bug resulted from a merge-o during upstreaming (or rather,
a failure to properly merge FreeBSD-side changed downstream).
Spotted by: des, kibab
MFC after: 3 days
This has been irking me for a while. This causes significant
CPU use on bottlenecked CPUs (eg my older EEEPC w/ an earlier
Celeron CPU and my MIPS24k boards) when they're passing
a lot of traffic.
Since the file/line values are only used for printing, this
should only affect display. It should have no operational
change on the code, besides reducing CPU use.
so that if no vnodes in the filesystem are actively in use the unmount
will succeed rather than failing with EBUSY.
Reported by: Garrett Cooper
Reviewed by: Attilio Rao and Kostik Belousov
Tested by: Garrett Cooper
PR: kern/161016
MFC after: 3 weeks
the unlikely event that sysctl_kmem_map_free() was performed on an
empty kmem map, it would incorrectly report the free space as zero.
Discussed with: avg
MFC after: 1 week
As noted in kern/159780, printf() is not very jail-friendly, since it can't be easily monitored by jail management tools. This patch reports an error via log() instead, which, if nobody is watching the log file, still prints to the console.
Approved by: mentor (rwatson)
Submitted by: Eugene Grosbein <eugen@eg.sd.rdtc.ru>
MFC after: 5 days
itself, which sparc64 hardware doesn't support. One way to solve this
would be to directly call sched_preempt() instead of issuing a self-IPI.
However, quoting jhb@:
"On the other hand, you can probably just skip the IPI entirely if we are
going to send it to the current CPU. Presumably, once this routine
finishes, the current CPU will exit softlock (or will do so "soon") and
will then pick the next thread to run based on the adjustments made in
this routine, so there's no need to IPI the CPU running this routine
anyway. I think this is the better solution. Right now what is probably
happening on other platforms is as soon as this routine finishes the CPU
processes its self-IPI and causes mi_switch() which will just switch back
to the softclock thread it is already running."
- With r226054 and the the above change in place, sparc64 now no longer is
incompatible with ULE and vice versa. However, powerpc/E500 still is.
Submitted by: jhb
Reviewed by: jeff
valid - we don't allow for setting it on a file, for example - but it's
not something we should assert on.
For STABLE kernel, it changes nothing, because it's not compiled with
INVARIANTS. If it was, it would fix crashes. It also fixes an assert
in libc encountered with NFSv4 without nfsuserd(8) running.
Submitted by: Yuri Pankov (earlier version)
MFC after: 1 month
POSIX/SUSvN. The sigwait(2) syscall does return EINTR, and libc.so.7
contains the wrapper sigwait(3) which hides EINTR from callers. The
EINTR return is used by libthr to handle required cancellation point
in the sigwait(3).
To help the binaries linked against pre-libc.so.7, i.e. RELENG_6 and
earlier, to have right ABI for sigwait(3), transform EINTR return from
sigwait(2) into ERESTART.
Discussed with: davidxu
MFC after: 1 week
wdog_kern_pat() acquires eventhandler mutex, thus it cannot work in
kernel context (from where kdb_trap() runs).
The right way to fix this is both offering the
cpu-stop-on-panic-and-skip-locking logic and also a context for KDB
to officially run. We can re-enable this (or a similar) improvement
when these 2 patches hit the tree.
Sponsored by: Sandvine Incorporated
Discussed with: emaste, rstone
MFC after: immediately
syscall exit path. Otherwise, if SIGTRAP is ignored, that tdsendsignal()
do not want to deliver the signal, and debugger never get a notification
of exec.
Found and tested by: Anton Yuzhaninov <citrin citrin ru>
Discussed with: jhb
MFC after: 2 weeks
we were accounting the newly created process to its parent instead
of the child itself. This caused problems later, when the child
changed its credentials - the per-uid, per-jail etc counters were
not properly updated, because the maxproc counter in the child
process was 0.
Approved by: re (kib)
patch modifies makesyscalls.sh to prefix all of the non-compatibility
calls (e.g. not linux_, freebsd32_) with sys_ and updates the kernel
entry points and all places in the code that use them. It also
fixes an additional name space collision between the kernel function
psignal and the libc function of the same name by renaming the kernel
psignal kern_psignal(). By introducing this change now we will ease future
MFCs that change syscalls.
Reviewed by: rwatson
Approved by: re (bz)
If it overflows before the taskqueue can run, the task will be
re-added to the taskqueue and cause a loop in the task list.
Reported by: Arnaud Lacombe <lacombar@gmail.com>
Submitted by: Ryan Stone <rysto32@gmail.com>
Reviewed by: jhb
Approved by: re (kib)
MFC after: 1 day
on vfc_name to set vfc_typenum, so that vfc_typenum doesn't
change when file systems are loaded in different orders. This
keeps NFS file handles from changing, for file systems that
use vfc_typenum in their fsid. This change is controlled via
a loader.conf variable called vfs.typenumhash, since vfc_typenum
will change once when this is enabled. It defaults to 1 for
9.0, but will default to 0 when MFC'd to stable/8.
Tested by: hrs
Reviewed by: jhb, pjd (earlier version)
Approved by: re (kib)
MFC after: 1 month
of the device boundry.
While this is generally ok, the problem is that all the consumers
handle similar cases (and expect to catch) ENOSPC for this (for a
reference look at minidumpsys() and dumpsys() constructions). That
ends up in consumers not recognizing the issue and amd64 failing to
retry if the number of pages grows up during minidump.
Fix this by returning ENOSPC in dump_write() and while here add some
more diagnostic on involved values.
Sponsored by: Sandvine Incorporated
In collabouration with: emaste
Approved by: re (kib)
MFC after: 10 days
In revision 223722 we introduced support for driver ioctls on init/lock
state devices. Unfortunately the call to ttydevsw_cioctl() clobbers the
value of the error variable, meaning that in many cases ioctl() will now
return ENOTTY, even though the ioctl() was processed properly.
Reported by: Boris Samorodov <bsam ipt ru>
Patch by: jilles@
Approved by: re@ (kib@)
- Axe out the SHOW_BUSYBUFS option and uses a tunable for selectively
enable/disable it, which is defaulted for not printing anything (0
value) but can be changed for printing (1 value) and be verbose (2
value)
- Improves the informations outputed: right now, there is no track of
the actual struct buf object or vnode which are referenced by the
shutdown process, but it is printed the related struct bufobj object
which is not really helpful
- Add more verbosity about the state of the struct buf lock and the
vnode informations, with the latter to be activated separately by the
sysctl
Sponsored by: Sandvine Incorporated
Reviewed by: emaste, kib
Approved by: re (ksmith)
MFC after: 10 days
such as j:name:maxproc:sigkill=100. Proper fix - deferring psignal
to a taskqueue - is somewhat complicated and thus will happen
after 9.0.
Approved by: re (kib)
While this is generally good, it brings along a serie of problems,
like clocks going off sync and in presence of SW_WATCHDOG, watchdogs
firing without a good reason (missed hardclock wdog ticks update).
Fix the latter by kicking the watchdog just before to re-enable the interrupts.
Also, while here, not rely on users to stop the watchdog manually when
entering DDB but do that when entering KDB context.
Sponsored by: Sandvine Incorporated
Reviewed by: emaste, rstone
Approved by: re (kib)
MFC after: 1 week
and the new setmode and setowner fileops in FreeBSD 9.0:
- Add new MAC Framework entry point mac_posixshm_check_create() to allow
MAC policies to authorise shared memory use. Provide a stub policy and
test policy templates.
- Add missing Biba and MLS implementations of mac_posixshm_check_setmode()
and mac_posixshm_check_setowner().
- Add 'accmode' argument to mac_posixshm_check_open() -- unlike the
mac_posixsem_check_open() entry point it was modeled on, the access mode
is required as shared memory access can be read-only as well as writable;
this isn't true of POSIX semaphores.
- Implement full range of POSIX shared memory entry points for Biba and MLS.
Sponsored by: Google Inc.
Obtained from: TrustedBSD Project
Approved by: re (kib)
accessible:
(1) Always compile in support for breaking into the debugger if options
KDB is present in the kernel.
(2) Disable both by default, but allow them to be enabled via tunables
and sysctls debug.kdb.break_to_debugger and
debug.kdb.alt_break_to_debugger.
(3) options BREAK_TO_DEBUGGER and options ALT_BREAK_TO_DEBUGGER continue
to behave as before -- only now instead of compiling in
break-to-debugger support, they change the default values of the
above sysctls to enable those features by default. Current kernel
configurations should, therefore, continue to behave as expected.
(4) Migrate alternative break-to-debugger state machine logic out of
individual device drivers into centralised KDB code. This has a
number of upsides, but also one downside: it's now tricky to release
sio spin locks when entering the debugger, so we don't. However,
similar logic does not exist in other device drivers, including uart.
(5) dcons requires some special handling; unlike other console types, it
allows overriding KDB's own debugger selection, so we need a new
interface to KDB to allow that to work.
GENERIC kernels in -CURRENT will now support break-to-debugger as long as
appropriate boot/run-time options are set, which should improve the
debuggability of BETA kernels significantly.
MFC after: 3 weeks
Reviewed by: kib, nwhitehorn
Approved by: re (bz)
but not removed; decrement it instead when the child jail actually
goes away. This avoids letting the counter go below zero in the case
where dying (pr_uref==0) jails are "resurrected", and an associated
KASSERT panic.
Submitted by: Steven Hartland
Approved by: re (bz)
MFC after: 1 week
If a selinfo object is recorded (via selrecord()) and then it is
quickly destroyed, with the waiters missing the opportunity to awake,
at the next iteration they will find the selinfo object destroyed,
causing a PF#.
That happens because the selinfo interface has no way to drain the
waiters before to destroy the registered selinfo object. Also this
race is quite rare to get in practice, because it would require a
selrecord(), a poll request by another thread and a quick destruction
of the selrecord()'ed selinfo object.
Fix this by adding the seldrain() routine which should be called
before to destroy the selinfo objects (in order to avoid such case),
and fix the present cases where it might have already been called.
Sometimes, the context is safe enough to prevent this type of race,
like it happens in device drivers which installs selinfo objects on
poll callbacks. There, the destruction of the selinfo object happens
at driver detach time, when all the filedescriptors should be already
closed, thus there cannot be a race.
For this case, mfi(4) device driver can be set as an example, as it
implements a full correct logic for preventing this from happening.
Sponsored by: Sandvine Incorporated
Reported by: rstone
Tested by: pluknet
Reviewed by: jhb, kib
Approved by: re (bz)
MFC after: 3 weeks
and the maximum TCP send and receive buffer limits from 256kB
to 2MB.
For sb_max_adj we need to add the cast as already used in the sysctl
handler to not overflow the type doing the maths.
Note that this is just the defaults. They will allow more memory
to be consumed per socket/connection if needed but not change the
default "idle" memory consumption. All values are still tunable
by sysctls.
Suggested by: gnn
Discussed on: arch (Mar and Aug 2011)
MFC after: 3 weeks
Approved by: re (kib)
Remove mapped pages for all dataset vnodes in zfs_rezget() using
new vn_pages_remove() to fix mmapped files changed by
zfs rollback or zfs receive -F.
PR: kern/160035, kern/156933
Reviewed by: kib, pjd
Approved by: re (kib)
MFC after: 1 week
callout cpu lock (and after having dropped it).
If the newly scheduled thread wants to acquire the old queue it will
just spin forever.
Fix this by disabling preemption and interrupts entirely (because fast
interrupt handlers may incur in the same problem too) while switching
locks.
Reported by: hrs, Mike Tancsa <mike AT sentex DOT net>,
Chip Camden <sterling AT camdensoftware DOT com>
Tested by: hrs, Mike Tancsa <mike AT sentex DOT net>,
Chip Camden <sterling AT camdensoftware DOT com>,
Nicholas Esborn <nick AT desert DOT net>
Approved by: re (kib)
MFC after: 10 days
effectively negative. Often seen as upstream fastcgi connection timeouts
in nginx when using sendfile over unix domain sockets for communication.
Sendfile(2) may send more bytes then currently allowed by the
hiwatermark of the socket, e.g. because the so_snd sockbuf lock is
dropped after sbspace() call in the kern_sendfile() loop. In this case,
recalculated hiwatermark will overflow. Since lowatermark is renewed
as half of the hiwatermark by sendfile code, and both are unsigned,
the send buffer never reaches the free space requested by lowatermark,
causing indefinite wait in sendfile.
Reviewed by: rwatson
Approved by: re (bz)
MFC after: 2 weeks
buffer is greater than 1. This triggered panics in at least one spot in
the kernel (the MAC Framework) which passes non-negative, rather than >1
buffer sizes based on the size of a user buffer passed into a system
call. While 0-size buffers aren't particularly useful, they also aren't
strictly incorrect, so loosen the assertion.
Discussed with: phk (fears I might be EDOOFUS but willing to go along)
Spotted by: pho + stress2
Approved by: re (kib)
A "process descriptor" file descriptor is used to manage processes
without using the PID namespace. This is required for Capsicum's
Capability Mode, where the PID namespace is unavailable.
New system calls pdfork(2) and pdkill(2) offer the functional equivalents
of fork(2) and kill(2). pdgetpid(2) allows querying the PID of the remote
process for debugging purposes. The currently-unimplemented pdwait(2) will,
in the future, allow querying rusage/exit status. In the interim, poll(2)
may be used to check (and wait for) process termination.
When a process is referenced by a process descriptor, it does not issue
SIGCHLD to the parent, making it suitable for use in libraries---a common
scenario when using library compartmentalisation from within large
applications (such as web browsers). Some observers may note a similarity
to Mach task ports; process descriptors provide a subset of this behaviour,
but in a UNIX style.
This feature is enabled by "options PROCDESC", but as with several other
Capsicum kernel features, is not enabled by default in GENERIC 9.0.
Reviewed by: jhb, kib
Approved by: re (kib), mentor (rwatson)
Sponsored by: Google Inc
query the needed size for a sysctl result by passing in a NULL old
pointer and a valid oldsize. The kern.proc.args sysctl handler broke
this assumption by not calling SYSCTL_OUT() if the old pointer was
NULL.
Approved by: re (kib)
MFC after: 3 days
to implement fchown(2) and fchmod(2) support for several file types
that previously lacked it. Add MAC entries for chown/chmod done on
posix shared memory and (old) in-kernel posix semaphores.
Based on the submission by: glebius
Reviewed by: rwatson
Approved by: re (bz)
When calling poll(2) on a capability, unwrap first and then poll the
underlying object.
Approved by: re (kib), mentor (rwatson)
Sponsored by: Google Inc
a bug was introduced in kern_openat() such that the error from the vnode
open operation was overwritten before it was passed as an argument to
dupfdopen(). This broke operations on /dev/{stdin,stdout,stderr}. Fix
by preserving the original error number across finstall() so that it is
still available.
Approved by: re (kib)
Reported by: cognet
A no-op for non-Capsicum kernels; for Capsicum kernels, completes the
enabling of fooat(2) system calls using capabilities. With this change,
and subject to bug fixes, Capsicum capability support is now complete for
9.0.
Approved by: re (kib)
Submitted by: jonathan
Sponsored by: Google Inc
namei() and lookup() can now perform "strictly relative" lookups.
Such lookups, performed when in capability mode or when looking up
relative to a directory capability, enforce two policies:
- absolute paths are disallowed (including symlinks to absolute paths)
- paths containing '..' components are disallowed
These constraints make it safe to enable openat() and friends.
These system calls are instrumental in supporting Capsicum
components such as the capability-mode-aware runtime linker.
Finally, adjust comments in capabilities.conf to reflect the actual state
of the world (e.g. shm_open(2) already has the appropriate constraints,
getdents(2) already requires CAP_SEEK).
Approved by: re (bz), mentor (rwatson)
Sponsored by: Google Inc.
access to file system subtrees to sandboxed processes.
- Use of absolute paths and '..' are limited in capability mode.
- Use of absolute paths and '..' are limited when looking up relative
to a capability.
- When a name lookup is performed, identify what operation is to be
performed (such as CAP_MKDIR) as well as check for CAP_LOOKUP.
With these constraints, openat() and friends are now safe in capability
mode, and can then be used by code such as the capability-mode runtime
linker.
Approved by: re (bz), mentor (rwatson)
Sponsored by: Google Inc
Since kern_openat() now uses falloc_noinstall() and finstall() separately,
there are cases where we could get to cleanup code without ever creating
a file descriptor. In those cases, we should not call fdclose() on FD -1.
Approved by: re (kib), mentor (rwatson)
Sponsored by: Google Inc
kernel for FreeBSD 9.0:
Add a new capability mask argument to fget(9) and friends, allowing system
call code to declare what capabilities are required when an integer file
descriptor is converted into an in-kernel struct file *. With options
CAPABILITIES compiled into the kernel, this enforces capability
protection; without, this change is effectively a no-op.
Some cases require special handling, such as mmap(2), which must preserve
information about the maximum rights at the time of mapping in the memory
map so that they can later be enforced in mprotect(2) -- this is done by
narrowing the rights in the existing max_protection field used for similar
purposes with file permissions.
In namei(9), we assert that the code is not reached from within capability
mode, as we're not yet ready to enforce namespace capabilities there.
This will follow in a later commit.
Update two capability names: CAP_EVENT and CAP_KEVENT become
CAP_POST_KEVENT and CAP_POLL_KEVENT to more accurately indicate what they
represent.
Approved by: re (bz)
Submitted by: jonathan
Sponsored by: Google Inc
doesn't operate on locked vnode. This could cause a panic.
Fix by unlocking vnode, re-locking afterwards and verifying that it wasn't
renamed or deleted. To improve readability and reduce code size, move code
to a new static function vfs_verify_global_path().
In addition, fix missing giant unlock in unmount().
Reported by: David Wolfskill <david@catwhisker.org>
Reviewed by: kib
Approved by: re (bz)
MFC after: 2 weeks
using vn_fullpath_global(). This fixes f_mntonname if mounting
inside chroot, jail or with relative path as argument.
For unmount in jail, use vn_fullpath_global() to discover
global path from supplied path argument. This fixes unmount in jail.
Reviewed by: pjd, kib
Approved by: re (kib)
MFC after: 2 weeks
This is a followup to r222032 and a reimplementation of it.
While that revision fixed the race for the smp_rv_waiters[2] exit
sentinel, it still left a possibility for a target CPU to access
stale or wrong smp_rv_func_arg in smp_rv_teardown_func.
To fix this race the slave CPUs signal when they are really fully
done with the rendezvous and the master CPU waits until all slaves
are done.
Diagnosed by: kib
Reviewed by: jhb, mlaier, neel
Approved by: re (kib)
MFC after: 2 weeks
This is done per request/suggestion from John Baldwin
who introduced the option. Trying to resume normal
system operation after a panic is very unpredictable
and dangerous. It will become even more dangerous
when we allow a thread in panic(9) to penetrate all
lock contexts.
I understand that the only purpose of this option was
for testing scenarios potentially resulting in panic.
Suggested by: jhb
Reviewed by: attilio, jhb
X-MFC-After: never
Approved by: re (kib)
so that it is visible to userland programs. This change enables
the `mount' command with no arguments to be able to show if a
filesystem is mounted using journaled soft updates as opposed
to just normal soft updates.
Approved by: re (bz)
32 bits to 64 bits and eliminates the unused mnt_xflag field. The
existing mnt_flag field is completely out of bits, so this update
gives us room to expand. Note that the f_flags field in the statfs
structure is already 64 bits, so the expanded mnt_flag field can
be exported without having to make any changes in the statfs structure.
Approved by: re (bz)
Now that the code is in place to audit capability method rights, start
using it to audit the 'rights' argument to cap_new(2).
Approved by: re (kib), mentor (rwatson)
Sponsored by: Google Inc
When reporting on a capability, flag the fact that it is a capability,
but also unwrap to report all of the usual information about the
underlying file.
Approved by: re (kib), mentor (rwatson)
Sponsored by: Google Inc
pollution. That is a step further in the direction of building correct
policies for userland and modules on how to deal with the number of
maxcpus at runtime.
Reported by: jhb
Reviewed and tested by: pluknet
Approved by: re (kib)
pc_name is only included when KTR option is and it does introduce a
subdle KBI breakage that totally breaks vmstat when world and kernel are
not in sync.
Besides, it is not used somewhere.
In collabouration with: pluknet
Reviewed by: jhb
Approved by: re (kib)
ki_rusage member when KERN_PROC_INC_THREAD is passed to one of the
process sysctls.
- Correctly account for the current thread's cputime in the thread when
doing the runtime fixup in calcru().
- Use TIDs as the key to lookup the previous thread to compute IO stat
deltas in IO mode in top when thread display is enabled.
Reviewed by: kib
Approved by: re (kib)
sintrcnt/sintrnames which are symbols containing the size of the 2
tables.
- For amd64/i386 remove the storage of intr* stuff from assembly files.
This area can be widely improved by applying the same to other
architectures and likely finding an unified approach among them and
move the whole code to be MI. More work in this area is expected to
happen fairly soon.
No MFC is previewed for this patch.
Tested by: pluknet
Reviewed by: jhb
Approved by: re (kib)
may be jointly referenced via the mask CTLFLAG_CAPRW. Sysctls with these
flags are available in Capsicum's capability mode; other sysctl nodes are
not.
Flag several useful sysctls as available in capability mode, such as memory
layout sysctls required by the run-time linker and malloc(3). Also expose
access to randomness and available kernel features.
A few sysctls are enabled to support name->MIB conversion; these may leak
information to capability mode by virtue of providing resolution on names
not flagged for access in capability mode. This is, generally, not a huge
problem, but might be something to resolve in the future. Flag these cases
with XXX comments.
Submitted by: jonathan
Sponsored by: Google, Inc.
sampling mode PMC is allocated, hwpmc calls linker_hwpmc_list_objects()
while already holding an exclusive lock on pmc-sx lock. list_objects()
tries to acquire an exclusive lock on the kld_sx lock. When a KLD module
is loaded or unloaded successfully, kern_kld(un)load calls into the pmc
hook while already holding an exclusive lock on the kld_sx lock. Calling
the pmc hook requires acquiring a shared lock on the pmc-sx lock.
Fix this by only acquiring a shared lock on the kld_sx lock in
linker_hwpmc_list_objects(), and also downgrading to a shared lock on the
kld_sx lock in kern_kld(un)load before calling into the pmc hook. In
kern_kldload this required moving some modifications of the linker_file_t
to happen before calling into the pmc hook.
This fixes the deadlock by ensuring that the hwpmc -> list_objects() case
is always able to proceed. Without this patch, I was able to deadlock a
multicore system within minutes by constantly loading and unloading an KLD
module while I simultaneously started a sampling mode PMC in a loop.
MFC after: 1 month
Implement two previously-reserved Capsicum system calls:
- cap_new() creates a capability to wrap an existing file descriptor
- cap_getrights() queries the rights mask of a capability.
Approved by: mentor (rwatson), re (Capsicum blanket)
Sponsored by: Google Inc
Code to actually implement Capsicum capabilities, including fileops and
kern_capwrap(), which creates a capability to wrap an existing file
descriptor.
We also modify kern_close() and closef() to handle capabilities.
Finally, remove cap_filelist from struct capability, since we don't
actually need it.
Approved by: mentor (rwatson), re (Capsicum blanket)
Sponsored by: Google Inc
option that is highly recommended to be adjusted in too much
documentation while doing nothing in FreeBSD since r2729 (rev 1.1).
ipcs(1) needs to be recompiled as it is accessing _KERNEL private
variables.
Reviewed by: jhb (before comment change on linux code)
Sponsored by: Sandvine Incorporated
delivered to parent when the child exists.
Submitted by: Petr Salinger <Petr.Salinger seznam cz> (Debian/kFreeBSD)
MFC after: 1 week
X-MFC-note: bump __FreeBSD_version
uiomove generates EFAULT if any accessed address is not mapped, as
opposed to handling the fault.
Sponsored by: The FreeBSD Foundation
Reviewed by: alc (previous version)
Rather than checking to see if a descriptor is a kqueue, check to see if
its fileops flags include DFLAG_PASSABLE.
At the moment, these two tests are equivalent, but this will change with
the addition of capabilities that wrap kqueues but are themselves of type
DTYPE_CAPABILITY. We already have the DFLAG_PASSABLE abstraction, so let's
use it.
This change has been tested with [the newly improved] tools/regression/kqueue.
Approved by: mentor (rwatson), re (Capsicum blanket)
Sponsored by: Google Inc
o Move the non-blocking socket test below the SBS_CANTRCVMORE so that EOF
is correctly returned on a remote connection close.
o In the non-blocking socket test compare SS_NBIO against the so->so_state
field instead of the incorrect sb->sb_state field.
o Simplify the ENOTCONN test by removing cases that can't occur.
Submitted by: trociny (with some further tweaks by committer)
Tested by: trociny
won't happen before 9.0. This commit adds "#ifdef RACCT" around all the
"PROC_LOCK(p); racct_whatever(p, ...); PROC_UNLOCK(p)" instances, in order
to avoid useless locking/unlocking in kernels built without "options RACCT".
This new version of _fget() requires new parameters:
- cap_rights_t needrights
the rights that we expect the capability's rights mask to include
(e.g. CAP_READ if we are going to read from the file)
- cap_rights_t *haverights
used to return the capability's rights mask (ignored if NULL)
- u_char *maxprotp
the maximum mmap() rights (e.g. VM_PROT_READ) that can be permitted
(only used if we are going to mmap the file; ignored if NULL)
- int fget_flags
FGET_GETCAP if we want to return the capability itself, rather than
the underlying object which it wraps
Approved by: mentor (rwatson), re (Capsicum blanket)
Sponsored by: Google Inc
cap_funwrap() and cap_funwrap_mmap() unwrap capabilities, exposing the
underlying object. Attempting to unwrap a capability with an inadequate
rights mask (e.g. calling cap_funwrap(fp, CAP_WRITE | CAP_MMAP, &result)
on a capability whose rights mask is CAP_READ | CAP_MMAP) will result in
ENOTCAPABLE.
Unwrapping a non-capability is effectively a no-op.
These functions will be used by Capsicum-aware versions of _fget(), etc.
Approved by: mentor (rwatson), re (Capsicum blanket)
Sponsored by: Google Inc
to be assigned to a non-default FIB instance.
You may need to recompile world or ports due to the change of struct ifnet.
Submitted by: cjsp
Submitted by: Alexander V. Chernikov (melifaro ipfw.ru)
(original versions)
Reviewed by: julian
Reviewed by: Alexander V. Chernikov (melifaro ipfw.ru)
MFC after: 2 weeks
X-MFC: use spare in struct ifnet
The cioctl() hook can be used by drivers to add ioctls to the *.init and
*.lock devices. This commit breaks the ttydevsw ABI, since this
structure didn't provide any padding. To prevent ABI breakage in the
future, add a tsw_spare.
Submitted by: Peter Jeremy <peter jeremy alcatel lucent com>
Obtained from: kern/152254 (slightly modified)
descriptors, we will want to allocate a new descriptor without installing
it in the FD array.
Split falloc() into falloc_noinstall() and finstall(), and rewrite
falloc() to call them with appropriate atomicity.
Approved by: mentor (rwatson), re (bz)
more explicit comments about what's going on and what future maintainers
need to do when e.g. adding a new operation to a sys_machdep.c.
Approved by: mentor(rwatson), re(bz)
option to vm_object_page_remove() asserts that the specified range of pages
is not mapped, or more precisely that none of these pages have any managed
mappings. Thus, vm_object_page_remove() need not call pmap_remove_all() on
the pages.
This change not only saves time by eliminating pointless calls to
pmap_remove_all(), but it also eliminates an inconsistency in the use of
pmap_remove_all() versus related functions, like pmap_remove_write(). It
eliminates harmless but pointless calls to pmap_remove_all() that were being
performed on PG_UNMANAGED pages.
Update all of the existing assertions on pmap_remove_all() to reflect this
change.
Reviewed by: kib
to do with global namespaces) and CAPABILITIES (which has to do with
constraining file descriptors). Just in case, and because it's a better
name anyway, let's move CAPABILITIES out of the way.
Also, change opt_capabilities.h to opt_capsicum.h; for now, this will
only hold CAPABILITY_MODE, but it will probably also hold the new
CAPABILITIES (implying constrained file descriptors) in the future.
Approved by: rwatson
Sponsored by: Google UK Ltd
... and thus retire debug.kdb.stop_cpus tunable/sysctl.
The knob was to work around CPU stopping issues, which since have been
either fixed or greatly reduced. kdb should really operate in a special
environment with scheduler stopped and interrupts disabled to provide
deterministic debugging.
Discussed with: attilio, rwatson
X-MFC after: 2 months or never
... and also increase the timeout.
It's better to try to proceed somehow despite stuck CPUs than to hang
indefinitely. Especially so during shutdown and when entering kdb or panic.
Timeout value is still an aribitrary value.
Timeout diagnostic is just a printf; the work on something more
debuggable is planned by attilio. Need to be careful here as
stop_cpus_hard is called very early while enetering kdb and soon(-ish)
it may become called very early when entering panic.
Reviewed by: attilio
MFC after: 2 months
processors unless the invariant TSC bit of CPUID is set. Intel processors
may stop incrementing TSC when DPSLP# pin is asserted, according to Intel
processor manuals, i. e., TSC timecounter is useless if the processor can
enter deep sleep state (C3/C4). This problem was accidentally uncovered by
r222869, which increased timecounter quality of P-state invariant TSC, e.g.,
for Core2 Duo T5870 (Family 6, Model f) and Atom N270 (Family 6, Model 1c).
Reported by: Fabian Keil (freebsd-listen at fabiankeil dot de)
Ian FREISLICH (ianf at clue dot co dot za)
Tested by: Fabian Keil (freebsd-listen at fabiankeil dot de)
- Core2 Duo T5870 (C3 state available/enabled)
jkim - Xeon X5150 (C3 state unavailable)
Modify the "alternate break sequence" detecting state
machine so that only a contiguous invocation of the
break sequence is accepted. The old implementation
did not reset the state machine when detecting an
unexpected character.
While here, use an enum for the states of the machine
instead of magic numbers.bmitted by:
Sponsored by: Spectra Logic Corporation
sys/kern/kern_conf.c:
Add make_dev_physpath_alias(). This interface takes
the parent cdev of the alias, an old alias cdev (if any)
to replace with the newly created alias, and the physical
path string. The alias is visiable as a symlink to the
parent, with the same name as the parent, rooted at
physpath in devfs.
Note: make_dev_physpath_alias() has hard coded knowledge of the
Solaris style prefix convention for physical path data,
"id1,". In the future, I expect the convention to change
to allow "physical path quality" to be reported in the
prefix. For example, a physical path based on NewBus
topology would be of "lower quality" than a physical path
reported by a device enclosure.
Sponsored by: Spectra Logic Corporation
device node has been created, pass MAKEDEV_CHECKNAME in so that the devfs
code will do the check.
Use a regular static variable as before, that's good enough to keep us from
calling into devfs most of the time.
Suggested by: kib
MFC after: 1 week
Sponsored by: Spectra Logic Corporation
In devstat_new_entry(), there is no need to initialize the queue
and the mutex in this function. There are ways to do static
initialization on both, so use STAILQ_HEAD_INITIALIZER and
MTX_SYSINIT to initialize the queue and the mutex.
In devstat_alloc(), use an atomic test and set routine to guard
making our entry in /dev. Using just a plain static variable
creates a race condition on multiprocessor machines. If you
attempt to create a second entry in devfs, the kernel will panic.
Submitted by: kdm
Reviewed by: gibbs
Sponsored by: Spectra Logic Corporation
MFC after: 1 week.
interleaving.
Signal dumping to happen only for the first panic which should be the
most important.
Sponsored by: Sandvine Incorporated
Submitted by: Nima Misaghian (nmisaghian AT sandvine DOT com)
MFC after: 2 weeks
Otherwise, p_bufr is set to garbage on the stack, and if that garbage
happens to be non-NULL, and the TOLOG or TOCONS flag is set, putbuf()
will get called and attempt to fill the non-existent buffer.
This is really only relevant for tprintf() (and only when the priority is
not -1), but set it in uprintf() and ttyprintf() for completeness.
The next step, to avoid log buffer scrambling, would be to add the
PRINTF_BUFR_SIZE code to tprintf(), but this should prevent panics.
Submitted by: rmacklem
Found by: pho
for it. Do not not expect a developer to call doadump(). Calling
doadump does not necessarily work when it's declared static. Nor
does it necessarily do what was intended in the context of text
dumps. The dump command always creates a core dump.
Move printing of error messages from doadump to the dump command,
now that we don't have to worry about being called from DDB.
In msgbuf_reinit() and msgbuf_init(), we weren't initializing the mutex.
Depending on the contents of memory, the LO_INITIALIZED flag might be
set on the mutex (either due to a warm reboot, and the message buffer
remaining in place, or due to garbage in memory) and in that case, with
INVARIANTS turned on, we would trigger an assertion that the mutex had
already been initialized.
Fix this by bzeroing the message buffer mutex for the _init() and _reinit()
paths.
Reported by: mdf
This introduce all the underlying support for making this possible (via
the function cpusetobj_strscan() and keeps ktr_cpumask exported. sparc64
implements its own assembly primitives for tracing events and needs to
properly check it. Anyway the sparc64 logic is not implemented yet due
to lack of knowledge (by me) and time (by marius), but it is just a
matter of using ktr_cpumask when possible.
Tested and fixed by: pluknet
Reviewed by: marius
While we have had a fix in place (options PRINTF_BUFR_SIZE=128) to fix
scrambled console output, the message buffer and syslog were still getting
log messages one character at a time. While all of the characters still
made it into the log (courtesy of atomic operations), they were often
interleaved when there were multiple threads writing to the buffer at the
same time.
This fixes message buffer accesses to use buffering logic as well, so that
strings that are less than PRINTF_BUFR_SIZE will be put into the message
buffer atomically. So now dmesg output should look the same as console
output.
subr_msgbuf.c: Convert most message buffer calls to use a new spin
lock instead of atomic variables in some places.
Add a new routine, msgbuf_addstr(), that adds a
NUL-terminated string to a message buffer. This
takes a priority argument, which allows us to
eliminate some races (at least in the the string
at a time case) that are present in the
implementation of msglogchar(). (dangling and
lastpri are static variables, and are subject to
races when multiple callers are present.)
msgbuf_addstr() also allows the caller to request
that carriage returns be stripped out of the
string. This matches the behavior of msglogchar(),
but in testing so far it doesn't appear that any
newlines are being stripped out. So the carriage
return removal functionality may be a candidate
for removal later on if further analysis shows
that it isn't necessary.
subr_prf.c: Add a new msglogstr() routine that calls
msgbuf_logstr().
Rename putcons() to putbuf(). This now handles
buffered output to the message log as well as
the console. Also, remove the logic in putcons()
(now putbuf()) that added a carriage return before
a newline. The console path was the only path that
needed it, and cnputc() (called by cnputs())
already adds a carriage return. So this
duplication resulted in kernel-generated console
output lines ending in '\r''\r''\n'.
Refactor putchar() to handle the new buffering
scheme.
Add buffering to log().
Change log_console() to use msglogstr() instead of
msglogchar(). Don't add extra newlines by default
in log_console(). Hide that behavior behind a
tunable/sysctl (kern.log_console_add_linefeed) for
those who would like the old behavior. The old
behavior led to the insertion of extra newlines
for log output for programs that print out a
string, and then a trailing newline on a separate
write. (This is visible with dmesg -a.)
msgbuf.h: Add a prototype for msgbuf_addstr().
Add three new fields to struct msgbuf, msg_needsnl,
msg_lastpri and msg_lock. The first two are needed
for log message functionality previously handled
by msglogchar(). (Which is still active if
buffering isn't enabled.)
Include sys/lock.h and sys/mutex.h for the new
mutex.
Reviewed by: gibbs
be brought up in the order they are enumerated in the device tree (in
particular, that thread 0 on each core be brought up first). The SLIST
through which we loop to start the CPUs has all of its entries added with
SLIST_INSERT_HEAD(), which means it is in reverse order of enumeration
and so AP startup would always fail in such situations (causing a machine
check or RTAS failure). Fix this by changing the SLIST into an STAILQ,
and inserting new CPUs at the end.
Reviewed by: jhb
larger than the receive buffer, we have to receive in sections.
When notifying the protocol that some data has been drained the
lock is released for a moment. Returning we block waiting for the
rest of data. There is a race, when data could arrive while the
lock was released and then the connection stalls in sbwait.
Fix this by checking for data before blocking and skip blocking
if there are some.
PR: kern/154504
Reported by: Andrey Simonenko <simon@comsys.ntu-kpi.kiev.ua>
Tested by: Andrey Simonenko <simon@comsys.ntu-kpi.kiev.ua>
Reviewed by: rwatson
Approved by: kib (co-mentor)
MFC after: 2 weeks