locks: a global list/counter/generation counter protected by a new
mutex unp_list_lock, and a global linkage rwlock, unp_global_rwlock,
which protects the connections between UNIX domain sockets.
This eliminates conditional lock acquisition that was previously a
property of the global lock being held over sonewconn() leading to a
call to uipc_attach(), which also required the global lock, but
couldn't rely on it as other paths existed to uipc_attach() that
didn't hold it: now uipc_attach() uses only the list lock, which
follows the linkage lock in the lock order. It may also reduce
contention on the global lock for some workloads.
Add global UNIX domain socket locks to hard-coded witness lock
order.
MFC after: 1 week
Discussed with: kris
directory of a vnode to find a dirent with a matching file number. The
name from that dirent is then used to provide the component name.
Note: if the initial vnode argument is not a directory itself, then
the default VOP_VPTOCNP(9) implementation still returns ENOENT.
Reviewed by: kib
Approved by: kib
Tested by: pho
extended attribute get/set; in the case of get an uninitialized user
buffer was passed before the EA was retrieved, making it of relatively
little use; the latter was simply unused by any policies.
Obtained from: TrustedBSD Project
Sponsored by: Google, Inc.
naming by renaming certain "proc" entry points to "cred" entry points,
reflecting their manipulation of credentials. For some entry points,
the process was passed into the framework but not into policies; in
these cases, stop passing in the process since we don't need it.
mac_proc_check_setaudit -> mac_cred_check_setaudit
mac_proc_check_setaudit_addr -> mac_cred_check_setaudit_addr
mac_proc_check_setauid -> mac_cred_check_setauid
mac_proc_check_setegid -> mac_cred_check_setegid
mac_proc_check_seteuid -> mac_cred_check_seteuid
mac_proc_check_setgid -> mac_cred_check_setgid
mac_proc_check_setgroups -> mac_cred_ceck_setgroups
mac_proc_check_setregid -> mac_cred_check_setregid
mac_proc_check_setresgid -> mac_cred_check_setresgid
mac_proc_check_setresuid -> mac_cred_check_setresuid
mac_proc_check_setreuid -> mac_cred_check_setreuid
mac_proc_check_setuid -> mac_cred_check_setuid
Obtained from: TrustedBSD Project
Sponsored by: Google, Inc.
poll_no_poll().
Return a poll_no_poll() result from devfs_poll_f() when
filedescriptor does not reference the live cdev, instead of ENXIO.
Noted and tested by: hps
MFC after: 1 week
Do not overload the local variable size in kern_shmat() due to vm_size_t
change.
Fix style bug by adding explicit comparision with 0.
Discussed with: bde
MFC after: 1 week
wrapper macros that allow trace points and arguments to be declared
using a single macro rather than several. This means a lot less
repetition and vertical space for each trace point.
Use these macros when defining privilege and MAC Framework trace points.
Reviewed by: jb
MFC after: 1 week
vfsopt and the vfs_buildopts function public, and add some new fields
to struct vfsopt (pos and seen), and new functions vfs_getopt_pos and
vfs_opterror.
Further extend the interface to allow reading options from the kernel
in addition to sending them to the kernel, with vfs_setopt and related
functions.
While this allows the "name=value" option interface to be used for more
than just FS mounts (planned use is for jails), it retains the current
"vfsopt" name and <sys/mount.h> requirement.
Approved by: bz (mentor)
operation is known and to retry or fail accordingly to that
outcome. This fixes the problem with namespace traversing
programs failing with random ENOENT errors if someone just
happened to try to unmount that same filesystem at the same
time.
Reported by: dhw
Reviewed by: kib, attilio
Sponsored by: Juniper Networks, Inc.
memory from int to size_t. Implement a workaround for current ABI not
allowing to properly save size for and report more then 2Gb sized segment
of shared memory.
This makes it possible to use > 2 Gb shared memory segments on 64bit
architectures. Please note the new BUGS section in shmctl(2) and
UPDATING note for limitations of this temporal solution.
Reviewed by: csjp
Tested by: Nikolay Dzham <i levsha org ua>
MFC after: 2 weeks
It's better to just use internal language constructs, because it is
likely the compiler has a better opinion on whether to perform inlining,
which is very likely to happen to struct winsize.
Submitted by: Christoph Mallon <christoph mallon gmx de>
net/route.h.
Remove the hidden include of opt_route.h and net/route.h from net/vnet.h.
We need to make sure that both opt_route.h and net/route.h are included
before net/vnet.h because of the way MRT figures out the number of FIBs
from the kernel option. If we do not, we end up with the default number
of 1 when including net/vnet.h and array sizes are wrong.
This does not change the list of files which depend on opt_route.h
but we can identify them now more easily.
printf() and vprintf() are exactly the same, except the way arguments
are passed. Just like we see in other pieces of code (i.e. libc's
printf()), implement printf() using vprintf().
Submitted by: Christoph Mallon <christoph mallon gmx de>
As mentioned by bz and bde, the change I made wasn't the proper way to
fix. Inspired by bde's patch, perform some small cleanups to uprintf().
Reviewed by: bz
Inside do_execve(), we have a pointer `ndp', which always points to
`&nd'. I can imagine a primitive (non-optimizing) compiler to really
reserve space for such a pointer, so just remove the variable and use
`&nd' directly.
kern_time.c:
- Unused variable `p'.
kern_thr.c:
- Variable `error' is always caught immediately, so no reason to
initialize it. There is no way that error != 0 at the end of
create_thread().
kern_sig.c:
- Unused variable `code'.
kern_synch.c:
- `rval' is always assigned in all different cases.
kern_rwlock.c:
- `v' is always overwritten with RW_UNLOCKED further on.
kern_malloc.c:
- `size' is always initialized with the proper value before being used.
kern_exit.c:
- `error' is always caught and returned immediately. abort2() never
returns a non-zero value.
kern_exec.c:
- `len' is always assigned inside the if-statement right below it.
tty_info.c:
- `td' is always overwritten by FOREACH_THREAD_IN_PROC().
Found by: LLVM's scan-build
`p' is already initialized with `td->td_proc'. Because td is always
curthread, it is safe to initialize it without any locks.
Found by: LLVM's scan-build
priv:kernel:priv_check:priv_ok fires for granted privileges
priv:kernel:priv_check:priv_errr fires for denied privileges
The first argument is the requested privilege number. The naming
convention is a little different from the OpenSolaris equivilent
because we can't have '-' in probefunc names, and our privilege
namespace is different.
MFC after: 1 week
predefined set of methods, which are set in osd_register() and called
via osd_call(). Currently, no methods are defined, though prison
objects will have some in the future.
Expand the locking from a single per-type mutex to three different kinds
of locks (four if you include the requirement that the container
(e.g. prison) be locked when getting/setting data). This clears up one
existing issue, as well as others added by the method support.
Approved by: bz (mentor)
The existing code calls kern_open() to resolve the vnode of a pathname
right after a stat(). This is not correct, because it causes random
character devices to be opened in /dev. This means ls'ing a tape
streamer will cause it to rewind, for example. Changes I have made:
- Add kern_statat_vnhook() to allow binary emulators to `post-process'
struct stat, using the proper vnode.
- Remove unneeded printf's from stat() and statfs().
- Make the Linuxolator use kern_statat_vnhook(), replacing
translate_path_major_minor_at().
- Let translate_fd_major_minor() use vp->v_rdev instead of
vp->v_un.vu_cdev.
Result:
crw-rw-rw- 1 root root 0, 14 Feb 20 13:54 /dev/ptmx
crw--w---- 1 root adm 136, 0 Feb 20 14:03 /dev/pts/0
crw--w---- 1 root adm 136, 1 Feb 20 14:02 /dev/pts/1
crw--w---- 1 ed tty 136, 2 Feb 20 14:03 /dev/pts/2
Before this commit, ptmx also had a major number of 136, because it
silently allocated and deallocated a pseudo-terminal. Device nodes that
cannot be opened now have proper major/minor-numbers.
Reviewed by: kib, netchild, rdivacky (thanks!)
stale entries, we save a copy of the directory's modification time when
the first negative cache entry was added in the directory's NFS node.
When a negative cache entry is hit during a pathname lookup, the parent
directory's modification time is checked. If it has changed, all of the
negative cache entries for that parent are purged and the lookup falls
back to using the RPC. This required adding a new cache_purge_negative()
method to the name cache to purge only negative cache entries for a given
directory.
Submitted by: mohans, Rick Macklem, Ricardo Labiaga @ NetApp
Reviewed by: mohans
- Don't return a negative errno when using an unknown ioctl() on a
pseudo-terminal master device. Be sure to convert ENOIOCTL to ENOTTY,
just like the TTY layer does.
- Even though we should return st_rdev of the master device node when
emulating pty(4) devices, FIODGNAME should still return the name of
the slave device. Otherwise ptsname(3) and ttyname(3) return an
invalid device name.
members for a kinfo entry on a process-wide system.
- Use the newly introduced function in order to fix cases like
KERN_PROC_PROC where aggregating stats are broken because they just
consider the first thread in the pool for each process.
(Note, additively, that KERN_PROC_PROC is rather inaccurate on
thread-wide informations like the 'state' of the process. Such
informations should maybe be invalidated and being forceably discarded
by the consumers?).
- Simplify the logic of sysctl_out_proc() and adjust the
fill_kinfo_thread() accordingly.
- Remove checks on the FIRST_THREAD_IN_PROC() being NULL but add
assertives.
This patch should fix aggregate statistics for KERN_PROC_PROC.
This is one of the reasons why top doesn't use this option and now it
can be use it safely.
ps, when launched in order to display just processes, now should report
correct cpu utilization percentages and times (as opposed by the old
code).
Reviewed by: jhb, emaste
Sponsored by: Sandvine Incorporated
If the file system backing a process' cwd is removed, and procstat -f PID
is called, then these messages would have been printed. The extra verbosity is
not required in this situation.
Requested by: kib
Approved by: kib
revealed that a process' current working directory can be VBAD if the
directory is removed. This can trigger a panic when procstat -f PID is
run.
Tested by: pho
Discovered by: phobot
Reviewed by: kib
Approved by: kib
the correct behaviour (sorting by distance from the current head position
in the scan direction) and bioq_insert_head() and bioq_insert_tail()
have a well defined (and useful) behaviour, especially when intermixed
with calls to bioq_disksort().
In particular:
- fix a bug in the existing bioq_disksort() that did not use the
current head position correctly;
- redefine semantics of bioq_insert_head() and bioq_insert_tail().
bioq_insert_tail() can now be used as a barrier
between previous and subsequent calls to bioq_disksort().
The code is heavily documented in the source code so please refer
to that for the details.
Much of this code comes from Fabio Checconi. Also thanks to Kirk
for feedback on the (re)definition of bioq_insert_tail().
NOTE: in the current tree there is only a handful of files which
intermix calls to bioq_disksort() with bioq_insert_head() and
bioq_insert_tail(). The ordering of the queue in these situation
was not specified (nor easy to figure out) before, so I doubt any
of that code could be affected by the specification of the API.
Also note that the current implementation is significantly simpler
than the previous one (also used in ata_sort_queue()).
It would be useful to reimplement ata_sort_queue() using
the same code used in bioq_disksort().
MFC after: 1 week
Just like the old TTY layer, the current MPSAFE TTY layer does not make
any attempt to serialize calls of write(). Data is copied into the
kernel in 256 (TTY_STACKBUF) byte chunks. If a write() call occurs at
the same time, the data may interleave. This is especially likely when
the TTY starts blocking, because the output queue reaches the high
watermark.
I've implemented this by adding a new flag, TTY_BUSY_OUT, which is used
to mark a TTY as having a thread stuck in write(). Because I don't want
non-blocking processes to be possibly blocked by a sleeping thread, I'm
still allowing it to bypass the protection. According to this message,
the Linux kernel returns EAGAIN in such cases, but I think that's a
little too restrictive:
http://kerneltrap.org/index.php?q=mailarchive/linux-kernel/2007/5/2/85418/thread
PR: kern/118287
from the parent to the child process if they have an operation vector
of &badfileops. This narrows a set of races involving system calls that
allocate a new file descriptor, potentially block for some extended
period, and then return the file descriptor, when invoked by a threaded
program that concurrently invokes fork(2). Similar approches are used
in both Solaris and Linux, and the wideness of this race was introduced
in FreeBSD when we moved to a more optimistic implementation of
accept(2) in order to simplify locking.
A small race necessarily remains because the fork(2) might occur after
the finit() in accept(2) but before the system call has returned, but
that appears unavoidable using current APIs. However, this race is
vastly narrower.
The fix can be validated using the newfileops_on_fork regression test.
PR: kern/130348
Reported by: Ivan Shcheklein <shcheklein at gmail dot com>
Reviewed by: jhb, kib
MFC after: 1 week
Attach call without devclass set crashes the system.
On resume AHCI driver sometimes tries to create duplicate adX device.
It is surely his own problem, but IMHO it is not a reason to crash here.
Other reasons are also possible.
result in errors for a format loading but subsequent correct recognizing
for another format.
File format loading functions should avoid printing any additional
informations but just returning appropriate (and different between each
other) error condition, characterizing different informations.
Additively, the linker should handle appropriately different format
loading errors.
While a general mechanism is desired, fix a simple and common case on
amd64: file type is not recognized for link elf and confuses the linker.
Printout an error if all the registered linker classes can't recognize
and load the module.
Reviewed by: jhb
Sponsored by: Sandvine Incorporated
- Align the fifo output in fifo_print() with other vn_printf() output.
- Remove the leading space from lockmgr_printinfo() so its output lines up
in vn_printf().
- lockmgr_printinfo() now ends with a newline, so remove an extra newline
from vn_printf().
Back in 1.1 of kern_sysctl.c the sysctl() routine wired the "old" userland
buffer for most sysctls (everything except kern.vnode.*). I think to prevent
issues with wiring too much memory it used a 'memlock' to serialize all
sysctl(2) invocations, meaning that only one user buffer could be wired at
a time. In 5.0 the 'memlock' was converted to an sx lock and renamed to
'sysctl lock'. However, it still only served the purpose of serializing
sysctls to avoid wiring too much memory and didn't actually protect the
sysctl tree as its name suggested. These changes expand the lock to actually
protect the tree.
Later on in 5.0, sysctl was changed to not wire buffers for requests by
default (sysctl_handle_opaque() will still wire buffers larger than a single
page, however). As a result, user buffers are no longer wired as often.
However, many sysctl handlers still wire user buffers, so it is still
desirable to serialize userland sysctl requests. Kernel sysctl requests
are allowed to run in parallel, however.
- Expose sysctl_lock()/sysctl_unlock() routines to exclusively lock the
sysctl tree for a few places outside of kern_sysctl.c that manipulate
the sysctl tree directly including the kernel linker and vfs_register().
- sysctl_register() and sysctl_unregister() require the caller to lock
the sysctl lock using sysctl_lock() and sysctl_unlock(). The rest of
the public sysctl API manage the locking internally.
- Add a locked variant of sysctl_remove_oid() for internal use so that
external uses of the API do not need to be aware of locking requirements.
- The kernel linker no longer needs Giant when manipulating the sysctl
tree.
- Add a missing break to the loop in vfs_register() so that we stop looking
at the sysctl MIB once we have changed it.
MFC after: 1 month
sysctls during a linker file unload. We drop the lock when doing similar
operations during a linker file load. To close races, clear the LINKED
flag before dropping the lock so that the linker file is no longer visible
to userland.
MFC after: 1 week
When we leave the console TTY constantly open, we never reset the
termios attributes. This causes output processing, echoing, etc. not to
be reset to the proper values when going into single user mode after the
system has booted. It also causes nl-to-crnl-conversion not to take
place during shutdown, which causes a `staircase effect'.
This patch adds a new TTY flag, TF_OPENED_CONS, which is set when the
TTY is opened through /dev/console. Because the flags are only used by
the kernel and the pstat(8) utility, I've decided to renumber the TTY
flags. This shouldn't be an issue, because the TTY layer is not yet part
of a stable release.
Reported by: Mark Atkinson <atkin901 yahoo com>
Tested by: sepotvin
jail doesn't support. This involves a new function prison_check_af,
like prison_check_ip[46] but that checks only the family.
With this change, most of the errors generated by jailed sockets
shouldn't ever occur, at least until jails are changeable.
Approved by: bz (mentor)
return zero on success and an error code otherwise. The possible errors
are EADDRNOTAVAIL if an address being checked for doesn't match the
prison, and EAFNOSUPPORT if the prison doesn't have any addresses in
that address family. For most callers of these functions, use the
returned error code instead of e.g. a hard-coded EADDRNOTAVAIL or
EINVAL.
Always include a jailed() check in these functions, where a non-jailed
cred always returns success (and makes no changes). Remove the explicit
jailed() checks that preceded many of the function calls.
Approved by: bz (mentor)
called without calling vfs_busy() first. This made umount(8) hang waiting
for mnt_lockref to become zero, which would never happen.
Reviewed by: kib
Approved by: rwatson (mentor)
Reported by: pho
Found with: stress2
Sponsored by: FreeBSD Foundation
Right now we only have a very small amount of drivers that use clists,
but we still allocate 50 cblocks as slush space, which allows drivers to
temporarily overcommit their storage. Most of the drivers don't allow
this anyway.
I've performed the following changes:
- We don't allocate any cblocks on startup.
- I've removed the DDB command, because it has nothing useful to print
now. You can obtain the amount of allocated blocks by running `vmstat
-m | grep clist'.
- I've removed cfreecount, which is now unused.
- The old code first tries to allocate using M_NOWAIT, followed by
M_WAITOK. This doesn't make any sense, so just remove this logic. It
seems the drivers allow us to sleep anyway.
We can even remove ccmax from clist_alloc_cblocks and c_cbmax from
struct clist, but this breaks binary compatibility.
This reduces the amount of allocated cblocks on my system from 54 to 4.
The TTY buffers used the standard <sys/queue.h> lists. Unfortunately
they have a big shortcoming. If you want to have a double linked list,
but no tail pointer, it's still not possible to obtain the previous
element in the list. Inside the buffers we don't need them. This is why
I switched to custom linked list macros. The macros will also keep track
of the amount of items in the list. Because it doesn't use a sentinel,
we can just initialize the queues with zero.
In its simplest form (the output queue), we will only keep two
references to blocks in the queue, namely the head of the list and the
last block in use. All free blocks are stored behind the last block in
use.
I noticed there was a very subtle bug in the previous code: in a very
uncommon corner case, it would uma_zfree() a block in the queue before
calling memcpy() to extract the data from the block.
After running a `make buildkernel', I noticed most of the Giant locks in
sysctl are only caused by a very small amount of sysctl's:
- sysctl.name2oid. This one is locked by SYSCTL_LOCK, just like
sysctl.oidfmt.
- kern.ident, kern.osrelease, kern.version, etc. These are just constant
strings.
- kern.arandom, used by the stack protector. It is already protected by
arc4_mtx.
I also saw the following sysctl's show up. Not as often as the ones
above, but still quite often:
- security.jail.jailed. Also mark security.jail.list as MPSAFE. They
don't need locking or already use allprison_lock.
- kern.devname, used by devname(3), ttyname(3), etc.
This seems to reduce Giant locking inside sysctl by ~75% in my primitive
test setup.
mutex to a reader/writer lock. Lookup operations first grab a read lock and
perform the lookup. If the operation results in a need to modify the cache,
then it tries to do an upgrade. If that fails, it drops the read lock,
obtains a write lock, and redoes the lookup.
an interpreter definition in its program header), set the auxiliary
ELF argument AT_BASE to 0 rather than to the address that we would
have mapped the interpreter at if there had been one.
The ELF ABI specifications appear to be ambiguous as to the desired
behavior in this situation, as they define AT_BASE as the base address
of the interpreter, but do not mention what to do if there is none.
On Solaris, AT_BASE will be set to the base address of the static
binary if there is no interpreter, and on Linux, AT_BASE is set to 0.
We go with the Linux semantics as they are of more immediate utility
and allow the early runtime environment to know that the kernel has
not mapped an interpreter, but because AT_PHDR points at the ELF
header for the running binary, it is still possible to retrieve all
required mapping information when the process starts should it be
required. Either approach would be preferable to our current behavior
of passing a pointer to an unmapped region of user memory as AT_BASE.
MFC after: 3 weeks
backend kegs so it may source compatible memory from multiple backends.
This is useful for cases such as NUMA or different layouts for the same
memory type.
- Provide a new api for adding new backend kegs to secondary zones.
- Provide a new flag for adjusting the layout of zones to stagger
allocations better across cache lines.
Sponsored by: Nokia
sizeof("MAXCPU") being used to calculate a string length rather than
something more reasonable such as sizeof("32"). This shouldn't have
caused any ill effect until we run on machines with 1000000 or more
cpus.
- Restructure selscan() and selrescan() to avoid producing extra selfps
when we have a fd in multiple sets. As described below multiple selfps
may still exist for other reasons.
- Make selrescan() tolerate multiple selfds for a given descriptor
set since sockets use two selinfos per fd. If an event on each selinfo
fires selrescan() will see the descriptor twice. This could result in
select() returning 2x the number of fds actually existing in fd sets.
Reported by: mgleason@ncftp.com
pointers to the callout handler just before and just after the callout
it invoked. I attempted to do this in a manner congruent to tracing in
Solaris's callout mechanism, but couldn't quite use the same names due
to convention and syntax differences.
Example DTrace script to generate a distribution graph of callout
execution times:
callout_execute:::callout_start
{
self->cstart = timestamp;
}
callout_execute:::callout_end
{
@length = quantize(timestamp - self->cstart);
}
Reviewed by: jb
MFC after: 3 days
inside the SYSCTL() macros and thus does not need to be done for
all of the nodes scattered across the source tree.
- Mark the name-cache related sysctl's (including debug.hashstat.*) MPSAFE.
- Mark vm.loadavg MPSAFE.
- Remove GIANT_REQUIRED from vmtotal() (everything in this routine already
has sufficient locking) and mark vm.vmtotal MPSAFE.
- Mark the vm.stats.(sys|vm).* sysctls MPSAFE.
around calls to vlrureclaim() on non-MPSAFE filesystems. Specifically,
vnlru no longer needs Giant for the common case of waking up and deciding
there is nothing for it to do.
MFC after: 2 weeks
the helper function. It is supposed to be useful for any filesystem
that has to unlock dvp to walk to the ".." entry in lookup routine.
Requested by: jhb
Tested by: pho
MFC after: 1 month
VOP_MARKATIME() since unlike the rest of VOP_SETATTR(), VA_MARKATIME
can be performed while holding a shared vnode lock (the same functionality
is done internally by VOP_READ which can run with a shared vnode lock).
Add missing locking of the vnode interlock to the ufs implementation and
remove a special note and test from the NFS client about not supporting the
feature.
Inspired by: ups
Tested by: pho
section of code, this uses WITNESS_NORELEASE() and WITNESS_RELEASEOK() to mark
the boundaries. Both functions require the lock to be held when calling.
This is intended for scenarios like a bus asserting that the bus lock is not
dropped during a driver call. There doesn't appear to be a man page to
document this in.
Reviewed by: jhb
vnode, from -1 down. When vinvalbuf(vp, V_ALT) is done for the vnode, it
incorrectly does vm_object_page_remove(0, 0), removing all pages from
the underlying vm object, not only the pages that back the extended
attributes data.
Change vinvalbuf() to not remove any pages from the object when
V_NORMAL or V_ALT are specified. Instead, the only in-tree caller
in ffs_inode.c:ffs_truncate() that specifies V_ALT explicitely
removes the corresponding page range. The V_NORMAL caller
does vnode_pager_setsize(vp, 0) immediately after the call to
vinvalbuf(V_NORMAL) already.
Reported by: csjp
Reviewed by: ups
MFC after: 3 weeks
In normal operation, the number of cache entries is roughly equal to the
number of active vnodes. However, when most of the recently accessed
vnodes have many hard links, the number of cache entries can be 32000
times as large, exhausting kernel memory and provoking a panic in
kmem_malloc().
MFC after: 2 weeks
with src/tools/sched/schedgraph.py. This allows developers to quickly
create a graphical view of ktr data for any resource in the system.
- Add sched_tdname() and the pcpu field 'name' for quickly and uniformly
identifying records associated with a thread or cpu.
- Reimplement the KTR_SCHED traces using the new generic facility.
Obtained from: attilio
Discussed with: jhb
Sponsored by: Nokia
guarantee atomicity of the operation for other semaphore consumers.
In particular, this should guard against access to the semaphore with
not done or partially done MAC label assignment.
Reviewed by: rwatson
MFC after: 1 month
on SysV semaphores.
The squeeze of the semaphore array in the kern_semctl() modifies
sem_base for the semaphores with sem_base greater then sem_base of
the removed semaphore, as well as the values of the semaphores,
without locking their mutex. This can lead to (killable) hangs or
unexpected behaviour of the processes performing any sem operations
while other process does IPC_RMID.
The semexit_myhook() eventhandler unlocks SEMUNDO_LOCK() while
accessing *suptr. This allows for IPC_RMID for the sem id to be
performed in parallel with undo hook referenced by the current undo
structure. This leads to the panic("semexit - semid not allocated") [1].
The semaphore creation is protected by Giant, while IPC_RMID is done
while only semaphore mutex is held. This seems to result in invalid
values for semtot, causing random ENOSPC error returns [2].
Redo the locking of the semaphores lifetime cycle. Delegate the
sem_mtx to the sole purpose of protecting semget() and
semctl(IPC_RMID). Introduce new sem_undo_mtx to protect SEM_UNDO
handling. Remove the Giant remnants from the code.
Note that mac_sysvsem_check_semget() and mac_sysvsem_create() are
now called while sem_mtx is held, as well as mac_sysvsem_cleanup() [3].
When semaphore is removed, acquire semaphore locks for all semaphores
with sem_base that is going to be changed by squeeze of the sema
array. The lock order is not important there, because the region is
protected by sem_mtx.
Organize both used and free sem_undo structures into the lists,
protected by sem_undo_mtx. In semexit_myhook(), remove sem_undo
structure that is being processed, from used list, without putting it
onto the free to prevent modifications by other threads. This allows
for sem_undo_lock to be dropped to acquire individial semaphore locks
without violating lock order. Since IPC_RMID may no longer find this
sem_undo, do tolerate references to unallocated semaphores in undo
structure, and check sequential number to not undo unrelated semaphore
with the same id.
While there, convert functions definitions to ANSI C and fix small
style(9) glitches.
Reported by: Omer Faruk Sen <omerfsen gmail com> [1], pho [2]
Reviewed by: rwatson [3]
Tested by: pho
MFC after: 1 month
locked. Lookup could attempt to recursively lock that vnode.
Do not call vn_start_write(V_WAIT) while vnode is locked, this may
result in a deadlock with suspension.
vfs_busy() the mountpoint before dropping vnode lock for vnode
that was used to look up the mountpoint, to prevent unmount in
between.
Reported and tested by: pho
Reviewed by: rwatson
MFC after: 3 weeks
During boot, the domain list is locked with Giant. It is not possible to
register any protocols after the system has booted, so the lock is only
used to protect insertion of entries.
There is already a mutex in uipc_domain.c called dom_mtx. Use this mutex
to lock the list, instead of using Giant. It won't matter anything with
respect to performance, but we'll never get rid of Giant if we don't
remove from places where we don't need it.
Approved by: rwatson
MFC after: 3 weeks
of the counter, that may happen when too many sendfile(2) calls are
being executed with this vnode [1].
To keep the size of the struct vm_page and offsets of the fields
accessed by out-of-tree modules, swap the types and locations
of the wire_count and cow fields. Add safety checks to detect cow
overflow and force fallback to the normal copy code for zero-copy
sockets. [2]
Reported by: Anton Yuzhaninov <citrin citrin ru> [1]
Suggested by: alc [2]
Reviewed by: alc
MFC after: 2 weeks
During startup some of the syscons TTY's are used to set attributes like
the screensaver and mouse options. These actions cause /dev/console to
be rendered unusable.
Fix the issue by leaving the TTY opened when it is used as the console
device.
Reported by: imp
the field in the mbuf constructors, since otherwise we have no way to
tell if they are valid. In the future, Kip has plans to add a flag
specifically to indicate validity, which is the preferred model.
When sysctl() is being called with a buffer that is too small, it will
return ENOMEM. Unfortunately the changes I made the other day sets the
error number to 0, because it just returns the error number of the
copyout(). Revert this part of the change.
did not compared nc_dvp with supplied parent directory vnode pointer.
Add the check and note that now branches for vp != NULL and vp == NULL
are the same, thus can be merged.
Reported and reviewed by: kan
Tested by: pho
MFC after: 2 weeks
It seems I forgot to remove `int error' from a single piece of code. I'm
also moving ogetkerninfo() to kern_xxx.c, because it belongs to the
class of compat system information system calls, not the generic sysctl
code.
In the existing code we didn't really enforce that callers hold Giant
before calling userland_sysctl(), even though there is no guarantee it
is safe. Fix this by just placing Giant locks around the call to the oid
handler. This also means we only pick up Giant for a very short period
of time. Maybe we should add MPSAFE flags to sysctl or phase it out all
together.
I've also added SYSCTL_LOCK_ASSERT(). We have to make sure sysctl_root()
and name2oid() are called with the sysctl lock held.
Reviewed by: Jille Timmermans <jille quis cx>
compare map->timestamp with saved timestamp after map read lock is
reacquired, not with saved timestamp + 1. The only consequence of the +1
was unconditional lookup of the next map entry, though.
Tested by: pho
Approved by: des
MFC after: 2 weeks
arrays under #ifndef XEN to make XEN config compile again.
In case of Xen vm_guest is hard coded.
Move the list for the vm_guest sysctl out of the restictive
bounds as the sysctl is there in either case.
invariants and approach for protocol switch methods in protsw_init(),
and also some KASSERT's for non-domain init entries in protocol
switch tables: pru_abort and pru_send must both be implemented.
For now, leave those assertions #if 0'd, since there are a few
protocols that violate them in non-harmful ways. Whether or not we
should enforce pru_abort being implemented for non-stream protocols
is an interesting question: currently abort is only invoked on stream
sockets in situations where un-accepted sockets must be abruptly
closed (i.e., close() on a listen socket with pending connections),
but in principle it is useful for datagram sockets and most datagram
socket types implement it.
MFC after: 3 weeks
looked up would have v_dd set to a non-NULL value. This fixes a panic
seen when running installworld on a diskless system with a separate /usr
file system.
Submitted by: cracauer
Approved by: kib
long commands into multiple requests. [08:12]
Avoid calling uninitialized function pointers in protocol switch
code. [08:13]
Merry Christmas everybody...
Approved by: so (cperciva)
Approved by: re (kensmith)
Security: FreeBSD-SA-08:12.ftpd, FreeBSD-SA-08:13.protosw
In r185891 I removed the newlines from messages written to /dev/console,
because it made startup messages from rc-scripts harder to read. This,
unfortunately, causes the kernel message that is printed after a
non-terminated log message to be concatenated.
This could be fixed, but on short term it's better to just revert the
change.
Reported by: Jaakko Heinonen <jh saunalahti fi>
Inside ptsdrv_{in,out}wakeup() we call KNOTE_LOCKED() to wake up any
kevent(2) users. Because the kqueue handlers are executed synchronously,
we must set PTS_FINISHED before calling ptsdrv_{in,out}wakeup().
Discovered by: nork
Right now the wchan strings "ttyinp" and "ttybgw" only differ one
character from the strings we used prior to MPSAFE TTY. Just rename them
back to their pre-MPSAFE TTY counterparts.
Also rename "ttylck" to "ttymtx", which should make it more clear that a
process is blocked on the TTY mutex, not some other form of locking.
of OFW access semantics, in order to allow future support for real-mode
OF access and flattened device frees. OF client interface modules are
implemented using KOBJ, in a similar way to the PPC PMAP modules.
Because we need Open Firmware to be available before mutexes can be used on
sparc64, changes are also included to allow KOBJ to be used very early in
the boot process by only using the mutex once we know it has been initialized.
Reviewed by: marius, grehan
NULL pointer to struct mount if the looked up vnode is reclaimed. Also,
these syscalls only mnt_ref() the mp, still allowing it to be unmounted;
only struct mount memory is kept from being reused.
Lock the vnode when doing name lookup, then reference its mount point,
unlock the vnode and vfs_busy the mountpoint. This sequence shall take
care of both races.
Reported and tested by: pho
Discussed with: attilio
MFC after: 1 month
it running under a virtual environment. This also introduces a globally
accessible variable vm_guest that can be used where appropriate in the
kernel to inspect this environment.
To make it easier for the long run, an enum VM_GUEST is also introduced,
which could possibly be factored out in a header somewhere (but the
question is where - vm/vm_param.h? sys/param.h?) so it eventually becomes
a part of the standard KPI. In any case, it's a start.
The purpose of all this isn't to absolutely detect that the OS is running
under a virtual environment (cf. "redpill") but to allow the parts of the
kernel and the userland that care about this particular aspect and can do
something useful depending on it to have a standardised interface. Reducing
kern.hz is one example but there are other things that could be done like
avoiding context switches, not using CPU instructions that are known to be
slow in emulation, possibly different strategies in VM (memory) allocation,
CPU scheduling, etc.
It isn't clear if the JAILS/VIMAGE functionality should also be exposed
by this particular mechanism (probably not since they're not "full"
virtual hardware environments). Sometime in the future another sysctl and
a variable could be introduced to reflect if the kernel supports any kind
of virtual hosting (e.g. VMWare VMI, Xen dom0).
Reviewed by: silence from src-commiters@, virtualization@, kmacy@
Approved by: gnn (mentor)
Security: Obscurity doesn't help.
- threadA runs vfs_rel(mp1)
- threadB does unmount the mp1 fs, sets MNTK_UNMOUNT and drop MNT_ILOCK()
- threadA runs vfs_busy(mp1) and, as long as, MNTK_UNMOUNT is set, sleeps
waiting for threadB to complete the unmount
- threadB, in vfs_mount_destroy(), finds mnt_lock > 0 and sleeps waiting
for the refcount to expire.
Fix the deadlock by adding a flag called MNTK_REFEXPIRE which signals the
unmounter is waiting for mnt_ref to expire.
The vfs_busy contenders got awake, fails, and if they retry the
MNTK_REFEXPIRE won't allow them to sleep again.
2) Simplify significantly the code of vfs_mount_destroy() trimming
unnecessary codes:
- as long as any reference exited, it is no-more possible to have
write-op (primarty and secondary) in progress.
- it is no needed to drop and reacquire the mount lock.
- filling the structures with dummy values is unuseful as long as
it is going to be freed.
Tested by: pho, Andrea Barberio <insomniac at slackware dot it>
Discussed with: kib
On HyperThreading CPUs logical cores have same frequency, so setting it
on any core will change the other's one. In most cases first request
to the second core will be the "set" request, done after setting frequency
of the first core. In such case second CPU will obtain throttled frequency
of the first core as it's max_mhz making cpufreq broken due to different
frequency sets.
time it is marked for user space callchain capture in the NMI
handler and the time the callchain capture callback runs.
- Improve code and control flow clarity by invoking hwpmc(4)'s user
space callchain capture callback directly from low-level code.
Reviewed by: jhb (kern/subr_trap.c)
Testing (various patch revisions): gnn,
Fabien Thomas <fabien dot thomas at netasq dot com>,
Artem Belevich <artemb at gmail dot com>
All ioctl()'s that aren't implemented by pts(4) are forwarded to the TTY
itself. Unfortunately this is not correct for FIONREAD, because it will
give the wrong amount of bytes that are available to read.
Tested by: keramida
Reminded by: keramida
error is not EAGAIN. Several sysctls that inspect another process use
p_candebug() for checking access right for the curproc. p_candebug()
returns EAGAIN for some reasons, in particular, for the process doing
exec() now. If execing process tries to lock Giant, we get a livelock,
because sysctl handlers are covered by Giant, and often do not sleep.
Break the livelock by dropping Giant and allowing other threads to
execute in the EAGAIN loop.
Also, do not return EAGAIN from p_candebug() when process is executing,
use more appropriate EBUSY error [1].
Reported and tested by: pho
Suggested by: rwatson [1]
Reviewed by: rwatson, des
MFC after: 1 week
on a best-effort basis. Teach vn_fullpath to use this new VOP if a
regular VFS cache lookup fails. This VOP is designed to supplement the
VFS cache to provide a better chance that a vnode-to-name lookup will
succeed.
Currently, an implementation for devfs is being committed. The default
implementation is to return ENOENT.
A big thanks to kib for the mentorship on this, and to pho for running it
through his stress test suite.
Reviewed by: arch
Approved by: kib
One thing I didn't expect many applications to use, was kqueue() on
pseudo-terminal master devices. There are applications that use kqueue()
on the TTY itself (rtorrent, etc). That doesn't mean we shouldn't
implement this. Libraries like libevent use kqueue() by default, which
means they wouldn't be able to use kqueue().
The old TTY layer implements a very broken version of kqueue() by
performing the actual polling on the TTY device.
Discussed with: peter
outside the prison_states array.
When checking if there is a name configured for the prison, check the
first character to not be '\0' instead of checking if the char array
is present, which it always is. Note, that this is different for the
*jailname in the syscall.
Found with: Coverity Prevent(tm)
CID: 4156, 4155
MFC after: 4 weeks (just that I get the mail)
container structures, depending on VIMAGE_GLOBALS compile time option.
Make VIMAGE_GLOBALS a new compile-time option, which by default will not
be defined, resulting in instatiations of global variables selected for
V_irtualization (enclosed in #ifdef VIMAGE_GLOBALS blocks) to be
effectively compiled out. Instantiate new global container structures
to hold V_irtualized variables: vnet_net_0, vnet_inet_0, vnet_inet6_0,
vnet_ipsec_0, vnet_netgraph_0, and vnet_gif_0.
Update the VSYM() macro so that depending on VIMAGE_GLOBALS the V_
macros resolve either to the original globals, or to fields inside
container structures, i.e. effectively
#ifdef VIMAGE_GLOBALS
#define V_rt_tables rt_tables
#else
#define V_rt_tables vnet_net_0._rt_tables
#endif
Update SYSCTL_V_*() macros to operate either on globals or on fields
inside container structs.
Extend the internal kldsym() lookups with the ability to resolve
selected fields inside the virtualization container structs. This
applies only to the fields which are explicitly registered for kldsym()
visibility via VNET_MOD_DECLARE() and vnet_mod_register(), currently
this is done only in sys/net/if.c.
Fix a few broken instances of MODULE_GLOBAL() macro use in SCTP code,
and modify the MODULE_GLOBAL() macro to resolve to V_ macros, which in
turn result in proper code being generated depending on VIMAGE_GLOBALS.
De-virtualize local static variables in sys/contrib/pf/net/pf_subr.c
which were prematurely V_irtualized by automated V_ prepending scripts
during earlier merging steps. PF virtualization will be done
separately, most probably after next PF import.
Convert a few variable initializations at instantiation to
initialization in init functions, most notably in ipfw. Also convert
TUNABLE_INT() initializers for V_ variables to TUNABLE_FETCH_INT() in
initializer functions.
Discussed at: devsummit Strassburg
Reviewed by: bz, julian
Approved by: julian (mentor)
Obtained from: //depot/projects/vimage-commit2/...
X-MFC after: never
Sponsored by: NLnet Foundation, The FreeBSD Foundation
by running the tunable_mbinit() SYSINIT at SI_ORDER_MIDDLE
before the init_maxsockets() SYSINT at SI_ORDER_ANY.
Reviewed by: rwatson, zec
Sponsored by: The FreeBSD Foundation
MFC after: 4 weeks
The /dev/console device node logs all strings that are written to it.
When the string does not contain a trailing newline, it appends one. I
can imagine this was useful a long time ago, but with our current
rc-scripts, it generates a whole bunch of messages that look like:
| Configuring syscons:
| blanktime
| .
By not appending the newlines, the output of `dmesg -a' is now (almost?)
exactly the same as what the user will see on the console device
(syscons, uart).
aio code and are registered via the recently added SYSCALL32_*() helpers.
- Since the aio code likes to invoke fuword and suword a lot down in the
"bowels" of system calls, add a structure holding a set of operations for
things like storing errors, copying in the aiocb structure, storing
status, etc. The 32-bit system calls use a separate operations vector to
handle fuword32 vs fuword, etc. Also, the oldsigevent handling is now
done by having seperate operation vectors with different aiocb copyin
routines.
- Split out kern_foo() functions for the various AIO system calls so the
32-bit front ends can manage things like copying in and converting
timespec structures, etc.
- For both the native and 32-bit aio_suspend() and lio_listio() calls,
just use copyin() to read the array of aiocb pointers instead of using
a for loop that iterated over fuword/fuword32. The error handling in
the old case was incomplete (lio_listio() just ignored any aiocb's that
it got an EFAULT trying to read rather than reporting an error), and
possibly slower.
MFC after: 1 month
parent threads sleep on the parent' struct proc until corresponding
child releases the vmspace. Each sleep is interlocked with proc mutex of
the child, that triggers assertion in the sleepq_add(). The assertion
requires that at any time, all simultaneous sleepers for the channel use
the same interlock.
Silent the assertion by using conditional variable allocated in the
child. Broadcast the variable event on exec() and exit().
Since struct proc * sleep wait channel is overloaded for several
unrelated events, I was unable to remove wakeups from the places where
cv_broadcast() is added, except exec().
Reported and tested by: ganbold
Suggested and reviewed by: jhb
MFC after: 2 week
move that module to the head of the associated linker file's list of modules.
The end result is that once all the modules are loaded, they are sorted in
the reverse of their load order. This causes the kernel linker to invoke
the MOD_QUIESCE and MOD_UNLOAD events in the reverse of the order that
MOD_LOAD was invoked. This means that the ordering of MOD_LOAD events that
is set by the SI_* paramters to DECLARE_MODULE() are now honored in the same
order they would be for SYSUNINIT() for the MOD_QUIESCE and MOD_UNLOAD
events.
MFC after: 1 month
unloading any modules. As a result, if any module veto's an unload
request via MOD_QUIESCE, the entire set of modules for that linker
file will remain loaded and active now rather than leaving the kld
in a weird state where some modules are loaded and some are unloaded.
- This also moves the logic for handling the "forced" unload flag out of
kern_module.c and into kern_linker.c which is a bit cleaner.
- Add a module_name() routine that returns the name of a module and use that
instead of printing pointer values in debug messages when a module fails
MOD_QUIESCE or MOD_UNLOAD.
MFC after: 1 month
Close subtle but relatively unlikely race conditions when
propagating the vnode write error to other active sessions
tracing to the same vnode, without holding a reference on
the vnode anymore. [2]
PR: kern/126368 [1]
Submitted by: rwatson [2]
Reviewed by: kib, rwatson
MFC after: 4 weeks
directly include only the header files needed. This reduces the
unneeded spamming of various headers into lots of files.
For now, this leaves us with very few modules including vnet.h
and thus needing to depend on opt_route.h.
Reviewed by: brooks, gnn, des, zec, imp
Sponsored by: The FreeBSD Foundation
entries for one name. Then, creating inode with that name would remove
one entry, leaving others dormant. Reclaiming the vnode would uncover
negative entries, causing false return of ENOENT from the calls like
stat, that do not create inode.
Prevent creation of the duplicated negative entries.
Reported and debugged with: pho
Reviewed by: jhb
X-MFC: after shared lookup changes
This changes struct kinfo_filedesc and kinfo_vmentry such that they are
same on both 32 and 64 bit platforms like i386/amd64 and won't require
sysctl wrapping.
Two new OIDs are assigned. The old ones are available under
COMPAT_FREEBSD7 - but it isn't that simple. The superceded interface
was never actually released on 7.x.
The other main change is to pack the data passed to userland via the
sysctl. kf_structsize and kve_structsize are reduced for the copyout.
If you have a process with 100,000+ sockets open, the unpacked records
require a 132MB+ copyout. With packing, it is "only" ~35MB. (Still
seriously unpleasant, but not quite as devastating). A similar problem
exists for the vmentry structure - have lots and lots of shared libraries
and small mmaps and its copyout gets expensive too.
My immediate problem is valgrind. It traditionally achieves this
functionality by parsing procfs output, in a packed format. Secondly, when
tracing 32 bit binaries on amd64 under valgrind, it uses a cross compiled
32 bit binary which ran directly into the differing data structures in 32
vs 64 bit mode. (valgrind uses this to track file descriptor operations
and this therefore affected every single 32 bit binary)
I've added two utility functions to libutil to unpack the structures into
a fixed record length and to make it a little more convenient to use.
exact multiple of system page size should still be allowed to be mapped
in their entirety to match the regular vnode backed file behavior.
Reported by: ed
Reviewed by: jhb
Bring in updated jail support from bz_jail branch.
This enhances the current jail implementation to permit multiple
addresses per jail. In addtion to IPv4, IPv6 is supported as well.
Due to updated checks it is even possible to have jails without
an IP address at all, which basically gives one a chroot with
restricted process view, no networking,..
SCTP support was updated and supports IPv6 in jails as well.
Cpuset support permits jails to be bound to specific processor
sets after creation.
Jails can have an unrestricted (no duplicate protection, etc.) name
in addition to the hostname. The jail name cannot be changed from
within a jail and is considered to be used for management purposes
or as audit-token in the future.
DDB 'show jails' command was added to aid debugging.
Proper compat support permits 32bit jail binaries to be used on 64bit
systems to manage jails. Also backward compatibility was preserved where
possible: for jail v1 syscalls, as well as with user space management
utilities.
Both jail as well as prison version were updated for the new features.
A gap was intentionally left as the intermediate versions had been
used by various patches floating around the last years.
Bump __FreeBSD_version for the afore mentioned and in kernel changes.
Special thanks to:
- Pawel Jakub Dawidek (pjd) for his multi-IPv4 patches
and Olivier Houchard (cognet) for initial single-IPv6 patches.
- Jeff Roberson (jeff) and Randall Stewart (rrs) for their
help, ideas and review on cpuset and SCTP support.
- Robert Watson (rwatson) for lots and lots of help, discussions,
suggestions and review of most of the patch at various stages.
- John Baldwin (jhb) for his help.
- Simon L. Nielsen (simon) as early adopter testing changes
on cluster machines as well as all the testers and people
who provided feedback the last months on freebsd-jail and
other channels.
- My employer, CK Software GmbH, for the support so I could work on this.
Reviewed by: (see above)
MFC after: 3 months (this is just so that I get the mail)
X-MFC Before: 7.2-RELEASE if possible
to the fs, but before a vnode on the fs is locked, unmount may free fs
structures, causing access to destroyed data and freed memory.
Introduce a vfs_busymp() function that looks up and busies found
fs while mountlist_mtx is held. Use it in nfsrv_fhtovp() and in the
implementation of the handle syscalls.
Two other uses of the vfs_getvfs() in the vfs_subr.c, namely in
sysctl_vfs_ctl and vfs_getnewfsid seems to be ok. In particular,
sysctl_vfs_ctl is protected by Giant by being a non-sleeping sysctl
handler, that prevents Giant-locked unmount code to interfere with it.
Noted by: tegge
Reviewed by: dfr
Tested by: pho
MFC after: 1 month
- Print flags in hex.
- Note that flags can be fine and panic can be due unexpected error condition.
- Remove redundant new line character.
Eventhough panic message excess 80 characters keep it in one line so it is
easier to grep.
The mqfs_search() routine uses strncmp() to match message queue objects
by name. This is because it can be called from environments where the
file name is not null terminated (the VFS for example).
Unfortunately it doesn't compare the lengths of the message queue names,
which means if a system has "Queue12345", the name "Queue" will also
match.
I noticed this when a student of mine handed in an exercise using
message queues with names "Queue2" and "Queue".
Reviewed by: rink
whitespace) macros from p4/vimage branch.
Do a better job at enclosing all instantiations of globals
scheduled for virtualization in #ifdef VIMAGE_GLOBALS blocks.
De-virtualize and mark as const saorder_state_alive and
saorder_state_any arrays from ipsec code, given that they are never
updated at runtime, so virtualizing them would be pointless.
Reviewed by: bz, julian
Approved by: julian (mentor)
Obtained from: //depot/projects/vimage-commit2/...
X-MFC after: never
Sponsored by: NLnet Foundation, The FreeBSD Foundation
advance of teaching vn_fullpath1() how to query file systems for
vnode-to-name mappings when cache lookups fail.
Thanks to kib for guidance and patience on this process.
Reviewed by: kib
Approved by: kib
configured, change the message to let people know this is a
possibility. I've slightly changed the message from the one
submitted by Pekka to keep the printf on one line.
Submitted by: Pekka Savola <pekkas@netcore.fi>
of the ABI of the currently executing image. Change some places to test
the flags instead of explicit comparing with address of known sysentvec
structures to determine ABI features.
Discussed with: dchagin, imp, jhb, peter
and ifnet functions
- add memory barriers to <machine/atomic.h>
- update drivers to only conditionally define their own
- add lockless producer / consumer ring buffer
- remove ring buffer implementation from cxgb and update its callers
- add if_transmit(struct ifnet *ifp, struct mbuf *m) to ifnet to
allow drivers to efficiently manage multiple hardware queues
(i.e. not serialize all packets through one ifq)
- expose if_qflush to allow drivers to flush any driver managed queues
This work was supported by Bitgravity Inc. and Chelsio Inc.
- An "at" hint now reserves a device name.
- A new BUS_HINT_DEVICE_UNIT method is added to the bus interface. When
determining the unit number of a device, this method is invoked to
let the bus driver specify the unit of a device given a specific
devclass. This is the only way a device can be given a name reserved
via an "at" hint.
- Implement BUS_HINT_DEVICE_UNIT() for the acpi(4) and isa(4) bus drivers.
Both of these busses implement this by comparing the resources for a
given hint device with the resources enumerated by ACPI/PnPBIOS and
wire a unit if the hint resources are a subset of the "real" resources.
- Use bus_hinted_children() for adding hinted devices on isa(4) busses
now instead of doing it by hand.
- Remove the unit kludging from sio(4) as it is no longer necessary.
Prodding from: peter, imp
OK'd by: marcel
MFC after: 1 month
This bring huge amount of changes, I'll enumerate only user-visible changes:
- Delegated Administration
Allows regular users to perform ZFS operations, like file system
creation, snapshot creation, etc.
- L2ARC
Level 2 cache for ZFS - allows to use additional disks for cache.
Huge performance improvements mostly for random read of mostly
static content.
- slog
Allow to use additional disks for ZFS Intent Log to speed up
operations like fsync(2).
- vfs.zfs.super_owner
Allows regular users to perform privileged operations on files stored
on ZFS file systems owned by him. Very careful with this one.
- chflags(2)
Not all the flags are supported. This still needs work.
- ZFSBoot
Support to boot off of ZFS pool. Not finished, AFAIK.
Submitted by: dfr
- Snapshot properties
- New failure modes
Before if write requested failed, system paniced. Now one
can select from one of three failure modes:
- panic - panic on write error
- wait - wait for disk to reappear
- continue - serve read requests if possible, block write requests
- Refquota, refreservation properties
Just quota and reservation properties, but don't count space consumed
by children file systems, clones and snapshots.
- Sparse volumes
ZVOLs that don't reserve space in the pool.
- External attributes
Compatible with extattr(2).
- NFSv4-ACLs
Not sure about the status, might not be complete yet.
Submitted by: trasz
- Creation-time properties
- Regression tests for zpool(8) command.
Obtained from: OpenSolaris
kern_unlinkat(), that expects that vn_start_write() actually fills the mp
even when the call failed.
As Tor noted, that pattern relies on the the type stability of the mount
points, as well as that suspended mount points are never freed and
V_XSLEEP is always passed to vn_start_write() when called on a freed
mount point.
Reported by: stass
Reviewed by: tegge
PR: 123768
Peter Holm just discovered this funny bug inside the TTY code: if
uiomove() in ttydisc_write() returns an error, we forget to relock the
TTY before jumping out of ttydisc_write(). Fix it by placing
tty_unlock() and tty_lock() around uiomove().
Submitted by: pho
- Use `fildes[2]' instead of `*fildes' to make more clear that pipe(2)
fills an array with two descriptors.
- Remove EFAULT from the manual page. Because of the current calling
convention, pipe(2) raises a segmentation fault when an invalid
address is passed.
- Introduce kern_pipe() to make it easier for binary emulations to
implement pipe(2).
- Make Linux binary emulation use kern_pipe(), which means we don't have
to recover td_retval after calling the FreeBSD system call.
Approved by: rdivacky
Discussed on: arch
This prevents a panic which occurs when a driver attempts to load
firmware at boot via firmware_get() when the firmware module has not
been preloaded. firmware_get() will enqueue a task using a struct
taskqueue allocated on the stack, and the machine will crash much
later in the firmware taskq thread when taskqs are started and the
struct taskqueue is garbage.
Not objected to by: sam
Looking at our source code history, it seems the uname(),
getdomainname() and setdomainname() system calls got deprecated
somewhere after FreeBSD 1.1, but they have never been phased out
properly. Because we don't have a COMPAT_FREEBSD1, just use
COMPAT_FREEBSD4.
Also fix the Linuxolator to build without the setdomainname() routine by
just making it call userland_sysctl on kern.domainname. Also replace the
setdomainname()'s implementation to use this approach, because we're
duplicating code with sysctl_domainname().
I wasn't able to keep these three routines working in our
COMPAT_FREEBSD32, because that would require yet another keyword for
syscalls.master (COMPAT4+NOPROTO). Because this routine is probably
unused already, this won't be a problem in practice. If it turns out to
be a problem, we'll just restore this functionality.
Reviewed by: rdivacky, kib
On RELENG_6 (and probably RELENG_7) we see our syscons windows and
pseudo-terminals have the following buffer sizes:
| LINE RAW CAN OUT IHIWT ILOWT OHWT LWT COL STATE SESS PGID DISC
| ttyv0 0 0 0 7680 6720 2052 256 7 OCcl 1146 1146 term
| ttyp0 0 0 0 7680 6720 1296 256 0 OCc 82033 82033 term
These buffer sizes make no sense, because we often have much more output
than input, but I guess having higher input buffer sizes improves
guarantees of the system.
On MPSAFE TTY I just sent both the input and output buffer sizes to 7
KB, which is pretty big on a standard FreeBSD install with 8 syscons
windows and some PTY's. Reduce the baud rate to 9600 baud, which means
we now have the following buffer sizes:
| LINE INQ CAN LIN LOW OUTQ USE LOW COL SESS PGID STATE
| ttyv0 1920 0 0 192 1984 0 199 7 2401 2401 Oil
| pts/0 1920 0 0 192 1984 0 199 5631 1305 2526 Oi
This is a lot smaller, but for pseudo-devices this should be good
enough. You need to do a lot of punching to fill up a 7.5 KB input
buffer. If it turns out things don't work out this way, we'll just
switch to 19200 baud.
vnode in question does not need to be held. All the data structures used
during the name lookup are protected by the global name cache lock.
Instead, the caller merely needs to ensure a reference is held on the
vnode (such as vhold()) to keep it from being freed.
In the case of procfs' <pid>/file entry, grab the process lock while we
gain a new reference (via vhold()) on p_textvp to fully close races with
execve(2).
For the kern.proc.vmmap sysctl handler, use a shared vnode lock around
the call to VOP_GETATTR() rather than an exclusive lock.
MFC after: 1 month
chdir(), chroot(), eaccess(), fpathconf(), fstat(), fstatfs(), lseek()
(when figuring out the current size of the file in the SEEK_END case),
pathconf(), readlink(), and statfs() system calls.
Submitted by: ups (mostly)
Tested by: pho
MFC after: 1 month
Really, the concept of holdcnt in the struct mount is rappresented by
the mnt_ref (which prevents the type-stable structure from being
"recycled) handled through vfs_ref() and vfs_rel().
On this optic, switch the holdcnt acquisition into an emulated vfs_ref()
(and subsequent release into vfs_rel()).
Discussed with: kib
Tested by: pho
and server. This replaces the RPC implementation of the NFS client and
server with the newer RPC implementation originally developed
(actually ported from the userland sunrpc code) to support the NFS
Lock Manager. I have tested this code extensively and I believe it is
stable and that performance is at least equal to the legacy RPC
implementation.
The NFS code currently contains support for both the new RPC
implementation and the older legacy implementation inherited from the
original NFS codebase. The default is to use the new implementation -
add the NFS_LEGACYRPC option to fall back to the old code. When I
merge this support back to RELENG_7, I will probably change this so
that users have to 'opt in' to get the new code.
To use RPCSEC_GSS on either client or server, you must build a kernel
which includes the KGSSAPI option and the crypto device. On the
userland side, you must build at least a new libc, mountd, mount_nfs
and gssd. You must install new versions of /etc/rc.d/gssd and
/etc/rc.d/nfsd and add 'gssd_enable=YES' to /etc/rc.conf.
As long as gssd is running, you should be able to mount an NFS
filesystem from a server that requires RPCSEC_GSS authentication. The
mount itself can happen without any kerberos credentials but all
access to the filesystem will be denied unless the accessing user has
a valid ticket file in the standard place (/tmp/krb5cc_<uid>). There
is currently no support for situations where the ticket file is in a
different place, such as when the user logged in via SSH and has
delegated credentials from that login. This restriction is also
present in Solaris and Linux. In theory, we could improve this in
future, possibly using Brooks Davis' implementation of variant
symlinks.
Supporting RPCSEC_GSS on a server is nearly as simple. You must create
service creds for the server in the form 'nfs/<fqdn>@<REALM>' and
install them in /etc/krb5.keytab. The standard heimdal utility ktutil
makes this fairly easy. After the service creds have been created, you
can add a '-sec=krb5' option to /etc/exports and restart both mountd
and nfsd.
The only other difference an administrator should notice is that nfsd
doesn't fork to create service threads any more. In normal operation,
there will be two nfsd processes, one in userland waiting for TCP
connections and one in the kernel handling requests. The latter
process will create as many kthreads as required - these should be
visible via 'top -H'. The code has some support for varying the number
of service threads according to load but initially at least, nfsd uses
a fixed number of threads according to the value supplied to its '-n'
option.
Sponsored by: Isilon Systems
MFC after: 1 month
usable for newer CPUs. The new value allows 2 x quad core configuration
dumps to fit within the initial buffer without reallocations.
Approved by: gnn (mentor) (older version)
Pointed out by: rdivacky
- Implement real draining for vfs consumers by not relying on the
mnt_lock and using instead a refcount in order to keep track of lock
requesters.
- Due to the change above, remove the mnt_lock lockmgr because it is now
useless.
- Due to the change above, vfs_busy() is no more linked to a lockmgr.
Change so its KPI by removing the interlock argument and defining 2 new
flags for it: MBF_NOWAIT which basically replaces the LK_NOWAIT of the
old version (which was unlinked from the lockmgr alredy) and
MBF_MNTLSTLOCK which provides the ability to drop the mountlist_mtx
once the mnt interlock is held (ability still desired by most consumers).
- The stub used into vfs_mount_destroy(), that allows to override the
mnt_ref if running for more than 3 seconds, make it totally useless.
Remove it as it was thought to work into older versions.
If a problem of "refcount held never going away" should appear, we will
need to fix properly instead than trust on such hackish solution.
- Fix a bug where returning (with an error) from dounmount() was still
leaving the MNTK_MWAIT flag on even if it the waiters were actually
woken up. Just a place in vfs_mount_destroy() is left because it is
going to recycle the structure in any case, so it doesn't matter.
- Remove the markercnt refcount as it is useless.
This patch modifies VFS ABI and breaks KPI for vfs_busy() so manpages and
__FreeBSD_version will be modified accordingly.
Discussed with: kib
Tested by: pho
We often run into these very high column numbers when we run curses
applications, because they don't print any newlines. This messes up the
table output of `pstat -t'. If these numbers get really high, they
aren't of any use to the reader anyway. Convert them to `99999' when
they run out of bounds.
One of the pieces of code that I had left alone during the development
of the MPSAFE TTY layer, was tty_cons.c. This file actually has two
different functions:
- It contains low-level console input/output routines (cnputc(), etc).
- It creates /dev/console and wraps all its cdevsw calls to the
appropriate TTY.
This commit reimplements the second set of functions by moving it
directly into the TTY layer. /dev/console is now a character device node
that's basically a regular TTY, but does a lookup of `si_drv1' each time
you open it. d_write has also been changed to call log_console().
d_close() is not present, because we must make sure we don't revoke the
TTY after writing a log message to it.
Even though I'm not convinced this is in line with the future directions
of our console code, it is a good move for now. It removes recursive
locking from the top half of the TTY layer. The previous implementation
called into the TTY layer with Giant held.
I'm renaming tty_cons.c to kern_cons.c now. The code hardly contains any
TTY related bits, so we'd better give it a less misleading name.
Tested by: Andrzej Tobola <ato iem pw edu pl>,
Carlos A.M. dos Santos <unixmania gmail com>,
Eygene Ryabinkin <rea-fbsd codelabs ru>
within an object that a mapping refers to. fileid and fsid are inode/dev
for vnodes. (Linux procfs has these and valgrind is really unhappy
without them.) I believe I didn't change the size of the struct.
dump of detected ULE CPU topology. This dump can be used to check the
topology detection and for general system information.
An example of CPU topology dump is:
kern.sched.topology_spec: <groups>
<group level="1" cache-level="0">
<cpu count="8" mask="0xff">0, 1, 2, 3, 4, 5, 6, 7</cpu>
<flags></flags>
<children>
<group level="2" cache-level="0">
<cpu count="4" mask="0xf">0, 1, 2, 3</cpu>
<flags></flags>
</group>
<group level="2" cache-level="0">
<cpu count="4" mask="0xf0">4, 5, 6, 7</cpu>
<flags></flags>
</group>
</children>
</group>
</groups>
Reviewed by: jeff
Approved by: gnn (mentor)
to add more V* constants, and the variables changed by this patch were often
being assigned to mode_t variables, which is 16 bit.
Approved by: rwatson (mentor)
vpollinfo with vnode interlock. Fully initialize vpollinfo before putting
pointer to it into vp->v_pollinfo.
Discussed with: dwhite
Tested by: pho
MFC after: 1 week
that they operate directly on credentials: mac_proc_create_swapper(),
mac_proc_create_init(), and mac_proc_associate_nfsd(). Update policies.
Obtained from: TrustedBSD Project
"ticks" goes negative. This breaks the signed comparison in softclock.
This causes sleep() to never wake up, tcp to stop, etc etc. This is
bad(TM). Use the SEQ_LT() method from tcp's sequence number comparisons.
Due to the nature of the beast it causes lot of unproductive overhead. This
is especially bad when running SMP kernel on VMWare with several virtual
processors - idle FreeBSD guest with SMP kernel takes 150% host CPU time on my
dual-core MacBook Pro when I am enabling two virtual CPUs, making even host
not very usable. Detect when we are running in the sandbox and reduce HZ
to 10 (can be adjusted via VM_HZ in the kernel config) in such cases. This
brings host CPU usage of idle FreeBSD/SMP on two virtual processors down
to 10%.
Detect most popular VM platforms out there - VMWare, Parallels, VirtualBox
and VirtualPC.
MFC after: 2 weeks
when thread is in kernel mode, it can cause dead loop, now unlock
process lock after acquired sleep queue lock and thread lock to
avoid the problem. This means TDF_NEEDSIGCHK and TDF_NEEDSUSPCHK must
be set with process lock and thread lock being hold at same time.
unnecessary, the normal process lock and thread lock are enough. The
spin lock is still needed for process and thread exiting to mimic
single sched_lock.
rest in kern_getdirentries(). Use kern_getdirentries() to implement
freebsd32_getdirentries(). This fixes a bug where calls to getdirentries()
in 32-bit binaries would trash the 4 bytes after the 'long base' in
userland.
Submitted by: ups
MFC after: 1 week
- If there aren't spinlocks held, but there are problems with old
sleeplocks, they are not reported.
- If the spinlock found is not the only one, problems are not reported.
Fix these 2 problems.
Reported by: tegge
and ffs_lock. This cannot catch situations where holdcnt is incremented
not by curthread, but I think it is useful.
Reviewed by: tegge, attilio
Tested by: pho
MFC after: 2 weeks
MNTK_UNMOUNT before, and mnt_mtx is used as interlock. vfs_busy() always
tries to obtain a shared lock on mnt_lock, the other user is unmount who
tries to drain it, setting MNTK_UNMOUNT before.
Reviewed by: tegge, attilio
Tested by: pho
MFC after: 2 weeks
realtimer_expire() to not rearm the timer, otherwise there is a chance
that a callout will be left there and be tiggered in future unexpectly.
Bug reported by: tegge@