acquire shared locks on intermediate directories.
- For the LASTCN, we may have to LK_UPGRADE the parent directory before
we lookup the last component.
- Acquire VFS_ROOT and dp locks based on the cn_lkflag.
Sponsored by: Isilon Systems, Inc.
vhold()s us.
- Avoid an extra mutex acquire and release in the common case of vgonel()
by checking for OWEINACT at the start of the function.
- Fix the case where we set OWEINACT in vput(). LK_EXCLUPGRADE drops our
shared lock if it fails.
Sponsored by: Isilon Systems, Inc.
- Assert that REMOVE, CREATE, and RENAME callers have WANTPARENT
or LOCKPARENT set. You can't complete any of these operations without
at least a reference to the parent. Many filesystems check for this case
even though it isn't possible in the current system.
calling VOP_LOOKUP(). Rather than having each filesystem check the
LOCKPARENT flag, we simply check it once here and unlock as required.
The only unusual case is ISDOTDOT, where we require an unlocked vnode
on return. Relocking this vnode with the child locked is allowed since
the child is actually its parent.
- Add a few asserts for some unusual conditions that I do not believe can
happen. These will later go away and turn into implementations for these
conditions.
Sponsored by: Isilon Systems, Inc.
now always allocates a new vnode.
- Define a new function, vnlru_free, which frees vnodes from the free list.
It takes as a parameter the number of vnodes to free, which is
wantfreevnodes - freevnodes when called from vnlru_proc or 1 when
called from getnewvnode(). For now, getnewvnode() still tries to reclaim
a free vnode before creating a new one when we are near the limit.
- Define a function, vdestroy, which handles the actual release of memory
and teardown of locks, etc. This could become a uma_dtor() routine.
- Get rid of minvnodes. Now wantfreevnodes is 1/4th the max vnodes. This
keeps more unreferenced vnodes around so that files which have only
been stat'd are less likely to be kicked out of the system before we
have a chance to read them, etc. These vnodes may still be freed via
the normal vnlru_proc() routines which may some day become a real lru.
actual root file system is mounted, the first entry on the mountlist
is not the root file system and the timestamp for that entry is
typically 0. Passing that to inittodr() caused annoying errors on
alpha and ia64.
So, call inittodr() for all file systems on mountlist, but only when
the timestamp (mnt_time) is non-zero.
lockmgr locks that this thread owns. This is complicated due to
LK_KERNPROC and because lockmgr tolerates unlocking an unlocked lock.
Sponsored by: Isilon Systes, Inc.
before it can call VOP_INACTIVE(). This must use the EXCLUPGRADE path
because we may violate some lock order with another locked vnode if
we drop and reacquire the lock. If EXCLUPGRADE fails, we mark the
vnode with VI_OWEINACT. This case should be very rare.
- Clear VI_OWEINACT in vinactive() and vbusy().
- If VI_OWEINACT is set in vgone() do the VOP_INACTIVE call here as well.
Sponsored by: Isilon Systems, Inc.
necessary since we disable the shared locks in vfs_cache, but it is
prefered that the option not leak out into filesystems when it is
disabled.
Sponsored by: Isilon Systems, Inc.
config option have now been fixed. All filesystems are properly locked
and checked via DEBUG_VFS_LOCKS. Remove the workaround code.
Sponsored by: Isilon Systems, Inc.
last in the list rather than first.
This makes the resouces print in the 4.x order rather than the 5.x order
(eg fdc0 at 0x3f0-0x3f5,0x3f7 is 4.x, but 0x3f7,0x3f0-0x3f5 is 5.x). This
also means that the pci code will once again print the resources in BAR
ascending order.
w/o problems than I was before... This simply brings back the knote_delete
as knlist_delete which will also drop the knote's, instead of just clearing
the list and seeing _ONESHOT...
Fix a race where if a note was _INFLUX and _DETACHED, it could end up being
modified... whoopse..
MFC after: 1 week
Prodded by: ambrisko and dwhite
alignment restrictive, and help performance on some ethernet cards which
currently copy the entire packet a couple bytes to get the packet aligned
properly...
Wordsmithing by: dwhite
Obtained from: NetBSD (code only)
I'll clean it up later: rwatson
in the window between the beginning of panic() and entering the debugger,
it's possible to receive interrupts. If we receive an interrupt, don't
preempt if panicstr != NULL, as the system is in the process of failing, and
the preempting thread is likely to stumble over the failure. The typical
scenario is during the printf() in panic() prior to entering the debugger,
but when running with a slower console type such as serial console.
It could be that the panic string should be passed to the debugger to print,
so that it can run from the debugger's environment rather than a regular
kernel printf.
Glanced at by: jhb
session in tprintf(). SESSRELE() needs to properly dispose of the
sessions mutex.
Add sessrele() which does the proper cleanup and have SESSRELE() call it.
Use SESSRELE also in pgdelete().
Found by: Coverity (ID:526)
it to get better hashing in vfs_hash.
In case of an insert collision in vfs_hash_insert(), put the loosing vnode
on a special list so that vfs_hash_remove() can just assume that it is on
a list.
Drop the VI_HASHED flag.
instead of failing.
When looking for a region to allocate, we used to check to see if the
start address was < end. In the case where A..B is allocated already,
and one wants to allocate A..C (B < C), then this test would
improperly fail (which means we'd examine that region as a possible
one), and we'd return the region B+1..C+(B-A+1) rather than NULL.
Since C+(B-A+1) is necessarily larger than C (end argument), this is
incorrect behavior for rman_reserve_resource_bound().
The fix is to exclude those regions where r->r_start + count - 1 > end
rather than r->r_start > end. This bug has been in this code for a
very long time. I believe that all other tests against end are
correctly done.
This is why sio0 generated a message about interrupts not being
enabled properly for the device. When fdc had a bug that allocated
from 0x3f7 to 0x3fb, sio0 was then given 0x3fc-0x404 rather than the
0x3f8-0x3ff that it wanted. Now when fdc has the same bug, sio0 fails
to allocate its ports, which is the proper behavior. Since the probe
failed, we never saw the messed up resources reported.
I suspect that there are other places in the tree that have weird
looping or other odd work arounds to try to cope with the observed
weirdness this bug can introduce. These workarounds should be located
and eliminated.
Minor debug write fix to match the above test done as well.
'nice' by: mdodd
Sponsored by: timing solutions (http://www.timing.com/)
- Move VSHOULDBUSY, VSHOULDFREE, and VTRYRECYCLE into vfs_subr.c so
no one else attempts to grow a dependency on them.
- Now that objects with pages hold the vnode we don't have to do unlocked
checks for the page count in the vm object in VSHOULDFREE. These three
macros could simply check for holdcnt state transitions to determine
whether the vnode is on the free list already, but the extra safety
the flag affords us is probably worth the minimal cost.
- The leafonly sysctl and code have been dead for several years now,
remove the sysctl and the code that employed it from vtryrecycle().
- vtryrecycle() also no longer has to check the object's page count as
the object holds the vnode until it reaches 0.
Sponsored by: Isilon Systems, Inc.
to get from (mount + inode) to vnode. These tables are mostly
copy&pasted from UFS, sized based on desiredvnodes and therefore
quite large (128K-512K). Several filesystems are buggy enough that
they allocate the hash table even before they know if they will
ever be used or not.
Add "vfs_hash", a system wide hash table, which will replace all
the per-filesystem hash-tables.
The fields we add to struct vnode will more or less be saved in
the respective filesystems inodes.
Having one central implementation will save code and will allow us
to justify the complexity of code to dynamically (re)size the hash
at a later point.
to use only the holdcnt to determine whether a vnode may be recycled,
simplifying the V* macros as well as vtryrecycle(), etc.
Sponsored by: Isilon Systems, Inc.
vtryrecycle(). All obj refs also ref the vnode.
- Consistently use v_incr_usecount() to increment the usecount. This will
be more important later.
Sponsored by: Isilon Systems, Inc.
on an unlinked file. We can't know if this is the case until after we
have the lock.
- Lock the vnode in vn_close, many filesystems had code which was unsafe
without the lock held, and holding it greatly simplifies vgone().
- Adjust vn_lock() to check for the VI_DOOMED flag where appropriate.
Sponsored by: Isilon Systems, Inc.
- Add a vn_start_write/vn_finished_write around vlrureclaim so we don't do
writing ops without suspending. This could suspend the vlruproc which
should not be a problem under normal circumstances.
- Manually implement VMIGHTFREE in vlrureclaim as this was the only instance
where it was used.
- Acquire a lock before calling vgone() as it now requires it.
- Move the acquisition of the vnode interlock from vtryrecycle() to
getnewvnode() so that if it fails we don't drop and reacquire the
vnode_free_list_mtx.
- Check for a usecount or holdcount at the end of vtryrecycle() in case
someone grabbed a ref while we were recycling. Abort the recycle, and
on the final ref drop this vnode will be placed on the head of the free
list.
- Move the redundant VOP_INACTIVE protection code into the local
vinactive() routine to avoid code bloat.
- Keep the vnode lock held across calls to vgone() in several places.
- vgonel() no longer uses XLOCK, instead callers must hold an exclusive
vnode lock. The VI_DOOMED flag is set to allow other threads to detect
a vnode which is no longer valid. This flag is set until the last
reference is gone, and there are no chances for a new ref. vgonel()
holds this lock across the entire function, which greatly simplifies
logic.
_ Only vfree() in one place in vgone() not three.
- Adjust vget() to check the VI_DOOMED flag prior to waiting on the lock
in the LK_NOWAIT case. In other cases, check after we have slept and
acquired an exlusive lock. This will simulate the old vx_wait()
behavior.
Sponsored by: Isilon Systems, Inc.
so that the socket lock is held over the test-and-set removal of the
accept filter option during connect, and the two socket mutex regions
(transition to connected, perform accept filter) are combined.
from uipc_socket.c to uipc_accf.c in do_getopt_accept_filter(), so that it
now matches do_setopt_accept_filter(). Slightly reformulate the logic to
match the optimistic allocation of storage for the argument in advance,
and slightly expand the coverage of the socket lock.
socket lock around knlist_init(), so don't.
Hard code the setting of the socket reference count to 1 rather than
using soref() to avoid asserting the socket lock, since we've not yet
exposed the socket to other threads.
This removes two mutex operations from each socket allocation.
so that the socket does not generate SIGPIPE, only EPIPE, when a write
is attempted after socket shutdown. When the option was introduced in
2002, this required the logic for determining whether SIGPIPE was
generated to be pushed down from dofilewrite() to the socket layer so
that the socket options could be considered. However, the change in
2002 omitted modification to soo_write() required to add that logic,
resulting in SIGPIPE not being generated even without SO_NOSIGPIPE when
the socket was written to using write() or related generic system calls.
This change adds the EPIPE logic to soo_write(), generating a SIGPIPE
signal to the process associated with the passed uio in the event that
the SO_NOSIGPIPE option is not set.
Notes:
- The are upsides and downsides to placing this logic in the socket
layer as opposed to the file descriptor layer. This is really fd
layer logic, but because we need so_options, we have a choice of
layering violations and pick this one.
- SIGPIPE possibly should be delivered to the thread performing the
write, not the process performing the write.
- uio->uio_td and the td argument to soo_write() might potentially
differ; we use the thread in the uio argument.
- The "sigpipe" regression test in src/tools/regression/sockets/sigpipe
tests for the bug.
Submitted by: Mikko Tyolajarvi <mbsd at pacbell dot net>
Talked with: glebius, alfred
PR: 78478
MFC after: 1 week
SIGPIPE signal for the duration of the sento-family syscalls. Use it to
replace previously added hack in Linux layer based on temporarily setting
SO_NOSIGPIPE flag.
Suggested by: alfred
Add support for passing in a mutex. If NULL is passed a global
subr_unit mutex is used.
Add alloc_unrl() which expects the mutex to be held.
Allocating a unit will never sleep as it does not need to allocate
memory.
Cut possible range in half so we can use -1 to mean "out of number".
Collapse first and last runs into the head by means of counters.
This saves memory in the common case(s).
at some point result in a status event being triggered (it should
be a link down event: the Microsoft driver design guide says you
should generate one when the NIC is initialized). Some drivers
generate the event during MiniportInitialize(), such that by the
time MiniportInitialize() completes, the NIC is ready to go. But
some drivers, in particular the ones for Atheros wireless NICs,
don't generate the event until after a device interrupt occurs
at some point after MiniportInitialize() has completed.
The gotcha is that you have to wait until the link status event
occurs one way or the other before you try to fiddle with any
settings (ssid, channel, etc...). For the drivers that set the
event sycnhronously this isn't a problem, but for the others
we have to pause after calling ndis_init_nic() and wait for the event
to arrive before continuing. Failing to wait can cause big trouble:
on my SMP system, calling ndis_setstate_80211() after ndis_init_nic()
completes, but _before_ the link event arrives, will lock up or
reset the system.
What we do now is check to see if a link event arrived while
ndis_init_nic() was running, and if it didn't we msleep() until
it does.
Along the way, I discovered a few other problems:
- Defered procedure calls run at PASSIVE_LEVEL, not DISPATCH_LEVEL.
ntoskrnl_run_dpc() has been fixed accordingly. (I read the documentation
wrong.)
- Similarly, the NDIS interrupt handler, which is essentially a
DPC, also doesn't need to run at DISPATCH_LEVEL. ndis_intrtask()
has been fixed accordingly.
- MiniportQueryInformation() and MiniportSetInformation() run at
DISPATCH_LEVEL, and each request must complete before another
can be submitted. ndis_get_info() and ndis_set_info() have been
fixed accordingly.
- Turned the sleep lock that guards the NDIS thread job list into
a spin lock. We never do anything with this lock held except manage
the job list (no other locks are held), so it's safe to do this,
and it's possible that ndis_sched() and ndis_unsched() can be
called from DISPATCH_LEVEL, so using a sleep lock here is
semantically incorrect. Also updated subr_witness.c to add the
lock to the order list.
asynchronously by different threads. Thus, declare as volatile the
reference count that is accessed through m_ext's pointer, ref_cnt.
Revert the previous change, revision 1.144, that casts as volatile a
single dereference of ref_cnt.
Reviewed by: bmilekic, dwhite
Problem reported by: kris
MFC after: 3 days
for a signal, because kernel stack is swappable, this causes page fault
in kernel under heavy swapping case. Fix this bug by eliminating unneeded
code.
create kernel threads and call rfork(2) with RFTHREAD flag set in this case,
which puts parent and child into the same threading group. As a result
all threads that belong to the same program end up in the same threading
group.
This is similar to what linuxthreads port does, though in this case we don't
have a luxury of having access to the source code and there is no definite
way to differentiate linux_clone() called for threading purposes from other
uses, so that we have to resort to heuristics.
Allow SIGTHR to be delivered between all processes in the same threading
group previously it has been blocked for s[ug]id processes.
This also should improve locking of the same file descriptor from different
threads in programs running under linux compat layer.
PR: kern/72922
Reported by: Andriy Gapon <avg@icyb.net.ua>
Idea suggested by: rwatson
place.
This moves the dependency on GCC's and other compiler's features into
the central sys/cdefs.h file, while the individual source files can
then refer to #ifdef __COMPILER_FEATURE_FOO where they by now used to
refer to #if __GNUC__ > 3.1415 && __BARC__ <= 42.
By now, GCC and ICC (the Intel compiler) have been actively tested on
IA32 platforms by netchild. Extension to other compilers is supposed
to be possible, of course.
Submitted by: netchild
Reviewed by: various developers on arch@, some time ago
sleeping, so in do_tdsignal, we no longer need to test td_waitset.
now td_waitset is only used to give a thread higher priority when
delivering signal to multithreads process.
This also fixes a bug:
when a thread in sigwait states was suspended and later resumed
by SIGCONT, it can no longer receive signals belong to waitset.
debug.cpufreq.lowest tunable and sysctl. Some systems seem to have problems
with the lowest frequencies so setting this prevents them from being
available or used.
owned by a process when it forks, and creates a matching set of references
for the child process, as prescribed by POSIX.
In order to avoid races with other threads in the parent process during
fork(), it is necessary to allocate a temporary reference list while
holding the sem_lock, then transfer those references to the new process
once the sem_lock is released. The implementation is inefficient but
appears functional; in order to improve the efficiency, it will be
necessary to modify the existing structures and logic, which generally
rely on O(n) operations over the global set of semaphores.
PAGE_SIZE.
Unlike originator of the PR suggests retain MAXSHELLCMDLEN definition
(he has been proposing to replace it with PAGE_SIZE everywhere), not only
this reduced the diff significantly, but prevents code obfuscation and also
allows to increase/decrease this parameter easily if needed.
PR: kern/64196
Submitted by: Magnus Bäckström <b@etek.chalmers.se>
doesn't change functionality, but makes code more logical.
Obtained from: DrafonFlyBSD
o Use VOP_GETATTR() to obtain actual size of file and parse no more than that.
Previously, we parsed MAXSHELLCMDLEN characters regardless of the actual file
size. This makes the following working:
$ printf '#!/bin/echo' > /tmp/test.sh
$ chmod 755 /tmp/test.sh
$ /tmp/test.sh
Previously, attempts to execve() that shell script has been failing with bogus
ENAMETOOLONG.
PR: kern/64196
Submitted by: Magnus B.ckstr.m <b@etek.chalmers.se>
script. Otherwise it's possible to panic kernel by constructing a shell
script with first line not ending in '\n'.
Also, treat '\0' as line terminating character, which may me useful in
some situations.
Submitted by: gad
when trim'ing space off the back of a chain; this is indirect
solution to a potential null ptr deref
Noticed by: Coverity Prevent analysis tool (null ptr deref)
Reviewed by: dg, rwatson
vn_extattr_rm. This is meant to catch conditions where IO_NODELOCKED
has been specified without the vnode being locked.
Discussed with: rwatson
MFC after: 1 week
object on to the zone allocator. It should be noted that uma_zalloc(9)
uses bzero to zero out the object so there probably wont be any
real performance benefit. If UMA grows the ability to supply
zeroed zones more efficiently in the future, we will not have to
modify all the existing consumers.
Discussed with: rwatson,julian
MFC after: 1 week
has been set. Assert that this is the case so that we catch filesystems
who are using naked VOP_LOCKs in illegal cases.
Sponsored by: Isilon Systems, Inc.
considered to be as good as an exclusive lock, although there is still a
possibility of someone acquiring a VOP LOCK while xlock is held.
Sponsored by: Isilon Systems, Inc.
List devfs_dirents rather than vnodes off their shared struct cdev, this
saves a pointer field in the vnode at the expense of a field in the
devfs_dirent. There are often 100 times more vnodes so this is bargain.
In addition it makes it harder for people to try to do stypid things like
"finding the vnode from cdev".
Since DEVFS handles all VCHR nodes now, we can do the vnode related
cleanup in devfs_reclaim() instead of in dev_rel() and vgonel().
Similarly, we can do the struct cdev related cleanup in dev_rel()
instead of devfs_reclaim().
rename idestroy_dev() to destroy_devl() for consistency.
Add LIST_ENTRY de_alias to struct devfs_dirent.
Remove v_specnext from struct vnode.
Change si_hlist to si_alist in struct cdev.
String new devfs vnodes' devfs_dirent on si_alist when
we create them and take them off in devfs_reclaim().
Fix devfs_revoke() accordingly. Also don't clear fields
devfs_reclaim() will clear when called from vgone();
Let devfs_reclaim() call dev_rel() instead of vgonel().
Move the usecount tracking from dev_rel() to devfs_reclaim(),
and let dev_rel() take a struct cdev argument instead of vnode.
Destroy SI_CHEAPCLONE devices in dev_rel() (instead of
devfs_reclaim()) when they are no longer used. (This
should maybe happen in devfs_close() instead.)
a socket from a regular socket to a listening socket able to accept new
connections. As part of this state transition, solisten() calls into the
protocol to update protocol-layer state. There were several bugs in this
implementation that could result in a race wherein a TCP SYN received
in the interval between the protocol state transition and the shortly
following socket layer transition would result in a panic in the TCP code,
as the socket would be in the TCPS_LISTEN state, but the socket would not
have the SO_ACCEPTCONN flag set.
This change does the following:
- Pushes the socket state transition from the socket layer solisten() to
to socket "library" routines called from the protocol. This permits
the socket routines to be called while holding the protocol mutexes,
preventing a race exposing the incomplete socket state transition to TCP
after the TCP state transition has completed. The check for a socket
layer state transition is performed by solisten_proto_check(), and the
actual transition is performed by solisten_proto().
- Holds the socket lock for the duration of the socket state test and set,
and over the protocol layer state transition, which is now possible as
the socket lock is acquired by the protocol layer, rather than vice
versa. This prevents additional state related races in the socket
layer.
This permits the dual transition of socket layer and protocol layer state
to occur while holding locks for both layers, making the two changes
atomic with respect to one another. Similar changes are likely require
elsewhere in the socket/protocol code.
Reported by: Peter Holm <peter@holm.cc>
Review and fixes from: emax, Antoine Brodin <antoine.brodin@laposte.net>
Philosophical head nod: gnn
only call the protocol's pru_rcvd() if the protocol has the flag
PR_WANTRCVD set. This brings that instance of pru_rcvd() into line with
the rest, which do check the flag.
MFC after: 3 days
of the global UNIX domain socket mutex: no protection is needed that
early in the setup of the UNIX domain socket and socket structures.
MFC after: 3 days
patch from kan@).
Pull bufobj_invalbuf() out of vinvalbuf() and make g_vfs call it on
close. This is not yet a generally safe function, but for this very
specific use it is safe. This solves the problem with buffers not
being flushed by unmount or after failed mount attempts.
so->so_options when solisten() will succeed, rather than setting it
conditionally based on there not being queued sockets in the completed
socket queue. Otherwise, if the protocol exposes new sockets via the
completed queue before solisten() completes, the listen() system call
will succeed, but the socket and protocol state will be out of sync.
For TCP, this didn't happen in practice, as the TCP code will panic if
a new connection comes in after the tcpcb has been transitioned to a
listening state but the socket doesn't have SO_ACCEPTCONN set.
This is historical behavior resulting from bitrot since 4.3BSD, in which
that line of code was associated with the conditional NULL'ing of the
connection queue pointers (one-time initialization to be performed
during the transition to a listening socket), which are now initialized
separately.
Discussed with: fenner, gnn
MFC after: 3 days
driver. This used to be handled by cpufreq_drv_settings() but it's
useful to get the type/flags separately from getting the settings.
(For example, you don't have to pass an array of cf_setting just to find
the driver type.)
Use this new method in our in-tree drivers to detect reliably if acpi_perf
is present and owns the hardware. This simplifies logic in drivers as well
as fixing a bug introduced in my last commit where too many drivers attached.
soref() to also covering the update of so_state. While no other user
threads can update the socket state here as it's not yet hooked up to
the file descriptor array yet, the protocol could also frob the
socket state here, leading to a lost update to the so_state field.
No reported instances of this bug (as yet).
MFC after: 3 days
connection status before inserting the new socket into the listen
socket's accept queue, or there might be a race in which another thread
wakes up when the accept lock is released, and sees the socket before its
state is set correctly. The wakeup still occurs after the accept lock is
released. There have been no diagnoses of this bug in real-world systems
(as yet).
MFC after: 3 days
statement from some files, so re-add it for the moment, until the
related legalese is sorted out. This change affects:
sys/kern/kern_mbuf.c
sys/vm/memguard.c
sys/vm/memguard.h
sys/vm/uma.h
sys/vm/uma_core.c
sys/vm/uma_dbg.c
sys/vm/uma_dbg.h
sys/vm/uma_int.h
the rate for the 100% state once. Afterwards, use that value for deriving
states. This should fix the problem where the calibrated frequency was
different once a switch was done, giving a different set of levels each
time. Also, properly search for the right cpufreqX device when detaching.
override the current freq level temporarily and restore it when the
higher priority condition is past. Note that only the first overridden
value is saved. Callers pass NULL to CPUFREQ_SET to restore the saved
level. Priorities are not yet used so this commit should have no effect.
are not added to the list(s) of available settings. However, other drivers
can call the CPUFREQ_DRV_SETTINGS() method on those devices directly to
get info about available settings.
Update the acpi_perf(4) driver to use this flag in the presence of
"functional fixed hardware." Thus, future drivers like Powernow can
query acpi_perf for platform info but perform frequency transitions
themselves.
on dev.cpu.0 will affect all of the CPUs together. In the future,
independent control will be supported but this is good enough for now.
Check that the timecounter isn't TSC before switching (from Colin Percival.)
former is callable from user space and the latter from the kernel one. Make
kernel version take additional argument which tells if the respective call
should check for additional restrictions for sending signals to suid/sugid
applications or not.
Make all emulation layers using non-checked version, since signal numbers in
emulation layers can have different meaning that in native mode and such
protection can cause misbehaviour.
As a result remove LIBTHR from the signals allowed to be delivered to a
suid/sugid application.
Requested (sorta) by: rwatson
MFC after: 2 weeks
This information will be very useful for people who are tuning applications
which have a dependence on IPC mechanisms.
The following OIDs were documented:
Message queues:
kern.ipc.msgmax
kern.ipc.msgmni
kern.ipc.msgmnb
kern.ipc.msgtlq
kern.ipc.msgssz
kern.ipc.msgseg
Semaphores:
kern.ipc.semmap
kern.ipc.semmni
kern.ipc.semmns
kern.ipc.semmnu
kern.ipc.semmsl
kern.ipc.semopm
kern.ipc.semume
kern.ipc.semusz
kern.ipc.semvmx
kern.ipc.semaem
Shared memory:
kern.ipc.shmmax
kern.ipc.shmmin
kern.ipc.shmmni
kern.ipc.shmseg
kern.ipc.shmall
kern.ipc.shm_use_phys
kern.ipc.shm_allow_removed
kern.ipc.shmsegs
These new descriptions can be viewed using sysctl -d
PR: kern/65219
Submitted by: Dan Nelson <dnelson at allantgroup dot com> (modified)
No objections: developers@
Descriptions reviewed by: gnn
MFC after: 1 week
suid application. The problem is that Linux applications using old Linux
threads (pre-NPTL) use signal 32 (linux SIGRTMIN) for communication between
thread-processes. If such an linux application is installed suid or sgid
and security.bsd.conservative_signals=1 (default), then permission will be
denied to send such a signal and the application will freeze.
I believe the same will be true for native applications that use libthr,
since libthr uses SIGTHR for implementing conditional variables.
PR: 72922
Submitted by: Andriy Gapon <avg@icyb.net.ua>
MFC after: 2 weeks
list, set `curr_callout' to NULL. This ensures that we won't attempt
to cancel the current callout if the original callout structure
gets recycled while we wait to acquire Giant.
This is reported to fix an intermittent syscons problem that was
introduced by revision 1.96.
do not need to perform an extra memory fetch in the Packet (Mbuf+Cluster)
constructor to initialize the reference counter anymore. The reference
counts are located in a separate memory region (in the slab header,
because this zone is UMA_ZONE_REFCNT), so the memory fetch resulted very
often in a cache miss. Additionally, and perhaps more significantly,
optimize the free mbuf+cluster (packet) case, which is very common, to
no longer require an atomic operation on free (to verify the reference
counter) if the reference on the cluster has never been increased (also
very common). Reduces an atomic on mbuf free on average.
Original patch submitted by: Gerrit Nagelhout <gnagelhout@sandvine.com>
behaviour of chflags within a jail. If set to 0 (the default), then a
jailed root user is treated as an unprivileged user; if set to 1, then
a jailed root user is treated the same as an unjailed root user.
This is necessary to allow "make installworld" to work inside a jail,
since it attempts to manipulate the system immutable flag on certain
files.
Discussed with: csjp, rwatson
MFC after: 2 weeks
Give FFS vnodes a specific bufwrite method which contains all the
background write stuff and then calls into the default bufwrite()
for the rest of the job.
Remove all the background write related stuff from the normal bufwrite.
This drags the softdep_move_dependencies() back into FFS.
Long term, it is worth looking at simply copying the data into
allocated memory and issuing the bio directly and not create the
"shadow buf" in the first place (just like copy-on-write is done
in snapshots for instance). I don't think we really gain anything
but complexity from doing this with a buf.
structure in the struct pointed to by the 3rd argument for IPC_STAT and
get rid of the 4th argument. The old way returned a pointer into the
kernel array that the calling function would then access afterwards
without holding the appropriate locks and doing non-lock-safe things like
copyout() with the data anyways. This change removes that unsafeness and
resulting race conditions as well as simplifying the interface.
- Implement kern_foo wrappers for stat(), lstat(), fstat(), statfs(),
fstatfs(), and fhstatfs(). Use these wrappers to cut out a lot of
code duplication for freebsd4 and netbsd compatability system calls.
- Add a new lookup function kern_alternate_path() that looks up a filename
under an alternate prefix and determines which filename should be used.
This is basically a more general version of linux_emul_convpath() that
can be shared by all the ABIs thus allowing for further reduction of
code duplication.
callout is first initialised, using a new function callout_init_mtx().
The callout system will acquire this mutex before calling the callout
function and release it on return.
In addition, the callout system uses the mutex to avoid most of the
complications and race conditions inherent in asynchronous timer
facilities, so mutex-protected callouts have much simpler semantics.
As long as the mutex is held when invoking callout_stop() or
callout_reset(), then these functions will guarantee that the callout
will be stopped, even if softclock() had already begun to process
the callout.
Existing Giant-locked callouts will automatically pick up the new
race-free semantics. This should close a number of race conditions
in the USB code and probably other areas of the kernel too.
There should be no change in behaviour for "MP-safe" callouts; these
still need to use the techniques mentioned in timeout(9) to avoid
race conditions.
frequency as a percentage of the base rate and do not change the base
rate directly. The cpufreq framework combines these with absolute drivers
to produce synthesized levels made of one or more settings.
select the CPU frequency level (say for cooling). The driver interface
allows hardware drivers to announce themselves as capable of adjusting
an individual frequency setting.
- Add buffer size limitations (overflow will not be possible anymore).
- Add 'visible' option, which will allow for passphrase reading in the
future.
- Remove special treatment of '@' and '#', those two are only confusing.
Discussed with: rwatson
MFC after: 2 weeks
tond and not fromnd. This could lead us to leak Giant, or unlock it
twice, depending on the filesystems involved. renames within a single
filesystem would not have caused any problems.
Sponsored by: Isilon Systems, Inc.
all reserved, as the lisence makes clear), and strike the third clause
(now this is a 2-clause liberal BSDL as are the rest of files I hold
copyright over).
copies arguments into the kernel space and one that operates
completely in the kernel space;
o use kernel-only version of execve(2) to kill another stackgap in
linuxlator/i386.
Obtained from: DragonFlyBSD (partially)
MFC after: 2 weeks
Add minor2unit() in addition to dev2unit() and unit2minor().
If it wasn't such a hazzle we should redefine minor numbers in
the kernel without the gap for the major number, but it's not worth
the bother (yet).
a process return to userspace if it had pending GEOM events.
We need to have the same check in the exit pass to catch the case
where a GEOM related filedescriptor is not explicitly closed by
the process.
Bumped into by: people using dd(1) to build releases, nanobsd etc.
from the userland and pushes results back and the second which does
actual processing. Use the latter to eliminate stackgap in the linux wrapper
of that syscall.
MFC after: 2 weeks