- struct plimit includes a mutex to protect a reference count. The plimit
structure is treated similarly to struct ucred in that is is always copy
on write, so having a reference to a structure is sufficient to read from
it without needing a further lock.
- The proc lock protects the p_limit pointer and must be held while reading
limits from a process to keep the limit structure from changing out from
under you while reading from it.
- Various global limits that are ints are not protected by a lock since
int writes are atomic on all the archs we support and thus a lock
wouldn't buy us anything.
- All accesses to individual resource limits from a process are abstracted
behind a simple lim_rlimit(), lim_max(), and lim_cur() API that return
either an rlimit, or the current or max individual limit of the specified
resource from a process.
- dosetrlimit() was renamed to kern_setrlimit() to match existing style of
other similar syscall helper functions.
- The alpha OSF/1 compat layer no longer calls getrlimit() and setrlimit()
(it didn't used the stackgap when it should have) but uses lim_rlimit()
and kern_setrlimit() instead.
- The svr4 compat no longer uses the stackgap for resource limits calls,
but uses lim_rlimit() and kern_setrlimit() instead.
- The ibcs2 compat no longer uses the stackgap for resource limits. It
also no longer uses the stackgap for accessing sysctl's for the
ibcs2_sysconf() syscall but uses kernel_sysctl() instead. As a result,
ibcs2_sysconf() no longer needs Giant.
- The p_rlimit macro no longer exists.
Submitted by: mtm (mostly, I only did a few cleanups and catchups)
Tested on: i386
Compiled on: alpha, amd64
- Rename temporary variable names ("tmp", "tmp2") to more informative
names ("load", "pctcpu", "rss", ...)
- Unclutter indentation and return paths: rather than lots of nested
ifs, simply return earlier if it's not going to work out. Simplify
general structure and avoid "deep" code.
- Comment on the thread/process selection and locking.
- Correct handling of "running"/"runnable" states, avoid "unknown"
that people were seeing for running processes. This was due to
a misunderstanding of the more complex state machine / inhibitors
behavior of KSE.
- Do perform ttyinfo() printing on KSE (P_SA) processes, it seems
generally to work.
While I initially attempted to formulate this as two commits (one
layout, the other content), I concluded that the layout changes were
really structural changes.
Many elements submitted by: bde
instead, just dec/inc in the ctor/dtor. For now, increment/decrement
in two's, since we're now performing the operation once per pair,
not once per pipe. Not really any measurable performance change
in my micro-benchmarks, but doing less work is good, especially when
it comes to atomic operations.
Suggested by: alc
changes to jointly allocated pipe pairs. Replace these checks
with pipe_present checks. This avoids a NULL pointer dereference
when a pipe is half-closed.
Submitted by: Peter Edwards <peter.edwards@openet-telecom.com>
1. Root from inside a jail was able to unmount any file system
(except /).
2. Unprivileged root was able to unmount file systems mounted by
privileged root (execpt /).
3. User from inside a jail was able to mount file system when
sysctl vfs.usermount was set to 1.
4. User was able to mount file system when vfs.usermount was set to 1
(that's ok) and unmount it even if vfs.usermount was equal to 0
(that's not correct).
Possibility from point 1 was reported by: Dariusz Kowalski <darek@76.pl>
Only a part of this fix will be MFC'ed (if approved).
PR: kern/60149
Reviewed by: rwatson
Approved by: scottl (mentor)
MFC after: 3 days
packet along with data, instead of in their own packet. When serving files
of size (packetsize - headersize) or smaller, this will result in one less
packet crossing the network. Quick testing with thttpd and http_load has
shown a noticeable performance improvement in this case (350 vs 330 fetches
per second.)
Included in this commit are two support routines, iov_to_uio, and m_uiotombuf;
these routines are used by sendfile to construct the header mbuf chain that
will be linked to the rest of the data in the socket buffer.
sense with sched_4bsd as it does with sched_ule.
- Use P_NOLOAD instead of the absence of td->td_ithd to determine whether or
not a thread should be accounted for in sched_tdcnt.
would allocate two 'struct pipe's from the pipe zone, and malloc a
mutex.
- Create a new "struct pipepair" object holding the two 'struct
pipe' instances, struct mutex, and struct label reference. Pipe
structures now have a back-pointer to the pipe pair, and a
'pipe_present' flag to indicate whether the half has been
closed.
- Perform mutex init/destroy in zone init/destroy, avoiding
reallocating the mutex for each pipe. Perform most pipe structure
setup in zone constructor.
- VM memory mappings for pageable buffers are still done outside of
the UMA zone.
- Change MAC API to speak 'struct pipepair' instead of 'struct pipe',
update many policies. MAC labels are also handled outside of the
UMA zone for now. Label-only policy modules don't have to be
recompiled, but if a module is recompiled, its pipe entry points
will need to be updated. If a module actually reached into the
pipe structures (unlikely), that would also need to be modified.
These changes substantially simplify failure handling in the pipe
code as there are many fewer possible failure modes.
On half-close, pipes no longer free the 'struct pipe' for the closed
half until a full-close takes place. However, VM mapped buffers
are still released on half-close.
Some code refactoring is now possible to clean up some of the back
references, etc; this patch attempts not to change the structure
of most of the pipe implementation, only allocation/free code
paths, so as to avoid introducing bugs (hopefully).
This cuts about 8%-9% off the cost of sequential pipe allocation
and free in system call tests on UP and SMP in my micro-benchmarks.
May or may not make a difference in macro-benchmarks, but doing
less work is good.
Reviewed by: juli, tjr
Testing help: dwhite, fenestro, scottl, et al
track the load for the sched_load() function. In the SMP case this member
is not defined because it would be redundant with the ksg_load member
which already tracks the non ithd load.
- For sched_load() in the UP case simply return ksq_sysload. In the SMP
case traverse the list of kseq groups and sum up their ksg_load fields.
of sched_load(). This variable tracks the number of running and runnable
non ithd threads. This removes the need to traverse the proc table and
discover how many threads are runnable.
at packet arrival.
For benchmarking purposes SO_BINTIME is preferable to SO_TIMEVAL
since it has higher resolution and lower overhead. Simultaneous
use of the two options is possible and they will return consistent
timestamps.
This introduces an extra test and a function call for SO_TIMEVAL, but I have
not been able to measure that.
profiling buffers and hash table. This makes it a lot easier to
do multiple profiling runs without rebooting or performing
gratuitous arithmetic. Sysctl is named debug.mutex.prof.reset.
Reviewed by: jake
- witness_lock() is split into two pieces: witness_checkorder() and
witness_lock(). Witness_checkorder() determines if acquiring a specified
lock at the time it is called would result in a lock order. It
optionally adds a new lock order relationship as well. witness_lock()
updates witness's data structures to assume that a lock has been acquired
by stick a new lock instance in the appropriate lock instance list.
- The mutex and sx lock functions now call checkorder() prior to trying to
acquire a lock and continue to call witness_lock() after the acquire is
completed. This will let witness catch a deadlock before it happens
rather than trying to do so after the threads have deadlocked (i.e. never
actually report it).
- A new function witness_defineorder() has been added that adds a lock
order between two locks at runtime without having to acquire the locks.
If the lock order cannot be added it will return an error. This function
is available to programmers via the WITNESS_DEFINEORDER() macro which
accepts either two mutexes or two sx locks as its arguments.
- A few simple wrapper macros were added to allow developers to call
witness_checkorder() anywhere as a way of enforcing locking assertions
in code that might acquire a certain lock in some situations. The
macros are: witness_check_{mutex,shared_sx,exclusive_sx} and take an
appropriate lock as the sole argument.
- The code to remove a lock instance from a lock list in witness_unlock()
was unnested by using a goto to vastly improve the readability of this
function.
assure backward compatibility (conditional on !BURN_BRIDGES), look it up
by its old name first, and log a warning (but accept the setting) if it
was found. If both the old and new name are defined, the new name takes
precedence.
Also export vm.kmem_size as a read-only sysctl variable; I find it hard to
tune a parameter when I don't know its default value, especially when that
default value is computed at boot time.
SW_INVOL. Assert that one of these is set in mi_switch() and propery
adjust the rusage statistics. This is to simplify the large number of
users of this interface which were previously all required to adjust the
proper counter prior to calling mi_switch(). This also facilitates more
switch and locking optimizations.
- Change all callers of mi_switch() to pass the appropriate paramter and
remove direct references to the process statistics.
mutex profiling code. As with existing mutex profiling, measurement
is done with respect to mtx_lock() instances in the code, as opposed
to specific mutexes. In particular, measure two things:
(1) Lock contention. How often did this mtx_lock() call get made and
have to sleep (or almost sleep) waiting for the lock. This helps
identify the "victims" of contention.
(2) Hold contention. How often, while the lock was held by a thread
as a result of this mtx_lock(), did another thread try to acquire
the same mutex. This helps identify the causes of contention.
I'm currently exploring adding measurement of "time waited for the
lock", but the current implementation has proven useful to me so far
so I figured I'd commit it so others could try it out. Note that this
increases the size of mutexes when MUTEX_PROFILING is enabled, so you
might find you need to further bump UMA_BOOT_PAGES. Fixes welcome.
The once over: des, others
one which runs the actual update. This fixes a bug where there were
a delay in applying the frequency adjustment. In extreme cases this
could result in marginal stability of the kernel-pll.
The uidinfo code appears to be MPSAFE, and is referenced without Giant
elsewhere. While this grab of Giant was only made in fairly rare
circumstances (actually GC'ing on refcount==0), grabbing Giant here
potentially introduces lock order issues with any locks held by the
caller. So this probably won't help performance much unless you change
credentials a lot in an application, and leave a lot of file descriptors
and cached credentials around. However, it simplifies locking down
consumers of the credential interfaces.
Bumped into by: sam
Appeased: tjr
to a new prison_complete() task run by a task queue. This removes
a requirement for grabbing Giant in crfree(). Embed the 'struct task'
in 'struct prison' so that we don't have to allocate memory from
prison_free() (which means we also defer the FREE()).
With this change, I believe grabbing Giant from crfree() can now be
removed, but need to check the uidinfo code paths.
To avoid header pollution, move the definition of 'struct task'
to _task.h, and recursively include from taskqueue.h and jail.h; much
preferably to all files including jail.h picking up a requirement to
include taskqueue.h.
Bumped into by: sam
Reviewed by: bde, tjr
In case no real/physical IEEE 802 address is available, both the expired
"draft-leach-uuids-guids-01" (section "4. Node IDs when no IEEE 802
network card is available") and RFC 2518 (section "6.4.1 Node Field
Generation Without the IEEE 802 Address") recommend (quoted from RFC
2518):
"The ideal solution is to obtain a 47 bit cryptographic quality random
number, and use it as the low 47 bits of the node ID, with the _most_
significant bit of the first octet of the node ID set to 1. This bit
is the unicast/multicast bit, which will never be set in IEEE 802
addresses obtained from network cards; hence, there can never be a
conflict between UUIDs generated by machines with and without network
cards."
Unfortunately, this incorrectly explains how to implement this and
the FreeBSD UUID generator code inherited this generation bug from
the broken reference code in the standards draft. They should instead
specify the "_least_ significant bit of the first octet of the node ID"
as the multicast bit in a memory and hexadecimal string representation
of a 48-bit IEEE 802 MAC address.
This standards bug arised from a false interpretation, as the multicast
bit is actually the _most_ significant bit in IEEE 802.3 (Ethernet)
_transmission order_ of an IEEE 802 MAC address. The standards authors
forgot that the bitwise order of an _octet_ from a MAC address _memory_
and hexadecimal string representation is still always from left (MSB,
bit 7) to right (LSB, bit 0).
Fortunately, this UUID generation bug could have occurred on systems
without any Ethernet NICs only.
Presumably, at some point, you had to include jail.h if you included
proc.h, but that is no longer required.
Result of: self injury involving adding something to struct prison