activation (i.e., applications are using libpthread). This is because
SCHED_ULE sometimes puts P_SA processes into ksq_next unnecessarily.
Which doesn't give fair amount of CPU time to processes which are
using scheduler-activation-based threads when other (semi-)CPU-intensive,
non-P_SA processes are running.
Further work will no doubt be done by jeffr at a later date.
Submitted by: Taku YAMAMOTO <taku@cent.saitama-u.ac.jp>
Reviewed by: rwatson, freebsd-current@
entry size and the ELF version. Also, avoid a potential integer
overflow when determining whether the ELF header fits entirely
within the first page.
Reviewed by: jdp
A panic when attempting to execute an ELF binary with a bogus program
header table entry size was
Reported by: Christer Öberg <christer.oberg@texonet.com>
use sf_buf_free() instead of sf_buf_mext() to consolidate all actions
that require the page queues lock in one critical section. While I'm
here remove unnecessary splvm() and splx() calls.
options, status pointer and rusage pointer as arguments. It is up to
the caller to copyout the status and rusage to userland if needed. This
lets us axe the 'compat' argument and hide all that functionality in
owait(), by the way. This also cleans up some locking in kern_wait()
since it no longer has to drop locks around copyout() since all the
copyout()'s are deferred.
- Convert owait(), wait4(), and the various ABI compat wait() syscalls to
use kern_wait() rather than wait1() or wait4(). This removes a bit
more stackgap usage.
Tested on: i386
Compiled on: i386, alpha, amd64
Without this fix it is possible to cheat policies like:
- sysctl security.bsd.see_other_[gu]ids=0,
- mac_seeotheruids(4),
- jail(2)
and get full processes list with their arguments.
This problem exists from revision 1.62 of kern_proc.c when it was
introduced.
Reviewed by: nectar, rwatson.
dependent function by the same name and a machine-independent function,
sf_buf_mext(). Aside from the virtue of making more of the code machine-
independent, this change also makes the interface more logical. Before,
sf_buf_free() did more than simply undo an sf_buf_alloc(); it also
unwired and if necessary freed the page. That is now the purpose of
sf_buf_mext(). Thus, sf_buf_alloc() and sf_buf_free() can now be used
as a general-purpose emphemeral map cache.
might be enqueued on a sleep queue but not be asleep when the timeout fires
if it is blocked on a lock trying to check for pending signals before going
to sleep. In the case of fixing up the TDF_TIMEOUT race, however, the
thread must be marked asleep.
Reported by: kan (the bogus one)
Use kern_open() to implement creat() rather than taking the long route
through open(). Mark creat as MPSAFE.
While I'm at it, mark nosys() (syscall 0) as MPSAFE, for all the
difference it will make.
set it to avoid the need for a bunch of code that tests whether or
not the lock member is set to REQ_WIRED in order to determine which
length member should be used.
Fix another bug in the oldlen return value code.
Fix a potential wired memory leak if a sysctl handler uses
sysctl_wire_old_buffer() and returns an EAGAIN error to trigger
a retry.
If vslock() returns ENOMEM, sysctl_wire_old_buffer() should set
wiredlen to zero and return zero (success) so that the handler will
operate according to sysctl(3):
The size of the buffer is given by the location specified by
oldlenp before the call, and that location gives the amount
of data copied after a successful call and after a call that
returns with the error code ENOMEM.
The handler will return an ENOMEM error because the zero length
buffer will overflow.
ptrace_set_pc(), and cpu_ptrace() so that those functions are free to
acquire Giant, sleep, etc. We already do a PHOLD/PRELE around them so
that it is safe to sleep inside of these routines if necessary. This
allows ptrace() to be marked MP safe again as it no longer triggers lock
order reversals on Alpha.
Tested by: wilko
in the two consumers that need it.. processes using AIO and netncp.
Update docs. Say that process_exec is called with Giant, but not to
depend on it. All our consumers can handle it without Giant.
- no longer serialize on Giant for thread_single*() and family in fork,
exit and exec
- thread_wait() is mpsafe, assert no Giant
- reduce scope of Giant in exit to not cover thread_wait and just do
vm_waitproc().
- assert that thread_single() family are not called with Giant
- remove the DROP/PICKUP_GIANT macros from thread_single() family
- assert that thread_suspend_check() s not called with Giant
- remove manual drop_giant hack in thread_suspend_check since we know it
isn't held.
- remove the DROP/PICKUP_GIANT macros from thread_suspend_check() family
- mark kse_create() mpsafe
we always grab Giant, even if we're actually only polling objects that
don't require giant. Once socket locking is merged, there will be
strong motivation to fix this.
the last chunk are misaligned relative to a MAXBSIZE byte boundary.
vn_rdwr_inchunks() is used mainly for elf core dumps, and elf sections
are usually perfectly misaligned relative to MAXBSIZE, and chunking
prevents the file system from doing much realigning.
This gives a surprisingly large speedup for core dumps -- from 50 to
13 seconds for a 512MB core dump here. The pessimization was mostly
from an interaction of the misalignment with IO_DIRECT. It increased
the number of i/o's for each chunk by a factor of 5 (3 writes and 2
read-before-writes instead of 1 write).
to build the kernel. It doesn't affect the operation if gcc.
Most of the changes are just adding __INTEL_COMPILER to #ifdef's, as
icc v8 may define __GNUC__ some parts may look strange but are
necessary.
Additional changes:
- in_cksum.[ch]:
* use a generic C version instead of the assembly version in the !gcc
case (ASM code breaks with the optimizations icc does)
-> no bad checksums with an icc compiled kernel
Help from: andre, grehan, das
Stolen from: alpha version via ppc version
The entire checksum code should IMHO be replaced with the DragonFly
version (because it isn't guaranteed future revisions of gcc will
include similar optimizations) as in:
---snip---
Revision Changes Path
1.12 +1 -0 src/sys/conf/files.i386
1.4 +142 -558 src/sys/i386/i386/in_cksum.c
1.5 +33 -69 src/sys/i386/include/in_cksum.h
1.5 +2 -0 src/sys/netinet/igmp.c
1.6 +0 -1 src/sys/netinet/in.h
1.6 +2 -0 src/sys/netinet/ip_icmp.c
1.4 +3 -4 src/contrib/ipfilter/ip_compat.h
1.3 +1 -2 src/sbin/natd/icmp.c
1.4 +0 -1 src/sbin/natd/natd.c
1.48 +1 -0 src/sys/conf/files
1.2 +0 -1 src/sys/conf/files.amd64
1.13 +0 -1 src/sys/conf/files.i386
1.5 +0 -1 src/sys/conf/files.pc98
1.7 +1 -1 src/sys/contrib/ipfilter/netinet/fil.c
1.10 +2 -3 src/sys/contrib/ipfilter/netinet/ip_compat.h
1.10 +1 -1 src/sys/contrib/ipfilter/netinet/ip_fil.c
1.7 +1 -1 src/sys/dev/netif/txp/if_txp.c
1.7 +1 -1 src/sys/net/ip_mroute/ip_mroute.c
1.7 +1 -2 src/sys/net/ipfw/ip_fw2.c
1.6 +1 -2 src/sys/netinet/igmp.c
1.4 +158 -116 src/sys/netinet/in_cksum.c
1.6 +1 -1 src/sys/netinet/ip_gre.c
1.7 +1 -2 src/sys/netinet/ip_icmp.c
1.10 +1 -1 src/sys/netinet/ip_input.c
1.10 +1 -2 src/sys/netinet/ip_output.c
1.13 +1 -2 src/sys/netinet/tcp_input.c
1.9 +1 -2 src/sys/netinet/tcp_output.c
1.10 +1 -1 src/sys/netinet/tcp_subr.c
1.10 +1 -1 src/sys/netinet/tcp_syncache.c
1.9 +1 -2 src/sys/netinet/udp_usrreq.c
1.5 +1 -2 src/sys/netinet6/ipsec.c
1.5 +1 -2 src/sys/netproto/ipsec/ipsec.c
1.5 +1 -1 src/sys/netproto/ipsec/ipsec_input.c
1.4 +1 -2 src/sys/netproto/ipsec/ipsec_output.c
and finally remove
sys/i386/i386 in_cksum.c
sys/i386/include in_cksum.h
---snip---
- endian.h:
* DTRT in C++ mode
- quad.h:
* we don't use gcc v1 anymore, remove support for it
Suggested by: bde (long ago)
- assym.h:
* avoid zero-length arrays (remove dependency on a gcc specific
feature)
This change changes the contents of the object file, but as it's
only used to generate some values for a header, and the generator
knows how to handle this, there's no impact in the gcc case.
Explained by: bde
Submitted by: Marius Strobl <marius@alchemy.franken.de>
- aicasm.c:
* minor change to teach it about the way icc spells "-nostdinc"
Not approved by: gibbs (no reply to my mail)
- bump __FreeBSD_version (lang/icc needs to know about the changes)
Incarnations of this patch survive gcc compiles since a loooong time,
I use it on my desktop. An icc compiled kernel works since Nov. 2003
(exceptions: snd_* if used as modules), it survives a build of the
entire ports collection with icc.
Parts of this commit contains suggestions or submissions from
Marius Strobl <marius@alchemy.franken.de>.
Reviewed by: -arch
Submitted by: netchild
Conforming POSIX application should do by disallowing the argv
argument to be NULL.
PR: kern/33738
Submitted by: Marc Olzheim, Serge van den Boom
OK'ed by: nectar
silences an annoying warning in getblk() when VMIO'ing on a directory
vnode, which can happen when vfs.vmiodirenable is 1.
Bring the warning message in line with reality at the same time.
Submitted by: hmp
were a rather overwhelming task. I soon learned that if you don't know
where you're going to store something, at least try to pile it next to
something slightly related in the hope that a pattern emerges.
Apply the same principle to the ffs/snapshot/softupdates code which have
leaked into specfs: Add yet a buf-quasi-method and call it from the
only two places I can see it can make a difference and implement the
magic in ffs_softdep.c where it belongs.
It's not pretty, but at least it's one less layer violated.
- security.bsd.hardlink_check_uid, when set, means, that unprivileged
users are not permitted to create hard links to files not
owned by them,
- security.bsd.hardlink_check_gid, when set, means, that unprivileged
users are not permitted to create hard links to files owned
by group they don't belong to.
OK'ed by: rwatson
that we (p1) are currently running, we hold a reference on p_textvp which
means the vnode cannot go away. p2 cannot run yet (and hence cannot exit)
so this should be safe to do at this point. As a bonus, it removes a
block of under-Giant code that was there to support the vref.
- Push Giant down a bit in coredump() and call coredump() with the proc
lock already held rather than unlocking it only to turn around and
relock it.
Requested by: peter
process group and session dereferences. Also, check that p_pgrp and
p_sesssion are NULL before dereferencing them.
- Push down Giant in fork1().
Requested by: peter
introduction of kern_mlock() and kern_munlock() in
src/sys/kern/kern_sysctl.c 1.150
src/sys/vm/vm_extern.h 1.69
src/sys/vm/vm_glue.c 1.190
src/sys/vm/vm_mmap.c 1.179
because different resource limits are appropriate for transient and
"permanent" page wiring requests.
Retain the kern_mlock() and kern_munlock() API in the revived
vslock() and vsunlock() functions.
Combine the best parts of each of the original sets of implementations
with further code cleanup. Make the mclock() and vslock()
implementations as similar as possible.
Retain the RLIMIT_MEMLOCK check in mlock(). Move the most strigent
test, which can return EAGAIN, last so that requests that have no
hope of ever being satisfied will not be retried unnecessarily.
Disable the test that can return EAGAIN in the vslock() implementation
because it will cause the sysctl code to wedge.
Tested by: Cy Schubert <Cy.Schubert AT komquats.com>
DIAGNOSTIC instead of INVARIANTS. INVARIANTS is intended for tests
that don't substantially change code flow or behavior (passive), but
this test required locking both the proc lock and scheduler lock
in order to execute. It also appears to be a very advisory diagnostic
as opposed to an invariant violation.
Following discussion with: bde
race in between sleepq_add() and sleepq_catch_signals() in that setting
td_wchan and TDF_SINTR is not atomic to sched_lock but only to the sleepq
lock. This band-aid will stop assertion failures, but there is perhaps a
larger problem with the sleepq_add/sleepq_catch_signals race that I am not
sure how to solve. For the signals case the race is harmless because we
always call cursig() after setting TDF_SINTR. However, KSE doesn't do
anything in sleepq_catch_signals() to check that this race was lost, so I
am unsure if this race is harmful for this specific abort.
functions in kern_socket.c.
Rename the "canwait" field to "mflags" and pass M_WAITOK and M_NOWAIT
in from the caller context rather than "1" or "0".
Correct mflags pass into mac_init_socket() from previous commit to not
include M_ZERO.
Submitted by: sam
than a "waitok" argument. Callers now passing M_WAITOK or M_NOWAIT
rather than 0 or 1. This simplifies the soalloc() logic, and also
makes the waiting behavior of soalloc() more clear in the calling
context.
Submitted by: sam
generic watchdoc(9) interface.
Make watchdogd(8) perform as watchdog(8) as well, and make it
possible to specify a check command to run, timeout and sleep
periods.
Update watchdog(4) to talk about the generic interface and add
new watchdog(8) page.
sleep queue interface:
- Sleep queues attempt to merge some of the benefits of both sleep queues
and condition variables. Having sleep qeueus in a hash table avoids
having to allocate a queue head for each wait channel. Thus, struct cv
has shrunk down to just a single char * pointer now. However, the
hash table does not hold threads directly, but queue heads. This means
that once you have located a queue in the hash bucket, you no longer have
to walk the rest of the hash chain looking for threads. Instead, you have
a list of all the threads sleeping on that wait channel.
- Outside of the sleepq code and the sleep/cv code the kernel no longer
differentiates between cv's and sleep/wakeup. For example, calls to
abortsleep() and cv_abort() are replaced with a call to sleepq_abort().
Thus, the TDF_CVWAITQ flag is removed. Also, calls to unsleep() and
cv_waitq_remove() have been replaced with calls to sleepq_remove().
- The sched_sleep() function no longer accepts a priority argument as
sleep's no longer inherently bump the priority. Instead, this is soley
a propery of msleep() which explicitly calls sched_prio() before
blocking.
- The TDF_ONSLEEPQ flag has been dropped as it was never used. The
associated TDF_SET_ONSLEEPQ and TDF_CLR_ON_SLEEPQ macros have also been
dropped and replaced with a single explicit clearing of td_wchan.
TD_SET_ONSLEEPQ() would really have only made sense if it had taken
the wait channel and message as arguments anyway. Now that that only
happens in one place, a macro would be overkill.
the process state to zombie when a process exits to avoid a lock order
reversal with the sleepqueue locks. This appears to be the only place
that we call wakeup() with sched_lock held.
to queue threads sleeping on a wait channel similar to how turnstiles are
used to queue threads waiting for a lock. This subsystem will be used as
the backend for sleep/wakeup and condition variables initially. Eventually
it will also be used to replace the ithread-specific iwait thread
inhibitor.
Sleep queues are also not locked by sched_lock, so this splits sched_lock
up a bit further increasing concurrency within the scheduler. Sleep queues
also natively support timeouts on sleeps and interruptible sleeps allowing
for the reduction of a lot of duplicated code between the sleep/wakeup and
condition variable implementations. For more details on the sleep queue
implementation, check the comments in sys/sleepqueue.h and
kern/subr_sleepqueue.c.
the syscall arguments and does the suser() permission check, and
kern_mlock(), which does the resource limit checking and calls
vm_map_wire(). Split munlock() in a similar way.
Enable the RLIMIT_MEMLOCK checking code in kern_mlock().
Replace calls to vslock() and vsunlock() in the sysctl code with
calls to kern_mlock() and kern_munlock() so that the sysctl code
will obey the wired memory limits.
Nuke the vslock() and vsunlock() implementations, which are no
longer used.
Add a member to struct sysctl_req to track the amount of memory
that is wired to handle the request.
Modify sysctl_wire_old_buffer() to return an error if its call to
kern_mlock() fails. Only wire the minimum of the length specified
in the sysctl request and the length specified in its argument list.
It is recommended that sysctl handlers that use sysctl_wire_old_buffer()
should specify reasonable estimates for the amount of data they
want to return so that only the minimum amount of memory is wired
no matter what length has been specified by the request.
Modify the callers of sysctl_wire_old_buffer() to look for the
error return.
Modify sysctl_old_user to obey the wired buffer length and clean up
its implementation.
Reviewed by: bms
of all, PIPE_EOF is not checked pervasively after everything that can drop
the pipe mutex and msleep(), so fix. Additionally, though it might not
harm anything, pipelock() and pipeunlock() are not used consistently.
Third, the kqueue support functions do not use the pipe mutex correctly.
Last, but absolutely not least, is a race: if pipe_busy is not set on
the closing side of the pipe, the other side that is trying to write to
that will crash BECAUSE PIPE_EOF IS NOT SET! Unconditionally set
PIPE_EOF, and get rid of all the lockups/crashes I have seen trying
to build ports.
Now I believe it is done in the right way.
Removed some XXMAC cases, we now assume 'high' integrity level for all
sysctls, except those with CTLFLAG_ANYBODY flag set. No more magic.
Reviewed by: rwatson
Approved by: rwatson, scottl (mentor)
Tested with: LINT (compilation), mac_biba(4) (functionality)
to use the "year1-year3" format, as opposed to "year1, year2, year3".
This seems to make lawyers more happy, but also prevents the
lines from getting excessively long as the years start to add up.
Suggested by: imp
This is what we came here for: Hang dev_t's from their cdevsw,
refcount cdevsw and dev_t and generally keep track of things a lot
better than we used to:
Hold a cdevsw reference around all entrances into the device driver,
this will be necessary to safely determine when we can unload driver
code.
Hold a dev_t reference while the device is open.
KASSERT that we do not enter the driver on a non-referenced dev_t.
Remove old D_NAG code, anonymous dev_t's are not a problem now.
When destroy_dev() is called on a referenced dev_t, move it to
dead_cdevsw's list. When the refcount drops, free it.
Check that cdevsw->d_version is correct. If not, set all methods
to the dead_*() methods to prevent entrance into driver. Print
warning on console to this effect. The device driver may still
explode if it is also incompatible with newbus, but in that case
we probably didn't get this far in the first place.
Remove the unused second argument from udev2dev().
Convert all remaining users of makedev() to use udev2dev(). The
semantic difference is that udev2dev() will only locate a pre-existing
dev_t, it will not line makedev() create a new one.
Apart from the tiny well controlled windown in D_PSEUDO drivers,
there should no longer be any "anonymous" dev_t's in the system
now, only dev_t's created with make_dev() and make_dev_alias()
Introduce d_version field in struct cdevsw, this must always be
initialized to D_VERSION.
Flip sense of D_NOGIANT flag to D_NEEDGIANT, this involves removing
four D_NOGIANT flags and adding 145 D_NEEDGIANT flags.
Add missing D_TTY flags to various drivers.
Complete asserts that dev_t's passed to ttyread(), ttywrite(),
ttypoll() and ttykqwrite() have (d_flags & D_TTY) and a struct tty
pointer.
Make ttyread(), ttywrite(), ttypoll() and ttykqwrite() the default
cdevsw methods for D_TTY drivers and remove the explicit initializations
in various drivers cdevsw structures.
This commit adds a couple of functions for pseudodrivers to use for
implementing cloning in a manner we will be able to lock down (shortly).
Basically what happens is that pseudo drivers get a way to ask for
"give me the dev_t with this unit number" or alternatively "give
me a dev_t with the lowest guaranteed free unit number" (there is
unfortunately a lot of non-POLA in the exact numeric value of this
number, just live with it for now)
Managing the unit number space this way removes the need to use
rman(9) to do so in the drivers this greatly simplifies the code in
the drivers because even using rman(9) they still needed to manage
their dev_t's anyway.
I have taken the if_tun, if_tap, snp and nmdm drivers through the
mill, partly because they (ab)used makedev(), but mostly because
together they represent three different problems for device-cloning:
if_tun and snp is the plain case: just give me a device.
if_tap has two kinds of devices, with a flag for device type.
nmdm has paired devices (ala pty) can you can clone either of them.
Instead of creating a mutex that we msleep on but don't actually lock when
doing the corresponding wakeup(), in the kthread, lock the mutex associated
with our taskqueue and msleep while the queue is empty. Assert that the
queue is locked when the callback function is called to wake the kthread.
It returns 1 is process is inside of jail and 0 if it is not.
Information if we are in jail or not is not a secret, there is plenty of
ways to discover it. Many people are using own hack to check this and
this will be a legal way from now on.
It will be great if our starting scripts will take advantage of this sysctl
to allow clean "boot" inside jail.
Approved by: rwatson, scottl (mentor)
kernel. I'm not happy with it yet - refinements are to come.
This hack allows the kern.ps_strings and kern.usrstack sysctls to respond
to a 32 bit request, such as those coming from emulated i386 binaries.
missing parentheses). Use default handling (trap to debugger) for
udev2dev(x, 1) since it is an error and doesn't happen anywhere in
the sys tree except in bogusly commented out code in coda.
and given a value, but never used. This has no effect on the
resulting binaries, since gcc optimizes the variable away anyway.
PR: kern/62684
Approved by: rwatson (mentor)
data for the file system on which the jail's root vnode is located.
Previous behavior (show data for all mountpoints) can be restored
by setting security.jail.getfsstatroot_only to 0. Note: this also
has the effect of hiding other mounts inside a jail, such as /dev,
/tmp, and /proc, but errs on the side of leaking less information.
called until DEVFS had a chance to initialize. Since DEVFS is mandatory
and things over in that department coincidentally works from without
any initialization now, this is safe.
set to SIGCHLD. This avoids the creation of orphaned Linux-threaded
zombies that init is unable to reap. This can occur when the parent
process sets its SIGCHLD to SIG_IGN. Fix a similar situation in the
PT_DETACH code.
Tested by: "Steven Hartland" <killing AT multiplay.co.uk>
mindful of blocking on disk I/O and instead return EBUSY when such
blocking would occur.
Results from the DeBox project indicate that blocking on disk I/O
can slow the performance of a kqueue/poll based webserver. Using
a flag such as SF_NODISKIO and throwing connections that would block
to helper processes/threads helped increase performance.
Currently, only the Flash webserver uses this flag, although it could
probably be applied to thttpd with relative ease.
Idea by: Yaoping Ruan & Vivek Pai
RLIM_INFINITY case for ogetrlimit().
- Use %jd and intmax_t to output negative time in usec in calcru().
- Rework getrusage() to make a copy of the rusage struct into a local
variable while holding Giant and then do the copyout from the local
variable to avoid having to have the original process rusage struct
locked while doing the copyout (which would not be safe). This also
includes a few style fixes from Bruce to getrusage().
Submitted by: bde (1, parts of 3)
Suggested by: bde (2)
failed, the reference count for the virtual memory object referenced
by the specified shared memory segment would have been erroneously
incremented.
Reported by: Joost Pol <joost@pine.nl>
- struct plimit includes a mutex to protect a reference count. The plimit
structure is treated similarly to struct ucred in that is is always copy
on write, so having a reference to a structure is sufficient to read from
it without needing a further lock.
- The proc lock protects the p_limit pointer and must be held while reading
limits from a process to keep the limit structure from changing out from
under you while reading from it.
- Various global limits that are ints are not protected by a lock since
int writes are atomic on all the archs we support and thus a lock
wouldn't buy us anything.
- All accesses to individual resource limits from a process are abstracted
behind a simple lim_rlimit(), lim_max(), and lim_cur() API that return
either an rlimit, or the current or max individual limit of the specified
resource from a process.
- dosetrlimit() was renamed to kern_setrlimit() to match existing style of
other similar syscall helper functions.
- The alpha OSF/1 compat layer no longer calls getrlimit() and setrlimit()
(it didn't used the stackgap when it should have) but uses lim_rlimit()
and kern_setrlimit() instead.
- The svr4 compat no longer uses the stackgap for resource limits calls,
but uses lim_rlimit() and kern_setrlimit() instead.
- The ibcs2 compat no longer uses the stackgap for resource limits. It
also no longer uses the stackgap for accessing sysctl's for the
ibcs2_sysconf() syscall but uses kernel_sysctl() instead. As a result,
ibcs2_sysconf() no longer needs Giant.
- The p_rlimit macro no longer exists.
Submitted by: mtm (mostly, I only did a few cleanups and catchups)
Tested on: i386
Compiled on: alpha, amd64
- Rename temporary variable names ("tmp", "tmp2") to more informative
names ("load", "pctcpu", "rss", ...)
- Unclutter indentation and return paths: rather than lots of nested
ifs, simply return earlier if it's not going to work out. Simplify
general structure and avoid "deep" code.
- Comment on the thread/process selection and locking.
- Correct handling of "running"/"runnable" states, avoid "unknown"
that people were seeing for running processes. This was due to
a misunderstanding of the more complex state machine / inhibitors
behavior of KSE.
- Do perform ttyinfo() printing on KSE (P_SA) processes, it seems
generally to work.
While I initially attempted to formulate this as two commits (one
layout, the other content), I concluded that the layout changes were
really structural changes.
Many elements submitted by: bde
instead, just dec/inc in the ctor/dtor. For now, increment/decrement
in two's, since we're now performing the operation once per pair,
not once per pipe. Not really any measurable performance change
in my micro-benchmarks, but doing less work is good, especially when
it comes to atomic operations.
Suggested by: alc
changes to jointly allocated pipe pairs. Replace these checks
with pipe_present checks. This avoids a NULL pointer dereference
when a pipe is half-closed.
Submitted by: Peter Edwards <peter.edwards@openet-telecom.com>
1. Root from inside a jail was able to unmount any file system
(except /).
2. Unprivileged root was able to unmount file systems mounted by
privileged root (execpt /).
3. User from inside a jail was able to mount file system when
sysctl vfs.usermount was set to 1.
4. User was able to mount file system when vfs.usermount was set to 1
(that's ok) and unmount it even if vfs.usermount was equal to 0
(that's not correct).
Possibility from point 1 was reported by: Dariusz Kowalski <darek@76.pl>
Only a part of this fix will be MFC'ed (if approved).
PR: kern/60149
Reviewed by: rwatson
Approved by: scottl (mentor)
MFC after: 3 days
packet along with data, instead of in their own packet. When serving files
of size (packetsize - headersize) or smaller, this will result in one less
packet crossing the network. Quick testing with thttpd and http_load has
shown a noticeable performance improvement in this case (350 vs 330 fetches
per second.)
Included in this commit are two support routines, iov_to_uio, and m_uiotombuf;
these routines are used by sendfile to construct the header mbuf chain that
will be linked to the rest of the data in the socket buffer.
sense with sched_4bsd as it does with sched_ule.
- Use P_NOLOAD instead of the absence of td->td_ithd to determine whether or
not a thread should be accounted for in sched_tdcnt.
would allocate two 'struct pipe's from the pipe zone, and malloc a
mutex.
- Create a new "struct pipepair" object holding the two 'struct
pipe' instances, struct mutex, and struct label reference. Pipe
structures now have a back-pointer to the pipe pair, and a
'pipe_present' flag to indicate whether the half has been
closed.
- Perform mutex init/destroy in zone init/destroy, avoiding
reallocating the mutex for each pipe. Perform most pipe structure
setup in zone constructor.
- VM memory mappings for pageable buffers are still done outside of
the UMA zone.
- Change MAC API to speak 'struct pipepair' instead of 'struct pipe',
update many policies. MAC labels are also handled outside of the
UMA zone for now. Label-only policy modules don't have to be
recompiled, but if a module is recompiled, its pipe entry points
will need to be updated. If a module actually reached into the
pipe structures (unlikely), that would also need to be modified.
These changes substantially simplify failure handling in the pipe
code as there are many fewer possible failure modes.
On half-close, pipes no longer free the 'struct pipe' for the closed
half until a full-close takes place. However, VM mapped buffers
are still released on half-close.
Some code refactoring is now possible to clean up some of the back
references, etc; this patch attempts not to change the structure
of most of the pipe implementation, only allocation/free code
paths, so as to avoid introducing bugs (hopefully).
This cuts about 8%-9% off the cost of sequential pipe allocation
and free in system call tests on UP and SMP in my micro-benchmarks.
May or may not make a difference in macro-benchmarks, but doing
less work is good.
Reviewed by: juli, tjr
Testing help: dwhite, fenestro, scottl, et al
track the load for the sched_load() function. In the SMP case this member
is not defined because it would be redundant with the ksg_load member
which already tracks the non ithd load.
- For sched_load() in the UP case simply return ksq_sysload. In the SMP
case traverse the list of kseq groups and sum up their ksg_load fields.
of sched_load(). This variable tracks the number of running and runnable
non ithd threads. This removes the need to traverse the proc table and
discover how many threads are runnable.
at packet arrival.
For benchmarking purposes SO_BINTIME is preferable to SO_TIMEVAL
since it has higher resolution and lower overhead. Simultaneous
use of the two options is possible and they will return consistent
timestamps.
This introduces an extra test and a function call for SO_TIMEVAL, but I have
not been able to measure that.
profiling buffers and hash table. This makes it a lot easier to
do multiple profiling runs without rebooting or performing
gratuitous arithmetic. Sysctl is named debug.mutex.prof.reset.
Reviewed by: jake
- witness_lock() is split into two pieces: witness_checkorder() and
witness_lock(). Witness_checkorder() determines if acquiring a specified
lock at the time it is called would result in a lock order. It
optionally adds a new lock order relationship as well. witness_lock()
updates witness's data structures to assume that a lock has been acquired
by stick a new lock instance in the appropriate lock instance list.
- The mutex and sx lock functions now call checkorder() prior to trying to
acquire a lock and continue to call witness_lock() after the acquire is
completed. This will let witness catch a deadlock before it happens
rather than trying to do so after the threads have deadlocked (i.e. never
actually report it).
- A new function witness_defineorder() has been added that adds a lock
order between two locks at runtime without having to acquire the locks.
If the lock order cannot be added it will return an error. This function
is available to programmers via the WITNESS_DEFINEORDER() macro which
accepts either two mutexes or two sx locks as its arguments.
- A few simple wrapper macros were added to allow developers to call
witness_checkorder() anywhere as a way of enforcing locking assertions
in code that might acquire a certain lock in some situations. The
macros are: witness_check_{mutex,shared_sx,exclusive_sx} and take an
appropriate lock as the sole argument.
- The code to remove a lock instance from a lock list in witness_unlock()
was unnested by using a goto to vastly improve the readability of this
function.
assure backward compatibility (conditional on !BURN_BRIDGES), look it up
by its old name first, and log a warning (but accept the setting) if it
was found. If both the old and new name are defined, the new name takes
precedence.
Also export vm.kmem_size as a read-only sysctl variable; I find it hard to
tune a parameter when I don't know its default value, especially when that
default value is computed at boot time.
SW_INVOL. Assert that one of these is set in mi_switch() and propery
adjust the rusage statistics. This is to simplify the large number of
users of this interface which were previously all required to adjust the
proper counter prior to calling mi_switch(). This also facilitates more
switch and locking optimizations.
- Change all callers of mi_switch() to pass the appropriate paramter and
remove direct references to the process statistics.
mutex profiling code. As with existing mutex profiling, measurement
is done with respect to mtx_lock() instances in the code, as opposed
to specific mutexes. In particular, measure two things:
(1) Lock contention. How often did this mtx_lock() call get made and
have to sleep (or almost sleep) waiting for the lock. This helps
identify the "victims" of contention.
(2) Hold contention. How often, while the lock was held by a thread
as a result of this mtx_lock(), did another thread try to acquire
the same mutex. This helps identify the causes of contention.
I'm currently exploring adding measurement of "time waited for the
lock", but the current implementation has proven useful to me so far
so I figured I'd commit it so others could try it out. Note that this
increases the size of mutexes when MUTEX_PROFILING is enabled, so you
might find you need to further bump UMA_BOOT_PAGES. Fixes welcome.
The once over: des, others
one which runs the actual update. This fixes a bug where there were
a delay in applying the frequency adjustment. In extreme cases this
could result in marginal stability of the kernel-pll.
The uidinfo code appears to be MPSAFE, and is referenced without Giant
elsewhere. While this grab of Giant was only made in fairly rare
circumstances (actually GC'ing on refcount==0), grabbing Giant here
potentially introduces lock order issues with any locks held by the
caller. So this probably won't help performance much unless you change
credentials a lot in an application, and leave a lot of file descriptors
and cached credentials around. However, it simplifies locking down
consumers of the credential interfaces.
Bumped into by: sam
Appeased: tjr
to a new prison_complete() task run by a task queue. This removes
a requirement for grabbing Giant in crfree(). Embed the 'struct task'
in 'struct prison' so that we don't have to allocate memory from
prison_free() (which means we also defer the FREE()).
With this change, I believe grabbing Giant from crfree() can now be
removed, but need to check the uidinfo code paths.
To avoid header pollution, move the definition of 'struct task'
to _task.h, and recursively include from taskqueue.h and jail.h; much
preferably to all files including jail.h picking up a requirement to
include taskqueue.h.
Bumped into by: sam
Reviewed by: bde, tjr
In case no real/physical IEEE 802 address is available, both the expired
"draft-leach-uuids-guids-01" (section "4. Node IDs when no IEEE 802
network card is available") and RFC 2518 (section "6.4.1 Node Field
Generation Without the IEEE 802 Address") recommend (quoted from RFC
2518):
"The ideal solution is to obtain a 47 bit cryptographic quality random
number, and use it as the low 47 bits of the node ID, with the _most_
significant bit of the first octet of the node ID set to 1. This bit
is the unicast/multicast bit, which will never be set in IEEE 802
addresses obtained from network cards; hence, there can never be a
conflict between UUIDs generated by machines with and without network
cards."
Unfortunately, this incorrectly explains how to implement this and
the FreeBSD UUID generator code inherited this generation bug from
the broken reference code in the standards draft. They should instead
specify the "_least_ significant bit of the first octet of the node ID"
as the multicast bit in a memory and hexadecimal string representation
of a 48-bit IEEE 802 MAC address.
This standards bug arised from a false interpretation, as the multicast
bit is actually the _most_ significant bit in IEEE 802.3 (Ethernet)
_transmission order_ of an IEEE 802 MAC address. The standards authors
forgot that the bitwise order of an _octet_ from a MAC address _memory_
and hexadecimal string representation is still always from left (MSB,
bit 7) to right (LSB, bit 0).
Fortunately, this UUID generation bug could have occurred on systems
without any Ethernet NICs only.
Presumably, at some point, you had to include jail.h if you included
proc.h, but that is no longer required.
Result of: self injury involving adding something to struct prison
in OpenBSD by Niels Provos. The patch introduces a bitmap of allocated
file descriptors which is used to locate available descriptors when a new
one is needed. It also moves the task of growing the file descriptor table
out of fdalloc(), reducing complexity in both fdalloc() and do_dup().
Debts of gratitude are owed to tjr@ (who provided the original patch on
which this work is based), grog@ (for the gdb(4) man page) and rwatson@
(for assistance with pxeboot(8)).
ithread_remove_handler() may fail to remove the interrupt handler if
it decides to let the ithread do the removal. The problem is that during
boot "cold" is set, which causes msleep() to return immediately. This
will cause ithread_remove_handler() to fail to wait for the ithread
to do the removal from the handler TAILQ before freeing the handler
back to the heap. Bad things will happen when some other user of the
TAILQ, such as ithread_add_handler() or the actual ithread attempts to use
the freed handler. Fix the problem by forcing ithread_remove_handler()
to do the actual removal itself if the "cold" flag is set.
Reviewed by: jhb