11874 Commits

Author SHA1 Message Date
mdf
b46510221b Fix brain fart when converting an if statement into a KASSERT. 2010-09-03 16:12:39 +00:00
mdf
edfefb2af9 Use math rather than iteration when the desired sbuf size is larger than
SBUF_MAXEXTENDSIZE.
2010-09-03 16:09:17 +00:00
gibbs
6833acab2d Correct bioq_disksort so that bioq_insert_tail() offers barrier semantic.
Add the BIO_ORDERED flag for struct bio and update bio clients to use it.

The barrier semantics of bioq_insert_tail() were broken in two ways:

 o In bioq_disksort(), an added bio could be inserted at the head of
   the queue, even when a barrier was present, if the sort key for
   the new entry was less than that of the last queued barrier bio.

 o The last_offset used to generate the sort key for newly queued bios
   did not stay at the position of the barrier until either the
   barrier was de-queued, or a new barrier (which updates last_offset)
   was queued.  When a barrier is in effect, we know that the disk
   will pass through the barrier position just before the
   "blocked bios" are released, so using the barrier's offset for
   last_offset is the optimal choice.

sys/geom/sched/subr_disk.c:
sys/kern/subr_disk.c:
	o Update last_offset in bioq_insert_tail().

	o Only update last_offset in bioq_remove() if the removed bio is
	  at the head of the queue (typically due to a call via
	  bioq_takefirst()) and no barrier is active.

	o In bioq_disksort(), if we have a barrier (insert_point is non-NULL),
	  set prev to the barrier and cur to it's next element.  Now that
	  last_offset is kept at the barrier position, this change isn't
	  strictly necessary, but since we have to take a decision branch
	  anyway, it does avoid one, no-op, loop iteration in the while
	  loop that immediately follows.

	o In bioq_disksort(), bypass the normal sort for bios with the
	  BIO_ORDERED attribute and instead insert them into the queue
	  with bioq_insert_tail().  bioq_insert_tail() not only gives
	  the desired command order during insertion, but also provides
	  barrier semantics so that commands disksorted in the future
	  cannot pass the just enqueued transaction.

sys/sys/bio.h:
	Add BIO_ORDERED as bit 4 of the bio_flags field in struct bio.

sys/cam/ata/ata_da.c:
sys/cam/scsi/scsi_da.c
	Use an ordered command for SCSI/ATA-NCQ commands issued in
	response to bios with the BIO_ORDERED flag set.

sys/cam/scsi/scsi_da.c
	Use an ordered tag when issuing a synchronize cache command.

	Wrap some lines to 80 columns.

sys/cddl/contrib/opensolaris/uts/common/fs/zfs/vdev_geom.c
sys/geom/geom_io.c
	Mark bios with the BIO_FLUSH command as BIO_ORDERED.

Sponsored by:	Spectra Logic Corporation
MFC after:	1 month
2010-09-02 19:40:28 +00:00
mdf
200dc21dcc Fix UP build.
MFC after:	2 weeks
2010-09-02 16:23:05 +00:00
mdf
bbc3957715 Fix a bug with sched_affinity() where it checks td_pinned of another
thread in a racy manner, which can lead to attempting to migrate a
thread that is pinned to a CPU.  Instead, have sched_switch() determine
which CPU a thread should run on if the current one is not allowed.

KASSERT in sched_bind() that the thread is not yet pinned to a CPU.

KASSERT in sched_switch() that only migratable threads or those moving
due to a sched_bind() are changing CPUs.

sched_affinity code came from jhb@.

MFC after:	2 weeks
2010-09-01 20:32:47 +00:00
mlaier
2ef34e9086 rmlock(9) two additions and one change/fix:
- add rm_try_rlock().
 - add RM_SLEEPABLE to use sx(9) as the back-end lock in order to sleep while
   holding the write lock.
 - change rm_noreadtoken to a cpu bitmask to indicate which CPUs need to go
   through the lock/unlock in order to synchronize.  As a side effect, this
   also avoids IPI to CPUs without any readers during rm_wlock.

Discussed with:		ups@, rwatson@ on arch@
Sponsored by:		Isilon Systems, Inc.
2010-09-01 19:50:03 +00:00
emaste
5216dea167 As long as we are going to panic anyway, there's no need to hide additional
information behind DIAGNOSTIC.
2010-09-01 13:47:11 +00:00
davidxu
be8bfd2384 rescure comments from RELENG_4. 2010-09-01 01:26:07 +00:00
mdf
42170bf6d6 The realloc case for memguard(9) will copy too many bytes when
reallocating to a smaller-sized allocation.  Fix this issue.

Noticed by:     alc
Reviewed by:    alc
Approved by:    zml (mentor)
MFC after:      3 weeks
2010-08-31 16:57:58 +00:00
davidxu
ae90c2fd09 If a process is being debugged, skips job control caused by SIGSTOP/SIGCONT
signals, because it is managed by debugger, however a normal signal sent to
a interruptibly sleeping thread wakes up the thread so it will handle the
signal when the process leaves the stopped state.

PR:	150138
MFC after:	1 week
2010-08-31 07:15:50 +00:00
jh
c2cb836190 execve(2) has a special check for file permissions: a file must have at
least one execute bit set, otherwise execve(2) will return EACCES even
for an user with PRIV_VFS_EXEC privilege.

Add the check also to vaccess(9), vaccess_acl_nfs4(9) and
vaccess_acl_posix1e(9). This makes access(2) to better agree with
execve(2). Because ZFS doesn't use vaccess(9) for VEXEC, add the check
to zfs_freebsd_access() too. There may be other file systems which are
not using vaccess*() functions and need to be handled separately.

PR:		kern/125009
Reviewed by:	bde, trasz
Approved by:	pjd (ZFS part)
2010-08-30 16:30:18 +00:00
kib
8505815b26 Regen 2010-08-30 14:26:02 +00:00
kib
0a6b8011f4 Make the syscalls reserved for AFS usable by OpenAFS port.
Submitted by:	Benjamin Kaduk <kaduk mit edu>
MFC after:	2 weeks
2010-08-30 14:24:44 +00:00
kib
9d21e17f07 For some file types, select code registers two selfd structures. E.g.,
for socket, when specified POLLIN|POLLOUT in events, you would have one
selfd registered for receiving socket buffer, and one for sending. Now,
if both events are not ready to fire at the time of the initial scan,
but are simultaneously ready after the sleep, pollrescan() would iterate
over the pollfd struct twice. Since both times revents is not zero,
returned value would be off by one.

Fix this by recalculating the return value in pollout().

PR:	kern/143029
MFC after:	2 weeks
2010-08-28 17:42:08 +00:00
pjd
bc73fabf27 There is a bug in vfs_allocate_syncvnode() failure handling in mount code.
Actually it is hard to properly handle such a failure, especially in MNT_UPDATE
case. The only reason for the vfs_allocate_syncvnode() function to fail is
getnewvnode() failure. Fortunately it is impossible for current implementation
of getnewvnode() to fail, so we can assert this and make
vfs_allocate_syncvnode() void. This in turn free us from handling its failures
in the mount code.

Reviewed by:	kib
MFC after:	1 month
2010-08-28 08:57:15 +00:00
pjd
43af1b0877 Run all tasks from a proper context, with proper priority, etc.
Reviewed by:	jhb
MFC after:	1 month
2010-08-28 08:38:03 +00:00
kib
65295a82b8 Fix typo.
Submitted by:	Ben Kaduk <minimarmot gmail com>
2010-08-26 11:20:57 +00:00
brian
e098f7b033 If we read zero bytes from the directory, early out with ENOENT
rather than forging ahead and interpreting garbage buffer content
and dirent structures.

This change backs out r211684 which was essentially a no-op.

MFC after:	1 week
2010-08-25 18:09:51 +00:00
davidxu
86cb0861ef If a thread is removed from umtxq while sleeping, reset error code
to zero, this gives userland a better indication that a thread needn't
to be cancelled.
2010-08-25 03:14:32 +00:00
davidxu
22bc7d14ad Optimize thr_suspend, if timeout is zero, don't call msleep, just
return immediately.
2010-08-24 07:29:55 +00:00
davidxu
6616b254f2 - According to specification, SI_USER code should only be generated by
standard kill(). On other systems, SI_LWP is generated by lwp_kill().
  This will allow conforming applications to differentiate between
  signals generated by standard events and those generated by other
  implementation events in a manner compatible with existing practice.
- Bump __FreeBSD_version
2010-08-24 07:22:24 +00:00
imp
50d4a3193c This should really be MACHINE not MACHINE_ARCH, and is this Makefile even used? 2010-08-23 06:22:35 +00:00
brian
89b2d8bbb4 uio_resid isn't updated by VOP_READDIR for nfs filesystems. Use
the uio_offset adjustment instead to calculate a correct *len.

Without this change, we run off the end of the directory data
we're reading and panic horribly for nfs filesystems.

MFC after:	1 week
2010-08-23 05:33:31 +00:00
rpaulo
dda24289cb Call the systrace_probe_func() when the error value.
Sponsored by:	The FreeBSD Foundation
2010-08-22 11:30:49 +00:00
rpaulo
ea11ba6788 Add an extra comment to the SDT probes definition. This allows us to get
use '-' in probe names, matching the probe names in Solaris.[1]

Add userland SDT probes definitions to sys/sdt.h.

Sponsored by:	The FreeBSD Foundation
Discussed with:	rwaston [1]
2010-08-22 11:18:57 +00:00
rpaulo
6f62630bc2 Bump KDTRACE_THREAD_ZERO and use M_ZERO as a malloc flag instead of
calling bzero.

Sponsored by:	The FreeBSD Foundation
2010-08-22 11:09:53 +00:00
rpaulo
a34abf7c98 Fix style issues.
Sponsored by:	The FreeBSD Foundation
2010-08-22 11:08:18 +00:00
davidxu
84d25462c9 make sure thread lock is locked. 2010-08-20 23:51:34 +00:00
jhb
d4890c88b0 Add dedicated routines to toggle lockmgr flags such as LK_NOSHARE and
LK_CANRECURSE after a lock is created.  Use them to implement macros that
otherwise manipulated the flags directly.  Assert that the associated
lockmgr lock is exclusively locked by the current thread when manipulating
these flags to ensure the flag updates are safe.  This last change required
some minor shuffling in a few filesystems to exclusively lock a brand new
vnode slightly earlier.

Reviewed by:	kib
MFC after:	3 days
2010-08-20 19:46:50 +00:00
davidxu
89f466d2b2 If thread set a TDP_WAKEUP for itself, clears the flag and returns EINTR
immediately, this is used for implementing reliable pthread cancellation.
2010-08-20 04:28:30 +00:00
jhb
d02cab2556 Remove unused KTRACE includes. 2010-08-19 16:41:27 +00:00
jhb
2f662f7a9c There isn't really a need to hold the ktrace mutex just to read the value
of p_traceflag that is stored in the kinfo_proc structure.  It is still
racey even with the lock and the code will read a consistent snapshot of
the flag without the lock.
2010-08-19 16:40:30 +00:00
jhb
faa167a723 Fix a whitespace nit and remove a questioning comment. STAILQ_CONCAT()
does require the STAILQ the existing list is being added to to already
be initialized (it is CONCAT() vs MOVE()).
2010-08-19 16:38:58 +00:00
jhb
d64a4df941 Keep the process locked when calling ktrops() or ktrsetchildren() instead
of dropping the lock only to immediately reacquire it.
2010-08-17 21:34:19 +00:00
kib
d9f088a03e Supply some useful information to the started image using ELF aux vectors.
In particular, provide pagesize and pagesizes array, the canary value
for SSP use, number of host CPUs and osreldate.

Tested by:	marius (sparc64)
MFC after:	1 month
2010-08-17 08:55:45 +00:00
pjd
120209c66c Simplify taskqueue_drain() by using proved macros. 2010-08-13 19:20:35 +00:00
gibbs
f5039e4d7d Allow interrupt driven config hooks to be registered from config hook callbacks.
Interrupt driven configuration hooks serve two purposes: they are a
mechanism for registering for a callback that is invoked once interrupt
services are available, and they hold off root device selection so long
as any configuration hooks are still active.  Before this change, it was
not possible to safely register additional hooks from the context of a
configuration hook callback.  The need for this feature arises when
interrupts are required to discover new devices (e.g. access to the XenStore
to find para-virtualized devices) which in turn also require the ability
to hold off root device selection until some lengthy, interrupt driven,
configuration task has completed (e.g. Xen front/back device driver
negotiation).

More specifically, the mutex protecting the list of active configuration
hooks is never held during a callback, and static information is used
to ensure proper ordering and only a single callback to each hook even
when faced with registration or removal of a hook during an active run.

Sponsored by:	Spectra Logic Corporation
MFC after:      1 week.
2010-08-12 19:50:40 +00:00
gibbs
6b6ab892f9 Properly indent a continue statement. No functional changes. 2010-08-12 19:26:27 +00:00
jkim
f9341f06d7 Add the half of time-of-day clock resolution when we adjust system time from
time-of-day clock or vice versa.  For x86 systems, RTC resolution is one
second and we used to lose up to one second whenever we initialize system
time from RTC or write system time back to RTC.  With this change, margin
of error per conversion is roughly between -0.5 and +0.5 second rather
than between -1 and 0 second.  Note that it does not take care of errors
from getnanotime(9) (which is up to 1/hz second) or CLOCK_GETTIME() latency.
These are just too expensive to correct and it is not worthy of the cost.
2010-08-12 17:17:05 +00:00
jkim
1974c6514b Provide description for 'machdep.disable_rtc_set' sysctl. Clean up style(9)
nits.  Remove a redundant return statement and an unnecessary variable.
2010-08-12 16:13:24 +00:00
kib
ade28bdd40 The buffers b_vflags field is not always properly protected by
bufobj lock. If b_bufobj is not NULL, then bufobj lock should be
held when manipulating the flags. Not doing this sometimes leaves
BV_BKGRDINPROG to be erronously set, causing softdep' getdirtybuf() to
stuck indefinitely in "getbuf" sleep, waiting for background write to
finish which is not actually performed.

Add BO_LOCK() in the cases where it was missed.

In collaboration with:	pho
Tested by:	bz
Reviewed by:	jeff
MFC after:	1 month
2010-08-12 08:36:23 +00:00
mdf
0737955344 Rework memguard(9) to reserve significantly more KVA to detect
use-after-free over a longer time.  Also release the backing pages of
a guarded allocation at free(9) time to reduce the overhead of using
memguard(9).  Allow setting and varying the malloc type at run-time.
Add knobs to allow:

 - randomly guarding memory
 - adding un-backed KVA guard pages to detect underflow and overflow
 - a lower limit on the size of allocations that are guarded

Reviewed by:    alc
Reviewed by:    brueffer, Ulrich Spörlein <uqs spoerlein net> (man page)
Silence from:   -arch
Approved by:    zml (mentor)
MFC after:      1 month
2010-08-11 22:10:37 +00:00
ivoras
9cecaf1c60 Fix (hopefully) the spelling of "queuing."
Submitted by:	bf1783 at gmail com
2010-08-09 23:32:37 +00:00
ivoras
191f678b27 Bumping the read-ahead count once more, to value equivalent to 512 KiB on
most system, based on benchmark results on a low-end fibre channel SAN
under VMWare:

vfs.read_max		read performance
8  (historical default)	83 MB/s
16 (recent bump)	131 MB/s
32 (this version)	152 MB/s
64			157 MB/s

(results are +/- 3 MB/s)

As read-ahead is heuristic, based on past IO requests, it shouldn't be
problematic. The new default is still smaller then in other OSes.
2010-08-09 22:56:10 +00:00
ivoras
fa067e3c30 Elaborate on how hirunningspace was chosen. 2010-08-09 22:22:46 +00:00
gavin
dbc7cd5ae9 Add descriptions to a handful of sysctl nodes.
PR:		kern/148580
Submitted by:	Galimov Albert <wtfcrap mail.ru>
MFC after:	1 week
2010-08-09 14:48:31 +00:00
attilio
307b2c04a2 The r208165 fixed a bug related to unsigned integer overflowing for the
number of CPUs detection.
However, that was not mention at all, the problem was not reported, the
patch has not been MFCed and the fix is mostly improper.

Fix the original overflow (caused when 32 CPUs must be detected) by
just using a different mathematical computation (it also makes more
explicit the size of operands involved, which is good in the moment
waiting for a more complete support for a large number of CPUs).

PR:		kern/148698
Submitted by:	Joe Landers <jlanders at vmware dot com>
Tested by:	gianni
MFC after:	10 days
2010-08-09 00:23:57 +00:00
jamie
4e0690ba81 Back out r210974. Any convenience of not typing "persist" is outweighed
by the possibility of unintended partially-formed jails.
2010-08-08 23:22:55 +00:00
ivoras
252207bbfb To help with sequential read UFS performance on modern systems, increase
the vfs.read_max default. For most systems this means going from 128 KiB
to 256 KiB, which is still very conservative and lower than what most
other operating systems use, but as a sane default should not
interfere much with existing systems.

For systems with RAID volumes and/or virtualization envirnments, where
read performance is very important, increasing this sysctl tunable to 32
or even more will demonstratively yield additional performance benefits.

If MAXPHYS ever gets bumped up, it will probably be a good idea to slave
read_max to it.
2010-08-07 18:30:10 +00:00
tuexen
542f657a7f Fix a bug where MSG_TRUNC was not returned in all necessary cases for
SOCK_DGRAM socket. MSG_TRUNC was only returned when some mbufs could
not be copied to the application. If some data was left in the last
mbuf, it was correctly discarded, but MSG_TRUNC was not set.

Reviewed by: bz
MFC after: 3 weeks
2010-08-07 17:57:58 +00:00