Commit Graph

12408 Commits

Author SHA1 Message Date
Alexander Motin
556a5850fa Fix interrupt counters dumping on SW_WATCHDOG fire. 2011-09-27 09:30:20 +00:00
Edward Tomasz Napierala
1dbf9dcc20 Fix error handling bug that would prevent MAC structures from getting
freed properly if resource limit got exceeded.

Approved by:	re (kib)
2011-09-17 20:48:49 +00:00
Edward Tomasz Napierala
b38520f09c Fix long-standing thinko regarding maxproc accounting. Basically,
we were accounting the newly created process to its parent instead
of the child itself.  This caused problems later, when the child
changed its credentials - the per-uid, per-jail etc counters were
not properly updated, because the maxproc counter in the child
process was 0.

Approved by:	re (kib)
2011-09-17 19:55:32 +00:00
Kip Macy
9eca9361f9 Auto-generated code from sys_ prefixing makesyscalls.sh change
Approved by:	re(bz)
2011-09-16 14:04:14 +00:00
Kip Macy
8451d0dd78 In order to maximize the re-usability of kernel code in user space this
patch modifies makesyscalls.sh to prefix all of the non-compatibility
calls (e.g. not linux_, freebsd32_) with sys_ and updates the kernel
entry points and all places in the code that use them. It also
fixes an additional name space collision between the kernel function
psignal and the libc function of the same name by renaming the kernel
psignal kern_psignal(). By introducing this change now we will ease future
MFCs that change syscalls.

Reviewed by:	rwatson
Approved by:	re (bz)
2011-09-16 13:58:51 +00:00
Adrian Chadd
d2849f27bc Ensure that ta_pending doesn't overflow u_short by capping its value at USHRT_MAX.
If it overflows before the taskqueue can run, the task will be
re-added to the taskqueue and cause a loop in the task list.

Reported by:	Arnaud Lacombe <lacombar@gmail.com>
Submitted by:	Ryan Stone <rysto32@gmail.com>
Reviewed by:	jhb
Approved by:	re (kib)
MFC after:	1 day
2011-09-15 08:42:06 +00:00
Rick Macklem
4d30adc494 Modify vfs_register() to use a hash calculation
on vfc_name to set vfc_typenum, so that vfc_typenum doesn't
change when file systems are loaded in different orders. This
keeps NFS file handles from changing, for file systems that
use vfc_typenum in their fsid. This change is controlled via
a loader.conf variable called vfs.typenumhash, since vfc_typenum
will change once when this is enabled. It defaults to 1 for
9.0, but will default to 0 when MFC'd to stable/8.

Tested by:	hrs
Reviewed by:	jhb, pjd (earlier version)
Approved by:	re (kib)
MFC after:	1 month
2011-09-13 21:01:26 +00:00
Attilio Rao
58379067a3 dump_write() returns ENXIO if the dump is trying to be written outside
of the device boundry.
While this is generally ok, the problem is that all the consumers
handle similar cases (and expect to catch) ENOSPC for this (for a
reference look at minidumpsys() and dumpsys() constructions). That
ends up in consumers not recognizing the issue and amd64 failing to
retry if the number of pages grows up during minidump.
Fix this by returning ENOSPC in dump_write() and while here add some
more diagnostic on involved values.

Sponsored by:	Sandvine Incorporated
In collabouration with:	emaste
Approved by:	re (kib)
MFC after:	10 days
2011-09-12 20:39:31 +00:00
Ed Schouten
ca0856d3d1 Fix error return codes for ioctls on init/lock state devices.
In revision 223722 we introduced support for driver ioctls on init/lock
state devices. Unfortunately the call to ttydevsw_cioctl() clobbers the
value of the error variable, meaning that in many cases ioctl() will now
return ENOTTY, even though the ioctl() was processed properly.

Reported by:	Boris Samorodov <bsam ipt ru>
Patch by:	jilles@
Approved by:	re@ (kib@)
2011-09-12 10:07:21 +00:00
Konstantin Belousov
26ccf4f10f Inline the syscallenter() and syscallret(). This reduces the time measured
by the syscall entry speed microbenchmarks by ~10% on amd64.

Submitted by:	jhb
Approved by:	re (bz)
MFC after:	2 weeks
2011-09-11 16:05:09 +00:00
Attilio Rao
fa2b39a18d Improve the informations reported in case of busy buffers during the shutdown:
- Axe out the SHOW_BUSYBUFS option and uses a tunable for selectively
enable/disable it, which is defaulted for not printing anything (0
value) but can be changed for printing (1 value) and be verbose (2
value)
- Improves the informations outputed: right now, there is no track of
the actual struct buf object or vnode which are referenced by the
shutdown process, but it is printed the related struct bufobj object
which is not really helpful
- Add more verbosity about the state of the struct buf lock and the
vnode informations, with the latter to be activated separately by the
sysctl

Sponsored by:	Sandvine Incorporated
Reviewed by:	emaste, kib
Approved by:	re (ksmith)
MFC after:	10 days
2011-09-08 12:56:26 +00:00
Edward Tomasz Napierala
ba1b206990 Fix whitespace.
Submitted by:	amdmi3
Approved by:	re (rwatson)
2011-09-07 07:52:45 +00:00
Edward Tomasz Napierala
3044751e35 Work around a kernel panic triggered by forkbomb with an rctl rule
such as j:name:maxproc:sigkill=100.  Proper fix - deferring psignal
to a taskqueue - is somewhat complicated and thus will happen
after 9.0.

Approved by:	re (kib)
2011-09-06 17:22:40 +00:00
Attilio Rao
9f39e22e6d Interrupts are disabled/enabled when entering and exiting the KDB context.
While this is generally good, it brings along a serie of problems,
like clocks going off sync and in presence of SW_WATCHDOG, watchdogs
firing without a good reason (missed hardclock wdog ticks update).

Fix the latter by kicking the watchdog just before to re-enable the interrupts.
Also, while here, not rely on users to stop the watchdog manually when
entering DDB but do that when entering KDB context.

Sponsored by: Sandvine Incorporated
Reviewed by:	emaste, rstone
Approved by:	re (kib)
MFC after:	1 week
2011-09-04 13:07:02 +00:00
Edward Tomasz Napierala
cff08ec0f4 Since r224036 the cputime and wallclock are supposed to be in seconds,
not microseconds.  Make it so.

Approved by:	re (kib)
2011-09-04 05:04:34 +00:00
Edward Tomasz Napierala
2419d7f93b Fix panic that happens when fork(2) fails due to a limit other than
the rctl one - for example, it happens when someone reaches maximum
number of processes in the system.

Approved by:	re (kib)
2011-09-03 08:08:24 +00:00
Robert Watson
9b6dd12e5d Correct several issues in the integration of POSIX shared memory objects
and the new setmode and setowner fileops in FreeBSD 9.0:

- Add new MAC Framework entry point mac_posixshm_check_create() to allow
  MAC policies to authorise shared memory use.  Provide a stub policy and
  test policy templates.

- Add missing Biba and MLS implementations of mac_posixshm_check_setmode()
  and mac_posixshm_check_setowner().

- Add 'accmode' argument to mac_posixshm_check_open() -- unlike the
  mac_posixsem_check_open() entry point it was modeled on, the access mode
  is required as shared memory access can be read-only as well as writable;
  this isn't true of POSIX semaphores.

- Implement full range of POSIX shared memory entry points for Biba and MLS.

Sponsored by:   Google Inc.
Obtained from:	TrustedBSD Project
Approved by:    re (kib)
2011-09-02 17:40:39 +00:00
Robert Watson
4cf7545589 Attempt to make break-to-debugger and alternative break-to-debugger more
accessible:

(1) Always compile in support for breaking into the debugger if options
    KDB is present in the kernel.

(2) Disable both by default, but allow them to be enabled via tunables
    and sysctls debug.kdb.break_to_debugger and
    debug.kdb.alt_break_to_debugger.

(3) options BREAK_TO_DEBUGGER and options ALT_BREAK_TO_DEBUGGER continue
    to behave as before -- only now instead of compiling in
    break-to-debugger support, they change the default values of the
    above sysctls to enable those features by default.  Current kernel
    configurations should, therefore, continue to behave as expected.

(4) Migrate alternative break-to-debugger state machine logic out of
    individual device drivers into centralised KDB code.  This has a
    number of upsides, but also one downside: it's now tricky to release
    sio spin locks when entering the debugger, so we don't.  However,
    similar logic does not exist in other device drivers, including uart.

(5) dcons requires some special handling; unlike other console types, it
    allows overriding KDB's own debugger selection, so we need a new
    interface to KDB to allow that to work.

GENERIC kernels in -CURRENT will now support break-to-debugger as long as
appropriate boot/run-time options are set, which should improve the
debuggability of BETA kernels significantly.

MFC after:	3 weeks
Reviewed by:	kib, nwhitehorn
Approved by:	re (bz)
2011-08-26 21:46:36 +00:00
Xin LI
cd39bb098e Fix format strings for KTR_STATE in 4BSD ad ULE schedulers.
Submitted by:	Ivan Klymenko <fidaj@ukr.net>
PR:		kern/159904, kern/159905
MFC after:	2 weeks
Approved by:	re (kib)
2011-08-26 18:00:07 +00:00
Jamie Gritton
e6d5cb63fa Delay the recursive decrement of pr_uref when jails are made invisible
but not removed; decrement it instead when the child jail actually
goes away. This avoids letting the counter go below zero in the case
where dying (pr_uref==0) jails are "resurrected", and an associated
KASSERT panic.

Submitted by:	Steven Hartland
Approved by:	re (bz)
MFC after:	1 week
2011-08-26 16:03:34 +00:00
Attilio Rao
6aba400a70 Fix a deficiency in the selinfo interface:
If a selinfo object is recorded (via selrecord()) and then it is
quickly destroyed, with the waiters missing the opportunity to awake,
at the next iteration they will find the selinfo object destroyed,
causing a PF#.

That happens because the selinfo interface has no way to drain the
waiters before to destroy the registered selinfo object. Also this
race is quite rare to get in practice, because it would require a
selrecord(), a poll request by another thread and a quick destruction
of the selrecord()'ed selinfo object.

Fix this by adding the seldrain() routine which should be called
before to destroy the selinfo objects (in order to avoid such case),
and fix the present cases where it might have already been called.
Sometimes, the context is safe enough to prevent this type of race,
like it happens in device drivers which installs selinfo objects on
poll callbacks. There, the destruction of the selinfo object happens
at driver detach time, when all the filedescriptors should be already
closed, thus there cannot be a race.
For this case, mfi(4) device driver can be set as an example, as it
implements a full correct logic for preventing this from happening.

Sponsored by:	Sandvine Incorporated
Reported by:	rstone
Tested by:	pluknet
Reviewed by:	jhb, kib
Approved by:	re (bz)
MFC after:	3 weeks
2011-08-25 15:51:54 +00:00
Bjoern A. Zeeb
b233773bb9 Increase the defaults for the maximum socket buffer limit,
and the maximum TCP send and receive buffer limits from 256kB
to 2MB.

For sb_max_adj we need to add the cast as already used in the sysctl
handler to not overflow the type doing the maths.

Note that this is just the defaults.  They will allow more memory
to be consumed per socket/connection if needed but not change the
default "idle" memory consumption.   All values are still tunable
by sysctls.

Suggested by:	gnn
Discussed on:	arch (Mar and Aug 2011)
MFC after:	3 weeks
Approved by:	re (kib)
2011-08-25 09:20:13 +00:00
Martin Matuska
82378711f9 Generalize ffs_pages_remove() into vn_pages_remove().
Remove mapped pages for all dataset vnodes in zfs_rezget() using
new vn_pages_remove() to fix mmapped files changed by
zfs rollback or zfs receive -F.

PR:		kern/160035, kern/156933
Reviewed by:	kib, pjd
Approved by:	re (kib)
MFC after:	1 week
2011-08-25 08:17:39 +00:00
Attilio Rao
e75baa2802 callout_cpu_switch() allows preemption when dropping the outcoming
callout cpu lock (and after having dropped it).
If the newly scheduled thread wants to acquire the old queue it will
just spin forever.

Fix this by disabling preemption and interrupts entirely (because fast
interrupt handlers may incur in the same problem too) while switching
locks.

Reported by:	hrs, Mike Tancsa <mike AT sentex DOT net>,
		Chip Camden <sterling AT camdensoftware DOT com>
Tested by:	hrs, Mike Tancsa <mike AT sentex DOT net>,
		Chip Camden <sterling AT camdensoftware DOT com>,
		Nicholas Esborn <nick AT desert DOT net>
Approved by:	re (kib)
MFC after:	10 days
2011-08-21 10:52:50 +00:00
Konstantin Belousov
aab4f50170 Prevent the hiwatermark for the unix domain socket from becoming
effectively negative. Often seen as upstream fastcgi connection timeouts
in nginx when using sendfile over unix domain sockets for communication.

Sendfile(2) may send more bytes then currently allowed by the
hiwatermark of the socket, e.g. because the so_snd sockbuf lock is
dropped after sbspace() call in the kern_sendfile() loop. In this case,
recalculated hiwatermark will overflow. Since lowatermark is renewed
as half of the hiwatermark by sendfile code, and both are unsigned,
the send buffer never reaches the free space requested by lowatermark,
causing indefinite wait in sendfile.

Reviewed by:	rwatson
Approved by:	re (bz)
MFC after:	2 weeks
2011-08-20 16:12:29 +00:00
Robert Watson
311fa10b52 r222015 introduced a new assertion that the size of a fixed-length sbuf
buffer is greater than 1.  This triggered panics in at least one spot in
the kernel (the MAC Framework) which passes non-negative, rather than >1
buffer sizes based on the size of a user buffer passed into a system
call.  While 0-size buffers aren't particularly useful, they also aren't
strictly incorrect, so loosen the assertion.

Discussed with:	phk (fears I might be EDOOFUS but willing to go along)
Spotted by:	pho + stress2
Approved by:	re (kib)
2011-08-19 08:29:10 +00:00
Jonathan Anderson
f8ca0a757a Auto-generated system call code based on r224987.
Approved by:	re (implicit)
2011-08-18 23:08:52 +00:00
Jonathan Anderson
cfb5f76865 Add experimental support for process descriptors
A "process descriptor" file descriptor is used to manage processes
without using the PID namespace. This is required for Capsicum's
Capability Mode, where the PID namespace is unavailable.

New system calls pdfork(2) and pdkill(2) offer the functional equivalents
of fork(2) and kill(2). pdgetpid(2) allows querying the PID of the remote
process for debugging purposes. The currently-unimplemented pdwait(2) will,
in the future, allow querying rusage/exit status. In the interim, poll(2)
may be used to check (and wait for) process termination.

When a process is referenced by a process descriptor, it does not issue
SIGCHLD to the parent, making it suitable for use in libraries---a common
scenario when using library compartmentalisation from within large
applications (such as web browsers). Some observers may note a similarity
to Mach task ports; process descriptors provide a subset of this behaviour,
but in a UNIX style.

This feature is enabled by "options PROCDESC", but as with several other
Capsicum kernel features, is not enabled by default in GENERIC 9.0.

Reviewed by: jhb, kib
Approved by: re (kib), mentor (rwatson)
Sponsored by: Google Inc
2011-08-18 22:51:30 +00:00
John Baldwin
f55d3fbe84 One of the general principles of the sysctl(3) API is that a user can
query the needed size for a sysctl result by passing in a NULL old
pointer and a valid oldsize.  The kern.proc.args sysctl handler broke
this assumption by not calling SYSCTL_OUT() if the old pointer was
NULL.

Approved by:	re (kib)
MFC after:	3 days
2011-08-18 22:20:45 +00:00
Konstantin Belousov
68889ed699 Fix build breakage. Initialize error variables explicitely for !MAC case.
Pointy hat to:	kib
Approved by:	re (bz)
2011-08-17 12:37:14 +00:00
Konstantin Belousov
9c00bb9190 Add the fo_chown and fo_chmod methods to struct fileops and use them
to implement fchown(2) and fchmod(2) support for several file types
that previously lacked it. Add MAC entries for chown/chmod done on
posix shared memory and (old) in-kernel posix semaphores.

Based on the submission by:	glebius
Reviewed by:	rwatson
Approved by:	re (bz)
2011-08-16 20:07:47 +00:00
Jonathan Anderson
d6f7248983 poll(2) implementation for capabilities.
When calling poll(2) on a capability, unwrap first and then poll the
underlying object.

Approved by: re (kib), mentor (rwatson)
Sponsored by: Google Inc
2011-08-16 14:14:56 +00:00
Robert Watson
359b396113 Trim some warnings and notes from capabilities.conf -- these are left over
from Capsicum development, and no longer apply.

Approved by:	re (kib)
Sponsored by:	Google Inc
2011-08-13 17:22:16 +00:00
Robert Watson
fd9a5f73f6 When falloc() was broken into separate falloc_noinstall() and finstall(),
a bug was introduced in kern_openat() such that the error from the vnode
open operation was overwritten before it was passed as an argument to
dupfdopen().  This broke operations on /dev/{stdin,stdout,stderr}.  Fix
by preserving the original error number across finstall() so that it is
still available.

Approved by:	re (kib)
Reported by:	cognet
2011-08-13 16:03:40 +00:00
Robert Watson
854d7b9fc8 Update use of the FEATURE() macro in sys_capability.c to reflect the move
to two different kernel options for capability mode vs. capabilities.

Approved by:	re (bz)
2011-08-13 13:34:01 +00:00
Robert Watson
73516dbd27 Now that capability support has been committed, update and expand the
comment at the type of sys_capability.c to describe its new contents.

Approved by:  re (xxx)
2011-08-13 13:26:40 +00:00
Robert Watson
74536eddbe Regenerate system call files following r224812 changes to capabilities.conf.
A no-op for non-Capsicum kernels; for Capsicum kernels, completes the
enabling of fooat(2) system calls using capabilities.  With this change,
and subject to bug fixes, Capsicum capability support is now complete for
9.0.

Approved by:    re (kib)
Submitted by:   jonathan
Sponsored by:   Google Inc
2011-08-13 12:14:40 +00:00
Jonathan Anderson
bc69c09054 Allow openat(2), fstatat(2), etc. in capability mode.
namei() and lookup() can now perform "strictly relative" lookups.
Such lookups, performed when in capability mode or when looking up
relative to a directory capability, enforce two policies:
 - absolute paths are disallowed (including symlinks to absolute paths)
 - paths containing '..' components are disallowed

These constraints make it safe to enable openat() and friends.
These system calls are instrumental in supporting Capsicum
components such as the capability-mode-aware runtime linker.

Finally, adjust comments in capabilities.conf to reflect the actual state
of the world (e.g. shm_open(2) already has the appropriate constraints,
getdents(2) already requires CAP_SEEK).

Approved by: re (bz), mentor (rwatson)
Sponsored by: Google Inc.
2011-08-13 10:43:21 +00:00
Jonathan Anderson
69d377fe1b Allow Capsicum capabilities to delegate constrained
access to file system subtrees to sandboxed processes.

- Use of absolute paths and '..' are limited in capability mode.
- Use of absolute paths and '..' are limited when looking up relative
  to a capability.
- When a name lookup is performed, identify what operation is to be
  performed (such as CAP_MKDIR) as well as check for CAP_LOOKUP.

With these constraints, openat() and friends are now safe in capability
mode, and can then be used by code such as the capability-mode runtime
linker.

Approved by: re (bz), mentor (rwatson)
Sponsored by: Google Inc
2011-08-13 09:21:16 +00:00
Jonathan Anderson
d1b6899e83 Rename CAP_*_KEVENT to CAP_*_EVENT.
Change the names of a couple of capability rights to be less
FreeBSD-specific.

Approved by: re (kib), mentor (rwatson)
Sponsored by: Google Inc
2011-08-12 14:26:47 +00:00
Jonathan Anderson
d6d2cfa24b Only call fdclose() on successfully-opened FDs.
Since kern_openat() now uses falloc_noinstall() and finstall() separately,
there are cases where we could get to cleanup code without ever creating
a file descriptor. In those cases, we should not call fdclose() on FD -1.

Approved by: re (kib), mentor (rwatson)
Sponsored by: Google Inc
2011-08-11 13:29:59 +00:00
Robert Watson
a9d2f8d84f Second-to-last commit implementing Capsicum capabilities in the FreeBSD
kernel for FreeBSD 9.0:

Add a new capability mask argument to fget(9) and friends, allowing system
call code to declare what capabilities are required when an integer file
descriptor is converted into an in-kernel struct file *.  With options
CAPABILITIES compiled into the kernel, this enforces capability
protection; without, this change is effectively a no-op.

Some cases require special handling, such as mmap(2), which must preserve
information about the maximum rights at the time of mapping in the memory
map so that they can later be enforced in mprotect(2) -- this is done by
narrowing the rights in the existing max_protection field used for similar
purposes with file permissions.

In namei(9), we assert that the code is not reached from within capability
mode, as we're not yet ready to enforce namespace capabilities there.
This will follow in a later commit.

Update two capability names: CAP_EVENT and CAP_KEVENT become
CAP_POST_KEVENT and CAP_POLL_KEVENT to more accurately indicate what they
represent.

Approved by:	re (bz)
Submitted by:	jonathan
Sponsored by:	Google Inc
2011-08-11 12:30:23 +00:00
Martin Matuska
d16eac5c57 Revert r224655 and r224614 because vn_fullpath* does not always work
on nullfs mounts.

Change shall be reconsidered after 9.0 is released.

Requested by:	re (kib)
Approved by:	re (kib)
2011-08-08 14:02:08 +00:00
Martin Matuska
5388625fff The change in r224615 didn't take into account that vn_fullpath_global()
doesn't operate on locked vnode. This could cause a panic.

Fix by unlocking vnode, re-locking afterwards and verifying that it wasn't
renamed or deleted. To improve readability and reduce code size, move code
to a new static function vfs_verify_global_path().

In addition, fix missing giant unlock in unmount().

Reported by:	David Wolfskill <david@catwhisker.org>
Reviewed by:	kib
Approved by:	re (bz)
MFC after:	2 weeks
2011-08-05 11:12:50 +00:00
Martin Matuska
435d46675d Always disable mount and unmount for jails with enforce_statfs==2.
A working statfs(2) is required for umount(8) in jail.

Reviewed by:	pjd, kib
Approved by:	re (kib)
MFC after:	2 weeks
2011-08-02 19:44:40 +00:00
Martin Matuska
f6c1d63e47 For mount, discover f_mntonname from supplied path argument
using vn_fullpath_global(). This fixes f_mntonname if mounting
inside chroot, jail or with relative path as argument.

For unmount in jail, use vn_fullpath_global() to discover
global path from supplied path argument. This fixes unmount in jail.

Reviewed by:	pjd, kib
Approved by:	re (kib)
MFC after:	2 weeks
2011-08-02 19:13:56 +00:00
Konstantin Belousov
d0a724c5ea Fix the LK_NOSHARE lockmgr flag interaction with LK_UPGRADE and
LK_DOWNGRADE lock ops. Namely, the ops should be NOP since LK_NOSHARE
locks are always exclusive.

Reported by:	rmacklem
Reviewed by:	attilio
Tested by:	pho
Approved by:	re (kensmith)
MFC after:	1 week
2011-08-01 19:07:03 +00:00
Gleb Smirnoff
062393fe00 Don't leak kld_sx lock in kldunloadf().
Approved by:	re (kib)
2011-07-31 13:49:15 +00:00
Andriy Gapon
35edc49853 smp_rendezvous: master cpu should wait until all slaves are fully done
This is a followup to r222032 and a reimplementation of it.
While that revision fixed the race for the smp_rv_waiters[2] exit
sentinel, it still left a possibility for a target CPU to access
stale or wrong smp_rv_func_arg in smp_rv_teardown_func.
To fix this race the slave CPUs signal when they are really fully
done with the rendezvous and the master CPU waits until all slaves
are done.

Diagnosed by:	kib
Reviewed by:	jhb, mlaier, neel
Approved by:	re (kib)
MFC after:	2 weeks
2011-07-30 20:29:39 +00:00
Konstantin Belousov
889dffba25 Fix the devmtx lock leak from make_dev(9) when the old device cloning
failed due to invalid or duplicated path being generated.

Reviewed by:	jh
Approved by:	re (kensmith)
MFC after:	1 week
2011-07-30 14:12:37 +00:00
Andriy Gapon
7a0b13ed28 remove RESTARTABLE_PANICS option
This is done per request/suggestion from John Baldwin
who introduced the option.  Trying to resume normal
system operation after a panic is very unpredictable
and dangerous.  It will become even more dangerous
when we allow a thread in panic(9) to penetrate all
lock contexts.
I understand that the only purpose of this option was
for testing scenarios potentially resulting in panic.

Suggested by:	jhb
Reviewed by:	attilio, jhb
X-MFC-After:	never
Approved by:	re (kib)
2011-07-25 09:12:48 +00:00
Kirk McKusick
d716efa9f7 Move the MNTK_SUJ flag in mnt_kern_flag to MNT_SUJ in mnt_flag
so that it is visible to userland programs. This change enables
the `mount' command with no arguments to be able to show if a
filesystem is mounted using journaled soft updates as opposed
to just normal soft updates.

Approved by: re (bz)
2011-07-24 18:27:09 +00:00
Kirk McKusick
6beb3bb4eb This update changes the mnt_flag field in the mount structure from
32 bits to 64 bits and eliminates the unused mnt_xflag field.  The
existing mnt_flag field is completely out of bits, so this update
gives us room to expand. Note that the f_flags field in the statfs
structure is already 64 bits, so the expanded mnt_flag field can
be exported without having to make any changes in the statfs structure.

Approved by: re (bz)
2011-07-24 17:43:09 +00:00
Jonathan Anderson
7a270867e7 Turn on AUDIT_ARG_RIGHTS() for cap_new(2).
Now that the code is in place to audit capability method rights, start
using it to audit the 'rights' argument to cap_new(2).

Approved by: re (kib), mentor (rwatson)
Sponsored by: Google Inc
2011-07-22 12:50:21 +00:00
Jonathan Anderson
c30b9b5169 Export capability information via sysctls.
When reporting on a capability, flag the fact that it is a capability,
but also unwrap to report all of the usual information about the
underlying file.

Approved by: re (kib), mentor (rwatson)
Sponsored by: Google Inc
2011-07-20 09:53:35 +00:00
Attilio Rao
6338c57958 Remove explicit MAXCPU usage from sys/pcpu.h avoiding a namespace
pollution.  That is a step further in the direction of building correct
policies for userland and modules on how to deal with the number of
maxcpus at runtime.

Reported by:	jhb
Reviewed and tested by:	pluknet
Approved by:	re (kib)
2011-07-19 16:50:55 +00:00
Attilio Rao
edf26ab83e Remove pc_name member of struct pcpu.
pc_name is only included when KTR option is and it does introduce a
subdle KBI breakage that totally breaks vmstat when world and kernel are
not in sync.
Besides, it is not used somewhere.

In collabouration with:	pluknet
Reviewed by:	jhb
Approved by:	re (kib)
2011-07-19 14:57:59 +00:00
Bjoern A. Zeeb
925af54487 Rename ki_ocomm to ki_tdname and OCOMMLEN to TDNAMLEN.
Provide backward compatibility defines under BURN_BRIDGES.

Suggested by:	jhb
Reviewed by:	emaste
Sponsored by:	Sandvine Incorporated
Approved by:	re (kib)
2011-07-18 20:06:15 +00:00
John Baldwin
2417d97ebb - Export each thread's individual resource usage in in struct kinfo_proc's
ki_rusage member when KERN_PROC_INC_THREAD is passed to one of the
  process sysctls.
- Correctly account for the current thread's cputime in the thread when
  doing the runtime fixup in calcru().
- Use TIDs as the key to lookup the previous thread to compute IO stat
  deltas in IO mode in top when thread display is enabled.

Reviewed by:	kib
Approved by:	re (kib)
2011-07-18 17:33:08 +00:00
Attilio Rao
521ea19d1c - Remove the eintrcnt/eintrnames usage and introduce the concept of
sintrcnt/sintrnames which are symbols containing the size of the 2
  tables.
- For amd64/i386 remove the storage of intr* stuff from assembly files.
  This area can be widely improved by applying the same to other
  architectures and likely finding an unified approach among them and
  move the whole code to be MI. More work in this area is expected to
  happen fairly soon.

No MFC is previewed for this patch.

Tested by:	pluknet
Reviewed by:	jhb
Approved by:	re (kib)
2011-07-18 15:19:40 +00:00
Robert Watson
ff66f6a404 Define two new sysctl node flags: CTLFLAG_CAPRD and CTLFLAG_CAPRW, which
may be jointly referenced via the mask CTLFLAG_CAPRW.  Sysctls with these
flags are available in Capsicum's capability mode; other sysctl nodes are
not.

Flag several useful sysctls as available in capability mode, such as memory
layout sysctls required by the run-time linker and malloc(3).  Also expose
access to randomness and available kernel features.

A few sysctls are enabled to support name->MIB conversion; these may leak
information to capability mode by virtue of providing resolution on names
not flagged for access in capability mode.  This is, generally, not a huge
problem, but might be something to resolve in the future.  Flag these cases
with XXX comments.

Submitted by:	jonathan
Sponsored by:	Google, Inc.
2011-07-17 23:05:24 +00:00
Ryan Stone
05a4755ef6 Fix a LOR between hwpmc and the kernel linker. When a system-wide
sampling mode PMC is allocated, hwpmc calls linker_hwpmc_list_objects()
while already holding an exclusive lock on pmc-sx lock.  list_objects()
tries to acquire an exclusive lock on the kld_sx lock.  When a KLD module
is loaded or unloaded successfully, kern_kld(un)load calls into the pmc
hook while already holding an exclusive lock on the kld_sx lock.  Calling
the pmc hook requires acquiring a shared lock on the pmc-sx lock.

Fix this by only acquiring a shared lock on the kld_sx lock in
linker_hwpmc_list_objects(), and also downgrading to a shared lock on the
kld_sx lock in kern_kld(un)load before calling into the pmc hook.  In
kern_kldload this required moving some modifications of the linker_file_t
to happen before calling into the pmc hook.

This fixes the deadlock by ensuring that the hwpmc -> list_objects() case
is always able to proceed.  Without this patch, I was able to deadlock a
multicore system within minutes by constantly loading and unloading an KLD
module while I simultaneously started a sampling mode PMC in a loop.

MFC after:	1 month
2011-07-17 21:53:42 +00:00
Jonathan Anderson
63ba8b5ede Auto-generated system call code with cap_new(), cap_getrights().
Approved by: mentor (rwatson), re (Capsicum blanket)
Sponsored by: Google Inc
2011-07-15 18:33:12 +00:00
Jonathan Anderson
cfb9df5541 Add cap_new() and cap_getrights() system calls.
Implement two previously-reserved Capsicum system calls:
- cap_new() creates a capability to wrap an existing file descriptor
- cap_getrights() queries the rights mask of a capability.

Approved by: mentor (rwatson), re (Capsicum blanket)
Sponsored by: Google Inc
2011-07-15 18:26:19 +00:00
Jonathan Anderson
745bae379d Add implementation for capabilities.
Code to actually implement Capsicum capabilities, including fileops and
kern_capwrap(), which creates a capability to wrap an existing file
descriptor.

We also modify kern_close() and closef() to handle capabilities.

Finally, remove cap_filelist from struct capability, since we don't
actually need it.

Approved by: mentor (rwatson), re (Capsicum blanket)
Sponsored by: Google Inc
2011-07-15 09:37:14 +00:00
Jung-uk Kim
08e1b4f4a9 If TSC stops ticking in C3, disable deep sleep when the user forcefully
select TSC as timecounter hardware.

Tested by:	Fabian Keil (freebsd-listen at fabiankeil dot de)
2011-07-14 21:00:26 +00:00
Edward Tomasz Napierala
85a2f1b4f2 Rename resource names to match these in login.conf. 2011-07-14 19:18:17 +00:00
Bjoern A. Zeeb
1080a2c85d Remove semaphore map entry count "semmap" field and its tuning
option that is highly recommended to be adjusted in too much
documentation while doing nothing in FreeBSD since r2729 (rev 1.1).

ipcs(1) needs to be recompiled as it is accessing _KERNEL private
variables.

Reviewed by:	jhb (before comment change on linux code)
Sponsored by:	Sandvine Incorporated
2011-07-14 14:18:14 +00:00
Konstantin Belousov
f49d820256 Implement an RFTSIGZMB flag to rfork(2) to specify a signal that is
delivered to parent when the child exists.

Submitted by:	Petr Salinger <Petr.Salinger seznam cz> (Debian/kFreeBSD)
MFC after:	1 week
X-MFC-note:	bump __FreeBSD_version
2011-07-12 20:37:18 +00:00
Andrey V. Elsukov
fef7c585de Include sys/sbuf.h directly. 2011-07-11 05:17:46 +00:00
Kirk McKusick
b115b0e28f Update tags build script 2011-07-10 00:53:04 +00:00
Konstantin Belousov
2801687d56 Add a facility to disable processing page faults. When activated,
uiomove generates EFAULT if any accessed address is not mapped, as
opposed to handling the fault.

Sponsored by:	The FreeBSD Foundation
Reviewed by:	alc (previous version)
2011-07-09 15:21:10 +00:00
Matthew D Fleming
b5bd50ae40 Add an option to have a fail point term only execute when run by a
specified pid.  This is helpful for automated testing involving a global
knob that would otherwise be executed by many other threads.

MFC after: 1 week
2011-07-08 20:41:12 +00:00
Matthew D Fleming
0894a9bca6 style(9) and cleanup fixes.
MFC after: 1 week
2011-07-08 20:41:07 +00:00
Jonathan Anderson
5604e481b1 Fix the "passability" test in fdcopy().
Rather than checking to see if a descriptor is a kqueue, check to see if
its fileops flags include DFLAG_PASSABLE.

At the moment, these two tests are equivalent, but this will change with
the addition of capabilities that wrap kqueues but are themselves of type
DTYPE_CAPABILITY. We already have the DFLAG_PASSABLE abstraction, so let's
use it.

This change has been tested with [the newly improved] tools/regression/kqueue.

Approved by: mentor (rwatson), re (Capsicum blanket)
Sponsored by: Google Inc
2011-07-08 12:19:25 +00:00
Andre Oppermann
695da99eba In the experimental soreceive_stream():
o Move the non-blocking socket test below the SBS_CANTRCVMORE so that EOF
   is correctly returned on a remote connection close.
 o In the non-blocking socket test compare SS_NBIO against the so->so_state
   field instead of the incorrect sb->sb_state field.
 o Simplify the ENOTCONN test by removing cases that can't occur.

Submitted by:	trociny (with some further tweaks by committer)
Tested by:	trociny
2011-07-08 10:50:13 +00:00
Edward Tomasz Napierala
4fe8477539 Style fix - macros are supposed to be uppercase. 2011-07-07 17:44:42 +00:00
Andre Oppermann
1c6e7fa7f1 Remove the TCP_SORECEIVE_STREAM compile time option. The use of
soreceive_stream() for TCP still has to be enabled with the loader
tuneable net.inet.tcp.soreceive_stream.

Suggested by:	trociny and others
2011-07-07 10:37:14 +00:00
Edward Tomasz Napierala
afcc55f318 All the racct_*() calls need to happen with the proc locked. Fixing this
won't happen before 9.0.  This commit adds "#ifdef RACCT" around all the
"PROC_LOCK(p); racct_whatever(p, ...); PROC_UNLOCK(p)" instances, in order
to avoid useless locking/unlocking in kernels built without "options RACCT".
2011-07-06 20:06:44 +00:00
Marius Strobl
2e569926f8 Call pmap_qremove() before freeing or unwiring the pages, otherwise
there's a window during which a page can be re-used before its previous
mapping is removed.

Reviewed by:	alc
MFC after:	1 week
2011-07-05 18:40:37 +00:00
Jonathan Anderson
9acdfe6549 Rework _fget to accept capability parameters.
This new version of _fget() requires new parameters:
- cap_rights_t needrights
    the rights that we expect the capability's rights mask to include
    (e.g. CAP_READ if we are going to read from the file)

- cap_rights_t *haverights
    used to return the capability's rights mask (ignored if NULL)

- u_char *maxprotp
    the maximum mmap() rights (e.g. VM_PROT_READ) that can be permitted
    (only used if we are going to mmap the file; ignored if NULL)

- int fget_flags
    FGET_GETCAP if we want to return the capability itself, rather than
    the underlying object which it wraps

Approved by: mentor (rwatson), re (Capsicum blanket)
Sponsored by: Google Inc
2011-07-05 13:45:10 +00:00
Jonathan Anderson
af098ed8e7 Add kernel functions to unwrap capabilities.
cap_funwrap() and cap_funwrap_mmap() unwrap capabilities, exposing the
underlying object. Attempting to unwrap a capability with an inadequate
rights mask (e.g. calling cap_funwrap(fp, CAP_WRITE | CAP_MMAP, &result)
on a capability whose rights mask is CAP_READ | CAP_MMAP) will result in
ENOTCAPABLE.

Unwrapping a non-capability is effectively a no-op.

These functions will be used by Capsicum-aware versions of _fget(), etc.

Approved by: mentor (rwatson), re (Capsicum blanket)
Sponsored by: Google Inc
2011-07-04 14:40:32 +00:00
Attilio Rao
470107b2f1 MFC 2011-07-04 11:13:00 +00:00
Attilio Rao
a2f4e284b0 Completely remove now unused pc_other_cpus, pc_cpumask.
Tested by:	pluknet
2011-07-04 10:45:54 +00:00
Bjoern A. Zeeb
35fd7bc020 Add infrastructure to allow all frames/packets received on an interface
to be assigned to a non-default FIB instance.

You may need to recompile world or ports due to the change of struct ifnet.

Submitted by:	cjsp
Submitted by:	Alexander V. Chernikov (melifaro ipfw.ru)
		(original versions)
Reviewed by:	julian
Reviewed by:	Alexander V. Chernikov (melifaro ipfw.ru)
MFC after:	2 weeks
X-MFC:		use spare in struct ifnet
2011-07-03 12:22:02 +00:00
Ed Schouten
18f5477167 Reintroduce the cioctl() hook in the TTY layer for digi(4).
The cioctl() hook can be used by drivers to add ioctls to the *.init and
*.lock devices. This commit breaks the ttydevsw ABI, since this
structure didn't provide any padding. To prevent ABI breakage in the
future, add a tsw_spare.

Submitted by:	Peter Jeremy <peter jeremy alcatel lucent com>
Obtained from:	kern/152254 (slightly modified)
2011-07-02 13:54:20 +00:00
Jonathan Anderson
c0467b5e6e When Capsicum starts creating capabilities to wrap existing file
descriptors, we will want to allocate a new descriptor without installing
it in the FD array.

Split falloc() into falloc_noinstall() and finstall(), and rewrite
falloc() to call them with appropriate atomicity.

Approved by: mentor (rwatson), re (bz)
2011-06-30 15:22:49 +00:00
Jonathan Anderson
12bc222e57 Add some checks to ensure that Capsicum is behaving correctly, and add some
more explicit comments about what's going on and what future maintainers
need to do when e.g. adding a new operation to a sys_machdep.c.

Approved by: mentor(rwatson), re(bz)
2011-06-30 10:56:02 +00:00
Attilio Rao
7b744f6b01 MFC 2011-06-30 10:19:43 +00:00
Alan Cox
6bbee8e28a Add a new option, OBJPR_NOTMAPPED, to vm_object_page_remove(). Passing this
option to vm_object_page_remove() asserts that the specified range of pages
is not mapped, or more precisely that none of these pages have any managed
mappings.  Thus, vm_object_page_remove() need not call pmap_remove_all() on
the pages.

This change not only saves time by eliminating pointless calls to
pmap_remove_all(), but it also eliminates an inconsistency in the use of
pmap_remove_all() versus related functions, like pmap_remove_write().  It
eliminates harmless but pointless calls to pmap_remove_all() that were being
performed on PG_UNMANAGED pages.

Update all of the existing assertions on pmap_remove_all() to reflect this
change.

Reviewed by:	kib
2011-06-29 16:40:41 +00:00
Jonathan Anderson
24c1c3bf71 We may split today's CAPABILITIES into CAPABILITY_MODE (which has
to do with global namespaces) and CAPABILITIES (which has to do with
constraining file descriptors). Just in case, and because it's a better
name anyway, let's move CAPABILITIES out of the way.

Also, change opt_capabilities.h to opt_capsicum.h; for now, this will
only hold CAPABILITY_MODE, but it will probably also hold the new
CAPABILITIES (implying constrained file descriptors) in the future.

Approved by: rwatson
Sponsored by: Google UK Ltd
2011-06-29 13:03:05 +00:00
Attilio Rao
40a034576b MFC 2011-06-28 14:40:17 +00:00
Attilio Rao
c0757daf1f Fix a mismerge. 2011-06-27 13:02:23 +00:00
Ed Schouten
7c9669276e Fix whitespace inconsistencies in the TTY layer and its drivers owned by me. 2011-06-26 18:26:20 +00:00
Attilio Rao
cfdfd32d34 MFC 2011-06-26 17:30:46 +00:00
Jonathan Anderson
54350dfa33 Remove redundant Capsicum sysctl.
Since we're now declaring FEATURE(security_capabilities), there's no need for an explicit SYSCTL_NODE.

Approved by: rwatson
2011-06-25 12:37:06 +00:00
Andriy Gapon
31c5a6e2b8 unconditionally stop other cpus when entering kdb in smp system
... and thus retire debug.kdb.stop_cpus tunable/sysctl.
The knob was to work around CPU stopping issues, which since have been
either fixed or greatly reduced.  kdb should really operate in a special
environment with scheduler stopped and interrupts disabled to provide
deterministic debugging.

Discussed with:	attilio, rwatson
X-MFC after:	2 months or never
2011-06-25 10:28:16 +00:00
Andriy Gapon
1aac6ac94a generic_stop_cpus: pull timeout logic from under DIAGNOSTIC
... and also increase the timeout.
It's better to try to proceed somehow despite stuck CPUs than to hang
indefinitely.  Especially so during shutdown and when entering kdb or panic.

Timeout value is still an aribitrary value.
Timeout diagnostic is just a printf; the work on something more
debuggable is planned by attilio.  Need to be careful here as
stop_cpus_hard is called very early while enetering kdb and soon(-ish)
it may become called very early when entering panic.

Reviewed by:	attilio
MFC after:	2 months
2011-06-25 10:01:43 +00:00
Attilio Rao
de138ec703 MFC 2011-06-24 16:35:40 +00:00
Jonathan Anderson
5e26234ff1 Tidy up a capabilities-related comment.
This comment refers to an #ifdef that hasn't been merged [yet?]; remove it.

Approved by: rwatson
2011-06-24 14:40:22 +00:00
Attilio Rao
9b571ec6b3 MFC 2011-06-22 19:42:32 +00:00
Jung-uk Kim
a49399a903 Set negative quality to TSC timecounter when C3 state is enabled for Intel
processors unless the invariant TSC bit of CPUID is set.  Intel processors
may stop incrementing TSC when DPSLP# pin is asserted, according to Intel
processor manuals, i. e., TSC timecounter is useless if the processor can
enter deep sleep state (C3/C4).  This problem was accidentally uncovered by
r222869, which increased timecounter quality of P-state invariant TSC, e.g.,
for Core2 Duo T5870 (Family 6, Model f) and Atom N270 (Family 6, Model 1c).

Reported by:	Fabian Keil (freebsd-listen at fabiankeil dot de)
		Ian FREISLICH (ianf at clue dot co dot za)
Tested by:	Fabian Keil (freebsd-listen at fabiankeil dot de)
		- Core2 Duo T5870 (C3 state available/enabled)
		jkim - Xeon X5150 (C3 state unavailable)
2011-06-22 16:40:45 +00:00
Attilio Rao
5519971c21 MFC 2011-06-19 14:22:35 +00:00
David E. O'Brien
52e95a64ea Add comment from CSRG rev 7.27 (1992/06/23 19:56:55; author: mckusick) 2011-06-17 21:44:13 +00:00
Konstantin Belousov
508462ed1b Do not trash the argv[0] pointer for an a.out process on amd64.
Found with the binary provided by joerg.
2011-06-16 22:00:59 +00:00
Konstantin Belousov
5a888c066f Fix silly typo that resulted in the a.out process stack to end at
~200MB instead of 3GB on amd64.
2011-06-16 21:59:16 +00:00
Marcel Moolenaar
8a71031712 Even if the loaded module has no symbols, we still need to notify
MD code about it and update the link map for GDB's use.
2011-06-16 17:41:21 +00:00
Attilio Rao
49f5aeaf41 MFC 2011-06-15 07:20:22 +00:00
Justin T. Gibbs
583bef3863 sys/kern/subr_kdb.c:
Modify the "alternate break sequence" detecting state
	machine so that only a contiguous invocation of the
	break sequence is accepted.  The old implementation
	did not reset the state machine when detecting an
	unexpected character.

	While here, use an enum for the states of the machine
	instead of magic numbers.bmitted by:

Sponsored by:	Spectra Logic Corporation
2011-06-14 21:37:25 +00:00
David E. O'Brien
f528c3fdbc We should not return ECHILD when debugging a child and the parent does a
"wait4(-1, ..., WNOHANG, ...)".  Instead wait(2) should behave as if the
child does not wish to report status at this time.

Reviewed by:	jhb
2011-06-14 17:09:30 +00:00
Justin T. Gibbs
aa76615dd1 sys/sys/conf.h:
sys/kern/kern_conf.c:
	Add make_dev_physpath_alias().  This interface takes
	the parent cdev of the alias, an old alias cdev (if any)
	to replace with the newly created alias, and the physical
	path string.  The alias is visiable as a symlink to the
	parent, with the same name as the parent, rooted at
	physpath in devfs.

	Note: make_dev_physpath_alias() has hard coded knowledge of the
	      Solaris style prefix convention for physical path data,
	      "id1,".  In the future, I expect the convention to change
	      to allow "physical path quality" to be reported in the
	      prefix.  For example, a physical path based on NewBus
	      topology would be of "lower quality" than a physical path
	      reported by a device enclosure.

Sponsored by:	Spectra Logic Corporation
2011-06-14 16:29:43 +00:00
Kenneth D. Merry
094efe753d Instead of using an atomic operation to determine whether the devstat(9)
device node has been created, pass MAKEDEV_CHECKNAME in so that the devfs
code will do the check.

Use a regular static variable as before, that's good enough to keep us from
calling into devfs most of the time.

Suggested by:	kib
MFC after:	1 week
Sponsored by:	Spectra Logic Corporation
2011-06-13 22:08:24 +00:00
Justin T. Gibbs
27c959cf05 Fix a couple of race conditions in devstat(9) initialization.
In devstat_new_entry(), there is no need to initialize the queue
and the mutex in this function.  There are ways to do static
initialization on both, so use STAILQ_HEAD_INITIALIZER and
MTX_SYSINIT to initialize the queue and the mutex.

In devstat_alloc(), use an atomic test and set routine to guard
making our entry in /dev.  Using just a plain static variable
creates a race condition on multiprocessor machines.  If you
attempt to create a second entry in devfs, the kernel will panic.

Submitted by:	kdm
Reviewed by:	gibbs
Sponsored by:	Spectra Logic Corporation
MFC after:	1 week.
2011-06-13 21:21:02 +00:00
Attilio Rao
a38f1f263b Remove pc_cpumask and pc_other_cpus usage from MI code.
Tested by:	pluknet
2011-06-13 13:28:31 +00:00
Jeff Roberson
6f59b2bd33 - When printing bufs with show buf the lblkno is often more useful than
the blkno.  Print them both.
2011-06-10 22:15:36 +00:00
Attilio Rao
e3adb68519 In the current code, a double panic condition may lead to dumps
interleaving.
Signal dumping to happen only for the first panic which should be the
most important.

Sponsored by:	Sandvine Incorporated
Submitted by:	Nima Misaghian (nmisaghian AT sandvine DOT com)
MFC after:	2 weeks
2011-06-08 19:28:59 +00:00
John Baldwin
c721b93449 Log the socket address passed as the destination to sendto() and sendmsg()
via ktrace.

MFC after:	1 week
2011-06-07 17:40:33 +00:00
Attilio Rao
5e9857e76b MFC 2011-06-07 08:24:29 +00:00
Kenneth D. Merry
5e319c480c Set pca.p_bufr to NULL when we haven't allocated a buffer.
Otherwise, p_bufr is set to garbage on the stack, and if that garbage
happens to be non-NULL, and the TOLOG or TOCONS flag is set, putbuf()
will get called and attempt to fill the non-existent buffer.

This is really only relevant for tprintf() (and only when the priority is
not -1), but set it in uprintf() and ttyprintf() for completeness.

The next step, to avoid log buffer scrambling, would be to add the
PRINTF_BUFR_SIZE code to tprintf(), but this should prevent panics.

Submitted by:	rmacklem
Found by:	pho
2011-06-07 05:04:37 +00:00
David Xu
a231144921 Use p4prio_to_tsprio to calculate TS priority instead of using
p4prio_to_rtpprio which is for RT priority.

PR:	kern/157657
Submitted by:	krivenok.dmitry at gmail dot com
MFC after:	3 days
2011-06-07 02:50:14 +00:00
Marcel Moolenaar
299cceef03 Fix making kernel dumps from the debugger by creating a command
for it. Do not not expect a developer to call doadump(). Calling
doadump does not necessarily work when it's declared static. Nor
does it necessarily do what was intended in the context of text
dumps. The dump command always creates a core dump.

Move printing of error messages from doadump to the dump command,
now that we don't have to worry about being called from DDB.
2011-06-07 01:28:12 +00:00
Attilio Rao
81c02539f1 MFC 2011-06-06 21:38:39 +00:00
John Baldwin
69b63a9dc7 Clear the device_t pointer in 'struct resource' when releasing a device
as otherwise the sysctl to export rman info can dereference a stale
pointer.

PR:		kern/115371
Submitted by:	Arthur Hartwig
MFC after:	1 week
2011-06-06 13:12:56 +00:00
Attilio Rao
bc6339618e MFC 2011-06-01 16:54:33 +00:00
Kenneth D. Merry
534917efef Fix a bug introduced in revision 222537.
In msgbuf_reinit() and msgbuf_init(), we weren't initializing the mutex.
Depending on the contents of memory, the LO_INITIALIZED flag might be
set on the mutex (either due to a warm reboot, and the message buffer
remaining in place, or due to garbage in memory) and in that case, with
INVARIANTS turned on, we would trigger an assertion that the mutex had
already been initialized.

Fix this by bzeroing the message buffer mutex for the _init() and _reinit()
paths.

Reported by:	mdf
2011-05-31 22:39:32 +00:00
Attilio Rao
61b926921f MFC 2011-05-31 21:22:44 +00:00
Attilio Rao
e370959707 Fix KTR_CPUMASK in order to accept a string representing a cpuset_t.
This introduce all the underlying support for making this possible (via
the function cpusetobj_strscan() and keeps ktr_cpumask exported.  sparc64
implements its own assembly primitives for tracing events and needs to
properly check it.  Anyway the sparc64 logic is not implemented yet due
to lack of knowledge (by me) and time (by marius), but it is just a
matter of using ktr_cpumask when possible.

Tested and fixed by:	pluknet
Reviewed by:		marius
2011-05-31 20:48:58 +00:00
Attilio Rao
d0984adc98 Revert a change that crept in during MFC. 2011-05-31 20:23:33 +00:00
Kenneth D. Merry
d42a4eb507 Fix apparent garbage in the message buffer.
While we have had a fix in place (options PRINTF_BUFR_SIZE=128) to fix
scrambled console output, the message buffer and syslog were still getting
log messages one character at a time.  While all of the characters still
made it into the log (courtesy of atomic operations), they were often
interleaved when there were multiple threads writing to the buffer at the
same time.

This fixes message buffer accesses to use buffering logic as well, so that
strings that are less than PRINTF_BUFR_SIZE will be put into the message
buffer atomically.  So now dmesg output should look the same as console
output.

subr_msgbuf.c:		Convert most message buffer calls to use a new spin
			lock instead of atomic variables in some places.

			Add a new routine, msgbuf_addstr(), that adds a
			NUL-terminated string to a message buffer.  This
			takes a priority argument, which allows us to
			eliminate some races (at least in the the string
			at a time case) that are present in the
			implementation of msglogchar().  (dangling and
			lastpri are static variables, and are subject to
			races when multiple callers are present.)

			msgbuf_addstr() also allows the caller to request
			that carriage returns be stripped out of the
			string.  This matches the behavior of msglogchar(),
			but in testing so far it doesn't appear that any
			newlines are being stripped out.  So the carriage
			return removal functionality may be a candidate
			for removal later on if further analysis shows
			that it isn't necessary.

subr_prf.c:		Add a new msglogstr() routine that calls
			msgbuf_logstr().

			Rename putcons() to putbuf().  This now handles
			buffered output to the message log as well as
			the console.  Also, remove the logic in putcons()
			(now putbuf()) that added a carriage return before
			a newline.  The console path was the only path that
			needed it, and cnputc() (called by cnputs())
			already adds a carriage return.  So this
			duplication resulted in kernel-generated console
			output lines ending in '\r''\r''\n'.

			Refactor putchar() to handle the new buffering
			scheme.

			Add buffering to log().

			Change log_console() to use msglogstr() instead of
			msglogchar().  Don't add extra newlines by default
			in log_console().  Hide that behavior behind a
			tunable/sysctl (kern.log_console_add_linefeed) for
			those who would like the old behavior.  The old
			behavior led to the insertion of extra newlines
			for log output for programs that print out a
			string, and then a trailing newline on a separate
			write.  (This is visible with dmesg -a.)

msgbuf.h:		Add a prototype for msgbuf_addstr().

			Add three new fields to struct msgbuf, msg_needsnl,
			msg_lastpri and msg_lock.  The first two are needed
			for log message functionality previously handled
			by msglogchar().  (Which is still active if
			buffering isn't enabled.)

			Include sys/lock.h and sys/mutex.h for the new
			mutex.

Reviewed by:	gibbs
2011-05-31 17:29:58 +00:00
Nathan Whitehorn
d098f93019 On multi-core, multi-threaded PPC systems, it is important that the threads
be brought up in the order they are enumerated in the device tree (in
particular, that thread 0 on each core be brought up first). The SLIST
through which we loop to start the CPUs has all of its entries added with
SLIST_INSERT_HEAD(), which means it is in reverse order of enumeration
and so AP startup would always fail in such situations (causing a machine
check or RTAS failure). Fix this by changing the SLIST into an STAILQ,
and inserting new CPUs at the end.

Reviewed by:	jhb
2011-05-31 15:11:43 +00:00
Attilio Rao
5b6ea0b538 MFC 2011-05-31 14:18:10 +00:00
Attilio Rao
da3dd8b7ab MFC 2011-05-29 18:33:13 +00:00
Mikolaj Golub
3204c8e596 In soreceive_generic(), if MSG_WAITALL is set but the request is
larger than the receive buffer, we have to receive in sections.
When notifying the protocol that some data has been drained the
lock is released for a moment. Returning we block waiting for the
rest of data. There is a race, when data could arrive while the
lock was released and then the connection stalls in sbwait.

Fix this by checking for data before blocking and skip blocking
if there are some.

PR:		kern/154504
Reported by:	Andrey Simonenko <simon@comsys.ntu-kpi.kiev.ua>
Tested by:	Andrey Simonenko <simon@comsys.ntu-kpi.kiev.ua>
Reviewed by:	rwatson
Approved by:	kib (co-mentor)
MFC after:	2 weeks
2011-05-29 18:00:50 +00:00
Attilio Rao
c7df91af4b MFC 2011-05-29 00:59:38 +00:00
Edward Tomasz Napierala
7e2548ae0a Remove definitions for RACCT_FSIZE and RACCT_SBSIZE - these two are rather
performance-sensitive and not that useful, so I won't be merging them
before 9.0.
2011-05-27 19:57:58 +00:00
Attilio Rao
9cb46334ee MFC 2011-05-27 16:09:10 +00:00
Edward Tomasz Napierala
b8fdb0d94d Fix support for RACCT_CORE by merging forgotten file. 2011-05-26 18:54:07 +00:00
Attilio Rao
7fcdc9a26f MFC 2011-05-26 17:38:00 +00:00
John Baldwin
5b41f90fd1 Silly spelling typos.
Submitted by:	"b. f."
2011-05-24 19:55:57 +00:00
John Baldwin
47ad691f87 Fix an issue with critical sections and SMP rendezvous handlers.
Specifically, a critical_exit() call that drops the nesting level to zero
has a brief window where the pending preemption flag is set and the
nesting level is set to zero.  This is done purposefully to avoid races
where a preemption scheduled by an interrupt could be lost otherwise (see
revision 144777).  However, this does mean that if an interrupt fires
during this window and enters and exits a critical section, it may preempt
from the interrupt context.  This is generally fine as the interrupt code
is careful to arrange critical sections so that they are not exited until
it is safe to preempt (e.g. interrupts EOI'd and masked if necessary).

However, the SMP rendezvous IPI handler does not quite follow this rule,
and in general a rendezvous can never be preempted.  Rendezvous handlers
are also not permitted to schedule threads to execute, so they will not
typically trigger preemptions.  SMP rendezvous handlers may use
spinlocks (carefully) such as the rm_cleanIPI() handler used in rmlocks,
but using a spinlock also enters and exits a critical section.  If the
interrupted top-half code is in the brief window of critical_exit() where
the nesting level is zero but a preemption is pending, then releasing the
spinlock can trigger a preemption.  Because we know that SMP rendezvous
handlers can never schedule a thread, we know that a critical_exit() in
an SMP rendezvous handler will only preempt in this edge case.  We also
know that the top-half thread will happily handle the deferred preemption
once the SMP rendezvous has completed, so the preemption will not be lost.

This makes it safe to employ a workaround where we use a nested critical
section in the SMP rendezvous code itself around rendezvous action
routines to prevent any preemptions during an SMP rendezvous.  The
workaround intentionally avoids checking for a deferred preemption
when leaving the critical section on the assumption that if there is a
pending preemption it will be handled by the interrupted top-half code.

Submitted by:	mlaier (variation specific to rm_cleanIPI())
Obtained from:	Isilon
MFC after:	1 week
2011-05-24 13:36:41 +00:00
John Baldwin
af21235ac4 Update comments for DEVICE_PROBE() to reflect that BUS_PROBE_DEFAULT is
now the preferred typical return value from a probe routine.  Discourage
the use of 0 (BUS_PROBE_SPECIFIC) as it should be used very rarely.
Point the reader to the DEVICE_PROBE(9) manpage for more detailed notes
on possible probe return values.

Submitted by:	Philip Soeberg  philip-dev of soeberg net
2011-05-24 13:22:40 +00:00
John Baldwin
211d4a2c42 Simplify a stale assertion. We have not called mi_switch() from a nested
critical section during a preemption for several years.

MFC after:	1 week
2011-05-24 13:17:08 +00:00
Attilio Rao
3ac3f6002b MFC 2011-05-23 23:58:02 +00:00
Attilio Rao
217e1c0ebc Revert a patch that unvolountary sneaked in while I was MFCing. 2011-05-23 23:50:21 +00:00
Ruslan Ermilov
5e863acb63 BKVASIZE was bumped to 16k more than a decade ago. 2011-05-23 19:59:01 +00:00
Jaakko Heinonen
f53edc909e In init_dynamic_kenv(), ignore environment strings exceeding the
KENV_MNAMELEN + 1 + KENV_MVALLEN + 1 length limit to avoid buffer
overflow in getenv(). Currenly loader(8) doesn't limit the length of
environment strings.

PR:		kern/132104
MFC after:	1 month
2011-05-23 16:40:44 +00:00
Attilio Rao
a9ff18a210 MFC 2011-05-23 01:17:30 +00:00
Attilio Rao
e3071102d6 Merge r221912 from largeSMP project branch:
Fix a long-standing bug in cpuset_thread0() where only the first part
of cs_mask is set full.

Submitted by:	anonymous
MFC after:	1 week
2011-05-22 21:35:03 +00:00
Attilio Rao
8c4431d022 MFC 2011-05-22 20:41:10 +00:00
Attilio Rao
34a1e065bd Make cpusetobj_strprint() prepare the string in order to print the
least significant cpuset_t word at the outmost right part of the string
(more far from the beginning of it).  This follows the natural build of
bits rappresentation in the words.
2011-05-22 20:29:47 +00:00