Commit Graph

12773 Commits

Author SHA1 Message Date
Andrey V. Elsukov
199aa9756b Add debug.kld_debug loader tunable.
MFC after:	2 weeks
2012-03-15 05:11:29 +00:00
Jaakko Heinonen
db62ced238 Add an assert for proctree_lock to proc_to_reap().
Discussed with:	kib
MFC after:	1 week
2012-03-14 15:52:23 +00:00
Konstantin Belousov
7335ed90a0 Lock the process around manipulations with p_flag.
Reported and reviewed by:	jh
MFC after:	3 days
2012-03-13 22:00:46 +00:00
Adrian Chadd
a9a282f672 Add module load/unload stubs. 2012-03-13 20:27:48 +00:00
Alexander Motin
fd053fae73 Add kern.eventtimer.activetick tunable/sysctl, specifying whether each
hardclock() tick should be run on every active CPU, or on only one.

On my tests, avoiding extra interrupts because of this on 8-CPU Core i7
system with HZ=10000 saves about 2% of performance. At this moment option
implemented only for global timers, as reprogramming per-CPU timers is
too expensive now to be compensated by this benefit, especially since we
still have to regularly run hardclock() on at least one active CPU to
update system uptime. For global timer it is quite trivial: timer runs
always, but we just skip IPIs to other CPUs when possible.

Option is enabled by default now, keeping previous behavior, as periodic
hardclock() calls are still used at least to implement setitimer(2) with
ITIMER_VIRTUAL and ITIMER_PROF arguments. But since default schedulers don't
depend on it since r232917, we are much more free to experiment with it.

MFC after:	1 month
2012-03-13 10:21:08 +00:00
Alexander Motin
7295465e33 Rewrite thread CPU usage percentage math to not depend on periodic calls
with HZ rate through the sched_tick() calls from hardclock().

Potentially it can be used to improve precision, but now it is just minus
one more reason to call hardclock() for every HZ tick on every active CPU.
SCHED_4BSD never used sched_tick(), but keep it in place for now, as at
least SCHED_FBFS existing in patches out of the tree depends on it.

MFC after:	1 month
2012-03-13 08:18:54 +00:00
Peter Holm
62a9fc76df Allways call fdrop(). 2012-03-12 11:56:57 +00:00
Konstantin Belousov
1a9c7dec1f ELF image can have several PT_NOTE program headers. Look for the ELF
brand note in each header, instead of using only first one.

Reviewed by:	kan
Tested by:	andrew (arm), flo (sparc64)
MFC after:	3 weeks
2012-03-11 19:38:49 +00:00
Konstantin Belousov
b80dcb55aa Remove fifo.h. The only used function declaration from the header is
migrated to sys/vnode.h.

Submitted by:	gianni
2012-03-11 12:19:58 +00:00
Alexander Motin
5f3818a56e Revert r175376 and tune cpufreq(4) frequency comparison logic instead.
Instead of using 25MHz equality threshold, look for the nearest value when
handling dev.cpu.0.freq sysctl and for exact match when it is expected.

ACPI may report extra level with frequency 1MHz above the nominal to
control Intel Turbo Boost operation. It is not a bug, but feature:
dev.cpu.0.freq_levels: 2934/106000 2933/95000 2800/82000 ...
In this case value 2933 means 2.93GHz, but 2934 means 3.2-3.6GHz.

I've found that my Core i7-870 based system has Intel Turbo Boost disabled
by default and without this change it was absolutely invisible and hard
to control.

MFC after:	2 weeks
2012-03-10 18:56:16 +00:00
Alexander Motin
bcfd016cff Idle ticks optimization:
- Pass number of events to the statclock() and profclock() functions
   same as to hardclock() before to not call them many times in a loop.
 - Rename them into statclock_cnt() and profclock_cnt().
 - Turn statclock() and profclock() into compatibility wrappers,
   still needed for arm.
 - Rename hardclock_anycpu() into hardclock_cnt() for unification.

MFC after:	1 week
2012-03-10 14:57:21 +00:00
Edward Tomasz Napierala
0a53cd5742 Remove useless thread_{lock,unlock}() in raccd. 2012-03-10 14:38:49 +00:00
Juli Mallett
85729c2c44 Export intrcnt correctly when running under 32-bit compatibility.
Reviewed by:	gonzo, nwhitehorn
2012-03-09 22:30:54 +00:00
Peter Holm
39e77c4c50 Perform the parameter validation before assigning it to a signed int
variable. This fixes the problem seen with readdir(3) fuzzing.

Submitted by:	bde
MFC after:	1 week
2012-03-09 21:31:12 +00:00
Alexander Motin
b3f40a4107 Make kern.sched.idlespinthresh default value adaptive depending of HZ.
Otherwise with HZ above 8000 CPU may never skip timer ticks on idle.
2012-03-09 19:09:08 +00:00
Alexander Motin
55c71d634f Be more polite when setting state->nextevent inside cpu_new_callout().
Hardclock is not the only who wakes idle CPU since kdtrace cyclic addition.

MFC after:	2 weeks
2012-03-09 07:30:48 +00:00
Konstantin Belousov
38ddb5725b Decomission mnt_noasync. Introduce MNTK_NOASYNC mnt_kern_flag which
allows a filesystem to request VFS to not allow MNTK_ASYNC.

MFC after:	1 week
2012-03-09 00:12:05 +00:00
Peter Holm
ffae9d4d7c Free up allocated memory used by posix_fadvise(2). 2012-03-08 20:34:13 +00:00
John Baldwin
b47f624183 Add KTR_VFS traces to track modifications to a vnode's writecount. 2012-03-08 20:27:20 +00:00
John Baldwin
44ad547522 Add a new sched_clear_name() method to the scheduler interface to clear
the cached name used for KTR_SCHED traces when a thread's name changes.
This way KTR_SCHED traces (and thus schedgraph) will notice when a thread's
name changes, most commonly via execve().

MFC after:	2 weeks
2012-03-08 19:41:05 +00:00
Konstantin Belousov
f950879e16 The pipe_poll() performs lockless access to the vnode to test
fifo_iseof() condition, allowing the v_fifoinfo to be reset and freed
by fifo_cleanup().

Precalculate EOF at the places were fo_wgen is changed, and cache the
state in a new pipe state flag PIPE_SAMEWGEN.

Reported and tested by:	bf
Submitted by:	gianni
MFC after:	1 week (a backport)
2012-03-07 07:31:50 +00:00
Edward Tomasz Napierala
c34bbd2ada Make racct and rctl correctly handle jail renaming. Previously
they would continue using old name, the one jail was created with.

PR:		bin/165207
2012-03-06 11:05:50 +00:00
Ivan Voras
2573ea5f76 Print out process name and thread id in the debugging message.
This is useful because the message can end up in system logs in
non-debugging operation.

Reviewed by:	attilio (earlier version)
2012-03-05 14:19:43 +00:00
Konstantin Belousov
e7f19c3d81 pipe_read(): change the type of size to int, and remove signed clamp.
pipe_write(): change the type of desiredsize back to int, its value fits.

Requested by: bde
MFC after:    3 weeks
2012-03-04 15:09:01 +00:00
Konstantin Belousov
8bb9a904d5 Instead of incomplete handling of read(2)/write(2) return values that
does not fit into registers, declare that we do not support this case
using CTASSERT(), and remove endianess-unsafe code to split return value
into td_retval.

While there, change the style of the sysctl debug.iosize_max_clamp
definition.

Requested by:	bde
MFC after:	3 weeks
2012-03-04 14:55:37 +00:00
Mikolaj Golub
e0fcf639d2 Make kern.proc.umask sysctl readonly.
Requested by:	src
MFC after:	1 week
2012-03-03 11:53:35 +00:00
Alexander Motin
6022f0bcb3 Fix bug of r232207, when cpu_search() could prefer CPU group with best
load, but with no CPU matching given limitations. It caused kernel panics
in some cases when thread was bound to specific CPUs with cpuset(1).
2012-03-03 11:50:48 +00:00
Juli Mallett
9624d94701 o) Add COMPAT_FREEBSD32 support for MIPS kernels using the n64 ABI with userlands
using the o32 ABI.  This mostly follows nwhitehorn's lead in implementing
   COMPAT_FREEBSD32 on powerpc64.
o) Add a new type to the freebsd32 compat layer, time32_t, which is time_t in the
   32-bit ABI being used.  Since the MIPS port is relatively-new, even the 32-bit
   ABIs use a 64-bit time_t.
o) Because time{spec,val}32 has the same size and layout as time{spec,val} on MIPS
   with 32-bit compatibility, then, disable some code which assumes otherwise
   wrongly when built for MIPS.  A more general macro to check in this case would
   seem like a good idea eventually.  If someone adds support for using n32
   userland with n64 kernels on MIPS, then they will have to add a variety of
   flags related to each piece of the ABI that can vary.  That's probably the
   right time to generalize further.
o) Add MIPS to the list of architectures which use PAD64_REQUIRED in the
   freebsd32 compat code.  Probably this should be generalized at some point.

Reviewed by:	gonzo
2012-03-03 08:19:18 +00:00
Rick Macklem
5e99212d36 Post r230394, the Lookup RPC counts for both NFS clients increased
significantly. Upon investigation this was caused by name cache
misses for lookups of "..". For name cache entries for non-".."
directories, the cache entry serves double duty. It maps both the
named directory plus ".." for the parent of the directory. As such,
two ctime values (one for each of the directory and its parent) need
to be saved in the name cache entry.
This patch adds an entry for ctime of the parent directory to the
name cache. It also adds an additional uma zone for large entries
with this time value, in order to minimize memory wastage.
As well, it fixes a couple of cases where the mtime of the parent
directory was being saved instead of ctime for positive name cache
entries. With this patch, Lookup RPC counts return to values similar
to pre-r230394 kernels.

Reported by:	bde
Discussed with:	kib
Reviewed by:	jhb
MFC after:	2 weeks
2012-03-03 01:06:54 +00:00
John Baldwin
831ce4cb3d - Change contigmalloc() to use the vm_paddr_t type instead of an unsigned
long for specifying a boundary constraint.
- Change bus_dma tags to use bus_addr_t instead of bus_size_t for boundary
  constraints.

These allow boundary constraints to be fully expressed for cases where
sizeof(bus_addr_t) != sizeof(bus_size_t).  Specifically, it allows a
driver to properly specify a 4GB boundary in a PAE kernel.

Note that this cannot be safely MFC'd without a lot of compat shims due
to KBI changes, so I do not intend to merge it.

Reviewed by:	scottl
2012-03-01 19:58:34 +00:00
Kirk McKusick
35338e6091 This change avoids a kernel deadlock on "snaplk" when using
snapshots on UFS filesystems running with journaled soft updates.
This is the first of several bugs that need to be fixed before
removing the restriction added in -r230250 to prevent the use
of snapshots on filesystems running with journaled soft updates.

The deadlock occurs when holding the snapshot lock (snaplk)
and then trying to flush an inode via ffs_update(). We become
blocked by another process trying to flush a different inode
contained in the same inode block that we need. It holds the
inode block for which we are waiting locked. When it tries to
write the inode block, it gets blocked waiting for the our
snaplk when it calls ffs_copyonwrite() to see if the inode
block needs to be copied in our snapshot.

The most obvious place that this deadlock arises is in the
ffs_copyonwrite() routine when it updates critical metadata
in a snapshot and tries to write it out before proceeding.
The fix here is to write the data and indirect block pointer
for the snapshot, but to skip the call to ffs_update() to
write the snapshot inode. To ensure that we will never have
to update a pointer in the inode itself, the ffs_snapshot()
routine that creates the snapshot has to ensure that all the
direct blocks are allocated as part of the creation of the
snapshot.

A less obvious place that this deadlock occurs is when we hold
the snaplk because we are deleting a snapshot. In the course of
doing the deletion, we need to allocate various soft update
dependency structures and allocate some journal space. If we
hit a resource limit while doing this we decrease the resources
in use by flushing out an existing dirty file to get it to give
up the soft dependency resources that it holds. The flush can
cause an ffs_update() to be done on the inode for the file that
we have selected to flush resulting in the same deadlock as
described above when the inode that we have chosen to flush
resides in the same inode block as the snapshot inode that we hold.
The fix is to defer cleaning up any time that the inode on which
we are operating is a snapshot.

Help and review by:    Jeff Roberson
Tested by:             Peter Holm
MFC (to 9 only) after: 2 weeks
2012-03-01 18:45:25 +00:00
Mikolaj Golub
c7e41c8b50 Introduce VOP_UNP_BIND(), VOP_UNP_CONNECT(), and VOP_UNP_DETACH()
operations for setting and accessing vnode's v_socket field.

The operations are necessary to implement proper unix socket handling
on layered file systems like nullfs(5).

This change fixes the long standing issue with nullfs(5) being in that
unix sockets did not work between lower and upper layers: if we bound
to a socket on the lower layer we could connect only to the lower
path; if we bound to the upper layer we could connect only to the
upper path. The new behavior is one can connect to both the lower and
the upper paths regardless what layer path one binds to.

PR:		kern/51583, kern/159663
Suggested by:	kib
Reviewed by:	arch
MFC after:	2 weeks
2012-02-29 21:38:31 +00:00
David Xu
c9b01ed581 initialize clock ID and flags only when copying timespec, a _umtx_time
copy already contains these fields.
2012-02-29 02:01:48 +00:00
Martin Matuska
41c0675e6e Add procfs to jail-mountable filesystems.
Reviewed by:	jamie
MFC after:	1 week
2012-02-29 00:30:18 +00:00
Dimitry Andric
ac382cb7f1 Change definition of pipe_chmod() from K&R to C99, to avoid the
following clang warning:

sys/kern/sys_pipe.c:1556:10: error: promoted type 'int' of K&R function parameter is not compatible with the parameter type 'mode_t'
      (aka 'unsigned short') declared in a previous prototype [-Werror]
        mode_t mode;
               ^
sys/kern/sys_pipe.c:155:19: note: previous declaration is here
static fo_chmod_t       pipe_chmod;
                        ^
2012-02-28 21:45:21 +00:00
John Baldwin
0428cbe4df Properly clear a device's devclass if DEVICE_ATTACH() fails if the device
does not have a fixed devclass.

Reviewed by:	imp
MFC after:	2 weeks
2012-02-28 19:16:02 +00:00
Konstantin Belousov
1d7ca9bb8e Currently, the debugger attached to the process executing vfork() does
not get syscall exit notification until the child performed exec of
exit.  Swap the order of doing ptracestop() and waiting for P_PPWAIT
clearing, by postponing the wait into syscallret after ptracestop()
notification is done.

Reported, tested and reviewed by:	Dmitry Mikulin <dmitrym juniper net>
MFC after:	 2 weeks
2012-02-27 21:10:10 +00:00
John Baldwin
ef7b427562 Clear the a device's description string anytime it's driver changes.
Descriptions  are specific to drivers and we don't change drivers on attached
devices.  This fixes a few places where we were not clearing the description
when detaching a driver (e.g. with device_attach() failed).  While here, fix
a few other nits:
- Remove spurious call to remove a device's driver from
  devclass_driver_deleted().  device_detach() removes it already.
- Fix a typo.
2012-02-27 16:08:18 +00:00
David Xu
24c209494a Follow changes made in revision 232144, pass absolute timeout to kernel,
this eliminates a clock_gettime() syscall.
2012-02-27 13:38:52 +00:00
Alexander Motin
36acfc6507 Rework CPU load balancing in SCHED_ULE:
- In sched_pickcpu() be more careful taking previous CPU on SMT systems.
Do it only if all other logical CPUs of that physical one are idle to avoid
extra resource sharing.
 - In sched_pickcpu() change general logic of CPU selection. First
look for idle CPU, sharing last level cache with previously used one,
skipping SMT CPU groups. If none found, search all CPUs for the least loaded
one, where the thread with its priority can run now. If none found, search
just for the least loaded CPU.
 - Make cpu_search() compare lowest/highest CPU load when comparing CPU
groups with equal load. That allows to differentiate 1+1 and 2+0 loads.
 - Make cpu_search() to prefer specified (previous) CPU or group if load
is equal. This improves cache affinity for more complicated topologies.
 - Randomize CPU selection if above factors are equal. Previous code tend
to prefer CPUs with lower IDs, causing unneeded collisions.
 - Rework periodic balancer in sched_balance_group(). With cpu_search()
more intelligent now, make balansing process flat, removing recursion
over the topology tree. That fixes double swap problem and makes load
distribution more even and predictable.

All together this gives 10-15% performance improvement in many tests on
CPUs with SMT, such as Core i7, for number of threads is less then number
of logical CPUs. In some tests it also gives positive effect to systems
without SMT.

Reviewed by:	jeff
Tested by:	flo, hackers@
MFC after:	1 month
Sponsored by:	iXsystems, Inc.
2012-02-27 10:31:54 +00:00
Poul-Henning Kamp
f9a61f7dcb Also call the low-level driver if ->c_iflag & (IXON|IXOFF|IXANY) changes.
Uftdi(4) examines (c_iflag & (IXON|IXOFF)) to control hw XON-XOFF support.
This is obviously no good, if changes to those bits are not communicated
down the stack.
2012-02-26 20:56:49 +00:00
Alan Cox
e5c92401dd Fix typo.
MFC after:	1 week
2012-02-26 19:10:14 +00:00
Martin Matuska
e7af90ab00 Analogous to r232059, add a parameter for the ZFS file system:
allow.mount.zfs:
	allow mounting the zfs filesystem inside a jail

This way the permssions for mounting all current VFCF_JAIL filesystems
inside a jail are controlled wia allow.mount.* jail parameters.

Update sysctl descriptions.
Update jail(8) and zfs(8) manpages.

TODO:	document the connection of allow.mount.* and VFCF_JAIL for kernel
	developers

MFC after:	10 days
2012-02-26 16:30:39 +00:00
Jilles Tjoelker
581400dfed Fix fchmod() and fchown() on fifos.
The new fifo implementation in r232055 broke fchmod() and fchown() on fifos.
Postfix needs this.

Submitted by:	gianni
Reported by:	dougb
2012-02-26 15:14:29 +00:00
Mikolaj Golub
6ce13747dc Add sysctl to retrieve or set umask of another process.
Submitted by:	Dmitry Banschikov <me ubique spb ru>
Discussed with:	kib, rwatson
Reviewed by:	kib
MFC after:	2 weeks
2012-02-26 14:25:48 +00:00
Konstantin Belousov
747d2fa178 Add SO_PROTOCOL/SO_PROTOTYPE socket SOL_SOCKET-level option to get the
socket protocol number.  This is useful since the socket type can
be implemented by different protocols in the same protocol family,
e.g. SOCK_STREAM may be provided by both TCP and SCTP.

Submitted by:	Jukka A. Ukkonen <jau iki fi>
PR:	  kern/162352
Discussed with:	bz
Reviewed by:	glebius
MFC after:	2 weeks
2012-02-26 13:55:43 +00:00
Konstantin Belousov
9493639e35 Remove apparently redundand checks for socket so_proto being non-NULL
from sosetopt() and sogetopt().  No exposed sockets may have so_proto
invalid.

Discussed with:	bz, rwatson
Reviewed by:	glebius
MFC after:	2 weeks
2012-02-26 13:51:05 +00:00
Maxim Konovalov
7dfdd83d56 o Reduce chances for integer overflow.
o More verbose sysctl description added.

MFC after:	2 weeks
Sponsored by:	Nginx, Inc.
2012-02-25 12:06:40 +00:00
Mikolaj Golub
662c901c54 When detaching an unix domain socket, uipc_detach() checks
unp->unp_vnode pointer to detect if there is a vnode associated with
(binded to) this socket and does necessary cleanup if there is.

The issue is that after forced unmount this check may be too late as
the unp_vnode is reclaimed and the reference is stale.

To fix this provide a helper function that is called on a socket vnode
reclamation to do necessary cleanup.

Pointed by:	kib
Reviewed by:	kib
MFC after:	2 weeks
2012-02-25 10:15:41 +00:00
David Xu
df1f1bae9e In revision 231989, we pass a 16-bit clock ID into kernel, however
according to POSIX document, the clock ID may be dynamically allocated,
it unlikely will be in 64K forever. To make it future compatible, we
pack all timeout information into a new structure called _umtx_time, and
use fourth argument as a size indication, a zero means it is old code
using timespec as timeout value, but the new structure also includes flags
and a clock ID, so the size argument is different than before, and it is
non-zero. With this change, it is possible that a thread can sleep
on any supported clock, though current kernel code does not have such a
POSIX clock driver system.
2012-02-25 02:12:17 +00:00
Konstantin Belousov
dcdc6c361b Restore the return statement erronously removed in the r232048.
Submitted by:	cognet
Pointy hat to:	kib (reuse the one I already got today)
MFC after:	13 days
2012-02-24 11:02:35 +00:00
Martin Matuska
bf3db8aa65 To improve control over the use of mount(8) inside a jail(8), introduce
a new jail parameter node with the following parameters:

allow.mount.devfs:
	allow mounting the devfs filesystem inside a jail

allow.mount.nullfs:
	allow mounting the nullfs filesystem inside a jail

Both parameters are disabled by default (equals the behavior before
devfs and nullfs in jails). Administrators have to explicitly allow
mounting devfs and nullfs for each jail. The value "-1" of the
devfs_ruleset parameter is removed in favor of the new allow setting.

Reviewed by:	jamie
Suggested by:	pjd
MFC after:	2 weeks
2012-02-23 18:51:24 +00:00
Kip Macy
11ac7ec076 merge pipe and fifo implementations
Also reviewed by: jhb, jilles (initial revision)
Tested by: pho, jilles

Submitted by:	gianni
Reviewed by:	bde
2012-02-23 18:37:30 +00:00
Christian Brueffer
6bdc1841a9 Catch up with r195837 (2.5 years ago) which renamed net_add_domain() to domain_add().
PR:		165424
Submitted by:	Lachlan Kang
MFC after:	1 week
2012-02-23 17:47:19 +00:00
Konstantin Belousov
dcd432817e Allow the parent to gather the exit status of the children reparented
to the debugger.  When reparenting for debugging, keep the child in
the new orphan list of old parent.  When looping over the children in
kern_wait(), iterate over both children list and orphan list to search
for the process by pid.

Submitted by:	Dmitry Mikulin <dmitrym juniper.net>
MFC after:	2 weeks
2012-02-23 11:50:23 +00:00
David Xu
f911d9fa4d Fix typo. 2012-02-22 07:34:23 +00:00
David Xu
b13a8fa78f Use unused fourth argument of umtx_op to pass flags to kernel for operation
UMTX_OP_WAIT. Upper 16bits is enough to hold a clock id, and lower
16bits is used to pass flags. The change saves a clock_gettime() syscall
from libthr.
2012-02-22 03:22:49 +00:00
Mikolaj Golub
a95852edf3 unp_connect() may use a shared lock on the vnode to fetch the socket.
Suggested by:	jhb
Reviewed by:	jhb, kib, rwatson
MFC after:	2 weeks
2012-02-21 19:40:13 +00:00
Konstantin Belousov
526d0bd547 Fix found places where uio_resid is truncated to int.
Add the sysctl debug.iosize_max_clamp, enabled by default. Setting the
sysctl to zero allows to perform the SSIZE_MAX-sized i/o requests from
the usermode.

Discussed with:	bde, das (previous versions)
MFC after:	1 month
2012-02-21 01:05:12 +00:00
Xin LI
fcdd3d322b Revert r231923 for now. Further work is needed to make sure that the
behavior is consistent.
2012-02-20 09:32:32 +00:00
Xin LI
5bfbb59851 Use uprintf instead of printf for the reason why a kernel module can not
be loaded.  This way, the administrator can get response immediately from
the shell session rather than relying on dmesg.

MFC after:	1 month
2012-02-20 01:05:17 +00:00
Alan Cox
7dc0ace10e Close a race due to dropping of the map lock between creating a map entry
for a shared mapping and marking the entry for inheritance.

Reviewed by:	kib
X-MFC after:	r231526
2012-02-19 00:28:49 +00:00
Konstantin Belousov
3494f31ad2 Fix misuse of the kernel map in miscellaneous image activators.
Vnode-backed mappings cannot be put into the kernel map, since it is a
system map.

Use exec_map for transient mappings, and remove the mappings with
kmem_free_wakeup() to notify the waiters on available map space.

Do not map the whole executable into KVA at all to copy it out into
usermode.  Directly use vn_rdwr() for the case of not page aligned
binary.

There is one place left where the potentially unbounded amount of data
is mapped into exec_map, namely, in the COFF image activator
enumeration of the needed shared libraries.

Reviewed by:   alc
MFC after:     2 weeks
2012-02-17 23:47:16 +00:00
Bjoern A. Zeeb
9dba179d5e IFC @231845
Sponsored by:	Cisco Systems, Inc.
2012-02-17 00:27:48 +00:00
Eitan Adler
f17a6f1b17 Add a timestamp to the msgbuf output in order to determine when when
messages were printed.

This can be enabled with the kern.msgbuf_show_timestamp sysctl

PR:		kern/161553
Reviewed by:	avg
Submitted by:	Arnaud Lacombe <lacombar@gmail.com>
Approved by:	cperciva
MFC after:	1 month
2012-02-16 05:11:35 +00:00
Konstantin Belousov
343b391f20 The PTRACESTOP() macro is used only once. Inline the only use and remove
the macro.

MFC after:	1 week
2012-02-11 14:49:25 +00:00
Ed Schouten
852b05c5b5 Remove unneeded newline. It fits in 80 columns now.
Pointed out by:	jh
2012-02-10 14:55:47 +00:00
Ed Schouten
8fac9b7b7d Merge si_name and __si_namebuf.
The si_name pointer always points to the __si_namebuf member inside the
same object. Remove it and rename __si_namebuf to si_name.
2012-02-10 12:40:50 +00:00
Kevin Lo
de02885a7b Add a missing break. This bug was introduced in r228856. 2012-02-10 06:30:52 +00:00
Konstantin Belousov
db3273398b Mark the automatically attached child with PL_FLAG_CHILD in struct
lwpinfo flags, for PT_FOLLOWFORK auto-attachment.

In collaboration with:	Dmitry Mikulin <dmitrym juniper net>
MFC after:	 1 week
2012-02-10 00:02:13 +00:00
Martin Matuska
0cc207a6f5 Add support for mounting devfs inside jails.
A new jail(8) option "devfs_ruleset" defines the ruleset enforcement for
mounting devfs inside jails. A value of -1 disables mounting devfs in
jails, a value of zero means no restrictions. Nested jails can only
have mounting devfs disabled or inherit parent's enforcement as jails are
not allowed to view or manipulate devfs(8) rules.

Utilizes new functions introduced in r231265.

Reviewed by:	jamie
MFC after:	1 month
2012-02-09 10:22:08 +00:00
Konstantin Belousov
9cfb2326bc Unbreak detection of the async mode for clustered writes after r231075.
Submitted by:	bde
MFC after:	12 days
2012-02-08 15:07:19 +00:00
Pawel Jakub Dawidek
12075c0936 Allow to set kern.ipc.shmmax from /boot/loader.conf.
MFC after:	1 week
2012-02-08 09:18:22 +00:00
Ed Schouten
cd864a19a5 Fix whitespace inconsistencies in TTY code. 2012-02-06 18:15:46 +00:00
John Baldwin
bf40d24a3f Rename cache_lookup_times() to cache_lookup() and retire the old API and
ABI stub for cache_lookup().
2012-02-06 17:00:28 +00:00
Konstantin Belousov
c480f781ea Current implementations of sync(2) and syncer vnode fsync() VOP uses
mnt_noasync counter to temporary remove MNTK_ASYNC mount option, which
is needed to guarantee a synchronous completion of the initiated i/o
before syscall or VOP return.  Global removal of MNTK_ASYNC option is
harmful because not only i/o started from corresponding thread becomes
synchronous, but all i/o is synchronous on the filesystem which is
initiated during sync(2) or syncer activity.

Instead of removing MNTK_ASYNC from mnt_kern_flag, provide a local
thread flag to disable async i/o for current thread only. Use the
opportunity to move DOINGASYNC() macro into sys/vnode.h and
consistently use it through places which tested for MNTK_ASYNC.

Some testing demonstrated 60-70% improvements in run time for the
metadata-intensive operations on async-mounted UFS volumes, but still
with great deviation due to other reasons.

Reviewed by:	mckusick
Tested by:	scottl
MFC after:	2 weeks
2012-02-06 11:04:36 +00:00
Kevin Lo
0ca2381d9b - Use uint8_t for the variable x and spell the size of the variable
as sizeof(x)
- Capitalized comment
- Parentheses around return value

Requested by:	bde
2012-02-06 06:03:16 +00:00
Martin Matuska
a91d2201f9 Analogous to r230407 a separate path buffer in vfs_mount.c is required
for r230129. Fixes a out of bounds write to fspath.

MFC after:	10 days
2012-02-05 10:59:50 +00:00
David Xu
d56e058a79 Add 32-bit compat code for AIO kevent flags introduced in revision 230857. 2012-02-05 04:49:31 +00:00
Ryan Stone
312ac3a23a Whenever a new kernel thread is spawned, explicitly clear any CPU affinity
set on the new thread.  This prevents the thread from inadvertently
inheriting affinity from a random sibling.

Submitted by:	attilio
Tested by:	pho
MFC after:	1 week
2012-02-04 16:49:29 +00:00
Hiroki Sato
cf8b832511 Fix input validation in SO_SETFIB.
Reviewed by:	bz
MFC after:	1 day
2012-02-04 15:00:26 +00:00
Bjoern A. Zeeb
ee799639e8 Add SO_SETFIB option support on PF_INET6 sockets and allow inheriting the
FIB number from the process, as set by setfib(2), on socket creation.

Sponsored by:	Cisco Systems, Inc.
2012-02-03 11:00:53 +00:00
Konstantin Belousov
6af519cf18 Add kqueue support to /dev/klog.
Submitted by:	Mateusz Guzik <mjguzik gmail com>
PR:	  kern/156423
MFC after:	1 weeks
2012-02-01 14:34:52 +00:00
David Xu
fde809356a If multiple threads call kevent() to get AIO events on same kqueue fd,
it is possible that a single AIO event will be reported to multiple
threads, it is not threading friendly, and the existing API can not
control this behavior.
Allocate a kevent flags field sigev_notify_kevent_flags for AIO event
notification in sigevent, and allow user to pass EV_CLEAR, EV_DISPATCH
or EV_ONESHOT to AIO kernel code, user can control whether the event
should be cleared once it is retrieved by a thread. This change should
be comptaible with existing application, because the field should have
already been zero-filled, and no additional action will be taken by
kernel.

PR:	kern/156567
2012-02-01 02:53:06 +00:00
Konstantin Belousov
6ad1ff09cc A debugger which requested PT_FOLLOW_FORK should get the notification
about new child not only when doing PT_TO_SCX, but also for PT_CONTINUE.
If TDB_FORK flag is set, always issue a stop, the same as is done for
TDB_EXEC.

Reported by:	Dmitry Mikulin <dmitrym juniper net>
MFC after:	1 week
2012-01-30 20:00:29 +00:00
John Baldwin
2bd3e4c2c2 Refine the implementation of POSIX_FADV_NOREUSE for the read(2) case such
that instead of using direct I/O it allows read-ahead similar to
POSIX_FADV_NORMAL, but invokes VOP_ADVISE(POSIX_FADV_DONTNEED) after the
read(2) has completed to purge just-read data.  The write(2) path continues
to use direct I/O for POSIX_FADV_NOREUSE for now.  Note that NOREUSE works
optimally if an application reads and writes full fs blocks.
2012-01-30 19:35:15 +00:00
Doug Ambrisko
8e9fc27818 When detaching an AIO or LIO requests grab the lock and tell knlist_remove
that we have the lock now.  This cleans up a locking panic ASSERT when
knlist_empty is called without a lock when INVARIANTS etc. are turned.

Reviewed by:	kib jhb
MFC after:	1 week
2012-01-30 19:19:22 +00:00
Konstantin Belousov
62c625fdd2 Finally, try to enable the nxstacks on amd64 and powerpc64 for both 64bit
and 32bit ABIs. Also try to enable nxstacks for PAE/i386 when supported,
and some variants of powerpc32.

MFC after:	2 months (if ever)
2012-01-30 07:56:00 +00:00
Attilio Rao
5d7380f8e3 Avoid to check the same cache line/variable from all the locking
primitives by breaking stop_scheduler into a per-thread variable.
Also, store the new td_stopsched very close to td_*locks members as
they will be accessed mostly in the same codepaths as td_stopsched and
this results in avoiding a further cache-line pollution, possibly.

STOP_SCHEDULER() was pondered to use a new 'thread' argument, in order to
take advantage of already cached curthread, but in the end there should
not really be a performance benefit, while introducing a KPI breakage.

In collabouration with:	flo
Reviewed by:	avg
MFC after:	3 months (or never)
X-MFC:		r228424
2012-01-28 14:00:21 +00:00
Gleb Smirnoff
94fce84763 Fix size check, that prevents getting negative after casting
to a signed type

Reviewed by:	bde
2012-01-27 08:58:58 +00:00
Kenneth D. Merry
7e949c467c Xen netback driver rewrite.
share/man/man4/Makefile,
share/man/man4/xnb.4,
sys/dev/xen/netback/netback.c,
sys/dev/xen/netback/netback_unit_tests.c:

	Rewrote the netback driver for xen to attach properly via newbus
	and work properly in both HVM and PVM mode (only HVM is tested).
	Works with the in-tree FreeBSD netfront driver or the Windows
	netfront driver from SuSE.  Has not been extensively tested with
	a Linux netfront driver.  Does not implement LRO, TSO, or
	polling.  Includes unit tests that may be run through sysctl
	after compiling with XNB_DEBUG defined.

sys/dev/xen/blkback/blkback.c,
sys/xen/interface/io/netif.h:

	Comment elaboration.

sys/kern/uipc_mbuf.c:

	Fix page fault in kernel mode when calling m_print() on a
	null mbuf.  Since m_print() is only used for debugging, there
	are no performance concerns for extra error checking code.

sys/kern/subr_scanf.c:

	Add the "hh" and "ll" width specifiers from C99 to scanf().
	A few callers were already using "ll" even though scanf()
	was handling it as "l".

Submitted by:	Alan Somers <alans@spectralogic.com>
Submitted by:	John Suykerbuyk <johns@spectralogic.com>
Sponsored by:	Spectra Logic
MFC after:	1 week
Reviewed by:	ken
2012-01-26 16:35:09 +00:00
Gleb Smirnoff
434ea137cc Although aio_nbytes is size_t, later is is signed to
casted types: to ssize_t in filesystem code and to
int in buf code, thus supplying a negative argument
leads to kernel panic later. To fix that check user
supplied argument in the beginning of syscall.

Submitted by:	Maxim Dounin <mdounin mdounin.ru>, maxim@
2012-01-26 11:59:48 +00:00
Konstantin Belousov
abc942b56c When doing vflush(WRITECLOSE), clean vnode pages.
Unmounts do vfs_msync() before calling VFS_UNMOUNT(), but there is
still a race allowing a process to dirty pages after msync
finished. Remounts rw->ro just left dirty pages in system.

Reviewed by:	alc, tegge (long time ago)
Tested by:	pho
MFC after:	2 weeks
2012-01-25 20:54:09 +00:00
Konstantin Belousov
d5210589b7 Fix remaining calls to cache_enter() in both NFS clients to provide
appropriate timestamps.  Restore the assertions which verify that
NCF_TS is set when timestamp is asked for.

Reviewed by:  jhb (previous version)
MFC after:    2 weeks
2012-01-25 20:48:20 +00:00
Mikolaj Golub
45efc9b4aa Fix CTL flags in the declarations of KERN_PROC_ENV, AUXV and
PS_STRINGS sysctls: they are read only.

MFC after:	1 week
2012-01-25 20:15:58 +00:00
Konstantin Belousov
7a7e609a32 Apparently, both nfs clients do not use cache_enter_time()
consistently, creating some namecache entries without NCF_TS flag.
This causes panic due to failed assertion.

As a temporal relief, remove the assert. Return epoch timestamp for
the entries without timestamp if asked.

While there, consolidate the code which returns timestamps, into a
helper cache_out_ts().

Discussed with:	 jhb
MFC after: 2 weeks
2012-01-23 17:09:23 +00:00
Gleb Smirnoff
93a1b4c4cf Convert panic()s to KASSERT()s. This is an optimisation for
hashdestroy() since in absence of INVARIANTS a compiler
will drop the entire for() cycle.
2012-01-23 16:31:46 +00:00
Mikolaj Golub
8854fe3915 Change kern.proc.rlimit sysctl to:
- retrive only one, specified limit for a process, not the whole
  array, as it was previously (the sysctl has been added recently and
  has not been backported to stable yet, so this change is ok);

- allow to set a resource limit for another process.

Submitted by:	Andrey Zonov <andrey at zonov.org>
Discussed with:	kib
Reviewed by:	kib
MFC after:	2 weeks
2012-01-22 20:25:00 +00:00
Pawel Jakub Dawidek
9b9a01792d TDF_* flags should be used with td_flags field and TDP_* flags should be used
with td_pflags field. Correct two places where it was not the case.

Discussed with:	kib
MFC after:	1 week
2012-01-22 11:01:36 +00:00
Konstantin Belousov
c2b396f294 Remove the nc_time and nc_ticks elements from struct namecache, and
provide struct namecache_ts which is the old struct namecache. Only
allocate struct namecache_ts if non-null struct timespec *tsp was
passed to cache_enter_time, otherwise use struct namecache.

Change struct namecache allocation and deallocation macros into static
functions, since logic becomes somewhat twisty.  Provide accessor for
the nc_name member of struct namecache to hide difference between
struct namecache and namecache_ts.

The aim of the change is to not waste 20 bytes per small namecache
entry.

Reviewed by:	 jhb
MFC after: 2 weeks
X-MFC-note:  after r230394
2012-01-22 01:11:06 +00:00
Martin Matuska
6dfe0a3dc2 Use separate buffer for global path to avoid overflow of path buffer.
Reviewed by:	jamie@
MFC after:	3 weeks
2012-01-21 00:06:21 +00:00
John Baldwin
5aefb4cbbf Close a race in NFS lookup processing that could result in stale name cache
entries on one client when a directory was renamed on another client.  The
root cause for the stale entry being trusted is that each per-vnode nfsnode
structure has a single 'n_ctime' timestamp used to validate positive name
cache entries.  However, if there are multiple entries for a single vnode,
they all share a single timestamp.  To fix this, extend the name cache
to allow filesystems to optionally store a timestamp value in each name
cache entry.  The NFS clients now fetch the timestamp associated with
each name cache entry and use that to validate cache hits instead of the
timestamps previously stored in the nfsnode.  Another part of the fix is
that the NFS clients now use timestamps from the post-op attributes of
RPCs when adding name cache entries rather than pulling the timestamps out
of the file's attribute cache.  The latter is subject to races with other
lookups updating the attribute cache concurrently.  Some more details:
- Add a variant of nfsm_postop_attr() to the old NFS client that can return
  a vattr structure with a copy of the post-op attributes.
- Handle lookups of "." as a special case in the NFS clients since the name
  cache does not store name cache entries for ".", so we cannot get a
  useful timestamp.  It didn't really make much sense to recheck the
  attributes on the the directory to validate the namecache hit for "."
  anyway.
- ABI compat shims for the name cache routines are present in this commit
  so that it is safe to MFC.

MFC after:	2 weeks
2012-01-20 20:02:01 +00:00
Konstantin Belousov
2974cc36f7 Use shared lock for the executable vnode in the exec path after the
VV_TEXT changes are handled. Assert that vnode is exclusively locked at
the places that modify VV_TEXT.

Discussed with:	alc
MFC after:	3 weeks
2012-01-19 23:03:31 +00:00
Alan Cox
1dfab8025e Explain why it is safe to unlock the vnode.
Requested by:	kib
2012-01-17 16:20:50 +00:00
Kirk McKusick
cc672d3599 Make sure all intermediate variables holding mount flags (mnt_flag)
and that all internal kernel calls passing mount flags are declared
as uint64_t so that flags in the top 32-bits are not lost.

MFC after: 2 weeks
2012-01-17 01:08:01 +00:00
Alan Cox
292177e67a Improve abstraction. Eliminate direct access by elf*_load_section()
to an OBJT_VNODE-specific field of the vm object.  The same
information can be just as easily obtained from the struct vattr that
is in struct image_params if the latter is passed to
elf*_load_section().  Moreover, by replacing the vmspace and vm
object parameters to elf*_load_section() with a struct image_params
parameter, we actually reduce the size of the object code.

In collaboration with:	kib
2012-01-17 00:27:32 +00:00
Sergey Kandaurov
037f43d3ef Be pedantic and change // comment to C-style one.
Noticed by:		Bruce Evans
2012-01-16 20:42:56 +00:00
Kevin Lo
575cabed9e Fix a style bug
Spotted by:	avg
2012-01-16 14:54:48 +00:00
David Xu
29a06690ca Eliminate branch and insert an explicit reader memory barrier to ensure
that waiter bit is set before reading semaphore count.
2012-01-16 04:39:10 +00:00
Mikolaj Golub
fe7f89b71a Abrogate nchr argument in proc_getargv() and proc_getenvv(): we always want
to read strings completely to know the actual size.

As a side effect it fixes the issue with kern.proc.args and kern.proc.env
sysctls, which didn't return the size of available data when calling
sysctl(3) with the NULL argument for oldp.

Note, in get_ps_strings(), which does actual work for proc_getargv() and
proc_getenvv(), we still have a safety limit on the size of data read in
case of a corrupted procces stack.

Suggested by:	kib
MFC after:	3 days
2012-01-15 18:47:24 +00:00
Martin Matuska
9cbe30e1d5 Fix missing in r230129:
kern_jail.c: initialize fullpath_disabled to zero
vfs_cache.c: add missing dot in comment

Reported by:	kib
MFC after:	1 month
2012-01-15 18:08:15 +00:00
Ulrich Spörlein
9a14aa017b Convert files to UTF-8 2012-01-15 13:23:18 +00:00
Martin Matuska
f6e633a9e1 Introduce vn_path_to_global_path()
This function updates path string to vnode's full global path and checks
the size of the new path string against the pathlen argument.

In vfs_domount(), sys_unmount() and kern_jail_set() this new function
is used to update the supplied path argument to the respective global path.

Unbreaks jailed zfs(8) with enforce_statfs set to 1.

Reviewed by:	kib
MFC after:	1 month
2012-01-15 12:08:20 +00:00
Eitan Adler
886e862866 - Fix undefined behavior when device_get_name is null
- Make error message more informative

PR:		kern/149800
Submitted by:	olgeni
Approved by:	cperciva
MFC after:	1 week
2012-01-15 07:09:18 +00:00
Oleksandr Tymoshenko
4104e83567 Fix kernel modules loading for MIPS64 kernel:
On amd64, link_elf_obj.c must specify KERNBASE rather than
    VM_MIN_KERNEL_ADDRESS to vm_map_find() because kernel loadable
    modules must be mapped for execution in the same upper region
    of the kernel map as the kernel code and data segments.

    For MIPS32 KERNBASE lies below KVA area (it's less than
    VM_MIN_KERNEL_ADDRESS) so basically vm_map_find got whole
    KVA to look through. On MIPS64 it's not the case because
    KERNBASE is set to the very end of XKSEG, well out of KVA
    bounds, so vm_map_find always fails. We should use
    VM_MIN_KERNEL_ADDRESS as a base for vm_map_find.

Details obtained from: alc@
2012-01-14 00:36:07 +00:00
John Baldwin
fbcebf7f71 Convert the per-interface address list lock from a mutex to a reader/writer
lock.

Reviewed by:	bz
2012-01-09 19:34:12 +00:00
Andriy Gapon
90d8265326 enable stop_scheduler_on_panic by default
My plan is to make this behavior unconditional before 10.0 release.

X-MFC after:	r228424 (if ever)
2012-01-09 12:06:09 +00:00
Konstantin Belousov
3ab0160340 Avoid LOR between vfs_busy() lock and covered vnode lock on quotaon().
The vfs_busy() is after covered vnode lock in the global lock order, but
since quotaon() does recursive VFS call to open quota file, we usually
end up locking covered vnode after mp is busied in sys_quotactl().

Change the interface of VFS_QUOTACTL(), requiring that mp was unbusied
by fs code, and do not try to pick up vfs_busy() reference in ufs quotaon,
esp. if vfs_busy cannot succeed due to unmount being performed.

Reported and tested by:	pho
MFC after:	1 week
2012-01-08 23:06:53 +00:00
Alan Cox
2971897d51 Correct an error of omission in the implementation of the truncation
operation on POSIX shared memory objects and tmpfs.  Previously, neither of
these modules correctly handled the case in which the new size of the object
or file was not a multiple of the page size.  Specifically, they did not
handle partial page truncation of data stored on swap.  As a result, stale
data might later be returned to an application.

Interestingly, a data inconsistency was less likely to occur under tmpfs
than POSIX shared memory objects.  The reason being that a different mistake
by the tmpfs truncation operation helped avoid a data inconsistency.  If the
data was still resident in memory in a PG_CACHED page, then the tmpfs
truncation operation would reactivate that page, zero the truncated portion,
and leave the page pinned in memory.  More precisely, the benevolent error
was that the truncation operation didn't add the reactivated page to any of
the paging queues, effectively pinning the page.  This page would remain
pinned until the file was destroyed or the page was read or written.  With
this change, the page is now added to the inactive queue.

Discussed with:	jhb
Reviewed by:	kib (an earlier version)
MFC after:	3 weeks
2012-01-08 20:09:26 +00:00
Hiroki Sato
ca54e1aee3 Fix a typo. (s/nessesary/necessary/) 2012-01-08 18:48:36 +00:00
John Baldwin
71eeeaf256 Add 5 spare VOPs as placeholders to avoid breaking the KBI in the future
when new VOPs are MFC'd to a branch.

Reviewed by:	kib, bz
MFC after:	3 days
2012-01-06 20:06:45 +00:00
John Baldwin
908cac07ce Use proper argument structure types for the extattr post-VOP hooks.
The wrong structure happened to work since the only argument used was
the vnode which is in the same place in both VOP_SETATTR() and the two
extattr VOPs.

MFC after:	3 days
2012-01-06 20:05:48 +00:00
John Baldwin
948c460971 Fix a logic bug in change 228207 in the check for a thread's new user
priority being a realtime priority.

MFC after:	3 days
2012-01-05 19:02:52 +00:00
John Baldwin
137f91e80f Convert all users of IF_ADDR_LOCK to use new locking macros that specify
either a read lock or write lock.

Reviewed by:	bz
MFC after:	2 weeks
2012-01-05 19:00:36 +00:00
John Baldwin
7e3a96ea37 Some small fixes to CPU accounting for threads:
- Only initialize the per-cpu switchticks and switchtime in sched_throw()
  for the very first context switch on APs during boot.  This avoids a
  small gap between the middle of thread_exit() and sched_throw() where
  time is not accounted to any thread.
- In thread_exit(), update the timestamp bookkeeping to track the changes
  to mi_switch() introduced by td_rux so that the code once again matches
  the comment claiming it is mimicing mi_switch().  Specifically, only
  update the per-thread stats directly and depend on ruxagg() to update
  p_rux rather than adjusting p_rux directly.  While here, move the
  timestamp bookkeeping as late in the function as possible.

Reviewed by:	bde, kib
MFC after:	1 week
2012-01-03 21:03:28 +00:00
Ed Schouten
dc15eac046 Use strchr() and strrchr().
It seems strchr() and strrchr() are used more often than index() and
rindex(). Therefore, simply migrate all kernel code to use it.

For the XFS code, remove an empty line to make the code identical to
the code in the Linux kernel.
2012-01-02 12:12:10 +00:00
Konstantin Belousov
cdb7a43117 Avoid double-unlock or double unreference for ndp->ni_dvp when the vnode dp
lock upgrade right after the 'success' label fails.

In collaboration with:	pho
MFC after:	1 week
2012-01-01 18:45:59 +00:00
John Baldwin
0c0d27d5dd Cap the priority calculated from the current thread's running tick count
at SCHED_PRI_RANGE to prevent overflows in the priority value.  This can
happen due to irregularities with clock interrupts under certain
virtualization environments.

Tested by:	Larry Rosenman  ler lerctr org
MFC after:	2 weeks
2011-12-29 16:17:16 +00:00
Lawrence Stewart
6cedd609b7 Introduce the sysclock_getsnapshot() and sysclock_snap2bintime() KPIs. The
sysclock_getsnapshot() function allows the caller to obtain a snapshot of all
the system clock and timecounter state required to create time stamps at a later
point. The sysclock_snap2bintime() function converts a previously obtained
snapshot into a bintime time stamp according to the specified flags e.g. which
system clock, uptime vs absolute time, etc.

These KPIs enable useful functionality, including direct comparison of the
feedback and feed-forward system clocks and generation of multiple time stamps
with different formats from a single timecounter read.

Committed on behalf of Julien Ridoux and Darryl Veitch from the University of
Melbourne, Australia, as part of the FreeBSD Foundation funded "Feed-Forward
Clock Synchronization Algorithms" project.

For more information, see http://www.synclab.org/radclock/

In collaboration with:	Julien Ridoux (jridoux at unimelb edu au)
2011-12-24 01:32:01 +00:00
John Baldwin
f0d6c5caf0 Add post-VOP hooks for VOP_DELETEEXTATTR() and VOP_SETEXTATTR() and use
these to trigger a NOTE_ATTRIB EVFILT_VNODE kevent when the extended
attributes of a vnode are changed.

Note that OS X already implements this behavior.

Reviewed by:	rwatson
MFC after:	2 weeks
2011-12-23 20:11:37 +00:00
John Baldwin
268e76d86e Use TASK_INITIALIZER() for dev_dtr_task rather than a dedicated SYSINIT(). 2011-12-22 16:01:10 +00:00
Andriy Gapon
167057914b ule: ensure that batch timeshare threads are scheduled fairly
With the previous code, if the range of priorities for timeshare batch
threads was greater than RQ_NQS, then the threads with low priorities in
the part of the range above RQ_NQS would be scheduled to the run-queues
as if they had high priorities at the beginning of the range.
In other words, threads with a nice level of +N could be scheduled as
if they had a nice level of -M.

Reported by:	George Mitchell <george@m5p.com>
Reviewed by:	jhb
Tested by:	George Mitchell <george@m5p.com> (earlier version)
MFC after:	1 week
2011-12-19 20:01:21 +00:00
Mikolaj Golub
547b155eb1 Fix style and white spaces.
MFC after:	1 week
2011-12-17 22:18:26 +00:00
Mikolaj Golub
fa3935bcea On start most of sysctl_kern_proc functions use the same pattern:
locate a process calling pfind() and do some additional checks like
p_candebug(). To reduce this code duplication a new function pget() is
introduced and used.

As the function may be useful not only in kern_proc.c it is in the
kernel name space.

Suggested by:	kib
Reviewed by:	kib
MFC after:	2 weeks
2011-12-17 16:59:22 +00:00
Andriy Gapon
f389bc9585 belatedly transfer copyrights from libkern/gets.c to kern_cons.c
MFC after:	2 months
MFC with:	r228642
2011-12-17 15:50:45 +00:00
Andriy Gapon
f6ce353e58 replace uses of libkern gets with cngets
MFC after:	2 months
2011-12-17 15:26:34 +00:00
Andriy Gapon
8e62854265 introduce cngets, a method for kernel to read a string from console
This is intended as a replacement for libkern's gets and mostly borrows
its implementation.  It uses cngrab/cnungrab to delimit kernel's access
to console input.

Note: libkern's gets obviously doesn't share any bits of implementation
iwth libc's gets.  They also have different APIs and the former doesn't
have the overflow problems of the latter.

Inspired by:	bde
MFC after:	2 months
2011-12-17 15:16:54 +00:00
Andriy Gapon
bf8696b408 introduce cngrab/cnungrab stub calls in some places where they make sense
MFC after:	2 months
2011-12-17 15:11:22 +00:00
Andriy Gapon
9976156f12 kern cons: introduce infrastructure for console grabbing by kernel
At the moment grab and ungrab methods of all console drivers are no-ops.

Current intended meaning of the calls is that the kernel takes control of
console input.  In the future the semantics may be extended to mean that
the calling thread takes full ownership of the console (e.g. console
output from other threads could be suspended).

Inspired by:	bde
MFC after:	2 months
2011-12-17 15:08:43 +00:00
John Baldwin
f427c78b19 Fire a kevent if necessary after seeking on a regular file. This fixes a
case where a kevent would not fire on a regular file if an application read
to EOF and then seeked backwards into the file.

Reviewed by:	kib
MFC after:	2 weeks
2011-12-16 20:10:00 +00:00
John Baldwin
338e7cf235 Use vm_mmap_to_errno().
Submitted by:	kib
2011-12-15 15:17:19 +00:00
Jilles Tjoelker
6d1c58f8a2 Fix select/poll/kqueue for write on reverse direction before first write.
The reverse direction of a pipe is lazily allocated on the first write in
that direction (because pipes are usually used in one direction only).  A
special case is needed to ensure the pipe appears writable before the first
write because there are 0 bytes of pending data in 0 bytes of buffer space
at that point, leaving 0 bytes of data that can be written with the normal
code.

Note that the first write returns [ENOMEM] if kern.ipc.maxpipekva is
exceeded and does not block or return [EAGAIN], so selecting true for write
is correct even in that case.

PR:		kern/93685
Submitted by:	gianni
MFC after:	2 weeks
2011-12-14 22:26:39 +00:00
John Baldwin
fb680e16f4 Add a helper API to allow in-kernel code to map portions of shared memory
objects created by shm_open(2) into the kernel's address space.  This
provides a convenient way for creating shared memory buffers between
userland and the kernel without requiring custom character devices.
2011-12-14 22:22:19 +00:00
David E. O'Brien
1c5151f3f8 Match other formatting. 2011-12-14 02:31:32 +00:00
David E. O'Brien
3d7618d8bf Disallow various debug.kdb sysctl's when securelevel is raised.
PR:	161350
2011-12-13 17:59:16 +00:00
Eitan Adler
9910b854c6 - Add a sysctl to allow non-root users the ability to set idle
priorities.

- While here fix up some style nits.

Discussed with: cperciva (breifly)
Reviewed by:	pjd (earlier version)
Reviewed by:	bde
Approved by:	jhb
MFC after:	1 month
2011-12-13 14:00:27 +00:00
Eitan Adler
3eb9ab5255 Document a large number of currently undocumented sysctls. While here
fix some style(9) issues and reduce redundancy.

PR:		kern/155491
PR:		kern/155490
PR:		kern/155489
Submitted by:	Galimov Albert <wtfcrap@mail.ru>
Approved by:	bde
Reviewed by:	jhb
MFC after:	1 week
2011-12-13 00:38:50 +00:00
Andriy Gapon
7a7ce668ef put sys/systm.h at its proper place or add it if missing
Reported by:	lstewart, tinderbox
Pointyhat to:	avg, attilio
MFC after:	1 week
MFC with:	r228430
2011-12-12 10:05:13 +00:00
Andriy Gapon
0e225211a0 kern_racct: move sys/systm.h inclusion to its proper place
This should fix the build failure introduced with r228424.
Also remove duplicate inclusion of sys/param.h.

Pointyhat to:	avg
MFC after:	1 week
2011-12-12 07:46:10 +00:00
Andriy Gapon
353705930f panic: add a switch and infrastructure for stopping other CPUs in SMP case
Historical behavior of letting other CPUs merily go on is a default for
time being.  The new behavior can be switched on via
kern.stop_scheduler_on_panic tunable and sysctl.

Stopping of the CPUs has (at least) the following benefits:
- more of the system state at panic time is preserved intact
- threads and interrupts do not interfere with dumping of the system
  state

Only one thread runs uninterrupted after panic if stop_scheduler_on_panic
is set.  That thread might call code that is also used in normal context
and that code might use locks to prevent concurrent execution of certain
parts.  Those locks might be held by the stopped threads and would never
be released.  To work around this issue, it was decided that instead of
explicit checks for panic context, we would rather put those checks
inside the locking primitives.

This change has substantial portions written and re-written by attilio
and kib at various times.  Other changes are heavily based on the ideas
and patches submitted by jhb and mdf.  bde has provided many insights
into the details and history of the current code.

The new behavior may cause problems for systems that use a USB keyboard
for interfacing with system console.  This is because of some unusual
locking patterns in the ukbd code which have to be used because on one
hand ukbd is below syscons, but on the other hand it has to interface
with other usb code that uses regular mutexes/Giant for its concurrency
protection.  Dumping to USB-connected disks may also be affected.

PR:			amd64/139614 (at least)
In cooperation with:	attilio, jhb, kib, mdf
Discussed with:		arch@, bde
Tested by:		Eugene Grosbein <eugen@grosbein.net>,
			gnn,
			Steven Hartland <killing@multiplay.co.uk>,
			glebius,
			Andrew Boyer <aboyer@averesystems.com>
			(various versions of the patch)
MFC after:		3 months (or never)
2011-12-11 21:02:01 +00:00
Peter Holm
cdea31e305 Move cpu_set_upcall(newtd, td) up before the first call of
thread_free(newtd).  This to avoid a possible page fault in
cpu_thread_clean() as seen on amd64 with syscall fuzzing.

Reviewed by:	kib
MFC after:	1 week
2011-12-09 17:19:41 +00:00
Eitan Adler
5a01b72672 - Fix ktrace leakage if error is set
PR:		kern/163098
Submitted by:	Loganaden Velvindron <loganaden@devio.us>
Approved by:	sbruno@
MFC after:	1 month
2011-12-08 03:20:38 +00:00
Alan Cox
ea3f07d3a0 Eliminate stale numbers from a comment. 2011-12-07 16:27:23 +00:00
Alan Cox
c749c003b8 Eliminate the possibility of 32-bit arithmetic overflow in the calculation
of vm_kmem_size that may occur if the system administrator has specified a
vm.vm_kmem_size tunable value that exceeds the hard cap.

PR:		162741
Submitted by:	Adam McDougall
Reviewed by:	bde@
MFC after:	3 weeks
2011-12-07 07:03:14 +00:00
Konstantin Belousov
93c26de0ad Most users of pipe(2) do not call fstat(2) on the returned pipe descriptors.
Optimize for the case, by lazily allocating the pipe inode number at the
fstat(2) time. If alloc_unr(9) returns failure, do not fail fstat(2), since
uses of inode numbers are even rare then fstat(2), but provide zero inode
forever. Note that alloc_unr() failure is unlikely due to total number
of pipes in the system limited by the number of file descriptors.

Based on the submission by:	gianni
MFC after:	2 weeks
2011-12-06 11:24:03 +00:00
Mikolaj Golub
9e94d5b83f Really protect kern.proc.ps_strings sysctls with p_candebug(). This
was intended to be in r228288.

Spotted by:	many
MFC after:	1 week
2011-12-06 06:40:14 +00:00
Mikolaj Golub
c65932be9d Protect kern.proc.auxv and kern.proc.ps_strings sysctls with p_candebug().
Citing jilles:

If we are ever going to do ASLR, the AUXV information tells an attacker
where the stack, executable and RTLD are located, which defeats much of
the point of randomizing the addresses in the first place.

Given that the AUXV information seems to be used by debuggers only anyway,
I think it would be good to move it to p_candebug() now.

The full virtual memory maps (KERN_PROC_VMMAP, procstat -v) are already
under p_candebug().

Suggested by:	jilles
Discussed with:	rwatson
MFC after:	1 week
2011-12-05 19:34:02 +00:00
Kevin Lo
2b69bb1f27 Add a missing curly bracket 2011-12-05 10:34:52 +00:00
Andriy Gapon
5e27a60372 critical_exit: ignore td_owepreempt if kdb_active is set
calling mi_switch in such a context results in a recursion via
kdb_switch

Suggested by:	jhb
Reviewed by:	jhb
MFC after:	5 weeks
2011-12-04 21:27:41 +00:00
Mikolaj Golub
0f60ecdaa4 In sysctl_kern_proc_ps_strings() there is no much sense in checking
for P_WEXIT and P_SYSTEM flags.

Reviewed by:	kib
2011-12-04 21:24:01 +00:00
Hans Petter Selasky
494b6fec82 Make sure the description of pause() is
equivalent to its implementation.
No code change.

Suggested by:	Bruce Evans
MFC after:	3 days
2011-12-03 15:51:15 +00:00
Eitan Adler
f565f395c6 - Fix typos s/(more|less) then|\1 than/
Submitted by:	Davide Italiano <davide.italiano@gmail.com>
Approved by:	brucec
MFC after:	3 days
2011-12-03 15:41:37 +00:00
Peter Holm
9a1d0cf68f Use umtx_copyin_timeout() to copy and check timeout parameter.
In collaboration with:	kib
MFC after:	1 week
2011-12-03 12:35:13 +00:00
Peter Holm
662ebe9b53 Add umtx_copyin_timeout() and move parameter checks here.
In collaboration with:	kib
MFC after:	1 week
2011-12-03 12:30:58 +00:00
Peter Holm
ff77dfb0c1 Rename copyin_timeout32 to umtx_copyin_timeout32 and move parameter
check here. Include check for negative seconds value.

In collaboration with:	kib
MFC after:	1 week
2011-12-03 12:28:33 +00:00
Marius Strobl
002214d6bb It doesn't make much sense to check whether child is NULL after already
having dereferenced it. We either should generally check the device_t's
supplied to bus functions before using them (which we seem to virtually
never do) or just assume that they are not NULL.
While at it make this code fit 78 columns.

Found with:	Coverity Prevent(tm)
CID:		4230
2011-12-02 22:03:27 +00:00
Marius Strobl
f60d6c2bdf - In device_probe_child(9) check the return value of device_set_driver(9)
when actually setting a driver as especially ENOMEM is fatal in these
  cases.
- Annotate other calls to device_set_devclass(9) and device_set_driver(9)
  without the return value being checked and that are okay to fail.

Reviewed by:	yongari (slightly earlier version)
2011-12-02 21:19:14 +00:00
John Baldwin
593dd43eee When changing the user priority of a thread, change the real priority
in addition to the user priority for threads whose current real priority
is equal to the previous user priority or if the new priority is a
real-time priority.  This allows priority changes of other threads to
have an immediate effect.

MFC after:	2 weeks
2011-12-02 19:59:46 +00:00
Konstantin Belousov
5ed954efd1 If alloc_unr() call in the pipe_create() failed, then pipe->pipe_ino is
-1. But, because ino_t is unsigned, this case was not covered by the
test ino > 0 in pipeclose(), leading to the free_unr(-1). Fix it by
explicitely comparing with 0 and -1. [1]

Do no access freed memory, the inode number was cached to prevent access
to cpipe after it possibly was freed, but I failed to commit the right
patch.

Noted by:	gianni [1]
Pointy hat to:	kib
MFC after:	3 days
2011-12-01 11:36:41 +00:00
Lawrence Stewart
66dcfed32a Revise the sysctl handling code and restructure the hierarchy of sysctls
introduced when feed-forward clock support is enabled in the kernel:

- Rename the "choice" variable to "available".

- Streamline the implementation of the "active" variable's sysctl handler
  function.

- Create a kern.sysclock sysctl node for general sysclock related configuration
  options. Place the "available" and "active" variables under this node.

- Create a kern.sysclock.ffclock sysctl node for feed-forward clock specific
  configuration options. Place the "version" and "ffcounter_bypass" variables
  under this node.

- Tweak some of the description strings.

Discussed with:	Julien Ridoux (jridoux at unimelb edu au)
2011-12-01 07:19:13 +00:00
Konstantin Belousov
dc874f9881 Rename vm_page_set_valid() to vm_page_set_valid_range().
The vm_page_set_valid() is the most reasonable name for the m->valid
accessor.

Reviewed by:	attilio, alc
2011-11-30 17:39:00 +00:00
Lawrence Stewart
6f83fc5112 Make sysclock_active publicly available to external consumers.
Committed on behalf of Julien Ridoux and Darryl Veitch from the University of
Melbourne, Australia, as part of the FreeBSD Foundation funded "Feed-Forward
Clock Synchronization Algorithms" project.

For more information, see http://www.synclab.org/radclock/

Discussed with:	Julien Ridoux (jridoux at unimelb edu au)
Submitted by:	Julien Ridoux (jridoux at unimelb edu au)
2011-11-29 08:43:04 +00:00
Lawrence Stewart
88394fe42c Do away with the somewhat clunky sysclock_ops structure and associated code,
reimplementing the [get]{bin,nano,micro}[up]time() wrapper functions in terms of
the new "fromclock" API instead.

Committed on behalf of Julien Ridoux and Darryl Veitch from the University of
Melbourne, Australia, as part of the FreeBSD Foundation funded "Feed-Forward
Clock Synchronization Algorithms" project.

For more information, see http://www.synclab.org/radclock/

Discussed with:	Julien Ridoux (jridoux at unimelb edu au)
Submitted by:	Julien Ridoux (jridoux at unimelb edu au)
2011-11-29 08:33:40 +00:00
Lawrence Stewart
e977bac333 Make the fbclock_[get]{bin,nano,micro}[up]time() function prototypes public so
that new APIs with some performance sensitivity can be built on top of them.
These functions should not be called directly except in special circumstances.

Committed on behalf of Julien Ridoux and Darryl Veitch from the University of
Melbourne, Australia, as part of the FreeBSD Foundation funded "Feed-Forward
Clock Synchronization Algorithms" project.

For more information, see http://www.synclab.org/radclock/

Discussed with:	Julien Ridoux (jridoux at unimelb edu au)
Submitted by:	Julien Ridoux (jridoux at unimelb edu au)
2011-11-29 06:53:36 +00:00
Lawrence Stewart
c2a4ee9906 Fix an oversight in r227747 by calling fbclock_bin{up}time() directly from the
fbclock_{nanouptime|microuptime|bintime|nanotime|microtime}() functions to avoid
indirecting through a sysclock_ops wrapper function.

Committed on behalf of Julien Ridoux and Darryl Veitch from the University of
Melbourne, Australia, as part of the FreeBSD Foundation funded "Feed-Forward
Clock Synchronization Algorithms" project.

For more information, see http://www.synclab.org/radclock/

Submitted by:	Julien Ridoux (jridoux at unimelb edu au)
2011-11-29 06:12:19 +00:00
Mikolaj Golub
9732458f35 Add sysctl to retrieve ps_strings structure location of another process.
Suggested by:	kib
Reviewed by:	kib
2011-11-27 17:05:26 +00:00
Mikolaj Golub
4fd6053b43 In sysctl_kern_proc_auxv the process was released too early: we still
need to hold it when checking process sv_flags.

MFC after:	2 weeks
2011-11-27 16:56:01 +00:00
Lawrence Stewart
66761af34f Export the "ffclock" feature for kernels compiled with feed-forward clock
support.

Suggested by:	netchild
Reviewed by:	netchild
2011-11-26 01:44:37 +00:00
Mikolaj Golub
9e7d058351 Add sysctl to get process resource limits.
Reviewed by:	kib
MFC after:	2 weeks
2011-11-24 20:43:37 +00:00
Konstantin Belousov
561984be06 Fix a race between getvnode() dereferencing half-constructed file
and dupfdopen().

Reported and tested by:	pho
MFC after:	3 days
2011-11-24 20:34:06 +00:00
Mikolaj Golub
7ad9baae41 Fix build without INVARIANTS.
Discussed with:	kib
2011-11-23 08:11:04 +00:00
Hans Petter Selasky
3b12bdb58f Rename device_delete_all_children() into device_delete_children().
Suggested by:	jhb @ and marius @
MFC after:	1 week
2011-11-22 21:56:55 +00:00
Hans Petter Selasky
5b288d2abf Style change.
Suggested by:	jhb @ and marius @
MFC after:	1 week
2011-11-22 21:53:19 +00:00
Mikolaj Golub
c5cfcb1c19 Add new sysctls, KERN_PROC_ENV and KERN_PROC_AUXV, to return
environment strings and ELF auxiliary vectors from a process stack.

Make sysctl_kern_proc_args to read not cached arguments from the
process stack.

Export proc_getargv() and proc_getenvv() so they can be reused by
procfs and linprocfs.

Suggested by:	kib
Reviewed by:	kib
Discussed with:	kib, rwatson, jilles
Tested by:	pho
MFC after:	2 weeks
2011-11-22 20:40:18 +00:00
Lawrence Stewart
65e359a15c - Add Pulse-Per-Second timestamping using raw ffcounter and corresponding
ffclock time in seconds.

- Add IOCTL to retrieve ffclock timestamps from userland.

Committed on behalf of Julien Ridoux and Darryl Veitch from the University of
Melbourne, Australia, as part of the FreeBSD Foundation funded "Feed-Forward
Clock Synchronization Algorithms" project.

For more information, see http://www.synclab.org/radclock/

Submitted by:	Julien Ridoux (jridoux at unimelb edu au)
2011-11-21 13:34:29 +00:00
Attilio Rao
9fde98bba3 Introduce the same mutex-wise fix in r227758 for sx locks.
The functions that offer file and line specifications are:
- sx_assert_
- sx_downgrade_
- sx_slock_
- sx_slock_sig_
- sx_sunlock_
- sx_try_slock_
- sx_try_xlock_
- sx_try_upgrade_
- sx_unlock_
- sx_xlock_
- sx_xlock_sig_
- sx_xunlock_

Now vm_map locking is fully converted and can avoid to know specifics
about locking procedures.
Reviewed by:	kib
MFC after:	1 month
2011-11-21 12:59:52 +00:00
Sergey Kandaurov
ca4aa8c363 Remove no more relevant XXXRW comments since accessing the vmspace is now
properly done with the acquired vmspace reference.

Pointed out by:		kib
2011-11-21 12:21:00 +00:00
Sergey Kandaurov
18be8527e9 Use the acquired reference to the vmspace instead of direct dereferencing
of p->p_vmspace like it is done in sysctl_kern_proc_vmmap().
2011-11-21 10:36:57 +00:00
Lawrence Stewart
cf13a58510 - Add the ffclock_getcounter(), ffclock_getestimate() and ffclock_setestimate()
system calls to provide feed-forward clock management capabilities to
  userspace processes. ffclock_getcounter() returns the current value of the
  kernel's feed-forward clock counter. ffclock_getestimate() returns the current
  feed-forward clock parameter estimates and ffclock_setestimate() updates the
  feed-forward clock parameter estimates.

- Document the syscalls in the ffclock.2 man page.

- Regenerate the script-derived syscall related files.

Committed on behalf of Julien Ridoux and Darryl Veitch from the University of
Melbourne, Australia, as part of the FreeBSD Foundation funded "Feed-Forward
Clock Synchronization Algorithms" project.

For more information, see http://www.synclab.org/radclock/

Submitted by:	Julien Ridoux (jridoux at unimelb edu au)
2011-11-21 01:26:10 +00:00
Attilio Rao
ccdf233323 Introduce macro stubs in the mutex implementation that will be always
defined and will allow consumers, willing to provide options, file and
line to locking requests, to not worry about options redefining the
interfaces.
This is typically useful when there is the need to build another
locking interface on top of the mutex one.

The introduced functions that consumers can use are:
- mtx_lock_flags_
- mtx_unlock_flags_
- mtx_lock_spin_flags_
- mtx_unlock_spin_flags_
- mtx_assert_
- thread_lock_flags_

Spare notes:
- Likely we can get rid of all the 'INVARIANTS' specification in the
  ppbus code by using the same macro as done in this patch (but this is
  left to the ppbus maintainer)
- all the other locking interfaces may require a similar cleanup, where
  the most notable case is sx which will allow a further cleanup of
  vm_map locking facilities
- The patch should be fully compatible with older branches, thus a MFC
  is previewed (infact it uses all the underlying mechanisms already
  present).

Comments review by:	eadler, Ben Kaduk
Discussed with:		kib, jhb
MFC after:	1 month
2011-11-20 16:33:09 +00:00
Hans Petter Selasky
9e3ae31c7a Given that the typical usage of pause() is pause("zzz", hz / N), where N can
be greater than hz in some cases, simply ignore a timeout value of zero.

Suggested by:	Bruce Evans
MFC after:	1 week
2011-11-20 08:36:18 +00:00
Hans Petter Selasky
f1a1612fc2 Minor style change:
Simplify the description of pause() and shorten the KASSERT message in pause.
Also add a clamp for the timo argument in the non-KASSERT case.

Suggested by:	Bruce Evans
MFC after:	1 week
2011-11-20 08:29:23 +00:00
Lawrence Stewart
9bce0f05fe - Provide a sysctl interface to change the active system clock at runtime.
- Wrap [get]{bin,nano,micro}[up]time() functions of sys/time.h to allow
  requesting time from either the feedback or the feed-forward clock. If a
  feedback (e.g. ntpd) and feed-forward (e.g. radclock) daemon are both running
  on the system, both kernel clocks are updated but only one serves time.

- Add similar wrappers for the feed-forward difference clock.

Committed on behalf of Julien Ridoux and Darryl Veitch from the University of
Melbourne, Australia, as part of the FreeBSD Foundation funded "Feed-Forward
Clock Synchronization Algorithms" project.

For more information, see http://www.synclab.org/radclock/

Submitted by:	Julien Ridoux (jridoux at unimelb edu au)
2011-11-20 05:32:12 +00:00
Lawrence Stewart
f464c5ccb5 Provide high-level functions to access the feed-forward absolute and difference
clocks. Each routine can output an upper bound on the absolute time or time
interval requested. Different flavours of absolute time can be requested, for
example with or without leap seconds, monotonic or not, etc.

Committed on behalf of Julien Ridoux and Darryl Veitch from the University of
Melbourne, Australia, as part of the FreeBSD Foundation funded "Feed-Forward
Clock Synchronization Algorithms" project.

For more information, see http://www.synclab.org/radclock/

Submitted by:	Julien Ridoux (jridoux at unimelb edu au)
2011-11-20 01:20:50 +00:00
Lawrence Stewart
b0fdc83713 Core structure and functions to support a feed-forward clock within the kernel.
Implement ffcounter, a monotonically increasing cumulative counter on top of the
active timecounter. Provide low-level functions to read the ffcounter and
convert it to absolute time or a time interval in seconds using the current
ffclock estimates, which track the drift of the oscillator. Add a ring of
fftimehands to track passing of time on each kernel tick and pick up updates of
ffclock estimates.

Committed on behalf of Julien Ridoux and Darryl Veitch from the University of
Melbourne, Australia, as part of the FreeBSD Foundation funded "Feed-Forward
Clock Synchronization Algorithms" project.

For more information, see http://www.synclab.org/radclock/

Submitted by:	Julien Ridoux (jridoux at unimelb edu au)
2011-11-19 14:10:16 +00:00
Hans Petter Selasky
3ddd1777d5 Simplify the usb_pause_mtx() function by factoring out the generic parts
to the kernel's pause() function. The pause() function can now be used
when cold != 0. Also assert that the timeout in system ticks must be
positive.

Suggested by:	Bruce Evans
MFC after:	1 week
2011-11-19 11:17:27 +00:00
Hans Petter Selasky
11bcf702f4 Move the device_delete_all_children() function from usb_util.c
to kern/subr_bus.c. Simplify this function so that it no longer
depends on malloc() to execute. Identify a few other places where
it makes sense to use device_delete_all_children().

MFC after:	1 week
2011-11-19 10:11:50 +00:00
Konstantin Belousov
f82360acf2 Existing VOP_VPTOCNP() interface has a fatal flow that is critical for
nullfs.  The problem is that resulting vnode is only required to be
held on return from the successfull call to vop, instead of being
referenced.

Nullfs VOP_INACTIVE() method reclaims the vnode, which in combination
with the VOP_VPTOCNP() interface means that the directory vnode
returned from VOP_VPTOCNP() is reclaimed in advance, causing
vn_fullpath() to error with EBADF or like.

Change the interface for VOP_VPTOCNP(), now the dvp must be
referenced. Convert all in-tree implementations of VOP_VPTOCNP(),
which is trivial, because vhold(9) and vref(9) are similar in the
locking prerequisites. Out-of-tree fs implementation of VOP_VPTOCNP(),
if any, should have no trouble with the fix.

Tested by:	pho
Reviewed by:	mckusick
MFC after:	3 weeks (subject of re approval)
2011-11-19 07:50:49 +00:00
Ed Schouten
51cfb9474f Regenerate system call tables. 2011-11-19 06:36:11 +00:00
Ed Schouten
d3a993d46b Improve *access*() parameter name consistency.
The current code mixes the use of `flags' and `mode'. This is a bit
confusing, since the faccessat() function as a `flag' parameter to store
the AT_ flag.

Make this less confusing by using the same name as used in the POSIX
specification -- `amode'.
2011-11-19 06:35:15 +00:00
Navdeep Parhar
059d10c7b6 Do not increment the parent firmware's reference count when any other
firmware image in the module is registered.  Instead, do it when the
other image is itself referenced.

This allows a module with multiple firmware images to be automatically
unloaded when none of the firmware images are in use.

Discussed with:	jhb@ (on -hackers)
2011-11-19 00:20:28 +00:00
Peter Holm
7bbcd22d38 Added check for negative seconds value. Found by syscall() fuzzing.
MFC after:	1 week
2011-11-18 19:14:42 +00:00
Konstantin Belousov
7b5190779b Consistently use process spin lock for protection of the
p->p_boundary_count. Race could cause the execve(2) from the threaded
process to hung since thread boundary counter was incorrect and
single-threading never finished.

Reported by:	pluknet, pho
Tested by:	pho
MFC after:	1 week
2011-11-18 09:12:26 +00:00
Kevin Lo
41f1dccceb Add unicode support to msdosfs and smbfs; original pathes from imura,
bug fixes by Kuan-Chung Chiu <buganini at gmail dot com>.

Tested by me in production for several days at work.
2011-11-18 03:05:20 +00:00
Pawel Jakub Dawidek
d576deedb5 Constify arguments for locking KPIs where possible.
This enables locking consumers to pass their own structures around as const and
be able to assert locks embedded into those structures.

Reviewed by:	ed, kib, jhb
2011-11-16 21:51:17 +00:00
Pawel Jakub Dawidek
a20358302f Constify stack argument for functions that don't modify it.
Reviewed by:	ed, kib, jhb
2011-11-16 19:06:55 +00:00
Marius Strobl
d7ecd801ed As it turns out, r186347 actually is insufficient to avoid the use of the
curthread-accessing part of mtx_{,un}lock(9) when using a r210623-style
curthread implementation on sparc64, crashing the kernel in its early
cycles as PCPU isn't set up, yet (and can't be set up as OFW is one of the
things we need for that, which leads to a chicken-and-egg problem). What
happens is that due to the fact that the idea of r210623 actually is to
allow the compiler to cache invocations of curthread, it factors out
obtaining curthread needed for both mtx_lock(9) and mtx_unlock(9) to
before the branch based on kobj_mutex_inited when compiling the kernel
without the debugging options. So change kobj_class_compile_static(9)
to just never acquire kobj_mtx, effectively restricting it to its
documented use, and add a kobj_init_static(9) for initializing objects
using a class compiled with the former and that also avoids using mutex(9)
(and malloc(9)). Also assert in both of these functions that they are
used in their intended way only.
While at it, inline kobj_register_method() and kobj_unregister_method()
as there wasn't much point for factoring them out in the first place
and so that a reader of the code has to figure out the locking for
fewer functions missing a KOBJ_ASSERT.
Tested on powerpc{,64} by andreast.

Reviewed by:	nwhitehorn (earlier version), jhb
MFC after:	3 days
2011-11-15 20:11:03 +00:00
David E. O'Brien
0e31b3c15f Reformat comment to be more readable in standard Xterm.
(while I'm here, wrap other long lines)
2011-11-15 01:48:53 +00:00
Robert Millan
ea4d9a14f1 Remove a few bits of FreeBSD 2.x compatibility code.
Approved by:	kib (mentor)
2011-11-14 18:21:27 +00:00
John Baldwin
7edec6214e - Split out a kern_posix_fadvise() from the posix_fadvise() system call so
it can be used by in-kernel consumers.
- Make kern_posix_fallocate() public.
- Use kern_posix_fadvise() and kern_posix_fallocate() to implement the
  freebsd32 wrappers for the two system calls.
2011-11-14 18:00:15 +00:00
Alfred Perlstein
cfb09e00e6 Constify args to copyiniov and copyinuio. 2011-11-14 07:12:10 +00:00
Konstantin Belousov
56be1b9a7a To limit amount of the kernel memory allocated, and to optimize the
iteration over the fdsets, kern_select() limits the length of the
fdsets copied in by the last valid file descriptor index. If any bit
is set in a mask above the limit, current implementation ignores the
filedescriptor, instead of returning EBADF.

Fix the issue by scanning the tails of fdset before entering the
select loop and returning EBADF if any bit above last valid
filedescriptor index is set. The performance impact of the additional
check is only imposed on the (somewhat) buggy applications that pass
bad file descriptors to select(2) or pselect(2).

PR:	kern/155606, kern/162379
Discussed with:	cognet, glebius
Tested by:	andreast (powerpc, all 64/32bit ABI combinations, big-endian),
       marius (sparc64, big-endian)
MFC after:    2 weeks
2011-11-13 10:28:01 +00:00
Konstantin Belousov
4d651f4e5f Style.
MFC after:	1 week
2011-11-11 04:13:47 +00:00
Konstantin Belousov
f403cfb19c Guard against the unlikely case of the alias path containing the '%' symbols.
Reported by:	arundel
MFC after:	1 week
2011-11-11 04:12:58 +00:00
Ryan Stone
493b584dbd Correct the types of the arguments to return probes of the syscall
provider.  Previously we were erroneously supplying the argument types of
the corresponding entry probe.

Reviewed by:	rpaulo
MFC after:	1 week
2011-11-11 03:49:42 +00:00
Ed Schouten
d09ebcec17 Simplify the code emitted by makeobjops.awk slightly.
Just place the default kobj_method inside the kobjop_desc structure.
There's no need to give these kobj_methods their own symbol. This shaves
off 10 KB of a GENERIC kernel binary.
2011-11-09 11:00:29 +00:00
Ed Schouten
3f3f6bc302 Make kobj_methods constant.
These structures hold no information that is modified during runtime. By
marking this constant, we see approximately 600 symbols become
read-only (amd64 GENERIC). While there, also mark the kobj_method
structures generated by makeobjops.awk static. They are only referenced
by the kobjop_desc structures within the same file.

Before:

	$ ls -l kernel
	-rwxr-xr-x  1 ed  wheel  15937309 Nov  8 16:29 kernel*
	$ size kernel
	    text    data     bss      dec    hex filename
	12260854 1358468 2848832 16468154 fb48ba kernel
	$ nm kernel | fgrep -c ' r '
	8240

After:

	$ ls -l kernel
	-rwxr-xr-x  1 ed  wheel  15922469 Nov  8 16:25 kernel*
	$ size kernel
	    text    data     bss      dec    hex filename
	12302869 1302660 2848704 16454233 fb1259 kernel
	$ nm kernel | fgrep -c ' r '
	8838
2011-11-08 15:38:21 +00:00
Ryan Stone
6f6924e5a6 The in-kernel CTF parser caches the result of its first attempt to parse
CTF data from a module.  On subsequent attempts to retrieve CTF data for
a module, return an error if there no CTF data.

This fixes a panic if you try to enable fbt probes on a module with CTF
data twice.

Submitted by:	Paul Ambrose (ambrosehua AT gmail DOT com)
MFC after:	3 days
2011-11-08 15:17:54 +00:00
Attilio Rao
ed1f6dc235 Introduce the option VFS_ALLOW_NONMPSAFE and turn it on by default on
all the architectures.
The option allows to mount non-MPSAFE filesystem. Without it, the
kernel will refuse to mount a non-MPSAFE filesytem.

This patch is part of the effort of killing non-MPSAFE filesystems
from the tree.

No MFC is expected for this patch.

Tested by:	gianni
Reviewed by:	kib
2011-11-08 10:18:07 +00:00
Mikolaj Golub
5384d08913 Add KVME_FLAG_SUPER and use it in sysctl_kern_proc_vmmap for marking
entries with superpages.

Submitted by:	Mel Flynn <mel.flynn+fbsd.hackers@mailing.thruhere.net>
Reviewed by:	alc, rwatson
2011-11-07 21:13:19 +00:00
Mikolaj Golub
bde886fba4 In lim_fork() assert that processes locks are held.
Suggested by:	kib
2011-11-07 21:09:04 +00:00
Ed Schouten
6472ac3d8a Mark all SYSCTL_NODEs static that have no corresponding SYSCTL_DECLs.
The SYSCTL_NODE macro defines a list that stores all child-elements of
that node. If there's no SYSCTL_DECL macro anywhere else, there's no
reason why it shouldn't be static.
2011-11-07 15:43:11 +00:00
Ed Schouten
d745c852be Mark MALLOC_DEFINEs static that have no corresponding MALLOC_DECLAREs.
This means that their use is restricted to a single C file.
2011-11-07 06:44:47 +00:00
Max Khon
4e313b699e Add KLD_DEBUG option. 2011-11-06 08:10:41 +00:00
John Baldwin
cd06ae5c1b Regen. 2011-11-04 04:06:31 +00:00
John Baldwin
936c09ac0f Add the posix_fadvise(2) system call. It is somewhat similar to
madvise(2) except that it operates on a file descriptor instead of a
memory region.  It is currently only supported on regular files.

Just as with madvise(2), the advice given to posix_fadvise(2) can be
divided into two types.  The first type provide hints about data access
patterns and are used in the file read and write routines to modify the
I/O flags passed down to VOP_READ() and VOP_WRITE().  These modes are
thus filesystem independent.  Note that to ease implementation (and
since this API is only advisory anyway), only a single non-normal
range is allowed per file descriptor.

The second type of hints are used to hint to the OS that data will or
will not be used.  These hints are implemented via a new VOP_ADVISE().
A default implementation is provided which does nothing for the WILLNEED
request and attempts to move any clean pages to the cache page queue for
the DONTNEED request.  This latter case required two other changes.
First, a new V_CLEANONLY flag was added to vinvalbuf().  This requests
vinvalbuf() to only flush clean buffers for the vnode from the buffer
cache and to not remove any backing pages from the vnode.  This is
used to ensure clean pages are not wired into the buffer cache before
attempting to move them to the cache page queue.  The second change adds
a new vm_object_page_cache() method.  This method is somewhat similar to
vm_object_page_remove() except that instead of freeing each page in the
specified range, it attempts to move clean pages to the cache queue if
possible.

To preserve the ABI of struct file, the f_cdevpriv pointer is now reused
in a union to point to the currently active advice region if one is
present for regular files.

Reviewed by:	jilles, kib, arch@
Approved by:	re (kib)
MFC after:	1 month
2011-11-04 04:02:50 +00:00
John Baldwin
dccc45e4c0 Move the cleanup of f_cdevpriv when the reference count of a devfs
file descriptor drops to zero out of _fdrop() and into devfs_close_f()
as it is only relevant for devfs file descriptors.

Reviewed by:	kib
MFC after:	1 week
2011-11-04 03:39:31 +00:00
Attilio Rao
2b10b1f872 Disable interrupt and preemption for smp_rendezvous() also in the
UP/!SMP case.
The callbacks may be relying on this feature and having 2 different
ways to deal with them is not correct.

Reported by:	rstone
Reviewed by:	jhb
MFC after:	2 weeks
2011-11-03 14:36:56 +00:00
Marcel Moolenaar
b2f1a8f2b3 Revert rev. 226893: subr_syscall.c is being included from C files and
on amd64 with FREEBSD32 enabled, this means that systrace_probe_func
gets defined twice.
2011-10-30 02:19:39 +00:00
Marcel Moolenaar
056f0ec755 Define systrace_probe_func in subr_syscall.c where it's used, instead
of defining it in MD code. This eliminates porting to other architectures.
2011-10-29 01:26:36 +00:00
Sergey Kandaurov
c241c5e49a Fix arguments list for proc:::signal-discard DTrace probe.
Reported by:	Anton Yuzhaninov <citrin citrin ru>
MFC after:	1 week
2011-10-28 15:22:51 +00:00
John Baldwin
62238a6791 Whitespace fix. 2011-10-27 17:43:36 +00:00
Alan Cox
703dec68bf Eliminate vestiges of page coloring in VM_ALLOC_NOOBJ calls to
vm_page_alloc().  While I'm here, for the sake of consistency, always
specify the allocation class, such as VM_ALLOC_NORMAL, as the first of
the flags.
2011-10-27 16:39:17 +00:00
Sergey Kandaurov
3bedc94069 Remove the long reprecated ``/stand/sysinstall'' from the init_path.
It can be put back using the INIT_PATH config option or init_path
loader variable, if still needed (which I doubt).

MFC after:	1 week
2011-10-27 10:25:11 +00:00
Alan Cox
f346986b76 contigmalloc(9) and contigfree(9) are now implemented in terms of other
more general VM system interfaces.  So, their implementation can now
reside in kern_malloc.c alongside the other functions that are declared
in malloc.h.
2011-10-27 02:52:24 +00:00
John Baldwin
c48fb4da4c - Fixup filenames in a few more places where they are used.
- Some whitespace fixes.
2011-10-26 15:17:42 +00:00
Pawel Jakub Dawidek
4c11f091df The v_data field is a pointer, so set it to NULL, not 0.
MFC after:	3 days
2011-10-25 14:01:17 +00:00
Marcel Moolenaar
421b7fe574 Don't terminate the interactive root mount prompt on mount failure.
This restores the previous behaviour. While here, match '?' and '.'
inputs exactly and improve the error message.

Requested by: avg@
Derived from a patch by: Arnaud Lacombe <lacombar@gmail.com>
2011-10-23 20:03:33 +00:00
Dag-Erling Smørgrav
e141be6f79 Revisit the capability failure trace points. The initial implementation
only logged instances where an operation on a file descriptor required
capabilities which the file descriptor did not have.  By adding a type enum
to struct ktr_cap_fail, we can catch other types of capability failures as
well, such as disallowed system calls or attempts to wrap a file descriptor
with more capabilities than it had to begin with.
2011-10-18 07:28:58 +00:00
Marcel Moolenaar
80f1c58b0a Fix double vision syndrome (read: double output) when in the
debugger without a panic.
2011-10-16 14:16:46 +00:00
Konstantin Belousov
126b36a21e Control the execution permission of the readable segments for
i386 binaries on the amd64 and ia64 with the sysctl, instead of
unconditionally enabling it.

Reviewed by:	marcel
2011-10-15 12:35:18 +00:00
Marcel Moolenaar
676eda08d0 In elf32_trans_prot() and when compiling for amd64 or ia64, add
PROT_EXECUTE when PROT_READ is needed. By default i386 allows
execution when reading is allowed and JDK 1.4.x depends on that.
2011-10-13 16:16:46 +00:00
Gleb Smirnoff
8d689e042f Make memguard(9) capable to guard uma(9) allocations. 2011-10-12 18:08:28 +00:00
Robert Watson
b160c14194 Correct a bug in export of capability-related information from the sysctls
supporting procstat -f: properly provide capability rights information to
userspace.  The bug resulted from a merge-o during upstreaming (or rather,
a failure to properly merge FreeBSD-side changed downstream).

Spotted by:     des, kibab
MFC after:      3 days
2011-10-12 12:08:03 +00:00
Adrian Chadd
df46ae53f6 Don't call fixup_filename() on each witness lock call.
This has been irking me for a while. This causes significant
CPU use on bottlenecked CPUs (eg my older EEEPC w/ an earlier
Celeron CPU and my MIPS24k boards) when they're passing
a lot of traffic.

Since the file/line values are only used for printing, this
should only affect display. It should have no operational
change on the code, besides reducing CPU use.
2011-10-12 09:21:02 +00:00
Dag-Erling Smørgrav
c601ad8eeb Add a new trace point, KTRFAC_CAPFAIL, which traces capability check
failures.  It is included in the default set for ktrace(1) and kdump(1).
2011-10-11 20:37:10 +00:00
Kirk McKusick
cd795a6e1f When unmounting a filesystem always wait for the vfs_busy lock to clear
so that if no vnodes in the filesystem are actively in use the unmount
will succeed rather than failing with EBUSY.

Reported by: Garrett Cooper
Reviewed by: Attilio Rao and Kostik Belousov
Tested by:   Garrett Cooper
PR:          kern/161016
MFC after:   3 weeks
2011-10-11 18:46:41 +00:00
Marius Strobl
f305d1b0db In device_get_children() avoid malloc(0) in order to increase portability
to other operating systems.

PR:     154287
2011-10-09 21:21:37 +00:00
Alan Cox
1549ed03ff Fix the handling of an empty kmem map by sysctl_kmem_map_free(). In
the unlikely event that sysctl_kmem_map_free() was performed on an
empty kmem map, it would incorrectly report the free space as zero.

Discussed with:	avg
MFC after:	1 week
2011-10-08 18:29:30 +00:00
Jonathan Anderson
25e33e625f Change one printf() to log().
As noted in kern/159780, printf() is not very jail-friendly, since it can't be easily monitored by jail management tools. This patch reports an error via log() instead, which, if nobody is watching the log file, still prints to the console.

Approved by: mentor (rwatson)
Submitted by: Eugene Grosbein <eugen@eg.sd.rdtc.ru>
MFC after: 5 days
2011-10-07 09:51:12 +00:00