12625 Commits

Author SHA1 Message Date
eadler
169b46c915 - Clean up timestamps in msgbuf code. The timestamps should now be
inserted after the priority token thus cleaning up the output.
- Remove the needless double internal do_add_char function.
- Resolve a possible deadlock if interrupts are
    disabled and getnanotime is called

Reviewed by:	bde  kmacy, avg, sbruno (various versions)
Approved by:	cperciva
MFC after:	2 weeks
2012-03-19 00:36:32 +00:00
jh
683a986c03 Cast wallclock.tv_sec to uint64_t to avoid overflow in the calculation.
PR:		kern/161552
Reviewed by:	trasz
Tested by:	Nikos Vassiliadis
MFC after:	1 week
2012-03-18 19:13:32 +00:00
davide
cd0c342e57 Add rudimentary profiling of the hash table used in the in the umtx code to
hold active lock queues.

Reviewed by:	attilio
Approved by:	davidxu, gnn (mentor)
MFC after:	3 weeks
2012-03-16 20:32:11 +00:00
tuexen
b8b34b6ecf Fix bugs which can result in a panic when an non-SCTP socket it
used with an sctp_ system-call which expects an SCTP socket.

MFC after: 3 days.
2012-03-15 14:13:38 +00:00
ae
894c8dc15b Add CTLFLAG_TUN to the sysctl definition and fix style.
Pointed by:	Garrett Cooper
MFC after:	2 weeks
2012-03-15 06:01:21 +00:00
ae
9be115302d Add debug.kld_debug loader tunable.
MFC after:	2 weeks
2012-03-15 05:11:29 +00:00
jh
59d9d84ca4 Add an assert for proctree_lock to proc_to_reap().
Discussed with:	kib
MFC after:	1 week
2012-03-14 15:52:23 +00:00
kib
6e85340add Lock the process around manipulations with p_flag.
Reported and reviewed by:	jh
MFC after:	3 days
2012-03-13 22:00:46 +00:00
adrian
f2bb6a85d7 Add module load/unload stubs. 2012-03-13 20:27:48 +00:00
mav
5b5fc4e585 Add kern.eventtimer.activetick tunable/sysctl, specifying whether each
hardclock() tick should be run on every active CPU, or on only one.

On my tests, avoiding extra interrupts because of this on 8-CPU Core i7
system with HZ=10000 saves about 2% of performance. At this moment option
implemented only for global timers, as reprogramming per-CPU timers is
too expensive now to be compensated by this benefit, especially since we
still have to regularly run hardclock() on at least one active CPU to
update system uptime. For global timer it is quite trivial: timer runs
always, but we just skip IPIs to other CPUs when possible.

Option is enabled by default now, keeping previous behavior, as periodic
hardclock() calls are still used at least to implement setitimer(2) with
ITIMER_VIRTUAL and ITIMER_PROF arguments. But since default schedulers don't
depend on it since r232917, we are much more free to experiment with it.

MFC after:	1 month
2012-03-13 10:21:08 +00:00
mav
ffaa080e67 Rewrite thread CPU usage percentage math to not depend on periodic calls
with HZ rate through the sched_tick() calls from hardclock().

Potentially it can be used to improve precision, but now it is just minus
one more reason to call hardclock() for every HZ tick on every active CPU.
SCHED_4BSD never used sched_tick(), but keep it in place for now, as at
least SCHED_FBFS existing in patches out of the tree depends on it.

MFC after:	1 month
2012-03-13 08:18:54 +00:00
pho
e35bb21f2c Allways call fdrop(). 2012-03-12 11:56:57 +00:00
kib
4e790f9b2b ELF image can have several PT_NOTE program headers. Look for the ELF
brand note in each header, instead of using only first one.

Reviewed by:	kan
Tested by:	andrew (arm), flo (sparc64)
MFC after:	3 weeks
2012-03-11 19:38:49 +00:00
kib
8adabb0356 Remove fifo.h. The only used function declaration from the header is
migrated to sys/vnode.h.

Submitted by:	gianni
2012-03-11 12:19:58 +00:00
mav
4be9351f8b Revert r175376 and tune cpufreq(4) frequency comparison logic instead.
Instead of using 25MHz equality threshold, look for the nearest value when
handling dev.cpu.0.freq sysctl and for exact match when it is expected.

ACPI may report extra level with frequency 1MHz above the nominal to
control Intel Turbo Boost operation. It is not a bug, but feature:
dev.cpu.0.freq_levels: 2934/106000 2933/95000 2800/82000 ...
In this case value 2933 means 2.93GHz, but 2934 means 3.2-3.6GHz.

I've found that my Core i7-870 based system has Intel Turbo Boost disabled
by default and without this change it was absolutely invisible and hard
to control.

MFC after:	2 weeks
2012-03-10 18:56:16 +00:00
mav
1324baa4eb Idle ticks optimization:
- Pass number of events to the statclock() and profclock() functions
   same as to hardclock() before to not call them many times in a loop.
 - Rename them into statclock_cnt() and profclock_cnt().
 - Turn statclock() and profclock() into compatibility wrappers,
   still needed for arm.
 - Rename hardclock_anycpu() into hardclock_cnt() for unification.

MFC after:	1 week
2012-03-10 14:57:21 +00:00
trasz
a0d48d6f11 Remove useless thread_{lock,unlock}() in raccd. 2012-03-10 14:38:49 +00:00
jmallett
d25fa497f7 Export intrcnt correctly when running under 32-bit compatibility.
Reviewed by:	gonzo, nwhitehorn
2012-03-09 22:30:54 +00:00
pho
c84e05a07c Perform the parameter validation before assigning it to a signed int
variable. This fixes the problem seen with readdir(3) fuzzing.

Submitted by:	bde
MFC after:	1 week
2012-03-09 21:31:12 +00:00
mav
d6e827162d Make kern.sched.idlespinthresh default value adaptive depending of HZ.
Otherwise with HZ above 8000 CPU may never skip timer ticks on idle.
2012-03-09 19:09:08 +00:00
mav
fb50c869a4 Be more polite when setting state->nextevent inside cpu_new_callout().
Hardclock is not the only who wakes idle CPU since kdtrace cyclic addition.

MFC after:	2 weeks
2012-03-09 07:30:48 +00:00
kib
5abd2bb7cb Decomission mnt_noasync. Introduce MNTK_NOASYNC mnt_kern_flag which
allows a filesystem to request VFS to not allow MNTK_ASYNC.

MFC after:	1 week
2012-03-09 00:12:05 +00:00
pho
81cae127b0 Free up allocated memory used by posix_fadvise(2). 2012-03-08 20:34:13 +00:00
jhb
19feaba08b Add KTR_VFS traces to track modifications to a vnode's writecount. 2012-03-08 20:27:20 +00:00
jhb
4fea355eb2 Add a new sched_clear_name() method to the scheduler interface to clear
the cached name used for KTR_SCHED traces when a thread's name changes.
This way KTR_SCHED traces (and thus schedgraph) will notice when a thread's
name changes, most commonly via execve().

MFC after:	2 weeks
2012-03-08 19:41:05 +00:00
kib
9d4d411642 The pipe_poll() performs lockless access to the vnode to test
fifo_iseof() condition, allowing the v_fifoinfo to be reset and freed
by fifo_cleanup().

Precalculate EOF at the places were fo_wgen is changed, and cache the
state in a new pipe state flag PIPE_SAMEWGEN.

Reported and tested by:	bf
Submitted by:	gianni
MFC after:	1 week (a backport)
2012-03-07 07:31:50 +00:00
trasz
613d36617d Make racct and rctl correctly handle jail renaming. Previously
they would continue using old name, the one jail was created with.

PR:		bin/165207
2012-03-06 11:05:50 +00:00
ivoras
5fe0ebe46a Print out process name and thread id in the debugging message.
This is useful because the message can end up in system logs in
non-debugging operation.

Reviewed by:	attilio (earlier version)
2012-03-05 14:19:43 +00:00
kib
6f473d5748 pipe_read(): change the type of size to int, and remove signed clamp.
pipe_write(): change the type of desiredsize back to int, its value fits.

Requested by: bde
MFC after:    3 weeks
2012-03-04 15:09:01 +00:00
kib
273d08b6bc Instead of incomplete handling of read(2)/write(2) return values that
does not fit into registers, declare that we do not support this case
using CTASSERT(), and remove endianess-unsafe code to split return value
into td_retval.

While there, change the style of the sysctl debug.iosize_max_clamp
definition.

Requested by:	bde
MFC after:	3 weeks
2012-03-04 14:55:37 +00:00
trociny
f331516921 Make kern.proc.umask sysctl readonly.
Requested by:	src
MFC after:	1 week
2012-03-03 11:53:35 +00:00
mav
1504832681 Fix bug of r232207, when cpu_search() could prefer CPU group with best
load, but with no CPU matching given limitations. It caused kernel panics
in some cases when thread was bound to specific CPUs with cpuset(1).
2012-03-03 11:50:48 +00:00
jmallett
50c253779f o) Add COMPAT_FREEBSD32 support for MIPS kernels using the n64 ABI with userlands
using the o32 ABI.  This mostly follows nwhitehorn's lead in implementing
   COMPAT_FREEBSD32 on powerpc64.
o) Add a new type to the freebsd32 compat layer, time32_t, which is time_t in the
   32-bit ABI being used.  Since the MIPS port is relatively-new, even the 32-bit
   ABIs use a 64-bit time_t.
o) Because time{spec,val}32 has the same size and layout as time{spec,val} on MIPS
   with 32-bit compatibility, then, disable some code which assumes otherwise
   wrongly when built for MIPS.  A more general macro to check in this case would
   seem like a good idea eventually.  If someone adds support for using n32
   userland with n64 kernels on MIPS, then they will have to add a variety of
   flags related to each piece of the ABI that can vary.  That's probably the
   right time to generalize further.
o) Add MIPS to the list of architectures which use PAD64_REQUIRED in the
   freebsd32 compat code.  Probably this should be generalized at some point.

Reviewed by:	gonzo
2012-03-03 08:19:18 +00:00
rmacklem
f633984c25 Post r230394, the Lookup RPC counts for both NFS clients increased
significantly. Upon investigation this was caused by name cache
misses for lookups of "..". For name cache entries for non-".."
directories, the cache entry serves double duty. It maps both the
named directory plus ".." for the parent of the directory. As such,
two ctime values (one for each of the directory and its parent) need
to be saved in the name cache entry.
This patch adds an entry for ctime of the parent directory to the
name cache. It also adds an additional uma zone for large entries
with this time value, in order to minimize memory wastage.
As well, it fixes a couple of cases where the mtime of the parent
directory was being saved instead of ctime for positive name cache
entries. With this patch, Lookup RPC counts return to values similar
to pre-r230394 kernels.

Reported by:	bde
Discussed with:	kib
Reviewed by:	jhb
MFC after:	2 weeks
2012-03-03 01:06:54 +00:00
jhb
5013ab31bd - Change contigmalloc() to use the vm_paddr_t type instead of an unsigned
long for specifying a boundary constraint.
- Change bus_dma tags to use bus_addr_t instead of bus_size_t for boundary
  constraints.

These allow boundary constraints to be fully expressed for cases where
sizeof(bus_addr_t) != sizeof(bus_size_t).  Specifically, it allows a
driver to properly specify a 4GB boundary in a PAE kernel.

Note that this cannot be safely MFC'd without a lot of compat shims due
to KBI changes, so I do not intend to merge it.

Reviewed by:	scottl
2012-03-01 19:58:34 +00:00
mckusick
b23f922edf This change avoids a kernel deadlock on "snaplk" when using
snapshots on UFS filesystems running with journaled soft updates.
This is the first of several bugs that need to be fixed before
removing the restriction added in -r230250 to prevent the use
of snapshots on filesystems running with journaled soft updates.

The deadlock occurs when holding the snapshot lock (snaplk)
and then trying to flush an inode via ffs_update(). We become
blocked by another process trying to flush a different inode
contained in the same inode block that we need. It holds the
inode block for which we are waiting locked. When it tries to
write the inode block, it gets blocked waiting for the our
snaplk when it calls ffs_copyonwrite() to see if the inode
block needs to be copied in our snapshot.

The most obvious place that this deadlock arises is in the
ffs_copyonwrite() routine when it updates critical metadata
in a snapshot and tries to write it out before proceeding.
The fix here is to write the data and indirect block pointer
for the snapshot, but to skip the call to ffs_update() to
write the snapshot inode. To ensure that we will never have
to update a pointer in the inode itself, the ffs_snapshot()
routine that creates the snapshot has to ensure that all the
direct blocks are allocated as part of the creation of the
snapshot.

A less obvious place that this deadlock occurs is when we hold
the snaplk because we are deleting a snapshot. In the course of
doing the deletion, we need to allocate various soft update
dependency structures and allocate some journal space. If we
hit a resource limit while doing this we decrease the resources
in use by flushing out an existing dirty file to get it to give
up the soft dependency resources that it holds. The flush can
cause an ffs_update() to be done on the inode for the file that
we have selected to flush resulting in the same deadlock as
described above when the inode that we have chosen to flush
resides in the same inode block as the snapshot inode that we hold.
The fix is to defer cleaning up any time that the inode on which
we are operating is a snapshot.

Help and review by:    Jeff Roberson
Tested by:             Peter Holm
MFC (to 9 only) after: 2 weeks
2012-03-01 18:45:25 +00:00
trociny
1aad0004ee Introduce VOP_UNP_BIND(), VOP_UNP_CONNECT(), and VOP_UNP_DETACH()
operations for setting and accessing vnode's v_socket field.

The operations are necessary to implement proper unix socket handling
on layered file systems like nullfs(5).

This change fixes the long standing issue with nullfs(5) being in that
unix sockets did not work between lower and upper layers: if we bound
to a socket on the lower layer we could connect only to the lower
path; if we bound to the upper layer we could connect only to the
upper path. The new behavior is one can connect to both the lower and
the upper paths regardless what layer path one binds to.

PR:		kern/51583, kern/159663
Suggested by:	kib
Reviewed by:	arch
MFC after:	2 weeks
2012-02-29 21:38:31 +00:00
davidxu
49fb0a40aa initialize clock ID and flags only when copying timespec, a _umtx_time
copy already contains these fields.
2012-02-29 02:01:48 +00:00
mm
77766742e1 Add procfs to jail-mountable filesystems.
Reviewed by:	jamie
MFC after:	1 week
2012-02-29 00:30:18 +00:00
dim
e045194768 Change definition of pipe_chmod() from K&R to C99, to avoid the
following clang warning:

sys/kern/sys_pipe.c:1556:10: error: promoted type 'int' of K&R function parameter is not compatible with the parameter type 'mode_t'
      (aka 'unsigned short') declared in a previous prototype [-Werror]
        mode_t mode;
               ^
sys/kern/sys_pipe.c:155:19: note: previous declaration is here
static fo_chmod_t       pipe_chmod;
                        ^
2012-02-28 21:45:21 +00:00
jhb
22eaf01bc1 Properly clear a device's devclass if DEVICE_ATTACH() fails if the device
does not have a fixed devclass.

Reviewed by:	imp
MFC after:	2 weeks
2012-02-28 19:16:02 +00:00
kib
0c3998cc9e Currently, the debugger attached to the process executing vfork() does
not get syscall exit notification until the child performed exec of
exit.  Swap the order of doing ptracestop() and waiting for P_PPWAIT
clearing, by postponing the wait into syscallret after ptracestop()
notification is done.

Reported, tested and reviewed by:	Dmitry Mikulin <dmitrym juniper net>
MFC after:	 2 weeks
2012-02-27 21:10:10 +00:00
jhb
4110bb206b Clear the a device's description string anytime it's driver changes.
Descriptions  are specific to drivers and we don't change drivers on attached
devices.  This fixes a few places where we were not clearing the description
when detaching a driver (e.g. with device_attach() failed).  While here, fix
a few other nits:
- Remove spurious call to remove a device's driver from
  devclass_driver_deleted().  device_detach() removes it already.
- Fix a typo.
2012-02-27 16:08:18 +00:00
davidxu
96aacc2279 Follow changes made in revision 232144, pass absolute timeout to kernel,
this eliminates a clock_gettime() syscall.
2012-02-27 13:38:52 +00:00
mav
8a35bc6c4f Rework CPU load balancing in SCHED_ULE:
- In sched_pickcpu() be more careful taking previous CPU on SMT systems.
Do it only if all other logical CPUs of that physical one are idle to avoid
extra resource sharing.
 - In sched_pickcpu() change general logic of CPU selection. First
look for idle CPU, sharing last level cache with previously used one,
skipping SMT CPU groups. If none found, search all CPUs for the least loaded
one, where the thread with its priority can run now. If none found, search
just for the least loaded CPU.
 - Make cpu_search() compare lowest/highest CPU load when comparing CPU
groups with equal load. That allows to differentiate 1+1 and 2+0 loads.
 - Make cpu_search() to prefer specified (previous) CPU or group if load
is equal. This improves cache affinity for more complicated topologies.
 - Randomize CPU selection if above factors are equal. Previous code tend
to prefer CPUs with lower IDs, causing unneeded collisions.
 - Rework periodic balancer in sched_balance_group(). With cpu_search()
more intelligent now, make balansing process flat, removing recursion
over the topology tree. That fixes double swap problem and makes load
distribution more even and predictable.

All together this gives 10-15% performance improvement in many tests on
CPUs with SMT, such as Core i7, for number of threads is less then number
of logical CPUs. In some tests it also gives positive effect to systems
without SMT.

Reviewed by:	jeff
Tested by:	flo, hackers@
MFC after:	1 month
Sponsored by:	iXsystems, Inc.
2012-02-27 10:31:54 +00:00
phk
c1bac816f5 Also call the low-level driver if ->c_iflag & (IXON|IXOFF|IXANY) changes.
Uftdi(4) examines (c_iflag & (IXON|IXOFF)) to control hw XON-XOFF support.
This is obviously no good, if changes to those bits are not communicated
down the stack.
2012-02-26 20:56:49 +00:00
alc
351ef48158 Fix typo.
MFC after:	1 week
2012-02-26 19:10:14 +00:00
mm
d974ef7be1 Analogous to r232059, add a parameter for the ZFS file system:
allow.mount.zfs:
	allow mounting the zfs filesystem inside a jail

This way the permssions for mounting all current VFCF_JAIL filesystems
inside a jail are controlled wia allow.mount.* jail parameters.

Update sysctl descriptions.
Update jail(8) and zfs(8) manpages.

TODO:	document the connection of allow.mount.* and VFCF_JAIL for kernel
	developers

MFC after:	10 days
2012-02-26 16:30:39 +00:00
jilles
06d1ac3238 Fix fchmod() and fchown() on fifos.
The new fifo implementation in r232055 broke fchmod() and fchown() on fifos.
Postfix needs this.

Submitted by:	gianni
Reported by:	dougb
2012-02-26 15:14:29 +00:00
trociny
a9902044d1 Add sysctl to retrieve or set umask of another process.
Submitted by:	Dmitry Banschikov <me ubique spb ru>
Discussed with:	kib, rwatson
Reviewed by:	kib
MFC after:	2 weeks
2012-02-26 14:25:48 +00:00