12900 Commits

Author SHA1 Message Date
kib
de90907af2 Style fixes for r242958.
Reported and reviewed by:	bde
MFC after:	28 days
2012-11-16 06:22:14 +00:00
trasz
f25f7f2e87 Improve KASSERT messages in racct, to make it clear which resource
caused the problem.

Submitted by:	mjg
2012-11-15 15:55:49 +00:00
trasz
ec6f935202 Fix kassert that's not really valid for %CPU accounting. The problem
here is race between decaying the resource usage in containers, and updating
per-process usage; basically, the former may cause per-container usage
to get smaller than per-process usage.

Submitted by:	Rudo Tomori
2012-11-15 14:11:34 +00:00
mav
f2dcd36473 Fix bug in r242852 that prevented CPU from becoming idle if kernel built
without SMP support.
2012-11-15 14:10:51 +00:00
jeff
f40f3c3255 - Implement run-time expansion of the KTR buffer via sysctl.
- Implement a function to ensure that all preempted threads have switched
   back out at least once.  Use this to make sure there are no stale
   references to the old ktr_buf or the lock profiling buffers before
   updating them.

Reviewed by:	marius (sparc64 parts), attilio (earlier patch)
Sponsored by:	EMC / Isilon Storage Division
2012-11-15 00:51:57 +00:00
bapt
f7eb521a37 Style fix
MFC after:	1 day
2012-11-14 10:33:12 +00:00
bapt
554b1ce9d7 return ERANGE if the buffer is too small to contain the login as documented in
the manpage

Reviewed by:	cognet, kib
MFC after:	1 month
2012-11-14 10:32:12 +00:00
mjg
72315ca484 enterpgrp: get rid of pgrp2 variable and use KASSERT directly on pgfind result.
pgrp2 was used only for debugging, but pgrp2 = pgfind(..) was present in compiled code even for kernels without INVARIANTS

Approved by:	trasz (mentor)
MFC after:	1 week
2012-11-13 22:01:25 +00:00
kib
63c9e066e5 Regen 2012-11-13 12:53:41 +00:00
kib
1409e8df20 Add the wait6(2) system call. It takes POSIX waitid()-like process
designator to select a process which is waited for. The system call
optionally returns siginfo_t which would be otherwise provided to
SIGCHLD handler, as well as extended structure accounting for child
and cumulative grandchild resource usage.

Allow to get the current rusage information for non-exited processes
as well, similar to Solaris.

The explicit WEXITED flag is required to wait for exited processes,
allowing for more fine-grained control of the events the waiter is
interested in.

Fix the handling of siginfo for WNOWAIT option for all wait*(2)
family, by not removing the queued signal state.

PR:	standards/170346
Submitted by:	"Jukka A. Ukkonen" <jau@iki.fi>
MFC after:	1 month
2012-11-13 12:52:31 +00:00
trasz
974b82f77d Don't divide by zero.
Tested by:	swills
2012-11-13 11:29:08 +00:00
mav
4edd3bb7df Several optimizations to sched_idletd():
- Do not try to steal load from other CPUs if there was no contest switches
on this CPU (i.e. it was idle all the time and woke up just for bus mastering
or TLB shutdown). If current CPU was idle, then it is quite unlikely that some
other CPU has load to steal.  Under high I/O rate, when TLB shutdowns cause
numerous CPU wakeups, on 24-CPU system load stealing code may consume up to
25% of all CPU time without giving any benefits.
 - Change code that implements spinning for load to restart spin in case of
context switch.  Previous code periodically called cpu_idle() even under
high interrupt/context switch rate.
 - Rise spinning threshold to 10KHz, where it gives at least some effect
that may worth consumed power.

Reviewed by:	jeff@
2012-11-10 07:02:57 +00:00
alfred
7beb738c8a Allow maxusers to scale on machines with large address space.
Some hooks are added to clamp down maxusers and nmbclusters for
small address space systems.

VM_MAX_AUTOTUNE_MAXUSERS - the max maxusers that will be autotuned based on
physical memory.
VM_MAX_AUTOTUNE_NMBCLUSTERS - max nmbclusters based on physical memory.

These are set to the old values on i386 to preserve the clamping that was
being done to all arches.

Another macro VM_AUTOTUNE_NMBCLUSTERS is provided to allow an override
for the calculation on a MD basis.  Currently no arch defines this.

Reviewed by: peter
MFC after: 2 weeks
2012-11-10 02:08:40 +00:00
attilio
d5d551ec46 Complete MPSAFE VFS interface and remove MNTK_MPSAFE flag.
Porters should refer to __FreeBSD_version 1000021 for this change as
it may have happened at the same timeframe.
2012-11-09 18:02:25 +00:00
marius
dd60b8658b Make r242655 build on sparc64. While at it, make vm_{max,min}_kernel_address
vm_offset_t as they should be.
2012-11-08 08:10:32 +00:00
jeff
15bd5ad44d - Change ULE to use dynamic slice sizes for the timeshare queue in order
to further reduce latency for threads in this queue.  This should help
   as threads transition from realtime to timeshare.  The latency is
   bound to a max of sched_slice until we have more than sched_slice / 6
   threads runnable.  Then the min slice is allotted to all threads and
   latency becomes (nthreads - 1) * min_slice.

Discussed with: mav
2012-11-08 01:46:47 +00:00
kevlo
25611f9cf9 Fix typo; s/ouput/output 2012-11-07 07:00:59 +00:00
alfred
5d14749363 export VM_MIN_KERNEL_ADDRESS and VM_MAX_KERNEL_ADDRESS via sysctl.
On several platforms the are determined by too many nested #defines to be
easily discernible.  This will aid in development of auto-tuning.
2012-11-06 04:10:32 +00:00
kib
0e30ee6e51 A clarification to the behaviour of the active vnode list management
regarding the vnode page cleaning.

In collaboration with:	pho
MFC after:	1 week
2012-11-05 16:40:42 +00:00
kib
58ab2ca2eb Add decoding of the missed MNT_KERN_ flags to ddb "show mount" command.
MFC after:	3 weeks
2012-11-04 13:33:13 +00:00
kib
f8b34c9be8 Add decoding of the missed VI_ and VV_ flags to ddb "show vnode" command.
MFC after:	3 days
2012-11-04 13:32:45 +00:00
kib
84d617582f Order the enumeration of the MNT_ flags to be the same as the order of
their definitions.

MFC after:	3 days
2012-11-04 13:31:41 +00:00
ed
e77bf633a0 Add tty_set_winsize().
This removes some of the signalling magic from the Syscons driver and
puts it in the TTY layer, where it belongs.
2012-11-03 22:21:37 +00:00
attilio
c754915a07 Merge r242395,242483 from mutex implementation:
give rwlock(9) the ability to crunch different type of structures, with
the only constraint that they have a lock cookie named rw_lock.
This name, then, becames reserved from the struct that wants to use
the rwlock(9) KPI and other locking primitives cannot reuse it for
their members.

Namely such structs are the current struct rwlock and the new struct
rwlock_padalign. The new structure will define an object which has the
same layout of a struct rwlock but will be allocated in areas aligned
to the cache line size and will be as big as a cache line.

For further details check comments on above mentioned revisions.

Reviewed by:	jimharris, jeff
2012-11-03 15:57:37 +00:00
alfred
8c8997ccb9 Merge 242488, better use of strlcpy.
Submitted by:	Eric van Gyzen <eric@vangyzen.net>
2012-11-02 18:57:38 +00:00
kib
f16ea99007 The r241025 fixed the case when a binary, executed from nullfs mount,
was still possible to open for write from the lower filesystem.  There
is a symmetric situation where the binary could already has file
descriptors opened for write, but it can be executed from the nullfs
overlay.

Handle the issue by passing one v_writecount reference to the lower
vnode if nullfs vnode has non-zero v_writecount.  Note that only one
write reference can be donated, since nullfs only keeps one use
reference on the lower vnode.  Always use the lower vnode v_writecount
for the checks.

Introduce the VOP_GET_WRITECOUNT to read v_writecount, which is
currently always bypassed to the lower vnode, and VOP_ADD_WRITECOUNT
to manipulate the v_writecount value, which manages a single bypass
reference to the lower vnode.  Caling the VOPs instead of directly
accessing v_writecount provide the fix described in the previous
paragraph.

Tested by:	pho
MFC after:	3 weeks
2012-11-02 13:56:36 +00:00
alfred
4a74d2e51a Provide a device name in the sysctl tree for programs to query the
state of crashdump target devices.

This will be used to add a "-l" (ell) flag to dumpon(8) to list the
currently configured dumpdev.

Reviewed by:	phk
2012-11-01 17:01:05 +00:00
attilio
d38d7bb245 Rework the known mutexes to benefit about staying on their own
cache line in order to avoid manual frobbing but using
struct mtx_padalign.

The sole exception being nvme and sxfge drivers, where the author
redefined CACHE_LINE_SIZE manually, so they need to be analyzed and
dealt with separately.

Reviwed by:	jimharris, alc
2012-10-31 18:07:18 +00:00
jimharris
8fbe050915 Pad and align the callout_cpu mtx to its own cacheline to reduce false
sharing especially on the default CPU 0 callout_cpu structure.

This will be followed up by attilio@ with a conversion to the new struct
mtx_padalign but doing this manual conversion first gives an easy MFC
candidate since mtx_padalign is a more extensive system change.

Sponsored by:	Intel
Reviewed by:	jeff, attilio
MFC after:	1 week
2012-10-31 17:12:12 +00:00
attilio
b9e5ac8e1c Give mtx(9) the ability to crunch different type of structures, with the
only constraint that they have a lock cookie named mtx_lock.
This name, then, becames reserved from the struct that wants to use the
mtx(9) KPI and other locking primitives cannot reuse it for their
members.

Namely such structs are the current struct mtx and the new
struct mtx_padalign.  The new structure will define an object which is
the same as the same layout of a struct mtx but will be allocated in
areas aligned to the cache line size and will be as big as a cache line.

This is supposed to give higher performance for highly contented mutexes
both spin or sleep (because of the adaptive spinning), where the cache
line contention results in too much traffic on the system bus.

The struct mtx_padalign can be used in a completely transparent way
with the mtx(9) KPI.

At the moment, a possibility to MFC the patch should be carefully
evaluated because this patch breaks the low level KPI
(not its representation though).

Discussed with:	jhb
Reviewed by:	jeff, andre
Reviewed by:	mdf (earlier version)
Tested by:	jimharris
2012-10-31 13:38:56 +00:00
attilio
279b97daea Fixup r240246: hwpmc needs to retain the pinning until ASTs are not
executed. This means past the point where userret() is generally
executed.

Skip the td_pinned check if a callchain tracing is currently happening
and add a more robust check to pmc_capture_user_callchain() in order to
catch td_pinned leak past ast() in hwpmc case.

Reported and tested by:	fabient
MFC after:	1 week
X-MFC:	r240246
2012-10-30 15:10:50 +00:00
attilio
97b0f9880b tdq_lock_pair() already does spinlock_enter() so migration is not
possible in sched_balance_pair(). Remove redundant sched_pin().

Reviewed by:	marius, jeff
2012-10-30 12:25:52 +00:00
andre
845b471915 In soreceive_stream() don't drop an already dequeued mbuf chain by
overwriting the return mbuf pointer with newly received data after
a loop.  Instead append the new mbuf chain to the existing one.

Fix up sb_lastrecord when dequeuing mbuf's so that sbappend_stream()
doesn't get confused.

For the remainder copy case in the mbuf delivery part deduct the
copied length len instead of the whole mbuf length.  Additionally
don't depend on 'n' being being available which isn't true in the
case of MSG_PEEK.

Fix the MSG_WAITALL case by comparing against sb_hiwat.  Before
it was looping for every receive as sb_lowat normally is zero.
Add comment about issue with (MSG_WAITALL | MSG_PEEK) which isn't
properly handled.

Submitted by:	trociny (except for the change in last paragraph)
2012-10-29 12:31:12 +00:00
andre
48996a8b7a Add logging for socket attach failures in sonewconn() during accept(2).
Include the pointer to the PCB so it can be attributed to a particular
application by corresponding it to "netstat -A" output.

MFC after:	2 weeks
2012-10-29 12:14:57 +00:00
kevlo
a9293cf962 Since the macro dtom() has been removed, fix comments about the dtom.
Reviewed by:	glebius
2012-10-29 10:04:28 +00:00
andre
d4bd5c150c Improve m_cat() by being able to also merge contents from M_EXT
mbuf's by doing proper testing with M_WRITABLE().

In m_collapse() replace an incomplete manual check for M_RDONLY
with the M_WRITABLE() macro that also tests for shared buffers
and other cases that make a particular mbuf immutable.

MFC after:	2 weeks
2012-10-28 18:38:51 +00:00
davide
1f47916a20 The fields of struct timespec32 should be int32_t and not uint32_t.
Make this change.

Reviewed by:	bde, davidxu
Tested by:	pho
MFC after:	1 week
2012-10-27 23:42:41 +00:00
trasz
d97338334a Add CPU percentage limit enforcement to RCTL. The resouce name is "pcpu".
It was implemented by Rudolf Tomori during Google Summer of Code 2012.
2012-10-26 16:01:08 +00:00
ed
ae88b22791 Correct SIGTTIN handling.
In the old TTY layer, SIGTTIN was correctly handled like this:

	while (data should be read) {
		send SIGTTIN if not foreground process group
		read data
	}

In the new TTY layer, however, this behaviour was changed, based on a
false interpretation of the standard:

	send SIGTTIN if not foreground process group
	while (data should be read) {
		read data
	}

Correct this by pushing tty_wait_background() into the ttydisc_read_*()
functions.

Reported by:	koitsu
PR:		kern/173010
MFC after:	2 weeks
2012-10-25 09:05:21 +00:00
alfred
2873c69678 Allow autotune maxusers > 384 on 64 bit machines
A default install on large memory machines with multiple 10gigE interfaces
were not being given enough mbufs to do full bandwidth TCP or NFS traffic.

To keep the value somewhat reasonable, we scale back the number of
maxuers by 1/6 past the 384 point.  This gives us enough mbufs for most
of our pretty basic 10gigE line-speed tests to complete.
2012-10-25 01:46:20 +00:00
jimharris
e56b4fdc17 Pad tdq_lock to avoid false sharing with tdq_load and tdq_cpu_idle.
This enables CPU searches (which read tdq_load) to operate independently
of any contention on the spinlock.  Some scheduler-intensive workloads
running on an 8C single-socket SNB Xeon show considerable improvement with
this change (2-3% perf improvement, 5-6% decrease in CPU util).

Sponsored by:	Intel
Reviewed by:	jeff
2012-10-24 18:36:41 +00:00
andre
c59c83ebb6 Replace the ill-named ZERO_COPY_SOCKET kernel option with two
more appropriate named kernel options for the very distinct
send and receive path.

"options SOCKET_SEND_COW" enables VM page copy-on-write based
sending of data on an outbound socket.

NB: The COW based send mechanism is not safe and may result
in kernel crashes.

"options SOCKET_RECV_PFLIP" enables VM kernel/userspace page
flipping for special disposable pages attached as external
storage to mbufs.

Only the naming of the kernel options is changed and their
corresponding #ifdef sections are adjusted.  No functionality
is added or removed.

Discussed with:	alc (mechanism and limitations of send side COW)
2012-10-23 14:19:44 +00:00
ed
8467024240 Remove unused `vfslocked' variable.
I have no idea what this `vfslocked' thing means. I wonder how it ended
up here.
2012-10-22 21:14:26 +00:00
kib
560aa751e0 Remove the support for using non-mpsafe filesystem modules.
In particular, do not lock Giant conditionally when calling into the
filesystem module, remove the VFS_LOCK_GIANT() and related
macros. Stop handling buffers belonging to non-mpsafe filesystems.

The VFS_VERSION is bumped to indicate the interface change which does
not result in the interface signatures changes.

Conducted and reviewed by:	attilio
Tested by:	pho
2012-10-22 17:50:54 +00:00
eadler
24cc54b983 Correct the killpg(2) return values:
Return EPERM if processes were found but they
were unable to be signaled.

Return the first error from p_cansignal if no signal was successful.

Reviewed by:	jilles
Approved by:	cperciva
MFC after:	1 week
2012-10-22 03:43:02 +00:00
eadler
f38062a582 Colin acked the wrong diff originally. fixed version coming soon.
Approved by:	cperciva (implicit)
2012-10-22 03:36:44 +00:00
eadler
f48bd672c6 Correct the killpg(2) return values:
Return EPERM if processes were found but they
were unable to be signaled.

Return the first error from p_cansignal if no signal was successful.

Reviewed by:	jilles
Approved by:	cperciva
MFC after:	1 week
2012-10-22 03:34:43 +00:00
eadler
3f7a414911 remove duplicate semicolons where possible.
Approved by:	cperciva
MFC after:	1 week
2012-10-22 03:00:37 +00:00
andre
6062df3e66 Grammar fixes to r241781.
Submitted by:	alc
2012-10-20 19:38:22 +00:00
andre
1211b4598c Hide the unfortunate named sysctl kern.ipc.somaxconn from sysctl -a
output and replace it with a new visible sysctl kern.ipc.acceptqueue
of the same functionality.  It specifies the maximum length of the
accept queue on a listen socket.

The old kern.ipc.somaxconn remains available for reading and writing
for compatibility reasons so that existing programs, scripts and
configurations continue to work.  There no plans to ever remove the
orginal and now hidden kern.ipc.somaxconn.
2012-10-20 12:53:14 +00:00