Commit Graph

12912 Commits

Author SHA1 Message Date
Konstantin Belousov
daee0f0b0b Schedule garbage collection run for the in-flight rights passed over
the unix domain sockets to the next tick, coalescing the serial calls
until the collection fires.  The thought is that more work for the
collector could arise in the near time, allowing to clean more and not
spend too much CPU on repeated collection when there is no garbage.

Currently the collection task is fired immediately upon unix domain
socket close if there are any rights in flight, which caused excessive
CPU usage and too long blocking of the threads waiting for
unp_list_lock and unp_link_rwlock in write mode.

Robert noted that it would be nice if we could find some heuristic by
which we decide whether to run GC a bit more quickly.  E.g., if the
number of UNIX domain sockets is close to its resource limit, but not
quite.

Reported and tested by:	Markus Gebert <markus.gebert@hostpoint.ch>
Reviewed by:	rwatson
MFC after:	2 weeks
2012-11-20 15:45:48 +00:00
Konstantin Belousov
b7c8d2f2f5 Add a special meaning to the negative ticks argument for
taskqueue_enqueue_timeout().  Do not rearm the callout if it is
already armed and the ticks is negative.  Otherwise rearm it to fire
in abs(ticks) ticks in the future.

The intended use is to call taskqueue_enqueue_timeout() for the given
timeout_task with the same negative ticks argument.  As result, the
task is scheduled to execute not further than abs(ticks) ticks in
future, and the consequent enqueues are coalesced until the already
scheduled task is finished.

Reviewed by:	rwatson
Tested by:	Markus Gebert <markus.gebert@hostpoint.ch>
MFC after:	2 weeks
2012-11-20 15:33:48 +00:00
Attilio Rao
973b795b64 insmntque() is always called with the lock held in exclusive mode,
then:
- assume the lock is held in exclusive mode and remove a moot check
  about the lock acquisition.
- in the destructor remove !MPSAFE specific chunk.

Reviewed by:	kib
MFC after:	2 weeks
2012-11-19 20:43:19 +00:00
Andriy Gapon
ab49c952d9 assert_vop_locked should treat LK_EXCLOTHER as the not locked case
... from a perspective of the current thread.

Spotted by:	mjg
Discussed with:	kib
MFC after:	18 days
2012-11-19 11:35:56 +00:00
Andriy Gapon
c496727c54 vnode_if: fix locking protocol description for lookup and cachedlookup
Also remove the checks from vop_lookup_pre and vop_lookup_post, which
are now completely redundant (before this change they were partially
redundant).

Discussed with:	kib
MFC after:	10 days
2012-11-19 11:32:56 +00:00
Mateusz Guzik
dd103d4d06 Fix possible fp reference leak in posix_openpt
Reviewed by:	ed
Approved by:	trasz (mentor)
MFC after:	3 days
2012-11-18 15:48:34 +00:00
Gleb Smirnoff
716963cb5d Update comment. 2012-11-16 14:00:54 +00:00
Konstantin Belousov
134eb42e24 In pget(9), if PGET_NOTWEXIT flag is not specified, also search the
zombie list for the pid. This allows several kern.proc sysctls to
report useful information for zombies.

Hold the allproc_lock around all searches instead of relocking it.
Remove private pfind_locked() from the new nfs client code.

Requested and reviewed by:	pjd
Tested by:	pho
MFC after:	3 weeks
2012-11-16 08:25:06 +00:00
Konstantin Belousov
ea293f3f1d Restore the proper handling of the pid 0 for waitpid(2).
Fix the style around.

Reported and reviewed by:	bde (previous version)
MFC after:	28 days
2012-11-16 06:32:38 +00:00
Konstantin Belousov
a2a8559624 Style fixes for r242958.
Reported and reviewed by:	bde
MFC after:	28 days
2012-11-16 06:22:14 +00:00
Edward Tomasz Napierala
baf85d0a22 Improve KASSERT messages in racct, to make it clear which resource
caused the problem.

Submitted by:	mjg
2012-11-15 15:55:49 +00:00
Edward Tomasz Napierala
84c9193ba0 Fix kassert that's not really valid for %CPU accounting. The problem
here is race between decaying the resource usage in containers, and updating
per-process usage; basically, the former may cause per-container usage
to get smaller than per-process usage.

Submitted by:	Rudo Tomori
2012-11-15 14:11:34 +00:00
Alexander Motin
2fd4047f32 Fix bug in r242852 that prevented CPU from becoming idle if kernel built
without SMP support.
2012-11-15 14:10:51 +00:00
Jeff Roberson
28d91af30f - Implement run-time expansion of the KTR buffer via sysctl.
- Implement a function to ensure that all preempted threads have switched
   back out at least once.  Use this to make sure there are no stale
   references to the old ktr_buf or the lock profiling buffers before
   updating them.

Reviewed by:	marius (sparc64 parts), attilio (earlier patch)
Sponsored by:	EMC / Isilon Storage Division
2012-11-15 00:51:57 +00:00
Baptiste Daroussin
6f0a5dea71 Style fix
MFC after:	1 day
2012-11-14 10:33:12 +00:00
Baptiste Daroussin
6f68699fbd return ERANGE if the buffer is too small to contain the login as documented in
the manpage

Reviewed by:	cognet, kib
MFC after:	1 month
2012-11-14 10:32:12 +00:00
Mateusz Guzik
4419a8a88c enterpgrp: get rid of pgrp2 variable and use KASSERT directly on pgfind result.
pgrp2 was used only for debugging, but pgrp2 = pgfind(..) was present in compiled code even for kernels without INVARIANTS

Approved by:	trasz (mentor)
MFC after:	1 week
2012-11-13 22:01:25 +00:00
Konstantin Belousov
552e993580 Regen 2012-11-13 12:53:41 +00:00
Konstantin Belousov
f13b5a0f01 Add the wait6(2) system call. It takes POSIX waitid()-like process
designator to select a process which is waited for. The system call
optionally returns siginfo_t which would be otherwise provided to
SIGCHLD handler, as well as extended structure accounting for child
and cumulative grandchild resource usage.

Allow to get the current rusage information for non-exited processes
as well, similar to Solaris.

The explicit WEXITED flag is required to wait for exited processes,
allowing for more fine-grained control of the events the waiter is
interested in.

Fix the handling of siginfo for WNOWAIT option for all wait*(2)
family, by not removing the queued signal state.

PR:	standards/170346
Submitted by:	"Jukka A. Ukkonen" <jau@iki.fi>
MFC after:	1 month
2012-11-13 12:52:31 +00:00
Edward Tomasz Napierala
84590fd8e5 Don't divide by zero.
Tested by:	swills
2012-11-13 11:29:08 +00:00
Alexander Motin
2c27cb3a34 Several optimizations to sched_idletd():
- Do not try to steal load from other CPUs if there was no contest switches
on this CPU (i.e. it was idle all the time and woke up just for bus mastering
or TLB shutdown). If current CPU was idle, then it is quite unlikely that some
other CPU has load to steal.  Under high I/O rate, when TLB shutdowns cause
numerous CPU wakeups, on 24-CPU system load stealing code may consume up to
25% of all CPU time without giving any benefits.
 - Change code that implements spinning for load to restart spin in case of
context switch.  Previous code periodically called cpu_idle() even under
high interrupt/context switch rate.
 - Rise spinning threshold to 10KHz, where it gives at least some effect
that may worth consumed power.

Reviewed by:	jeff@
2012-11-10 07:02:57 +00:00
Alfred Perlstein
79f62ed690 Allow maxusers to scale on machines with large address space.
Some hooks are added to clamp down maxusers and nmbclusters for
small address space systems.

VM_MAX_AUTOTUNE_MAXUSERS - the max maxusers that will be autotuned based on
physical memory.
VM_MAX_AUTOTUNE_NMBCLUSTERS - max nmbclusters based on physical memory.

These are set to the old values on i386 to preserve the clamping that was
being done to all arches.

Another macro VM_AUTOTUNE_NMBCLUSTERS is provided to allow an override
for the calculation on a MD basis.  Currently no arch defines this.

Reviewed by: peter
MFC after: 2 weeks
2012-11-10 02:08:40 +00:00
Attilio Rao
bc2258da88 Complete MPSAFE VFS interface and remove MNTK_MPSAFE flag.
Porters should refer to __FreeBSD_version 1000021 for this change as
it may have happened at the same timeframe.
2012-11-09 18:02:25 +00:00
Marius Strobl
c882264c95 Make r242655 build on sparc64. While at it, make vm_{max,min}_kernel_address
vm_offset_t as they should be.
2012-11-08 08:10:32 +00:00
Jeff Roberson
5e5c387373 - Change ULE to use dynamic slice sizes for the timeshare queue in order
to further reduce latency for threads in this queue.  This should help
   as threads transition from realtime to timeshare.  The latency is
   bound to a max of sched_slice until we have more than sched_slice / 6
   threads runnable.  Then the min slice is allotted to all threads and
   latency becomes (nthreads - 1) * min_slice.

Discussed with: mav
2012-11-08 01:46:47 +00:00
Kevin Lo
0f5e7edc14 Fix typo; s/ouput/output 2012-11-07 07:00:59 +00:00
Alfred Perlstein
fc6874bcbb export VM_MIN_KERNEL_ADDRESS and VM_MAX_KERNEL_ADDRESS via sysctl.
On several platforms the are determined by too many nested #defines to be
easily discernible.  This will aid in development of auto-tuning.
2012-11-06 04:10:32 +00:00
Konstantin Belousov
76fd782cd9 A clarification to the behaviour of the active vnode list management
regarding the vnode page cleaning.

In collaboration with:	pho
MFC after:	1 week
2012-11-05 16:40:42 +00:00
Konstantin Belousov
90af57930c Add decoding of the missed MNT_KERN_ flags to ddb "show mount" command.
MFC after:	3 weeks
2012-11-04 13:33:13 +00:00
Konstantin Belousov
fb81941575 Add decoding of the missed VI_ and VV_ flags to ddb "show vnode" command.
MFC after:	3 days
2012-11-04 13:32:45 +00:00
Konstantin Belousov
df3161c7df Order the enumeration of the MNT_ flags to be the same as the order of
their definitions.

MFC after:	3 days
2012-11-04 13:31:41 +00:00
Ed Schouten
305921c48e Add tty_set_winsize().
This removes some of the signalling magic from the Syscons driver and
puts it in the TTY layer, where it belongs.
2012-11-03 22:21:37 +00:00
Attilio Rao
19d4153329 Merge r242395,242483 from mutex implementation:
give rwlock(9) the ability to crunch different type of structures, with
the only constraint that they have a lock cookie named rw_lock.
This name, then, becames reserved from the struct that wants to use
the rwlock(9) KPI and other locking primitives cannot reuse it for
their members.

Namely such structs are the current struct rwlock and the new struct
rwlock_padalign. The new structure will define an object which has the
same layout of a struct rwlock but will be allocated in areas aligned
to the cache line size and will be as big as a cache line.

For further details check comments on above mentioned revisions.

Reviewed by:	jimharris, jeff
2012-11-03 15:57:37 +00:00
Alfred Perlstein
5a3a8ec037 Merge 242488, better use of strlcpy.
Submitted by:	Eric van Gyzen <eric@vangyzen.net>
2012-11-02 18:57:38 +00:00
Konstantin Belousov
140dedb81c The r241025 fixed the case when a binary, executed from nullfs mount,
was still possible to open for write from the lower filesystem.  There
is a symmetric situation where the binary could already has file
descriptors opened for write, but it can be executed from the nullfs
overlay.

Handle the issue by passing one v_writecount reference to the lower
vnode if nullfs vnode has non-zero v_writecount.  Note that only one
write reference can be donated, since nullfs only keeps one use
reference on the lower vnode.  Always use the lower vnode v_writecount
for the checks.

Introduce the VOP_GET_WRITECOUNT to read v_writecount, which is
currently always bypassed to the lower vnode, and VOP_ADD_WRITECOUNT
to manipulate the v_writecount value, which manages a single bypass
reference to the lower vnode.  Caling the VOPs instead of directly
accessing v_writecount provide the fix described in the previous
paragraph.

Tested by:	pho
MFC after:	3 weeks
2012-11-02 13:56:36 +00:00
Alfred Perlstein
bad7e7f3dd Provide a device name in the sysctl tree for programs to query the
state of crashdump target devices.

This will be used to add a "-l" (ell) flag to dumpon(8) to list the
currently configured dumpdev.

Reviewed by:	phk
2012-11-01 17:01:05 +00:00
Attilio Rao
4ceaf45de5 Rework the known mutexes to benefit about staying on their own
cache line in order to avoid manual frobbing but using
struct mtx_padalign.

The sole exception being nvme and sxfge drivers, where the author
redefined CACHE_LINE_SIZE manually, so they need to be analyzed and
dealt with separately.

Reviwed by:	jimharris, alc
2012-10-31 18:07:18 +00:00
Jim Harris
84e7a2ebb7 Pad and align the callout_cpu mtx to its own cacheline to reduce false
sharing especially on the default CPU 0 callout_cpu structure.

This will be followed up by attilio@ with a conversion to the new struct
mtx_padalign but doing this manual conversion first gives an easy MFC
candidate since mtx_padalign is a more extensive system change.

Sponsored by:	Intel
Reviewed by:	jeff, attilio
MFC after:	1 week
2012-10-31 17:12:12 +00:00
Attilio Rao
7f44c61839 Give mtx(9) the ability to crunch different type of structures, with the
only constraint that they have a lock cookie named mtx_lock.
This name, then, becames reserved from the struct that wants to use the
mtx(9) KPI and other locking primitives cannot reuse it for their
members.

Namely such structs are the current struct mtx and the new
struct mtx_padalign.  The new structure will define an object which is
the same as the same layout of a struct mtx but will be allocated in
areas aligned to the cache line size and will be as big as a cache line.

This is supposed to give higher performance for highly contented mutexes
both spin or sleep (because of the adaptive spinning), where the cache
line contention results in too much traffic on the system bus.

The struct mtx_padalign can be used in a completely transparent way
with the mtx(9) KPI.

At the moment, a possibility to MFC the patch should be carefully
evaluated because this patch breaks the low level KPI
(not its representation though).

Discussed with:	jhb
Reviewed by:	jeff, andre
Reviewed by:	mdf (earlier version)
Tested by:	jimharris
2012-10-31 13:38:56 +00:00
Attilio Rao
5584e91718 Fixup r240246: hwpmc needs to retain the pinning until ASTs are not
executed. This means past the point where userret() is generally
executed.

Skip the td_pinned check if a callchain tracing is currently happening
and add a more robust check to pmc_capture_user_callchain() in order to
catch td_pinned leak past ast() in hwpmc case.

Reported and tested by:	fabient
MFC after:	1 week
X-MFC:	r240246
2012-10-30 15:10:50 +00:00
Attilio Rao
a049aa05c9 tdq_lock_pair() already does spinlock_enter() so migration is not
possible in sched_balance_pair(). Remove redundant sched_pin().

Reviewed by:	marius, jeff
2012-10-30 12:25:52 +00:00
Andre Oppermann
e8ad36aba4 In soreceive_stream() don't drop an already dequeued mbuf chain by
overwriting the return mbuf pointer with newly received data after
a loop.  Instead append the new mbuf chain to the existing one.

Fix up sb_lastrecord when dequeuing mbuf's so that sbappend_stream()
doesn't get confused.

For the remainder copy case in the mbuf delivery part deduct the
copied length len instead of the whole mbuf length.  Additionally
don't depend on 'n' being being available which isn't true in the
case of MSG_PEEK.

Fix the MSG_WAITALL case by comparing against sb_hiwat.  Before
it was looping for every receive as sb_lowat normally is zero.
Add comment about issue with (MSG_WAITALL | MSG_PEEK) which isn't
properly handled.

Submitted by:	trociny (except for the change in last paragraph)
2012-10-29 12:31:12 +00:00
Andre Oppermann
fdd1b7f52a Add logging for socket attach failures in sonewconn() during accept(2).
Include the pointer to the PCB so it can be attributed to a particular
application by corresponding it to "netstat -A" output.

MFC after:	2 weeks
2012-10-29 12:14:57 +00:00
Kevin Lo
a2c36a0234 Since the macro dtom() has been removed, fix comments about the dtom.
Reviewed by:	glebius
2012-10-29 10:04:28 +00:00
Andre Oppermann
14d7c5b11c Improve m_cat() by being able to also merge contents from M_EXT
mbuf's by doing proper testing with M_WRITABLE().

In m_collapse() replace an incomplete manual check for M_RDONLY
with the M_WRITABLE() macro that also tests for shared buffers
and other cases that make a particular mbuf immutable.

MFC after:	2 weeks
2012-10-28 18:38:51 +00:00
Davide Italiano
ba4be2110a The fields of struct timespec32 should be int32_t and not uint32_t.
Make this change.

Reviewed by:	bde, davidxu
Tested by:	pho
MFC after:	1 week
2012-10-27 23:42:41 +00:00
Edward Tomasz Napierala
36af98697d Add CPU percentage limit enforcement to RCTL. The resouce name is "pcpu".
It was implemented by Rudolf Tomori during Google Summer of Code 2012.
2012-10-26 16:01:08 +00:00
Ed Schouten
1da7bb41ed Correct SIGTTIN handling.
In the old TTY layer, SIGTTIN was correctly handled like this:

	while (data should be read) {
		send SIGTTIN if not foreground process group
		read data
	}

In the new TTY layer, however, this behaviour was changed, based on a
false interpretation of the standard:

	send SIGTTIN if not foreground process group
	while (data should be read) {
		read data
	}

Correct this by pushing tty_wait_background() into the ttydisc_read_*()
functions.

Reported by:	koitsu
PR:		kern/173010
MFC after:	2 weeks
2012-10-25 09:05:21 +00:00
Alfred Perlstein
7b6d92c0a0 Allow autotune maxusers > 384 on 64 bit machines
A default install on large memory machines with multiple 10gigE interfaces
were not being given enough mbufs to do full bandwidth TCP or NFS traffic.

To keep the value somewhat reasonable, we scale back the number of
maxuers by 1/6 past the 384 point.  This gives us enough mbufs for most
of our pretty basic 10gigE line-speed tests to complete.
2012-10-25 01:46:20 +00:00
Jim Harris
39f819e2fc Pad tdq_lock to avoid false sharing with tdq_load and tdq_cpu_idle.
This enables CPU searches (which read tdq_load) to operate independently
of any contention on the spinlock.  Some scheduler-intensive workloads
running on an 8C single-socket SNB Xeon show considerable improvement with
this change (2-3% perf improvement, 5-6% decrease in CPU util).

Sponsored by:	Intel
Reviewed by:	jeff
2012-10-24 18:36:41 +00:00