Commit Graph

12917 Commits

Author SHA1 Message Date
pjd
02c0badfc1 Allow to modify kern.sugid_coredump and kern.corefile from loader.conf.
Obtained from:	WHEEL Systems
2012-11-27 10:16:48 +00:00
pjd
8dbfcd9003 More style fixes. 2012-11-27 10:15:58 +00:00
pjd
0b5aef9e2b Style fixes (mostly whitespaces). 2012-11-27 10:11:54 +00:00
davidxu
adc108b87e Take first active vnode correctly.
Reviewed by:	kib
MFC after:	3 days
2012-11-27 06:07:58 +00:00
pjd
aec1bace62 Look for zombie process only if we were given process id.
Reviewed by:	kib
MFC after:	2 weeks
X-MFC-after-or-with:	243142
2012-11-25 19:31:42 +00:00
avg
fa7647f75a remove stop_scheduler_on_panic knob
There has not been any complaints about the default behavior, so there
is no need to keep a knob that enables the worse alternative.

Now that the hard-stopping of other CPUs is the only behavior, the panic_cpu
spinlock-like logic can be dropped, because only a single CPU is
supposed to win stop_cpus_hard(other_cpus) race and proceed past that
call.

MFC after:	1 month
2012-11-25 14:22:08 +00:00
avg
9c2d52ecde assert_vop_locked: make the assertion race-free and more efficient
this is really a minor improvement for the sake of correctness

MFC after:	6 days
2012-11-24 13:11:47 +00:00
avg
c20b4131f0 remove vop_lookup_pre and vop_lookup_post
Suggested by:	kib
MFC after:	5 days
2012-11-22 10:36:10 +00:00
kib
6d46d7b7ab Schedule garbage collection run for the in-flight rights passed over
the unix domain sockets to the next tick, coalescing the serial calls
until the collection fires.  The thought is that more work for the
collector could arise in the near time, allowing to clean more and not
spend too much CPU on repeated collection when there is no garbage.

Currently the collection task is fired immediately upon unix domain
socket close if there are any rights in flight, which caused excessive
CPU usage and too long blocking of the threads waiting for
unp_list_lock and unp_link_rwlock in write mode.

Robert noted that it would be nice if we could find some heuristic by
which we decide whether to run GC a bit more quickly.  E.g., if the
number of UNIX domain sockets is close to its resource limit, but not
quite.

Reported and tested by:	Markus Gebert <markus.gebert@hostpoint.ch>
Reviewed by:	rwatson
MFC after:	2 weeks
2012-11-20 15:45:48 +00:00
kib
f0eb44bc70 Add a special meaning to the negative ticks argument for
taskqueue_enqueue_timeout().  Do not rearm the callout if it is
already armed and the ticks is negative.  Otherwise rearm it to fire
in abs(ticks) ticks in the future.

The intended use is to call taskqueue_enqueue_timeout() for the given
timeout_task with the same negative ticks argument.  As result, the
task is scheduled to execute not further than abs(ticks) ticks in
future, and the consequent enqueues are coalesced until the already
scheduled task is finished.

Reviewed by:	rwatson
Tested by:	Markus Gebert <markus.gebert@hostpoint.ch>
MFC after:	2 weeks
2012-11-20 15:33:48 +00:00
attilio
e331787780 insmntque() is always called with the lock held in exclusive mode,
then:
- assume the lock is held in exclusive mode and remove a moot check
  about the lock acquisition.
- in the destructor remove !MPSAFE specific chunk.

Reviewed by:	kib
MFC after:	2 weeks
2012-11-19 20:43:19 +00:00
avg
4726ba44fc assert_vop_locked should treat LK_EXCLOTHER as the not locked case
... from a perspective of the current thread.

Spotted by:	mjg
Discussed with:	kib
MFC after:	18 days
2012-11-19 11:35:56 +00:00
avg
4d2c561ebf vnode_if: fix locking protocol description for lookup and cachedlookup
Also remove the checks from vop_lookup_pre and vop_lookup_post, which
are now completely redundant (before this change they were partially
redundant).

Discussed with:	kib
MFC after:	10 days
2012-11-19 11:32:56 +00:00
mjg
fb4bab611c Fix possible fp reference leak in posix_openpt
Reviewed by:	ed
Approved by:	trasz (mentor)
MFC after:	3 days
2012-11-18 15:48:34 +00:00
glebius
b821e8658c Update comment. 2012-11-16 14:00:54 +00:00
kib
801de09716 In pget(9), if PGET_NOTWEXIT flag is not specified, also search the
zombie list for the pid. This allows several kern.proc sysctls to
report useful information for zombies.

Hold the allproc_lock around all searches instead of relocking it.
Remove private pfind_locked() from the new nfs client code.

Requested and reviewed by:	pjd
Tested by:	pho
MFC after:	3 weeks
2012-11-16 08:25:06 +00:00
kib
8123554495 Restore the proper handling of the pid 0 for waitpid(2).
Fix the style around.

Reported and reviewed by:	bde (previous version)
MFC after:	28 days
2012-11-16 06:32:38 +00:00
kib
de90907af2 Style fixes for r242958.
Reported and reviewed by:	bde
MFC after:	28 days
2012-11-16 06:22:14 +00:00
trasz
f25f7f2e87 Improve KASSERT messages in racct, to make it clear which resource
caused the problem.

Submitted by:	mjg
2012-11-15 15:55:49 +00:00
trasz
ec6f935202 Fix kassert that's not really valid for %CPU accounting. The problem
here is race between decaying the resource usage in containers, and updating
per-process usage; basically, the former may cause per-container usage
to get smaller than per-process usage.

Submitted by:	Rudo Tomori
2012-11-15 14:11:34 +00:00
mav
f2dcd36473 Fix bug in r242852 that prevented CPU from becoming idle if kernel built
without SMP support.
2012-11-15 14:10:51 +00:00
jeff
f40f3c3255 - Implement run-time expansion of the KTR buffer via sysctl.
- Implement a function to ensure that all preempted threads have switched
   back out at least once.  Use this to make sure there are no stale
   references to the old ktr_buf or the lock profiling buffers before
   updating them.

Reviewed by:	marius (sparc64 parts), attilio (earlier patch)
Sponsored by:	EMC / Isilon Storage Division
2012-11-15 00:51:57 +00:00
bapt
f7eb521a37 Style fix
MFC after:	1 day
2012-11-14 10:33:12 +00:00
bapt
554b1ce9d7 return ERANGE if the buffer is too small to contain the login as documented in
the manpage

Reviewed by:	cognet, kib
MFC after:	1 month
2012-11-14 10:32:12 +00:00
mjg
72315ca484 enterpgrp: get rid of pgrp2 variable and use KASSERT directly on pgfind result.
pgrp2 was used only for debugging, but pgrp2 = pgfind(..) was present in compiled code even for kernels without INVARIANTS

Approved by:	trasz (mentor)
MFC after:	1 week
2012-11-13 22:01:25 +00:00
kib
63c9e066e5 Regen 2012-11-13 12:53:41 +00:00
kib
1409e8df20 Add the wait6(2) system call. It takes POSIX waitid()-like process
designator to select a process which is waited for. The system call
optionally returns siginfo_t which would be otherwise provided to
SIGCHLD handler, as well as extended structure accounting for child
and cumulative grandchild resource usage.

Allow to get the current rusage information for non-exited processes
as well, similar to Solaris.

The explicit WEXITED flag is required to wait for exited processes,
allowing for more fine-grained control of the events the waiter is
interested in.

Fix the handling of siginfo for WNOWAIT option for all wait*(2)
family, by not removing the queued signal state.

PR:	standards/170346
Submitted by:	"Jukka A. Ukkonen" <jau@iki.fi>
MFC after:	1 month
2012-11-13 12:52:31 +00:00
trasz
974b82f77d Don't divide by zero.
Tested by:	swills
2012-11-13 11:29:08 +00:00
mav
4edd3bb7df Several optimizations to sched_idletd():
- Do not try to steal load from other CPUs if there was no contest switches
on this CPU (i.e. it was idle all the time and woke up just for bus mastering
or TLB shutdown). If current CPU was idle, then it is quite unlikely that some
other CPU has load to steal.  Under high I/O rate, when TLB shutdowns cause
numerous CPU wakeups, on 24-CPU system load stealing code may consume up to
25% of all CPU time without giving any benefits.
 - Change code that implements spinning for load to restart spin in case of
context switch.  Previous code periodically called cpu_idle() even under
high interrupt/context switch rate.
 - Rise spinning threshold to 10KHz, where it gives at least some effect
that may worth consumed power.

Reviewed by:	jeff@
2012-11-10 07:02:57 +00:00
alfred
7beb738c8a Allow maxusers to scale on machines with large address space.
Some hooks are added to clamp down maxusers and nmbclusters for
small address space systems.

VM_MAX_AUTOTUNE_MAXUSERS - the max maxusers that will be autotuned based on
physical memory.
VM_MAX_AUTOTUNE_NMBCLUSTERS - max nmbclusters based on physical memory.

These are set to the old values on i386 to preserve the clamping that was
being done to all arches.

Another macro VM_AUTOTUNE_NMBCLUSTERS is provided to allow an override
for the calculation on a MD basis.  Currently no arch defines this.

Reviewed by: peter
MFC after: 2 weeks
2012-11-10 02:08:40 +00:00
attilio
d5d551ec46 Complete MPSAFE VFS interface and remove MNTK_MPSAFE flag.
Porters should refer to __FreeBSD_version 1000021 for this change as
it may have happened at the same timeframe.
2012-11-09 18:02:25 +00:00
marius
dd60b8658b Make r242655 build on sparc64. While at it, make vm_{max,min}_kernel_address
vm_offset_t as they should be.
2012-11-08 08:10:32 +00:00
jeff
15bd5ad44d - Change ULE to use dynamic slice sizes for the timeshare queue in order
to further reduce latency for threads in this queue.  This should help
   as threads transition from realtime to timeshare.  The latency is
   bound to a max of sched_slice until we have more than sched_slice / 6
   threads runnable.  Then the min slice is allotted to all threads and
   latency becomes (nthreads - 1) * min_slice.

Discussed with: mav
2012-11-08 01:46:47 +00:00
kevlo
25611f9cf9 Fix typo; s/ouput/output 2012-11-07 07:00:59 +00:00
alfred
5d14749363 export VM_MIN_KERNEL_ADDRESS and VM_MAX_KERNEL_ADDRESS via sysctl.
On several platforms the are determined by too many nested #defines to be
easily discernible.  This will aid in development of auto-tuning.
2012-11-06 04:10:32 +00:00
kib
0e30ee6e51 A clarification to the behaviour of the active vnode list management
regarding the vnode page cleaning.

In collaboration with:	pho
MFC after:	1 week
2012-11-05 16:40:42 +00:00
kib
58ab2ca2eb Add decoding of the missed MNT_KERN_ flags to ddb "show mount" command.
MFC after:	3 weeks
2012-11-04 13:33:13 +00:00
kib
f8b34c9be8 Add decoding of the missed VI_ and VV_ flags to ddb "show vnode" command.
MFC after:	3 days
2012-11-04 13:32:45 +00:00
kib
84d617582f Order the enumeration of the MNT_ flags to be the same as the order of
their definitions.

MFC after:	3 days
2012-11-04 13:31:41 +00:00
ed
e77bf633a0 Add tty_set_winsize().
This removes some of the signalling magic from the Syscons driver and
puts it in the TTY layer, where it belongs.
2012-11-03 22:21:37 +00:00
attilio
c754915a07 Merge r242395,242483 from mutex implementation:
give rwlock(9) the ability to crunch different type of structures, with
the only constraint that they have a lock cookie named rw_lock.
This name, then, becames reserved from the struct that wants to use
the rwlock(9) KPI and other locking primitives cannot reuse it for
their members.

Namely such structs are the current struct rwlock and the new struct
rwlock_padalign. The new structure will define an object which has the
same layout of a struct rwlock but will be allocated in areas aligned
to the cache line size and will be as big as a cache line.

For further details check comments on above mentioned revisions.

Reviewed by:	jimharris, jeff
2012-11-03 15:57:37 +00:00
alfred
8c8997ccb9 Merge 242488, better use of strlcpy.
Submitted by:	Eric van Gyzen <eric@vangyzen.net>
2012-11-02 18:57:38 +00:00
kib
f16ea99007 The r241025 fixed the case when a binary, executed from nullfs mount,
was still possible to open for write from the lower filesystem.  There
is a symmetric situation where the binary could already has file
descriptors opened for write, but it can be executed from the nullfs
overlay.

Handle the issue by passing one v_writecount reference to the lower
vnode if nullfs vnode has non-zero v_writecount.  Note that only one
write reference can be donated, since nullfs only keeps one use
reference on the lower vnode.  Always use the lower vnode v_writecount
for the checks.

Introduce the VOP_GET_WRITECOUNT to read v_writecount, which is
currently always bypassed to the lower vnode, and VOP_ADD_WRITECOUNT
to manipulate the v_writecount value, which manages a single bypass
reference to the lower vnode.  Caling the VOPs instead of directly
accessing v_writecount provide the fix described in the previous
paragraph.

Tested by:	pho
MFC after:	3 weeks
2012-11-02 13:56:36 +00:00
alfred
4a74d2e51a Provide a device name in the sysctl tree for programs to query the
state of crashdump target devices.

This will be used to add a "-l" (ell) flag to dumpon(8) to list the
currently configured dumpdev.

Reviewed by:	phk
2012-11-01 17:01:05 +00:00
attilio
d38d7bb245 Rework the known mutexes to benefit about staying on their own
cache line in order to avoid manual frobbing but using
struct mtx_padalign.

The sole exception being nvme and sxfge drivers, where the author
redefined CACHE_LINE_SIZE manually, so they need to be analyzed and
dealt with separately.

Reviwed by:	jimharris, alc
2012-10-31 18:07:18 +00:00
jimharris
8fbe050915 Pad and align the callout_cpu mtx to its own cacheline to reduce false
sharing especially on the default CPU 0 callout_cpu structure.

This will be followed up by attilio@ with a conversion to the new struct
mtx_padalign but doing this manual conversion first gives an easy MFC
candidate since mtx_padalign is a more extensive system change.

Sponsored by:	Intel
Reviewed by:	jeff, attilio
MFC after:	1 week
2012-10-31 17:12:12 +00:00
attilio
b9e5ac8e1c Give mtx(9) the ability to crunch different type of structures, with the
only constraint that they have a lock cookie named mtx_lock.
This name, then, becames reserved from the struct that wants to use the
mtx(9) KPI and other locking primitives cannot reuse it for their
members.

Namely such structs are the current struct mtx and the new
struct mtx_padalign.  The new structure will define an object which is
the same as the same layout of a struct mtx but will be allocated in
areas aligned to the cache line size and will be as big as a cache line.

This is supposed to give higher performance for highly contented mutexes
both spin or sleep (because of the adaptive spinning), where the cache
line contention results in too much traffic on the system bus.

The struct mtx_padalign can be used in a completely transparent way
with the mtx(9) KPI.

At the moment, a possibility to MFC the patch should be carefully
evaluated because this patch breaks the low level KPI
(not its representation though).

Discussed with:	jhb
Reviewed by:	jeff, andre
Reviewed by:	mdf (earlier version)
Tested by:	jimharris
2012-10-31 13:38:56 +00:00
attilio
279b97daea Fixup r240246: hwpmc needs to retain the pinning until ASTs are not
executed. This means past the point where userret() is generally
executed.

Skip the td_pinned check if a callchain tracing is currently happening
and add a more robust check to pmc_capture_user_callchain() in order to
catch td_pinned leak past ast() in hwpmc case.

Reported and tested by:	fabient
MFC after:	1 week
X-MFC:	r240246
2012-10-30 15:10:50 +00:00
attilio
97b0f9880b tdq_lock_pair() already does spinlock_enter() so migration is not
possible in sched_balance_pair(). Remove redundant sched_pin().

Reviewed by:	marius, jeff
2012-10-30 12:25:52 +00:00
andre
845b471915 In soreceive_stream() don't drop an already dequeued mbuf chain by
overwriting the return mbuf pointer with newly received data after
a loop.  Instead append the new mbuf chain to the existing one.

Fix up sb_lastrecord when dequeuing mbuf's so that sbappend_stream()
doesn't get confused.

For the remainder copy case in the mbuf delivery part deduct the
copied length len instead of the whole mbuf length.  Additionally
don't depend on 'n' being being available which isn't true in the
case of MSG_PEEK.

Fix the MSG_WAITALL case by comparing against sb_hiwat.  Before
it was looping for every receive as sb_lowat normally is zero.
Add comment about issue with (MSG_WAITALL | MSG_PEEK) which isn't
properly handled.

Submitted by:	trociny (except for the change in last paragraph)
2012-10-29 12:31:12 +00:00