Commit Graph

12948 Commits

Author SHA1 Message Date
pjd
24b0c94606 Configure UMA warnings for the following zones:
- unp_zone: kern.ipc.maxsockets limit reached
- socket_zone: kern.ipc.maxsockets limit reached
- zone_mbuf: kern.ipc.nmbufs limit reached
- zone_clust: kern.ipc.nmbclusters limit reached
- zone_jumbop: kern.ipc.nmbjumbop limit reached
- zone_jumbo9: kern.ipc.nmbjumbo9 limit reached
- zone_jumbo16: kern.ipc.nmbjumbo16 limit reached

Note that those warnings are printed not often than every five minutes and can
be globally turned off by setting sysctl/tunable vm.zone_warnings to 0.

Discussed on:	arch
Obtained from:	WHEEL Systems
MFC after:	2 weeks
2012-12-07 22:30:30 +00:00
pjd
0ae3a2ffee Make use of the fact that uma_zone_set_max(9) already returns actual limit set. 2012-12-07 22:23:53 +00:00
pjd
d71ec288dc More style cleanups. 2012-12-07 22:22:04 +00:00
pjd
8cb7b219b4 Style cleanups. 2012-12-07 22:19:41 +00:00
pjd
4a320b1cc2 - Make socket_zone static - it is used only in this file.
- Update maxsockets on uma_zone_set_max().

Obtained from:	WHEEL Systems
2012-12-07 22:15:51 +00:00
pjd
e7a7ec6407 Style cleanups. 2012-12-07 22:13:33 +00:00
pjd
3702962778 There is no need anymore to include vm/uma.h after r241726.
Obtained from:	WHEEL Systems
2012-12-07 22:05:42 +00:00
alfred
d4467dc033 Allow KASSERT to log instead of panic.
This is to allow debug images to be used without taking down the
system when non-fatal asserts are hit.

The following sysctls are added:

debug.kassert.warn_only: 1 = log, 0 = panic

debug.kassert.do_ktr: set to a ktr mask for logging via KTR

debug.kassert.do_log: 1 = log, 0 = quiet

debug.kassert.warnings: stats, number of kasserts hit

debug.kassert.log_panic_at:
  number of kasserts before we actually panic, 0 = never

debug.kassert.log_pps_limit: pps limit for log messages

debug.kassert.log_mute_at: stop warning after N kasserts, 0 = never stop

debug.kassert.kassert: set this sysctl to trigger a kassert

Discussed with: scottl, gnn, marcel
Sponsored by: iXsystems
2012-12-07 08:25:08 +00:00
alfred
471892517a Use uint instead of int for flags exported via sysctl. 2012-12-07 05:55:48 +00:00
kevlo
c71d284884 - according to POSIX, make socket(2) return EAFNOSUPPORT rather than
EPROTONOSUPPORT if the address family is not supported.
- introduce pffinddomain() to find a domain by family and use it as
  appropriate.

Reviewed by:	glebius
2012-12-07 02:22:48 +00:00
davidxu
386e9d0db5 Eliminate superfluous code. 2012-12-06 06:29:08 +00:00
attilio
3145eea56c Fixup r243901:
- As the comment report, CALLOUT_LOCAL_ALLOC cannot be checked
  directly from the callout flags but might be checked by a cached
  value.  Hence, do so before to actually remove the callout, when
  needed, in softclock_call_cc().
- In softclock_call_cc() also add a comment in the waiting and deferred
  migration case explaining that the dereference should be safe
  because of the migration dereference invariants.

Additively:
- In softclock_call_cc(), for the deferred migration case, move all the
  accesses to callout structure after the comment stating the callout
  must not be destroyed.
- For consistency with this last tweak, use cached c_flags for the
  KASSERT() in the deferred migration case.  It is not strictly necessary
  but this way all the callout accesses happen after the above mentioned
  comment, improving consistency.

Pointy hat to:	me
Sponsored by:	Isilon Systems / EMC Corporation
Reviewed by:	kib
MFC after:	2 weeks
X-MFC:		243901
2012-12-05 22:32:12 +00:00
kib
efc06bd801 The softclock_call_cc() is executing with the callout already removed
from the callwheel. Calculate the cc->cc_next before removing the
callout, otherwise the code followed the invalid tailq links.  After
this, make softclock_call_cc() return void, since it always return
cc->cc_next, which is immediately available to the softclock()
anyway. This also allows to eliminate a label under #ifdef SMP.

Remove the assignment of cc->cc_next from callout_cc_del(), since the
function is called with the callout already removed from callwheel.

If cancelling the migration, also clear the CALLOUT_DFRMIGRATION flag.

Postpone the free of the timeout(9) allocated callouts after the
migration checks are done.

Add some more strict asserts about the state of the callout in
callout_call_cc().

Reviewed by:	attilio
Reported and tested by:	pho (previous version)
MFC after:	2 weeks
2012-12-05 19:02:22 +00:00
attilio
97d8ae3890 Check for lockmgr recursion in case of disown and downgrade and panic
also in !debugging kernel rather than having "undefined" behaviour.

Tested by:	avg
MFC after:	1 week
2012-12-05 15:11:01 +00:00
glebius
8e20fa5ae9 Mechanically substitute flags from historic mbuf allocator with
malloc(9) flags within sys.

Exceptions:

- sys/contrib not touched
- sys/mbuf.h edited manually
2012-12-05 08:04:20 +00:00
kib
54d4ef7790 Fix a race between kern_setitimer() and realitexpire(), where the
callout is started before kern_setitimer() acquires process mutex, but
looses a race and kern_setitimer() gets the process mutex before the
callout.  Then, assuming that new specified struct itimerval has
it_interval zero, but it_value non-zero, the callout, after it starts
executing again, clears p->p_realtimer.it_value, but kern_setitimer()
already rescheduled the callout.

As the result of the race, both p_realtimer is zero, and the callout
is rescheduled. Then, in the exit1(), the exit code sees that it_value
is zero and does not even try to stop the callout. This allows the
struct proc to be reused and eventually the armed callout is
re-initialized.  The consequence is the corrupted callwheel tailq.

Use process mutex to interlock the callout start, which fixes the race.

Reported and tested by:	pho
Reviewed by:	jhb
MFC after:	2 weeks
2012-12-04 20:49:39 +00:00
kib
f366a4aadf Do not allocate buffer of the 255 bytes length on the stack.
Reported and tested by:	sig6247@gmail.com
MFC after:	1 week
2012-12-04 20:49:04 +00:00
alfred
d7137f4490 replace bit shifting loop with 1<<fls(n), improve comments.
Reviewed by: davide
2012-12-04 05:28:20 +00:00
kib
69dfcf0272 The vnode_free_list_mtx is required unconditionally when iterating
over the active list. The mount interlock is not enough to guarantee
the validity of the tailq link pointers. The __mnt_vnode_next_active()
and __mnt_vnode_first_active() active lists iterators helper functions
did not provided the neccessary stability for the list, allowing the
iterators to pick garbage.

This was uncovered after the r243599 made the active list iterators
non-nop.

Since a vnode interlock is before the vnode_free_list_mtx, obtain the
vnode ilock in the non-blocking manner when under vnode_free_list_mtx,
and restart iteration after the yield if the lock attempt failed.

Assert that a vnode found on the list is active, and assert that the
helpers return the vnode with interlock owned.

Reported and tested by:	pho
MFC after:	1 week
2012-12-03 22:15:16 +00:00
pjd
c6ea39d1ef Fix one more compilation issue. 2012-12-01 08:59:36 +00:00
pjd
632d7191a2 IFp4 @208451:
Fix path handling for *at() syscalls.

Before the change directory descriptor was totally ignored,
so the relative path argument was appended to current working
directory path and not to the path provided by descriptor, thus
wrong paths were stored in audit logs.

Now that we use directory descriptor in vfs_lookup, move
AUDIT_ARG_UPATH1() and AUDIT_ARG_UPATH2() calls to the place where
we hold file descriptors table lock, so we are sure paths will
be resolved according to the same directory in audit record and
in actual operation.

Sponsored by:	FreeBSD Foundation (auditdistd)
Reviewed by:	rwatson
MFC after:	2 weeks
2012-11-30 23:18:49 +00:00
pjd
a3ce94291c IFp4 @208450:
Remove redundant call to AUDIT_ARG_UPATH1().
Path will be remembered by the following NDINIT(AUDITVNODE1) call.

Sponsored by:	FreeBSD Foundation (auditdistd)
MFC after:	2 weeks
2012-11-30 22:49:28 +00:00
andre
051df828de Using a long is the wrong type to represent the realmem and maxmbufmem
variable as they may overflow on i386/PAE and i386 with > 2GB RAM.

Use 64bit quad_t instead.  It has broader kernel infrastructure support
with TUNABLE_QUAD_FETCH() and qmin/qmax() than other available types.

Pointed out by:	alc, bde
2012-11-29 07:30:42 +00:00
andre
b0ff37a790 Complete r243631 by applying the remainder of kern_mbuf.c that got
lost while merging into the commit tree.

MFC after:	1 month
X-MFC-with:	r243631
2012-11-27 23:16:56 +00:00
andre
48786251b2 Fix r243627 by testing against the head socket instead of the socket
just created.

MFC after:	1 week
X-MFC-with:	r243627
2012-11-27 22:35:48 +00:00
andre
d85b510608 Base the mbuf related limits on the available physical memory or
kernel memory, whichever is lower.  The overall mbuf related memory
limit must be set so that mbufs (and clusters of various sizes)
can't exhaust physical RAM or KVM.

The limit is set to half of the physical RAM or KVM (whichever is
lower) as the baseline.  In any normal scenario we want to leave
at least half of the physmem/kvm for other kernel functions and
userspace to prevent it from swapping too easily.  Via a tunable
kern.maxmbufmem the limit can be upped to at most 3/4 of physmem/kvm.

At the same time divorce maxfiles from maxusers and set maxfiles to
physpages / 8 with a floor based on maxusers.  This way busy servers
can make use of the significantly increased mbuf limits with a much
larger number of open sockets.

Tidy up ordering in init_param2() and check up on some users of
those values calculated here.

Out of the overall mbuf memory limit 2K clusters and 4K (page size)
clusters to get 1/4 each because these are the most heavily used mbuf
sizes.  2K clusters are used for MTU 1500 ethernet inbound packets.
4K clusters are used whenever possible for sends on sockets and thus
outbound packets.  The larger cluster sizes of 9K and 16K are limited
to 1/6 of the overall mbuf memory limit.  When jumbo MTU's are used
these large clusters will end up only on the inbound path.  They are
not used on outbound, there it's still 4K.  Yes, that will stay that
way because otherwise we run into lots of complications in the
stack.  And it really isn't a problem, so don't make a scene.

Normal mbufs (256B) weren't limited at all previously.  This was
problematic as there are certain places in the kernel that on
allocation failure of clusters try to piece together their packet
from smaller mbufs.

The mbuf limit is the number of all other mbuf sizes together plus
some more to allow for standalone mbufs (ACK for example) and to
send off a copy of a cluster.  Unfortunately there isn't a way to
set an overall limit for all mbuf memory together as UMA doesn't
support such a limiting.

NB: Every cluster also has an mbuf associated with it.

Two examples on the revised mbuf sizing limits:

1GB KVM:
 512MB limit for mbufs
 419,430 mbufs
  65,536 2K mbuf clusters
  32,768 4K mbuf clusters
   9,709 9K mbuf clusters
   5,461 16K mbuf clusters

16GB RAM:
 8GB limit for mbufs
 33,554,432 mbufs
  1,048,576 2K mbuf clusters
    524,288 4K mbuf clusters
    155,344 9K mbuf clusters
     87,381 16K mbuf clusters

These defaults should be sufficient for even the most demanding
network loads.

MFC after:	1 month
2012-11-27 21:19:58 +00:00
andre
e323c31816 Fix a race on listen socket teardown where while draining the
accept queues a new socket/connection may be added to the queue
due to a race on the ACCEPT_LOCK.

The submitted patch is slightly changed in comments, teardown
and locking order and extended with KASSERT's.

Submitted by:	Vijay Singh <vijju.singh-at-gmail-dot-com>
Found by:	His team.
MFC after:	1 week
2012-11-27 20:04:52 +00:00
pjd
e4de9f38a2 Add kern.capmode_coredump sysctl/tunable to allow processes in capability mode
to dump core.

Reviewed by:	rwatson
Obtained from:	WHEEL Systems
MFC after:	2 weeks
2012-11-27 10:38:11 +00:00
pjd
7a831b4b8c - Add NOCAPCHECK flag to namei that allows lookup to work even if the process
is in capability mode.
- Add VN_OPEN_NOCAPCHECK flag for vn_open_cred() to will ne converted into
  NOCAPCHECK namei flag.

This functionality will be used to enable core dumps for sandboxed processes.

Reviewed by:	rwatson
Obtained from:	WHEEL Systems
MFC after:	2 weeks
2012-11-27 10:32:35 +00:00
pjd
6a3d82fdf4 Regenerate after r243610. 2012-11-27 10:25:03 +00:00
pjd
c079174e14 Allow to use kill(2) in capability mode, but process can send a signal only
to himself. For example abort(3) at first tries to do kill(getpid(), SIGABRT)
which was failing in capability mode, so the code was failing back to exit(1).

Reviewed by:	rwatson
Obtained from:	WHEEL Systems
MFC after:	2 weeks
2012-11-27 10:22:40 +00:00
pjd
02c0badfc1 Allow to modify kern.sugid_coredump and kern.corefile from loader.conf.
Obtained from:	WHEEL Systems
2012-11-27 10:16:48 +00:00
pjd
8dbfcd9003 More style fixes. 2012-11-27 10:15:58 +00:00
pjd
0b5aef9e2b Style fixes (mostly whitespaces). 2012-11-27 10:11:54 +00:00
davidxu
adc108b87e Take first active vnode correctly.
Reviewed by:	kib
MFC after:	3 days
2012-11-27 06:07:58 +00:00
pjd
aec1bace62 Look for zombie process only if we were given process id.
Reviewed by:	kib
MFC after:	2 weeks
X-MFC-after-or-with:	243142
2012-11-25 19:31:42 +00:00
avg
fa7647f75a remove stop_scheduler_on_panic knob
There has not been any complaints about the default behavior, so there
is no need to keep a knob that enables the worse alternative.

Now that the hard-stopping of other CPUs is the only behavior, the panic_cpu
spinlock-like logic can be dropped, because only a single CPU is
supposed to win stop_cpus_hard(other_cpus) race and proceed past that
call.

MFC after:	1 month
2012-11-25 14:22:08 +00:00
avg
9c2d52ecde assert_vop_locked: make the assertion race-free and more efficient
this is really a minor improvement for the sake of correctness

MFC after:	6 days
2012-11-24 13:11:47 +00:00
avg
c20b4131f0 remove vop_lookup_pre and vop_lookup_post
Suggested by:	kib
MFC after:	5 days
2012-11-22 10:36:10 +00:00
kib
6d46d7b7ab Schedule garbage collection run for the in-flight rights passed over
the unix domain sockets to the next tick, coalescing the serial calls
until the collection fires.  The thought is that more work for the
collector could arise in the near time, allowing to clean more and not
spend too much CPU on repeated collection when there is no garbage.

Currently the collection task is fired immediately upon unix domain
socket close if there are any rights in flight, which caused excessive
CPU usage and too long blocking of the threads waiting for
unp_list_lock and unp_link_rwlock in write mode.

Robert noted that it would be nice if we could find some heuristic by
which we decide whether to run GC a bit more quickly.  E.g., if the
number of UNIX domain sockets is close to its resource limit, but not
quite.

Reported and tested by:	Markus Gebert <markus.gebert@hostpoint.ch>
Reviewed by:	rwatson
MFC after:	2 weeks
2012-11-20 15:45:48 +00:00
kib
f0eb44bc70 Add a special meaning to the negative ticks argument for
taskqueue_enqueue_timeout().  Do not rearm the callout if it is
already armed and the ticks is negative.  Otherwise rearm it to fire
in abs(ticks) ticks in the future.

The intended use is to call taskqueue_enqueue_timeout() for the given
timeout_task with the same negative ticks argument.  As result, the
task is scheduled to execute not further than abs(ticks) ticks in
future, and the consequent enqueues are coalesced until the already
scheduled task is finished.

Reviewed by:	rwatson
Tested by:	Markus Gebert <markus.gebert@hostpoint.ch>
MFC after:	2 weeks
2012-11-20 15:33:48 +00:00
attilio
e331787780 insmntque() is always called with the lock held in exclusive mode,
then:
- assume the lock is held in exclusive mode and remove a moot check
  about the lock acquisition.
- in the destructor remove !MPSAFE specific chunk.

Reviewed by:	kib
MFC after:	2 weeks
2012-11-19 20:43:19 +00:00
avg
4726ba44fc assert_vop_locked should treat LK_EXCLOTHER as the not locked case
... from a perspective of the current thread.

Spotted by:	mjg
Discussed with:	kib
MFC after:	18 days
2012-11-19 11:35:56 +00:00
avg
4d2c561ebf vnode_if: fix locking protocol description for lookup and cachedlookup
Also remove the checks from vop_lookup_pre and vop_lookup_post, which
are now completely redundant (before this change they were partially
redundant).

Discussed with:	kib
MFC after:	10 days
2012-11-19 11:32:56 +00:00
mjg
fb4bab611c Fix possible fp reference leak in posix_openpt
Reviewed by:	ed
Approved by:	trasz (mentor)
MFC after:	3 days
2012-11-18 15:48:34 +00:00
glebius
b821e8658c Update comment. 2012-11-16 14:00:54 +00:00
kib
801de09716 In pget(9), if PGET_NOTWEXIT flag is not specified, also search the
zombie list for the pid. This allows several kern.proc sysctls to
report useful information for zombies.

Hold the allproc_lock around all searches instead of relocking it.
Remove private pfind_locked() from the new nfs client code.

Requested and reviewed by:	pjd
Tested by:	pho
MFC after:	3 weeks
2012-11-16 08:25:06 +00:00
kib
8123554495 Restore the proper handling of the pid 0 for waitpid(2).
Fix the style around.

Reported and reviewed by:	bde (previous version)
MFC after:	28 days
2012-11-16 06:32:38 +00:00
kib
de90907af2 Style fixes for r242958.
Reported and reviewed by:	bde
MFC after:	28 days
2012-11-16 06:22:14 +00:00
trasz
f25f7f2e87 Improve KASSERT messages in racct, to make it clear which resource
caused the problem.

Submitted by:	mjg
2012-11-15 15:55:49 +00:00