Commit Graph

13277 Commits

Author SHA1 Message Date
Mikolaj Golub
9e89077c65 Plug up the lock lock leakage when exporting to a short buffer.
Reported by:	Alexander Leidinger
Submitted by:	mjg
MFC after:	1 week
2013-07-01 03:27:14 +00:00
Konstantin Belousov
70a7dd5d5b Fix issues with zeroing and fetching the counters, on x86 and ppc64.
Issues were noted by Bruce Evans and are present on all architectures.

On i386, a counter fetch should use atomic read of 64bit value,
otherwise carry from the increment on other CPU could be lost for the
given fetch, making error of 2^32.  If 64bit read (cmpxchg8b) is not
available on the machine, it cannot be SMP and it is enough to disable
preemption around read to avoid the split read.

On x86 the counter increment is not atomic on purpose, which makes it
possible for the store of the incremented result to override just
zeroed per-cpu slot.  The effect would be a counter going off by
arbitrary value after zeroing.  Perform the counter zeroing on the
same processor which does the increments, making the operations
mutually exclusive.  On i386, same as for the fetching, if the
cmpxchg8b is not available, machine is not SMP and we disable
preemption for zeroing.

PowerPC64 is treated the same as amd64.

For other architectures, the changes made to allow the compilation to
succeed, without fixing the issues with zeroing or fetching.  It
should be possible to handle them by using the 64bit loads and stores
atomic WRT preemption (assuming the architectures also converted from
using critical sections to proper asm).  If architecture does not
provide the facility, using global (spin) mutex would be non-optimal
but working solution.

Noted by:  bde
Sponsored by:	The FreeBSD Foundation
2013-07-01 02:48:27 +00:00
Mateusz Guzik
f15ba03632 acct: create a special plimit object and set it for exiting processes
instead of allocating new one each time

All limits are set to RLIM_INFINITY which sould be ok (even though we
care only about RLIMT_FSIZE in this case).

MFC after:	1 week
2013-06-30 19:08:06 +00:00
Mateusz Guzik
4a3c4f41e9 acct: reduce code duplication by using acct_disable as cleanup for
failed kproc_create

MFC after:	1 week
2013-06-30 13:17:37 +00:00
Peter Wemm
0a943e59c9 Revert accidental commit.
Pointy hat to:	 peter
2013-06-29 05:05:57 +00:00
Peter Wemm
76665df9bf Help out gcc. clang understands.
sys_generic.c:1510: warning: 'precision' may be used uninitialized
*** [sys_generic.o] Error code 1
2013-06-29 04:35:04 +00:00
Davide Italiano
7563068847 Correct the comment above _sleep() function which still mentions 'timo'
instead of 'sbintime_t'.

Reported by:	kan
2013-06-28 21:04:15 +00:00
Davide Italiano
237abf0c56 - Trim an unused and bogus Makefile for mount_smbfs.
- Reconnect with some minor modifications, in particular now selsocket()
internals are adapted to use sbintime units after recent'ish calloutng
switch.
2013-06-28 21:00:08 +00:00
Mateusz Guzik
07bd8bf929 Remove duplicate NULL check in kern_proc_filedesc_out.
No functional changes.

MFC after:	1 week
2013-06-28 18:32:46 +00:00
Mikolaj Golub
6359d169ef Rework r252313:
The filedesc lock may not be dropped unconditionally before exporting
fd to sbuf: fd might go away during execution.  While it is ok for
DTYPE_VNODE and DTYPE_FIFO because the export is from a vrefed vnode
here, for other types it is unsafe.

Instead, drop the lock in export_fd_to_sb(), after preparing data in
memory and before writing to sbuf.

Spotted by:	mjg
Suggested by:	kib
Review by:	kib
MFC after:	1 week
2013-06-28 18:07:41 +00:00
Ryan Stone
4a7d0bfcaa Correct a bug that prevented deadlkres from (almost) ever firing.
deadlkres was using a reversed test to check whether ticks had rolled over.
This meant that deadlkres could only fire after ticks had rolled over.
This test was actually unnecessary as deadlkres only ever took the
difference of ticks values which is safe even in the presence of ticks
rollover.  Remove the tests entirely.  Now deadlkres will properly fire
after a lock has been held after the timeout period.

MFC after:	1 month
2013-06-28 15:55:30 +00:00
Jeff Roberson
5f51836645 - Add a general purpose resource allocator, vmem, from NetBSD. It was
originally inspired by the Solaris vmem detailed in the proceedings
   of usenix 2001.  The NetBSD version was heavily refactored for bugs
   and simplicity.
 - Use this resource allocator to allocate the buffer and transient maps.
   Buffer cache defrags are reduced by 25% when used by filesystems with
   mixed block sizes.  Ultimately this may permit dynamic buffer cache
   sizing on low KVA machines.

Discussed with:	alc, kib, attilio
Tested by:	pho
Sponsored by:	EMC / Isilon Storage Division
2013-06-28 03:51:20 +00:00
John Baldwin
e35ce1f271 Make detaching drivers from PCI devices more robust. While here, fix a
bug where a PCI device would be powered down if it failed to probe, but
not when its driver was detached (e.g. via kldunload).
- Add a new helper method resource_list_release_active() which forcefully
  releases any active resources of a specified type from a resource list.
- Add a bus_child_detached method for the PCI bus driver which forces any
  active resources to be released (and whines to the console if it finds
  any) and then powers the device down.
- Call pci_child_detached() if we fail to probe a device when a driver
  is kldloaded.  This isn't perfect but can avoid leaking resources
  from a probe() routine in the kldload case.

Reviewed by:	imp, brooks
MFC after:	1 month
2013-06-27 20:21:54 +00:00
Mikolaj Golub
bd973910c8 To avoid LOR, always drop the filedesc lock before exporting fd to sbuf.
Reviewed by:	kib
MFC after:	3 days
2013-06-27 19:14:03 +00:00
John Baldwin
b5fb43e572 A few mostly cosmetic nits to aid in debugging:
- Call lock_init() first before setting any lock_object fields in
  lock init routines.  This way if the machine panics due to a duplicate
  init the lock's original state is preserved.
- Somewhat similarly, don't decrement td_locks and td_slocks until after
  an unlock operation has completed successfully.
2013-06-25 20:23:08 +00:00
John Baldwin
cd32bd7ad1 Several improvements to rmlock(9). Many of these are based on patches
provided by Isilon.
- Add an rm_assert() supporting various lock assertions similar to other
  locking primitives.  Because rmlocks track readers the assertions are
  always fully accurate unlike rw_assert() and sx_assert().
- Flesh out the lock class methods for rmlocks to support sleeping via
  condvars and rm_sleep() (but only while holding write locks), rmlock
  details in 'show lock' in DDB, and the lc_owner method used by
  dtrace.
- Add an internal destroyed cookie so that API functions can assert
  that an rmlock is not destroyed.
- Make use of rm_assert() to add various assertions to the API (e.g.
  to assert locks are held when an unlock routine is called).
- Give RM_SLEEPABLE locks their own lock class and always use the
  rmlock's own lock_object with WITNESS.
- Use THREAD_NO_SLEEPING() / THREAD_SLEEPING_OK() to disallow sleeping
  while holding a read lock on an rmlock.

Submitted by:	andre
Obtained from:	EMC/Isilon
2013-06-25 18:44:15 +00:00
Lawrence Stewart
0963c8e431 When a previous call to sbsndptr() leaves sb->sb_sndptroff at the start of an
mbuf that was fully consumed by the previous call, the mbuf ptr returned by the
current call ends up being the previous mbuf in the sb chain to the one that
contains the data we want.

This does not cause any observable issues because the mbuf copy routines happily
walk the mbuf chain to get to the data at the moff offset, which in this case
means they effectively skip over the mbuf returned by sbsndptr().

We can't adjust sb->sb_sndptr during the previous call for this case because the
next mbuf in the chain may not exist yet. We therefore need to detect the
condition and make the adjustment during the current call.

Fix by detecting the special case of moff being at the start of the next mbuf in
the chain and adjust the required accounting variables accordingly.

Reviewed by:	andre
MFC after:	2 weeks
2013-06-19 03:08:01 +00:00
Lawrence Stewart
ec41a9a1bd The fix committed in r250951 replaced the reported panic with a deadlock... gold
star for me. EVENTHANDLER_DEREGISTER() attempts to acquire the lock which is
held by the event handler framework while executing event handler functions,
leading to deadlock.

Move EVENTHANDLER_DEREGISTER() to alq_load_handler() and thus deregister the ALQ
shutdown_pre_sync handler at module unload time, which takes care of the
originally reported panic and fixes the deadlock introduced in r250951.

Reported by:	Luiz Otavio O Souza
MFC after:	3 days
X-MFC with:	250951
2013-06-17 09:49:07 +00:00
Ed Schouten
2381f6ef8c Change callout use counter to use C11 atomics.
In order to get some coverage of C11 atomics in kernelspace, switch at
least one piece of code in kernelspace to use C11 atomics instead of
<machine/atomic.h>.

While there, slightly improve the code by adding an assertion to prevent
the use count from going negative.
2013-06-16 09:30:35 +00:00
Lawrence Stewart
38f080cb04 Move hhook's per-vnet initialisation to an earlier SYSINIT SI_SUB stage to
ensure all per-vnet related hhook initialisation is completed prior to any
virtualised hhook points attempting registration.

vnet_register_sysinit() requires that a stage later than SI_SUB_VNET be chosen.
There are no per-vnet initialisors in the source tree at this time which run
earlier than SI_SUB_INIT_IF. A quick audit of non-virtualised SYSINITs indicates
there are no subsystems pre SI_SUB_MBUF that would likely be interested in
registering a virtualised hhook point.

Settle on SI_SUB_MBUF as hhook's per-vnet initialisation stage as it's the first
overtly network-related initilisation stage to run after SI_SUB_VNET. If a
subsystem that initialises earlier than SI_SUB_MBUF ends up wanting to register
virtualised hhook points in future, hhook's use of SI_SUB_MBUF will need to be
revisited and would probably warrant creating a dedicated SI_SUB_HHOOK which
runs immediately after SI_SUB_VNET.

MFC after:	1 week
2013-06-15 10:08:34 +00:00
Lawrence Stewart
933a8bff73 Cleanup and simplification in khelp_{register|deregister}_helper(). No
functional changes.

MFC after:	1 week
2013-06-15 06:45:17 +00:00
Lawrence Stewart
58261d30e1 Add a private KPI between hhook and khelp that allows khelp modules to insert
hook functions into hhook points which register after the modules were loaded -
potentially useful during boot or if hhook points are dynamically registered.

MFC after:	1 week
2013-06-15 05:57:29 +00:00
Lawrence Stewart
b1f53277ec Internalise handling of virtualised hook points inside
hhook_{add|remove}_hook_lookup() so that khelp (and other potential API
consumers) do not have to care when they attempt to (un)hook a particular hook
point identified by id and type.

Reviewed by:	scottl
MFC after:	1 week
2013-06-15 04:03:40 +00:00
Lawrence Stewart
bfe72a58e2 Fix a major oversight in r251732 which causes non-VIMAGE kernels to trigger a
KASSERT during TCP hhook registration at boot. Virtualised hook points only
require extra housekeeping and sanity checking when "options VIMAGE" is present.

Reported by:	bdrewery,jh,dhw
Tested by:	dhw
MFC after:	1 week
X-MFC with:	251732
2013-06-14 18:11:21 +00:00
Lawrence Stewart
601d4c7543 Add support for non-virtualised hhook points, which are uniquely identified by
type and id, as compared to virtualised hook points which are now uniquely
identified by type, id and a vid (which for vimage is the pointer to the vnet
that the hhook resides in).

All hhook_head structs for both virtualised and non-virtualised hook points
coexist in hhook_head_list, and a separate list is maintained for hhook points
within each vnet to simplify some vimage-related housekeeping.

Reviewed by:	scottl
MFC after:	1 week
2013-06-14 04:10:34 +00:00
Lawrence Stewart
86241d89a9 Fix a potential NULL-pointer dereference that would trigger if the hhook
registration site did not provide storage for a copy of the hhook_head struct.

MFC after:	3 days
2013-06-14 02:25:40 +00:00
Jeff Roberson
17a2737732 - Add a BIT_FFS() macro and use it to replace cpusetffs_obj()
Discussed with:	attilio
Sponsored by:	EMC / Isilon Storage Division
2013-06-13 20:46:03 +00:00
Konstantin Belousov
1d7466bca4 Fix two issues with the spin loops in the umtx(2) implementation.
- When looping, check for the pending suspension.  Otherwise, other
  usermode thread which races with the looping one, could try to
  prevent the process from stopping or exiting.

- Add missed checks for the faults from casuword*().  The code is
  structured in a way which makes the loops exit if the specified
  address is invalid, since both fuword() and casuword() return -1 on
  the fault.  But if the address is mapped readonly, the typical value
  read by fuword() is different from -1, while casuword() returns -1.
  Absent the checks for casuword() faults, this is interpreted as the
  race with other thread and causes non-interruptible spinning in the
  kernel.

Reported and tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
2013-06-13 09:33:22 +00:00
Marcel Moolenaar
4612275fdb Revert r251590. It unexpectedly broke the build and there were some
questions on locking. As part of commit-bit grooming, I'd like Steve
to handle this, but can't leave things broken in the mean time.
2013-06-10 15:22:27 +00:00
Marcel Moolenaar
8c7ca16f63 Add vfs_mounted and vfs_unmounted events so that components can be informed
about mount and unmount events. This is used by Juniper to implement a more
optimal implementation of NetBSD's veriexec.

Submitted by:	stevek@juniper.net
Obtained from:	Juniper Networks, Inc
2013-06-09 23:51:26 +00:00
Gleb Smirnoff
8d1aa3c6b4 aio_mlock() added:
- Regen for r251526.
  - Bump __FreeBSD_version.
2013-06-08 13:30:13 +00:00
Gleb Smirnoff
6160e12c10 Add new system call - aio_mlock(). The name speaks for itself. It allows
to perform the mlock(2) operation, which can consume a lot of time, under
control of aio(4).

Reviewed by:	kib, jilles
Sponsored by:	Nginx, Inc.
2013-06-08 13:27:57 +00:00
Gleb Smirnoff
f95c13db04 Separate LIO_SYNC processing into a separate function aio_process_sync(),
and rename aio_process() into aio_process_rw().

Reviewed by:	kib
Sponsored by:	Nginx, Inc.
2013-06-08 13:02:43 +00:00
John Baldwin
c9813d0a37 Do not compare the existing mask of a cpuset with a new mask when changing
the mask of a cpuset.  Also, change the cpuset's mask before updating the
masks of all children.  Previously changing a cpuset's mask first required
setting the mask to a super-set of both the old and new masks and then
changing it a second time to the new mask.
2013-06-06 14:43:19 +00:00
Alan Cox
27a18d6a23 Don't busy the page unless we are likely to release the object lock.
Reviewed by:	kib
Sponsored by:	EMC / Isilon Storage Division
2013-06-06 06:17:20 +00:00
Jeff Roberson
ba39d89bc9 - Consolidate duplicate code into support functions.
- Split the bqlock into bqclean and bqdirty locks.
 - Only acquire the wakeup synchronization locks when we cross a
   threshold requiring them.
 - Restructure the way flushbufqueues() targets work so they are more
   smp friendly and sane.

Reviewed by:	kib
Discussed with:	mckusick, attilio
Sponsored by:	EMC / Isilon Storage Division

M    vfs_bio.c
2013-06-05 23:53:00 +00:00
Gleb Smirnoff
82e825c4c9 Improve r250890, so that we stop processing of a message with zero
descriptors as early as possible, and assert that number of descriptors
is positive in unp_freerights().

Reviewed by:	mjg, pjd, jilles
2013-06-04 11:19:08 +00:00
John Baldwin
24150d37d3 - Fix a couple of inverted panic messages for shared/exclusive mismatches
of a lock within a single thread.
- Fix handling of interlocks in WITNESS by properly requiring the interlock
  to be held exactly once if it is specified.
2013-06-03 17:41:11 +00:00
John Baldwin
95d28652af - Handle the recursed/not recursed flags with RA_RLOCKED in rw_assert().
- Tweak a panic message.
2013-06-03 17:38:57 +00:00
Konstantin Belousov
d39116f5d5 Be more generous when donating the current thread time to the owner of
the vnode lock while iterating over the free vnode list.  Instead of
yielding, pause for 1 tick.  The change is reported to help in some
virtualized environments.

Submitted by:	Roger Pau Monn? <roger.pau@citrix.com>
Discussed with:	jilles
Tested by:	pho
MFC after:	2 weeks
2013-06-03 17:36:43 +00:00
Konstantin Belousov
1e65d73c74 Do not map the shared page COW. If the process wired its address
space, fork(2) would cause shadowing of the physical object and
copying of the shared page into private copy, effectively preventing
updates for the exported timehands structure and stopping the clock.

Specify the maximum allowed permissions for the page to be read and
execute, preventing write from the user mode.

Reported and tested by:	<huanghwh@yahoo.com>
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
2013-06-03 04:32:53 +00:00
Konstantin Belousov
92fab43f7f When auto-sizing the buffer cache, limit the amount of physical memory
used as the estimation of size, to 32GB.  This provides around 100K of
buffer headers and corresponding KVA for buffer map at the peak.
Sizing the cache larger is not useful, also resulting in the wasting
and exhausting of KVA for large machines.

Reported and tested by:	bdrewery
Sponsored by:	The FreeBSD Foundation
2013-06-03 04:16:48 +00:00
Alan Cox
39a4cd0cec Reduce the scope of the VM object locking in brelse(). In my tests, this
change reduced the total number of VM object lock acquisitions by brelse()
by 74%.

Sponsored by:	EMC / Isilon Storage Division
2013-06-02 16:18:03 +00:00
Marius Strobl
0ad17e4b32 Move an assertion to the right spot; only bus_dmamap_load_mbuf(9)
requires a pkthdr being present but that's not the case for either
_bus_dmamap_load_mbuf_sg() or bus_dmamap_load_mbuf_sg(9).

Reported by:	sbruno
MFC after:	1 week
2013-06-01 11:42:47 +00:00
John Baldwin
3d4c503cf0 Style fixes to vn_ioctl().
Suggested by:	bde
2013-05-31 16:15:22 +00:00
Jeff Roberson
22a722605d - Convert the bufobj lock to rwlock.
- Use a shared bufobj lock in getblk() and inmem().
 - Convert softdep's lk to rwlock to match the bufobj lock.
 - Move INFREECNT to b_flags and protect it with the buf lock.
 - Remove unnecessary locking around bremfree() and BKGRDINPROG.

Sponsored by:	EMC / Isilon Storage Division
Discussed with:	mckusick, kib, mdf
2013-05-31 00:43:41 +00:00
Julian Elischer
4591f0d339 Initialising the new fibnum field to a known value turns out to
be a GOOD IDEA (TM).
Apparently MOST users set this (e.g. tcp and friends) but there are a few
users that just assume that it is a sensible value but then go on to read it.
These include SCTP, pf and the FLOWTABLE option (and maybe others).
2013-05-24 02:18:37 +00:00
Lawrence Stewart
7639c9be45 Ensure alq's shutdown_pre_sync event handler is deregistered on module unload to
avoid a dangling pointer and eventual panic on system shutdown.

Reported by:	Ali <comnetboy at gmail.com>
Tested by:	Ali <comnetboy at gmail.com>
MFC after:	1 week
2013-05-24 00:49:12 +00:00
Pawel Jakub Dawidek
92981fdf9e Use proper malloc type for ioctls white-list.
Reported by:	pho
Tested by:	pho
2013-05-23 21:07:26 +00:00
Luigi Rizzo
4b62214f4a Increase the (arbitrary) limit for the number of packets per tick
from 1k to 20k The previous value was good 10 years ago, but not
anymore now.

More importantly, lots of good surprises:
polling is incredibly effective under virtualization, and not only
prevents livelock but also saves most of the VM exit overhead in
receive mode.

Using polling, a FreeBSD instance under qemu-kvm remains perfectly
responsive even when bombed with 10 Mpps over an emulated e1000,
and happily processes 1.7 Mpps through ipfw.

Note that some incompatibilities still remain: e.g. polling is not
(yet) compatible with netmap, and seems to freeze the guest when
kern.polling.idle_poll=1

MFC after:	3 days
2013-05-22 16:32:18 +00:00