Commit Graph

13452 Commits

Author SHA1 Message Date
ken
5591de079d Change the way that unmapped I/O capability is advertised.
The previous method was to set the D_UNMAPPED_IO flag in the cdevsw
for the driver.  The problem with this is that in many cases (e.g.
sa(4)) there may be some instances of the driver that can handle
unmapped I/O and some that can't.  The isp(4) driver can handle
unmapped I/O, but the esp(4) driver currently cannot.  The cdevsw
is shared among all driver instances.

So instead of setting a flag on the cdevsw, set a flag on the cdev.
This allows drivers to indicate support for unmapped I/O on a
per-instance basis.

sys/conf.h:	Remove the D_UNMAPPED_IO cdevsw flag and replace it
		with an SI_UNMAPPED cdev flag.

kern_physio.c:	Look at the cdev SI_UNMAPPED flag to determine
		whether or not a particular driver can handle
		unmapped I/O.

geom_dev.c:	Set the SI_UNMAPPED flag for all GEOM cdevs.
		Since GEOM will create a temporary mapping when
		needed, setting SI_UNMAPPED unconditionally will
		work.

		Remove the D_UNMAPPED_IO flag.

nvme_ns.c:	Set the SI_UNMAPPED flag on cdevs created here
		if NVME_UNMAPPED_BIO_SUPPORT is enabled.

vfs_aio.c:	In aio_qphysio(), check the SI_UNMAPPED flag on a
		cdev instead of the D_UNMAPPED_IO flag on the cdevsw.

sys/param.h:	Bump __FreeBSD_version to 1000045 for the switch from
		setting the D_UNMAPPED_IO flag in the cdevsw to setting
		SI_UNMAPPED in the cdev.

Reviewed by:	kib, jimharris
MFC after:	1 week
Sponsored by:	Spectra Logic
2013-08-15 22:52:39 +00:00
cperciva
8544609e64 Change the queue of locks in kern_rangelock.c from holding lock requests in
the order that they arrive, to holding
(a) granted write lock requests, followed by
(b) granted read lock requests, followed by
(c) ungranted requests, in order of arrival.

This changes the stopping condition for iterating through granted locks to
see if a new request can be granted: When considering a read lock request,
we can stop iterating as soon as we see a read lock request, since anything
after that point is either a granted read lock request or a request which
has not yet been granted.  (For write lock requests, we must still compare
against all granted lock requests.)

For workloads with R parallel reads and W parallel writes, this improves
the time spent from O((R+W)^2) to O(W*(R+W)); i.e., heavy parallel-read
workloads become significantly more scalable.

No statistically significant change in buildworld time has been measured,
but synthetic tests of parallel 'dd > /dev/null' and 'openssl enc >/dev/null'
with the input file cached yield dramatic (up to 10x) improvement with high
(up to 128 processes) levels of parallelism.

Reviewed by:	kib
2013-08-15 20:19:17 +00:00
glebius
722a1a5e5d Make sendfile() a method in the struct fileops. Currently only
vnode backed file descriptors have this method implemented.

Reviewed by:	kib
Sponsored by:	Nginx, Inc.
Sponsored by:	Netflix
2013-08-15 07:54:31 +00:00
markj
6c24c9fb32 Specify SDT probe argument types in the probe definition itself rather than
using SDT_PROBE_ARGTYPE(). This will make it easy to extend the SDT(9) API
to allow probes with dynamically-translated types.

There is no functional change.

MFC after:	2 weeks
2013-08-15 04:08:55 +00:00
markj
cee1e037da Use kld_{load,unload} instead of mod_{load,unload} for the linker file load
and unload event handlers added in r254266.

Reported by:	jhb
X-MFC with:	r254266
2013-08-14 00:42:21 +00:00
jeff
ca13e69c2f - Disable quantum caches on the kmem_arena. This can make fragmentation
worse on small KVA systems.  I had intended to only enable it for
   debugging.

Sponsored by:	EMC / Isilon Storage Division
2013-08-13 22:41:24 +00:00
jeff
d330a11545 - Add a statically allocated memguard arena since it is needed very early
on.
 - Pass the appropriate flags to vmem_xalloc() when allocating space for
   the arena from kmem_arena.

Sponsored by:	EMC / Isilon Storage Division
2013-08-13 22:40:43 +00:00
jhb
7120845712 Some small cleanups to the fixes in r180340:
- Set NOTE_TRACKERR before running filt_proc().  If the knote did not
  have NOTE_FORK set in fflags when registered, then the TRACKERR event
  could miss being posted.
- Don't pass the pid in to filt_proc() for NOTE_FORK events.  The special
  handling for pids is done knote_fork() directly and no longer in
  filt_proc().

MFC after:	2 weeks
2013-08-13 18:45:58 +00:00
glebius
33afc0388b - Minor style(9) fix.
- Bring a comment up to date.
2013-08-13 13:40:31 +00:00
markj
5a3f78714c FreeBSD's DTrace implementation has a few problems with respect to handling
probes declared in a kernel module when that module is unloaded. In
particular,

* Unloading a module with active SDT probes will cause a panic. [1]
* A module's (FBT/SDT) probes aren't destroyed when the module is unloaded;
  trying to use them after the fact will generally cause a panic.

This change fixes both problems by porting the DTrace module load/unload
handlers from illumos and registering them with the corresponding
EVENTHANDLER(9) handlers. This allows the DTrace framework to destroy all
probes defined in a module when that module is unloaded, and to prevent a
module unload from proceeding if some of its probes are active. The latter
problem has already been fixed for FBT probes by checking lf->nenabled in
kern_kldunload(), but moving the check into the DTrace framework generalizes
it to all kernel providers and also fixes a race in the current
implementation (since a probe may be activated between the check and the
call to linker_file_unload()).

Additionally, the SDT implementation has been reworked to define SDT
providers/probes/argtypes in linker sets rather than using SYSINIT/SYSUNINIT
to create and destroy SDT probes when a module is loaded or unloaded. This
simplifies things quite a bit since it means that pretty much all of the SDT
code can live in sdt.ko, and since it becomes easier to integrate SDT with
the DTrace framework. Furthermore, this allows FreeBSD to be quite flexible
in that SDT providers spanning multiple modules can be created on the fly
when a module is loaded; at the moment it looks like illumos' SDT
implementation requires all SDT probes to be statically defined in a single
kernel table.

PR:		166927, 166926, 166928
Reported by:	davide [1]
Reviewed by:	avg, trociny (earlier version)
MFC after:	1 month
2013-08-13 03:10:39 +00:00
markj
5423ffaa89 Remove some unused fields from struct linker_file. They were added in
r172862 for use by the DTrace SDT framework but don't seem to have ever
been used.

MFC after:	2 weeks
2013-08-13 03:09:00 +00:00
markj
80dd3f5e73 Add event handlers for module load and unload events. The load handlers are
called after the module has been loaded, and the unload handlers are called
before the module is unloaded. Moreover, the module unload handlers may
return an error to prevent the unload from proceeding.

Reviewed by:	avg
MFC after:	2 weeks
2013-08-13 03:07:49 +00:00
kib
58a6f3bcbe The r254167 moved initialization of the sleepqueues before the witness
is operational.  init_sleepqueues() initializes 256 mutexes, which,
due to witness still being cold, started to overflow the pending_locks
array.

As stated in the reported panic message, increase WITNESS_PENDLIST
from 768 to 1024, which provides space for additional 256 locks.

Reported by:	many
Tested by:	rakuco, bdrewery
2013-08-10 21:42:14 +00:00
cognet
333a884980 Don't call sleepinit() from proc0_init(), make it a SYSINIT instead.
vmem needs the sleepq locks to be initialized when free'ing kva, so we want it
called as early as possible.
2013-08-09 23:13:52 +00:00
cognet
51c3f72bfa Instead of just trying to do it for arm, make sure vm_kmem_size is properly
aligned in kmeminit(), where it'll work for any arch.

Suggested by:	alc
2013-08-09 22:30:54 +00:00
attilio
e9f37cac74 On all the architectures, avoid to preallocate the physical memory
for nodes used in vm_radix.
On architectures supporting direct mapping, also avoid to pre-allocate
the KVA for such nodes.

In order to do so make the operations derived from vm_radix_insert()
to fail and handle all the deriving failure of those.

vm_radix-wise introduce a new function called vm_radix_replace(),
which can replace a leaf node, already present, with a new one,
and take into account the possibility, during vm_radix_insert()
allocation, that the operations on the radix trie can recurse.
This means that if operations in vm_radix_insert() recursed
vm_radix_insert() will start from scratch again.

Sponsored by:	EMC / Isilon storage division
Reviewed by:	alc (older version)
Reviewed by:	jeff
Tested by:	pho, scottl
2013-08-09 11:28:55 +00:00
attilio
3f74b0e634 Give mutex(9) the ability to recurse on a per-instance basis.
Now the MTX_RECURSE flag can be passed to the mtx_*_flag() calls.
This helps in cases we want to narrow down to specific calls the
possibility to recurse for some locks.

Sponsored by:	EMC / Isilon storage division
Reviewed by:	jeff, alc
Tested by:	pho
2013-08-09 11:24:29 +00:00
attilio
16c7563cf4 The soft and hard busy mechanism rely on the vm object lock to work.
Unify the 2 concept into a real, minimal, sxlock where the shared
acquisition represent the soft busy and the exclusive acquisition
represent the hard busy.
The old VPO_WANTED mechanism becames the hard-path for this new lock
and it becomes per-page rather than per-object.
The vm_object lock becames an interlock for this functionality:
it can be held in both read or write mode.
However, if the vm_object lock is held in read mode while acquiring
or releasing the busy state, the thread owner cannot make any
assumption on the busy state unless it is also busying it.

Also:
- Add a new flag to directly shared busy pages while vm_page_alloc
  and vm_page_grab are being executed.  This will be very helpful
  once these functions happen under a read object lock.
- Move the swapping sleep into its own per-object flag

The KPI is heavilly changed this is why the version is bumped.
It is very likely that some VM ports users will need to change
their own code.

Sponsored by:	EMC / Isilon storage division
Discussed with:	alc
Reviewed by:	jeff, kib
Tested by:	gavin, bapt (older version)
Tested by:	pho, scottl
2013-08-09 11:11:11 +00:00
trasz
fcb31f05a9 Don't dereference null pointer should acl_alloc() be passed M_NOWAIT
and allocation failed.  Nothing in the tree passed M_NOWAIT.

Obtained from:	mjg
MFC after:	1 month
2013-08-09 08:40:31 +00:00
scottl
40b11a1746 Add a helpful message that can help point to why a sysctl tree removal failed
Obtained from:	Netflix
MFC after:	3 days
2013-08-09 01:04:44 +00:00
rstone
d9719f74bc Allow drivers to return BUS_PROBE_NOWILDCARD from their attach routine to
match devices where the driver class was fixed but the unit number was
wildcarded.  This better matches the documented behaviour in
DEVICE_PROBE(9).

Reviewed by:	imp
2013-08-08 19:30:49 +00:00
jhb
9481e259bb Don't emit a spurious EVFILT_PROC event with no fflags set on process exit
if NOTE_EXIT is not being monitored.  The rationale is that a listener
should only get an event for exit() if they registered interest via
NOTE_EXIT.  This matches the behavior on OS X.
- Don't save the exit status on process exit unless NOTE_EXIT is being
  monitored.
- Add an internal EV_DROP flag that requests kqueue_scan() to free the
  knote without signalling it to userland and use this when a process
  exits but the fflags in the knote is zero.

Reviewed by:	jmg
MFC after:	1 month
2013-08-07 19:56:35 +00:00
kevlo
52419f21e1 Remove unsigned comparison < 0
Found by:	LLVM
Reviewed by:	luigi
2013-08-07 07:22:56 +00:00
jeff
de4ecca213 Replace kernel virtual address space allocation with vmem. This provides
transparent layering and better fragmentation.

 - Normalize functions that allocate memory to use kmem_*
 - Those that allocate address space are named kva_*
 - Those that operate on maps are named kmap_*
 - Implement recursive allocation handling for kmem_arena in vmem.

Reviewed by:	alc
Tested by:	pho
Sponsored by:	EMC / Isilon Storage Division
2013-08-07 06:21:20 +00:00
kib
103825c951 Do not override the ENOENT error for the empty path, or EFAULT errors
from copyins, with the relative lookup check.

Discussed with:	rwatson
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2013-08-05 19:42:03 +00:00
attilio
899ab64514 Revert r253939:
We cannot busy a page before doing pagefaults.
Infact, it can deadlock against vnode lock, as it tries to vget().
Other functions, right now, have an opposite lock ordering, like
vm_object_sync(), which acquires the vnode lock first and then
sleeps on the busy mechanism.

Before this patch is reinserted we need to break this ordering.

Sponsored by:	EMC / Isilon storage division
Reported by:	kib
2013-08-05 08:55:35 +00:00
attilio
19b2ea9f81 The page hold mechanism is fast but it has couple of fallouts:
- It does not let pages respect the LRU policy
- It bloats the active/inactive queues of few pages

Try to avoid it as much as possible with the long-term target to
completely remove it.
Use the soft-busy mechanism to protect page content accesses during
short-term operations (like uiomove_fromphys()).

After this change only vm_fault_quick_hold_pages() is still using the
hold mechanism for page content access.
There is an additional complexity there as the quick path cannot
immediately access the page object to busy the page and the slow path
cannot however busy more than one page a time (to avoid deadlocks).

Fixing such primitive can bring to complete removal of the page hold
mechanism.

Sponsored by:	EMC / Isilon storage division
Discussed with:	alc
Reviewed by:	jeff
Tested by:	pho
2013-08-04 21:07:24 +00:00
attilio
e825889721 Remove unnecessary soft busy of the page before to do vn_rdwr() in
kern_sendfile() which is unnecessary.
The page is already wired so it will not be subjected to pagefault.
The content cannot be effectively protected as it is full of races
already.
Multiple accesses to the same indexes are serialized through vn_rdwr().

Sponsored by:	EMC / Isilon storage division
Reviewed by:	alc, jeff
Tested by:	pho
2013-08-04 15:56:19 +00:00
marcel
65c945d583 Add a tunable for the default timeout. 2013-08-03 04:25:25 +00:00
glebius
7eacd3a0af Remove extra zeroing after M_ZERO allocation. 2013-08-02 13:06:49 +00:00
kib
04cde0067f Remove unused malloc type.
Requested by:	alc
MFC after:	1 week
2013-08-01 12:55:41 +00:00
ian
24c5871f5a Changes to allow using BOOTP_NFSROOT and mounting an nfs root filesystem
other than the one specified by the BOOTP server.  This configures NFS
using the BOOTP protocol while also respecting other root-path options such
as setting vfs.root.mountfrom in the environment or using the RB_DFLTROOT
boot option.  It allows you to override the root path provided by the
server, or to supply a root path when the server provides IP configuration
but no root path info.

This maintains the historical BOOTP_NFSROOT behavior of panicking on a
failure to mount the root path provided by the server, unless you've
provided an alternative via the ROOTDEVNAME kernel option or by setting
vfs.root.mountfrom.  The behavior of panicking when given no other options
is preserved because it amounts to a bit of a retry loop that could
eventually recover from a transient network or server problem.

The user can now override the root path from loader(8) even if the
kernel is compiled with BOOTP_NFSROOT.  If vfs.root.mountfrom is set in
the environment it is used unconditionally -- it always overrides the
BOOTP info.  If it begins with [old]nfs: then the BOOTP code uses it
instead of the server-provided info.  If it specifies some other
filesystem then the bootp code will not panic like it used to and the code
in vfs_mountroot.c will invoke the right filesystem to do the mount.

If the kernel is compiled with the ROOTDEVNAME option, then that name is
used by the BOOTP code if either
      * The server doesn't provide a pathname.
      * The boothowto flags include RB_DFLTROOT.
The latter allows the user to compile in alternate path in ROOTDEVNAME
such as ufs:/dev/da0s1a and boot from that path by setting
boot_dftlroot=1 in loader(8) or using the '-r' option in boot(8).

The one thing not provided here is automatic failover from a
server-provided path to a compiled-in one without the user manually
requesting that.  The code just isn't currently structured in a way that
makes that possible with a lot of rewrite.  I think the ability to set
vfs.root.mountfrom and to use ROOTDEVNAME automatically when the server
doesn't provide a name covers the most common needs.

A set of patches submitted by Lars Eggert provided the part I couldn't
figure out by myself when I tried to do this last year; many thanks.

Reviewed by:	rodrigc
2013-07-31 19:14:00 +00:00
scottl
8ceb091210 Another fix for r253823; retain the default of 1 readahead block for sendfile.
Submitted by:	glebius
Obtained from:	Netflix
MFC after:	3 days
2013-07-31 15:55:01 +00:00
scottl
b9c4e3dc58 Fix r253823. Some WIP patches snuck in.
Submitted by:	zont
2013-07-30 23:50:09 +00:00
scottl
0eaffce7b3 Create a knob, kern.ipc.sfreadahead, that allows one to tune the amount of
readahead that sendfile() will do.  Default remains the same.

Obtained from:	Netflix
MFC after:	3 days
2013-07-30 23:26:05 +00:00
kib
6660649d5c When creation of the v_pollinfo raced and our instance of vpollinfo
must be destroyed, knlist_clear() and seldrain() calls could be
avoided, since vpollinfo was not used.  More, the knlist_clear()
calling protocol requires the knlist locked, which is not true at the
call site.

Split the destruction into the helper destroy_vpollinfo_free(), and
call it when raced, instead of destroy_vpollinfo().

Reported and tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:   3 days
2013-07-28 06:59:29 +00:00
jhb
1e215bb36e Use VMFS_OPTIMAL_SPACE instead of VMFS_ALIGNED_SPACE in shm_map(). 2013-07-24 20:34:25 +00:00
marcel
13d997511f Further restrict the MAC addresses that we use for UUID generation
to those that are universally administered. While it is possible to
add locally administered MAC addresses, it's unclear whether those
are (expected) to be more unique than random multicast MAC addresses
or not.

With many U-Boot configurations assigning fixed and non-official MAC
addresses to ethernet ports and without setting the 'X' flag, this
change may have very little value in the embedded (development)
space. Uniqueness of the universally administered addresses is non-
existent on the (H/W) bench and questionable under the (S/W) desk.
In short: this change is aimed at production environments...
2013-07-24 18:13:43 +00:00
marcel
d7c9064369 In uuid_ether_add(), avoid false positives due to the limited type
used to hold the sum of the bytes of the MAC address. While here,
rename the variable that holds the sum from 'c' to 'sum'.

Pointed out by: thompsa@
2013-07-24 16:22:27 +00:00
avg
9e6374b6a9 rename scheduler->swapper and SI_SUB_RUN_SCHEDULER->SI_SUB_LAST
Also directly call swapper() at the end of mi_startup instead of
relying on swapper being the last thing in sysinits order.

Rationale:

- "RUN_SCHEDULER" was misleading, scheduling already takes place at that stage
- "scheduler" was misleading, the function swaps in the swapped out processes
- another SYSINIT(SI_SUB_RUN_SCHEDULER, SI_ORDER_ANY) could never be
  invoked depending on its relative order with scheduler; this was not obvious
  and the bug actually used to exist

Reviewed by:	kib (ealier version)
MFC after:	14 days
2013-07-24 09:45:31 +00:00
glebius
a0ef2f4f21 Remove unused argument from vmem_add1().
Reviewed by:	jeff
2013-07-24 08:02:56 +00:00
marcel
4ca16da195 Decouple the UUID generator from network interfaces by having MAC
addresses added to the UUID generator using uuid_ether_add(). The
UUID generator keeps an arbitrary number of MAC addresses, under
the assumption that they are rarely removed (= uuid_ether_del()).
This achieves the following:
1.  It brings up closer to having the network stack as a loadable
    module.
2.  It allows the UUID generator to filter MAC addresses for best
    results (= highest chance of uniqeness).
3.  MAC addresses can come from anywhere, irrespactive of whether
    it's used for an interface or not.

A side-effect of the change is that when no MAC addresses have been
added, a random multicast MAC address is created once and re-used if
needed. Previusly, when a random MAC address was needed, it was
created for every call. Thus, a change in behaviour is introduced
for when no MAC addresses exist.

Obtained from:	Juniper Networks, Inc.
2013-07-24 04:24:21 +00:00
glebius
8a9169a4ba Revert r249590 and in case if mp_ncpus isn't initialized use MAXCPU. This
allows us to init counter zone at early stage of boot.

Reviewed by:	kib
Tested by:	Lytochkin Boris <lytboris gmail.com>
2013-07-23 11:16:40 +00:00
mjg
d467dfaf08 Remove cr_prison NULL check from proc_to_reap.
Userspace processes always have a prison.

MFC after:	2 weeks
2013-07-22 02:07:15 +00:00
mjg
de1e379f48 Remove duplicate assertion from tdsendsignal.
MFC after:	2 weeks
2013-07-22 00:44:37 +00:00
kib
a7dacef5ab Implement compat32 wrappers for the ktimer_* syscalls.
Reported, reviewed and tested by:	Petr Salinger <Petr.Salinger@seznam.cz>
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2013-07-21 19:43:52 +00:00
kib
e9d8b81db7 Wrap kmq_notify(2) for compat32 to properly consume struct sigevent32
argument.

Reviewed and tested by:	Petr Salinger <Petr.Salinger@seznam.cz>
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2013-07-21 19:40:30 +00:00
kib
97d40396c6 Move the convert_sigevent32() utility function into freebsd32_misc.c
for consumption outside the vfs_aio.c.

For SIGEV_THREAD_ID and SIGEV_SIGNAL notification delivery methods,
also copy in the sigev_value, since librt event pumping loop compares
note generation number with the value passed through sigev_value.

Tested by:	Petr Salinger <Petr.Salinger@seznam.cz>
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2013-07-21 19:33:48 +00:00
kib
82f12b6237 id_t is 64bit, provide the compat32 wrapper for clock_getcpuclockid2(2).
Reported and tested by:	Petr Salinger <Petr.Salinger@seznam.cz>
PR:	threads/180652
Sponsored by:	The FreeBSD Foundation
2013-07-20 13:39:41 +00:00
jhb
d67e7a1cc9 Be more aggressive in using superpages in all mappings of objects:
- Add a new address space allocation method (VMFS_OPTIMAL_SPACE) for
  vm_map_find() that will try to alter the alignment of a mapping to match
  any existing superpage mappings of the object being mapped.  If no
  suitable address range is found with the necessary alignment,
  vm_map_find() will fall back to using the simple first-fit strategy
  (VMFS_ANY_SPACE).
- Change mmap() without MAP_FIXED, shmat(), and the GEM mapping ioctl to
  use VMFS_OPTIMAL_SPACE instead of VMFS_ANY_SPACE.

Reviewed by:	alc (earlier version)
MFC after:	2 weeks
2013-07-19 19:06:15 +00:00
kib
fcfcea28a9 Clear the vnode knotes before destroying vpollinfo.
Reported and tested by:	Patrick Lamaiziere <patfbsd@davenulle.org>
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
2013-07-17 10:56:21 +00:00
glebius
8f9d71bb1b Nuke mbstat. It wasn't used for mbuf statistics since FreeBSD 5.
Now that r253351 moved sendfile() stats to a separate struct, the
last field used in mbstat is m_mcfail, which is updated, but never
read or obtained from userland.
2013-07-15 12:18:36 +00:00
ae
6f8e41d6cb Introduce new structure sfstat for collecting sendfile's statistics
and remove corresponding fields from struct mbstat. Use PCPU counters
and SFSTAT_INC() macro for update these statistics.

Discussed with:	glebius
2013-07-15 06:16:57 +00:00
rodrigc
7e3e1747c8 PR: 168520 170096
Submitted by: adrian, zec

Fix multiple kernel panics when VIMAGE is enabled in the kernel.
These fixes are based on patches submitted by Adrian Chadd and Marko Zec.

(1)  Set curthread->td_vnet to vnet0 in device_probe_and_attach() just before calling
     device_attach().  This fixes multiple VIMAGE related kernel panics
     when trying to attach Bluetooth or USB Ethernet devices because
     curthread->td_vnet is NULL.

(2)  Set curthread->td_vnet in if_detach().  This fixes kernel panics when detaching networking
     interfaces, especially USB Ethernet devices.

(3)  Use VNET_DOMAIN_SET() in ng_btsocket.c

(4)  In ng_unref_node() set curthread->td_vnet.  This fixes kernel panics
     when detaching Netgraph nodes.
2013-07-15 01:32:55 +00:00
kib
cbf04bd9c1 Assert that runningbufspace does not underflow.
Sponsored by:	The FreeBSD Foundation
2013-07-13 19:36:18 +00:00
kib
0c582b5a5c There is no need to count waiters for the runningbufspace.
Sponsored by:	The FreeBSD Foundation
2013-07-13 19:34:34 +00:00
kib
66a95162d6 Allow to call clock_gettime() on the clock id for zombie process.
Reported by:	Petr Salinger <Petr.Salinger@seznam.cz>
PR:	threads/180496
Sponsored by:	The FreeBSD Foundation
2013-07-13 19:32:50 +00:00
andre
19a467c450 Make use of the fact that uma_zone_set_max(9) already returns the
rounded limit making a call to uma_zone_get_max(9) unnecessary.

MFC after:	1 day
2013-07-11 12:53:13 +00:00
andre
a54d54c890 Fix style issues, a typo in "kern.ipc.nmbufs" and correctly plave and
expose the value of the tunable maxmbufmem as "kern.ipc.maxmbufmem"
through sysctl.

Reported by:	smh
MFC after:	1 day
2013-07-11 12:46:35 +00:00
kib
6d30588666 Do not invalidate page of the B_NOCACHE buffer or buffer after an I/O
error if any user wired mappings exist.  Doing the invalidation
destroys the user wiring.

The change is the temporal measure to close the bug, the more proper
fix is to delegate the invalidation of the page to upper layers
always.

Reported and tested by:	pho
Reviewed by:	alc
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
2013-07-11 05:36:26 +00:00
marcel
c660176671 Add vfs_mounted and vfs_unmounted events so that components can be informed
about mount and unmount events. This is used by Juniper to implement a more
optimal implementation of NetBSD's veriexec.

This change differs from r253224 in the following way:
o   The vfs_mounted handler is called before mountcheckdirs() and with
    newdp locked. vp is unlocked.
o   The event handlers are declared in <sys/eventhandler.h> and not in
    <sys/mount.h>. The <sys/mount.h> header is used in user land code
    that pretends to be kernel code and as such creates a very convoluted
    environment. It's hard to untangle.

Submitted by:	stevek@juniper.net
Discussed with:	pjd@
Obtained from:	Juniper Networks, Inc.
2013-07-10 15:35:25 +00:00
kib
a7b76b76e1 There are several code sequences like
vfs_busy(mp);
      vfs_write_suspend(mp);
which are problematic if other thread starts unmount between two
calls.  The unmount starts a write, while vfs_write_suspend() drain
writers.  On the other hand, unmount drains busy references, causing
the deadlock.

Add a flag argument to vfs_write_suspend and require the callers of it
to specify VS_SKIP_UNMOUNT flag, when the call is performed not in the
mount path, i.e. the covered vnode is not locked.  The suspension is
not attempted if VS_SKIP_UNMOUNT is specified and unmount is in
progress.

Reported and tested by:	Andreas Longwitz <longwitz@incore.de>
Sponsored by:	The FreeBSD Foundation
MFC after:	3 weeks
2013-07-09 20:49:32 +00:00
avg
44c288e149 should_yield: protect from td_swvoltick being uninitialized or too stale
The distance between ticks and td_swvoltick should be calculated as
an unsigned number.  Previously we could end up comparing a negative
number with hogticks in which case should_yield() would give incorrect
answer.

We should probably ensure that td_swvoltick is properly initialized.

Sponsored by:	HybridCluster
MFC after:	5 days
2013-07-09 09:01:44 +00:00
avg
a9e12e57cb namecache sdt: freebsd doesn't support structured characters yet
:-)

MFC after:	 7 days
2013-07-09 08:58:34 +00:00
jhb
581b7760a5 Fix build with INVARIANT_SUPPORT enabled but not INVARIANTS.
Reported by:	"Matthew D. Fuller" <fullermd@over-yonder.net>
2013-07-08 21:17:20 +00:00
alfred
89aac2b3a1 Make kassert_printf use __printflike.
Fix associated errors/warnings while I'm here.

Requested by: avg
2013-07-07 21:39:37 +00:00
jamie
7dc297170a Make the comments a little more clear about PRIV_KMEM_*, explicitly
referring to /dev/[k]mem and noting it's about opening the files rather
than actually reading and writing.

Reviewed by:	jmallett
2013-07-06 00:10:52 +00:00
jamie
33714247f6 Add new privileges, PRIV_KMEM_READ and PRIV_KMEM_WRITE, used in opening
/dev/kmem and /dev/mem (in addition to traditional file permission checks).
PRIV_KMEM_READ is different from other PRIV_* checks in that it's allowed
by default.

Reviewed by:	kib, mckusick
2013-07-05 21:31:16 +00:00
alfred
9bc7dd4897 The change in r236456 (atomic_store_rel not locked) exposed a bug
in the ithread code where we could lose ithread interrupts if
intr_event_schedule_thread() is called while the ithread is already
running.  Effectively memory writes could be ordered incorrectly
such that the unatomic write of 0 to ithd->it_need (in ithread_loop)
could be delayed until after it was set to be triggered again by
intr_event_schedule_thread().

This was particularly a big problem for CAM because CAM optimizes
scheduling of ithreads such that it only signals camisr() when it
queues to an empty queue.  This means that additional completion
events would not unstick CAM and quickly lead to a complete lockup
of the CAM stack.

To fix this use atomics in all places where ithd->it_need is accessed.

Submitted by: delphij, mav
Obtained from: TrueOS, iXsystems
MFC After: 1 week
2013-07-04 05:53:05 +00:00
mjg
80f69955c8 Fix receiving fd over unix socket broken in r247740.
If n fds were passed, it would receive the first one n times.

Reported by:	Shawn Webb <lattera@gmail.com>, koobs, gleb
Tested by:	koobs, gleb
Reviewed by:	pjd
2013-07-02 07:36:04 +00:00
trociny
68a112b157 Plug up the lock lock leakage when exporting to a short buffer.
Reported by:	Alexander Leidinger
Submitted by:	mjg
MFC after:	1 week
2013-07-01 03:27:14 +00:00
kib
8a0279994d Fix issues with zeroing and fetching the counters, on x86 and ppc64.
Issues were noted by Bruce Evans and are present on all architectures.

On i386, a counter fetch should use atomic read of 64bit value,
otherwise carry from the increment on other CPU could be lost for the
given fetch, making error of 2^32.  If 64bit read (cmpxchg8b) is not
available on the machine, it cannot be SMP and it is enough to disable
preemption around read to avoid the split read.

On x86 the counter increment is not atomic on purpose, which makes it
possible for the store of the incremented result to override just
zeroed per-cpu slot.  The effect would be a counter going off by
arbitrary value after zeroing.  Perform the counter zeroing on the
same processor which does the increments, making the operations
mutually exclusive.  On i386, same as for the fetching, if the
cmpxchg8b is not available, machine is not SMP and we disable
preemption for zeroing.

PowerPC64 is treated the same as amd64.

For other architectures, the changes made to allow the compilation to
succeed, without fixing the issues with zeroing or fetching.  It
should be possible to handle them by using the 64bit loads and stores
atomic WRT preemption (assuming the architectures also converted from
using critical sections to proper asm).  If architecture does not
provide the facility, using global (spin) mutex would be non-optimal
but working solution.

Noted by:  bde
Sponsored by:	The FreeBSD Foundation
2013-07-01 02:48:27 +00:00
mjg
436f750292 acct: create a special plimit object and set it for exiting processes
instead of allocating new one each time

All limits are set to RLIM_INFINITY which sould be ok (even though we
care only about RLIMT_FSIZE in this case).

MFC after:	1 week
2013-06-30 19:08:06 +00:00
mjg
523cdd9810 acct: reduce code duplication by using acct_disable as cleanup for
failed kproc_create

MFC after:	1 week
2013-06-30 13:17:37 +00:00
peter
51861f0c71 Revert accidental commit.
Pointy hat to:	 peter
2013-06-29 05:05:57 +00:00
peter
b08615c213 Help out gcc. clang understands.
sys_generic.c:1510: warning: 'precision' may be used uninitialized
*** [sys_generic.o] Error code 1
2013-06-29 04:35:04 +00:00
davide
3ca83727a0 Correct the comment above _sleep() function which still mentions 'timo'
instead of 'sbintime_t'.

Reported by:	kan
2013-06-28 21:04:15 +00:00
davide
0dd1d9c578 - Trim an unused and bogus Makefile for mount_smbfs.
- Reconnect with some minor modifications, in particular now selsocket()
internals are adapted to use sbintime units after recent'ish calloutng
switch.
2013-06-28 21:00:08 +00:00
mjg
19b2af5262 Remove duplicate NULL check in kern_proc_filedesc_out.
No functional changes.

MFC after:	1 week
2013-06-28 18:32:46 +00:00
trociny
9888acd231 Rework r252313:
The filedesc lock may not be dropped unconditionally before exporting
fd to sbuf: fd might go away during execution.  While it is ok for
DTYPE_VNODE and DTYPE_FIFO because the export is from a vrefed vnode
here, for other types it is unsafe.

Instead, drop the lock in export_fd_to_sb(), after preparing data in
memory and before writing to sbuf.

Spotted by:	mjg
Suggested by:	kib
Review by:	kib
MFC after:	1 week
2013-06-28 18:07:41 +00:00
rstone
4b2a7a843c Correct a bug that prevented deadlkres from (almost) ever firing.
deadlkres was using a reversed test to check whether ticks had rolled over.
This meant that deadlkres could only fire after ticks had rolled over.
This test was actually unnecessary as deadlkres only ever took the
difference of ticks values which is safe even in the presence of ticks
rollover.  Remove the tests entirely.  Now deadlkres will properly fire
after a lock has been held after the timeout period.

MFC after:	1 month
2013-06-28 15:55:30 +00:00
jeff
e725dd5c1e - Add a general purpose resource allocator, vmem, from NetBSD. It was
originally inspired by the Solaris vmem detailed in the proceedings
   of usenix 2001.  The NetBSD version was heavily refactored for bugs
   and simplicity.
 - Use this resource allocator to allocate the buffer and transient maps.
   Buffer cache defrags are reduced by 25% when used by filesystems with
   mixed block sizes.  Ultimately this may permit dynamic buffer cache
   sizing on low KVA machines.

Discussed with:	alc, kib, attilio
Tested by:	pho
Sponsored by:	EMC / Isilon Storage Division
2013-06-28 03:51:20 +00:00
jhb
c1207dc20c Make detaching drivers from PCI devices more robust. While here, fix a
bug where a PCI device would be powered down if it failed to probe, but
not when its driver was detached (e.g. via kldunload).
- Add a new helper method resource_list_release_active() which forcefully
  releases any active resources of a specified type from a resource list.
- Add a bus_child_detached method for the PCI bus driver which forces any
  active resources to be released (and whines to the console if it finds
  any) and then powers the device down.
- Call pci_child_detached() if we fail to probe a device when a driver
  is kldloaded.  This isn't perfect but can avoid leaking resources
  from a probe() routine in the kldload case.

Reviewed by:	imp, brooks
MFC after:	1 month
2013-06-27 20:21:54 +00:00
trociny
1e5776a323 To avoid LOR, always drop the filedesc lock before exporting fd to sbuf.
Reviewed by:	kib
MFC after:	3 days
2013-06-27 19:14:03 +00:00
jhb
19d363377e A few mostly cosmetic nits to aid in debugging:
- Call lock_init() first before setting any lock_object fields in
  lock init routines.  This way if the machine panics due to a duplicate
  init the lock's original state is preserved.
- Somewhat similarly, don't decrement td_locks and td_slocks until after
  an unlock operation has completed successfully.
2013-06-25 20:23:08 +00:00
jhb
0807c44cdd Several improvements to rmlock(9). Many of these are based on patches
provided by Isilon.
- Add an rm_assert() supporting various lock assertions similar to other
  locking primitives.  Because rmlocks track readers the assertions are
  always fully accurate unlike rw_assert() and sx_assert().
- Flesh out the lock class methods for rmlocks to support sleeping via
  condvars and rm_sleep() (but only while holding write locks), rmlock
  details in 'show lock' in DDB, and the lc_owner method used by
  dtrace.
- Add an internal destroyed cookie so that API functions can assert
  that an rmlock is not destroyed.
- Make use of rm_assert() to add various assertions to the API (e.g.
  to assert locks are held when an unlock routine is called).
- Give RM_SLEEPABLE locks their own lock class and always use the
  rmlock's own lock_object with WITNESS.
- Use THREAD_NO_SLEEPING() / THREAD_SLEEPING_OK() to disallow sleeping
  while holding a read lock on an rmlock.

Submitted by:	andre
Obtained from:	EMC/Isilon
2013-06-25 18:44:15 +00:00
lstewart
a203bbbf37 When a previous call to sbsndptr() leaves sb->sb_sndptroff at the start of an
mbuf that was fully consumed by the previous call, the mbuf ptr returned by the
current call ends up being the previous mbuf in the sb chain to the one that
contains the data we want.

This does not cause any observable issues because the mbuf copy routines happily
walk the mbuf chain to get to the data at the moff offset, which in this case
means they effectively skip over the mbuf returned by sbsndptr().

We can't adjust sb->sb_sndptr during the previous call for this case because the
next mbuf in the chain may not exist yet. We therefore need to detect the
condition and make the adjustment during the current call.

Fix by detecting the special case of moff being at the start of the next mbuf in
the chain and adjust the required accounting variables accordingly.

Reviewed by:	andre
MFC after:	2 weeks
2013-06-19 03:08:01 +00:00
lstewart
4f358ad954 The fix committed in r250951 replaced the reported panic with a deadlock... gold
star for me. EVENTHANDLER_DEREGISTER() attempts to acquire the lock which is
held by the event handler framework while executing event handler functions,
leading to deadlock.

Move EVENTHANDLER_DEREGISTER() to alq_load_handler() and thus deregister the ALQ
shutdown_pre_sync handler at module unload time, which takes care of the
originally reported panic and fixes the deadlock introduced in r250951.

Reported by:	Luiz Otavio O Souza
MFC after:	3 days
X-MFC with:	250951
2013-06-17 09:49:07 +00:00
ed
99f22b551f Change callout use counter to use C11 atomics.
In order to get some coverage of C11 atomics in kernelspace, switch at
least one piece of code in kernelspace to use C11 atomics instead of
<machine/atomic.h>.

While there, slightly improve the code by adding an assertion to prevent
the use count from going negative.
2013-06-16 09:30:35 +00:00
lstewart
035b65bb7d Move hhook's per-vnet initialisation to an earlier SYSINIT SI_SUB stage to
ensure all per-vnet related hhook initialisation is completed prior to any
virtualised hhook points attempting registration.

vnet_register_sysinit() requires that a stage later than SI_SUB_VNET be chosen.
There are no per-vnet initialisors in the source tree at this time which run
earlier than SI_SUB_INIT_IF. A quick audit of non-virtualised SYSINITs indicates
there are no subsystems pre SI_SUB_MBUF that would likely be interested in
registering a virtualised hhook point.

Settle on SI_SUB_MBUF as hhook's per-vnet initialisation stage as it's the first
overtly network-related initilisation stage to run after SI_SUB_VNET. If a
subsystem that initialises earlier than SI_SUB_MBUF ends up wanting to register
virtualised hhook points in future, hhook's use of SI_SUB_MBUF will need to be
revisited and would probably warrant creating a dedicated SI_SUB_HHOOK which
runs immediately after SI_SUB_VNET.

MFC after:	1 week
2013-06-15 10:08:34 +00:00
lstewart
04d0c6d8fa Cleanup and simplification in khelp_{register|deregister}_helper(). No
functional changes.

MFC after:	1 week
2013-06-15 06:45:17 +00:00
lstewart
d367548193 Add a private KPI between hhook and khelp that allows khelp modules to insert
hook functions into hhook points which register after the modules were loaded -
potentially useful during boot or if hhook points are dynamically registered.

MFC after:	1 week
2013-06-15 05:57:29 +00:00
lstewart
1cc3945aab Internalise handling of virtualised hook points inside
hhook_{add|remove}_hook_lookup() so that khelp (and other potential API
consumers) do not have to care when they attempt to (un)hook a particular hook
point identified by id and type.

Reviewed by:	scottl
MFC after:	1 week
2013-06-15 04:03:40 +00:00
lstewart
cbb41e7015 Fix a major oversight in r251732 which causes non-VIMAGE kernels to trigger a
KASSERT during TCP hhook registration at boot. Virtualised hook points only
require extra housekeeping and sanity checking when "options VIMAGE" is present.

Reported by:	bdrewery,jh,dhw
Tested by:	dhw
MFC after:	1 week
X-MFC with:	251732
2013-06-14 18:11:21 +00:00
lstewart
44b2ad15cd Add support for non-virtualised hhook points, which are uniquely identified by
type and id, as compared to virtualised hook points which are now uniquely
identified by type, id and a vid (which for vimage is the pointer to the vnet
that the hhook resides in).

All hhook_head structs for both virtualised and non-virtualised hook points
coexist in hhook_head_list, and a separate list is maintained for hhook points
within each vnet to simplify some vimage-related housekeeping.

Reviewed by:	scottl
MFC after:	1 week
2013-06-14 04:10:34 +00:00
lstewart
1144352c64 Fix a potential NULL-pointer dereference that would trigger if the hhook
registration site did not provide storage for a copy of the hhook_head struct.

MFC after:	3 days
2013-06-14 02:25:40 +00:00
jeff
7ee88fb112 - Add a BIT_FFS() macro and use it to replace cpusetffs_obj()
Discussed with:	attilio
Sponsored by:	EMC / Isilon Storage Division
2013-06-13 20:46:03 +00:00
kib
26dcc60640 Fix two issues with the spin loops in the umtx(2) implementation.
- When looping, check for the pending suspension.  Otherwise, other
  usermode thread which races with the looping one, could try to
  prevent the process from stopping or exiting.

- Add missed checks for the faults from casuword*().  The code is
  structured in a way which makes the loops exit if the specified
  address is invalid, since both fuword() and casuword() return -1 on
  the fault.  But if the address is mapped readonly, the typical value
  read by fuword() is different from -1, while casuword() returns -1.
  Absent the checks for casuword() faults, this is interpreted as the
  race with other thread and causes non-interruptible spinning in the
  kernel.

Reported and tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
2013-06-13 09:33:22 +00:00
marcel
ded5e8df05 Revert r251590. It unexpectedly broke the build and there were some
questions on locking. As part of commit-bit grooming, I'd like Steve
to handle this, but can't leave things broken in the mean time.
2013-06-10 15:22:27 +00:00
marcel
ac0545a5fe Add vfs_mounted and vfs_unmounted events so that components can be informed
about mount and unmount events. This is used by Juniper to implement a more
optimal implementation of NetBSD's veriexec.

Submitted by:	stevek@juniper.net
Obtained from:	Juniper Networks, Inc
2013-06-09 23:51:26 +00:00
glebius
85cf0e083f aio_mlock() added:
- Regen for r251526.
  - Bump __FreeBSD_version.
2013-06-08 13:30:13 +00:00
glebius
9a02f3097d Add new system call - aio_mlock(). The name speaks for itself. It allows
to perform the mlock(2) operation, which can consume a lot of time, under
control of aio(4).

Reviewed by:	kib, jilles
Sponsored by:	Nginx, Inc.
2013-06-08 13:27:57 +00:00
glebius
98da44e07d Separate LIO_SYNC processing into a separate function aio_process_sync(),
and rename aio_process() into aio_process_rw().

Reviewed by:	kib
Sponsored by:	Nginx, Inc.
2013-06-08 13:02:43 +00:00
jhb
190d5ac85b Do not compare the existing mask of a cpuset with a new mask when changing
the mask of a cpuset.  Also, change the cpuset's mask before updating the
masks of all children.  Previously changing a cpuset's mask first required
setting the mask to a super-set of both the old and new masks and then
changing it a second time to the new mask.
2013-06-06 14:43:19 +00:00
alc
e60ab9c72d Don't busy the page unless we are likely to release the object lock.
Reviewed by:	kib
Sponsored by:	EMC / Isilon Storage Division
2013-06-06 06:17:20 +00:00
jeff
38a637b8a0 - Consolidate duplicate code into support functions.
- Split the bqlock into bqclean and bqdirty locks.
 - Only acquire the wakeup synchronization locks when we cross a
   threshold requiring them.
 - Restructure the way flushbufqueues() targets work so they are more
   smp friendly and sane.

Reviewed by:	kib
Discussed with:	mckusick, attilio
Sponsored by:	EMC / Isilon Storage Division

M    vfs_bio.c
2013-06-05 23:53:00 +00:00
glebius
1bf8d856bd Improve r250890, so that we stop processing of a message with zero
descriptors as early as possible, and assert that number of descriptors
is positive in unp_freerights().

Reviewed by:	mjg, pjd, jilles
2013-06-04 11:19:08 +00:00
jhb
fbe5803939 - Fix a couple of inverted panic messages for shared/exclusive mismatches
of a lock within a single thread.
- Fix handling of interlocks in WITNESS by properly requiring the interlock
  to be held exactly once if it is specified.
2013-06-03 17:41:11 +00:00
jhb
e32d3d0bca - Handle the recursed/not recursed flags with RA_RLOCKED in rw_assert().
- Tweak a panic message.
2013-06-03 17:38:57 +00:00
kib
da0ffbeb0e Be more generous when donating the current thread time to the owner of
the vnode lock while iterating over the free vnode list.  Instead of
yielding, pause for 1 tick.  The change is reported to help in some
virtualized environments.

Submitted by:	Roger Pau Monn? <roger.pau@citrix.com>
Discussed with:	jilles
Tested by:	pho
MFC after:	2 weeks
2013-06-03 17:36:43 +00:00
kib
fd6876af27 Do not map the shared page COW. If the process wired its address
space, fork(2) would cause shadowing of the physical object and
copying of the shared page into private copy, effectively preventing
updates for the exported timehands structure and stopping the clock.

Specify the maximum allowed permissions for the page to be read and
execute, preventing write from the user mode.

Reported and tested by:	<huanghwh@yahoo.com>
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
2013-06-03 04:32:53 +00:00
kib
6e7fd63965 When auto-sizing the buffer cache, limit the amount of physical memory
used as the estimation of size, to 32GB.  This provides around 100K of
buffer headers and corresponding KVA for buffer map at the peak.
Sizing the cache larger is not useful, also resulting in the wasting
and exhausting of KVA for large machines.

Reported and tested by:	bdrewery
Sponsored by:	The FreeBSD Foundation
2013-06-03 04:16:48 +00:00
alc
3744954e30 Reduce the scope of the VM object locking in brelse(). In my tests, this
change reduced the total number of VM object lock acquisitions by brelse()
by 74%.

Sponsored by:	EMC / Isilon Storage Division
2013-06-02 16:18:03 +00:00
marius
e27884bd80 Move an assertion to the right spot; only bus_dmamap_load_mbuf(9)
requires a pkthdr being present but that's not the case for either
_bus_dmamap_load_mbuf_sg() or bus_dmamap_load_mbuf_sg(9).

Reported by:	sbruno
MFC after:	1 week
2013-06-01 11:42:47 +00:00
jhb
52ec94b217 Style fixes to vn_ioctl().
Suggested by:	bde
2013-05-31 16:15:22 +00:00
jeff
d7efebc4db - Convert the bufobj lock to rwlock.
- Use a shared bufobj lock in getblk() and inmem().
 - Convert softdep's lk to rwlock to match the bufobj lock.
 - Move INFREECNT to b_flags and protect it with the buf lock.
 - Remove unnecessary locking around bremfree() and BKGRDINPROG.

Sponsored by:	EMC / Isilon Storage Division
Discussed with:	mckusick, kib, mdf
2013-05-31 00:43:41 +00:00
julian
49df02549e Initialising the new fibnum field to a known value turns out to
be a GOOD IDEA (TM).
Apparently MOST users set this (e.g. tcp and friends) but there are a few
users that just assume that it is a sensible value but then go on to read it.
These include SCTP, pf and the FLOWTABLE option (and maybe others).
2013-05-24 02:18:37 +00:00
lstewart
f3f987c066 Ensure alq's shutdown_pre_sync event handler is deregistered on module unload to
avoid a dangling pointer and eventual panic on system shutdown.

Reported by:	Ali <comnetboy at gmail.com>
Tested by:	Ali <comnetboy at gmail.com>
MFC after:	1 week
2013-05-24 00:49:12 +00:00
pjd
4a0d14c9b4 Use proper malloc type for ioctls white-list.
Reported by:	pho
Tested by:	pho
2013-05-23 21:07:26 +00:00
luigi
2205d3e0a3 Increase the (arbitrary) limit for the number of packets per tick
from 1k to 20k The previous value was good 10 years ago, but not
anymore now.

More importantly, lots of good surprises:
polling is incredibly effective under virtualization, and not only
prevents livelock but also saves most of the VM exit overhead in
receive mode.

Using polling, a FreeBSD instance under qemu-kvm remains perfectly
responsive even when bombed with 10 Mpps over an emulated e1000,
and happily processes 1.7 Mpps through ipfw.

Note that some incompatibilities still remain: e.g. polling is not
(yet) compatible with netmap, and seems to freeze the guest when
kern.polling.idle_poll=1

MFC after:	3 days
2013-05-22 16:32:18 +00:00
mjg
4072f856eb passing fd over unix socket: fix a corner case where caller
wants to pass no descriptors.

Previously the kernel would leak memory and try to free a potentially
arbitrary pointer.

Reviewed by:	pjd
2013-05-21 21:58:00 +00:00
attilio
9adbac28e8 vm_object locking is not needed there as pages are already wired.
Sponsored by:	EMC / Isilon storage division
Submitted by:	alc
2013-05-21 20:54:03 +00:00
kib
19425ad923 Regenerate. 2013-05-21 11:41:08 +00:00
kib
ad28c68314 Fix the wait6(2) on 32bit architectures and for the compat32, by using
the right type for the argument in syscalls.master.  Also fix the
posix_fallocate(2) and posix_fadvise(2) compat32 syscalls on the
architectures which require padding of the 64bit argument.

Noted and reviewed by:	jhb
Pointy hat to:	kib
MFC after:	1 week
2013-05-21 11:40:16 +00:00
pjd
3df7f9c0a1 Style nits. 2013-05-19 23:30:24 +00:00
pjd
f945983782 Use SDT_PROBE1() instead of SDT_PROBE(). 2013-05-19 23:29:22 +00:00
jamie
7941fefd80 Refine the "nojail" rc keyword, adding "nojailvnet" for files that don't
apply to most jails but do apply to vnet jails.  This includes adding
a new sysctl "security.jail.vnet" to identify vnet jails.

PR:		conf/149050
Submitted by:	mdodd
MFC after:	3 days
2013-05-19 04:10:34 +00:00
attilio
bd997a13e8 Use readlocking now that assertions on vm_page_lookup() are relaxed.
Sponsored by:	EMC / Isilon storage division
Reviewed by:	alc
Tested by:	flo, pho
2013-05-17 20:03:55 +00:00
jh
a3ca3565e2 A library function shall not set errno to 0.
Reviewed by:	mdf
2013-05-16 18:13:10 +00:00
jeff
1cfa4a3bc4 - Add a new general purpose path-compressed radix trie which can be used
with any structure containing a uint64_t index.  The tree code
   auto-generates type safe wrappers.
 - Eliminate the buf splay and replace it with pctrie.  This is not only
   significantly faster with large files but also allows for the possibility
   of shared locking.

Reviewed by:    alc, attilio
Sponsored by:   EMC / Isilon Storage Division
2013-05-12 04:05:01 +00:00
kib
dfd7a7f46d - Fix nullfs vnode reference leak in nullfs_reclaim_lowervp(). The
null_hashget() obtains the reference on the nullfs vnode, which must
  be dropped.

- Fix a wart which existed from the introduction of the nullfs
  caching, do not unlock lower vnode in the nullfs_reclaim_lowervp().
  It should be innocent, but now it is also formally safe.  Inform the
  nullfs_reclaim() about this using the NULLV_NOUNLOCK flag set on
  nullfs inode.

- Add a callback to the upper filesystems for the lower vnode
  unlinking. When inactivating a nullfs vnode, check if the lower
  vnode was unlinked, indicated by nullfs flag NULLV_DROP or VV_NOSYNC
  on the lower vnode, and reclaim upper vnode if so.  This allows
  nullfs to purge cached vnodes for the unlinked lower vnode, avoiding
  excessive caching.

Reported by:	G??ran L??wkrantz <goran.lowkrantz@ismobile.com>
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
2013-05-11 11:17:44 +00:00
eadler
4f9ab6c580 Fxi a bunch of typos.
PR:	misc/174625
Submitted by:	Jeremy Chadwick <jdc@koitsu.org>
2013-05-10 16:41:26 +00:00
marcel
bd49099038 Add option WITNESS_NO_VNODE to suppress printing LORs between VNODE
locks. To support this, VNODE locks are created with the LK_IS_VNODE
flag. This flag is propagated down using the LO_IS_VNODE flag.

Note that WITNESS still records the LOR. Only the printing and the
optional entering into the kernel debugger is bypassed with the
WITNESS_NO_VNODE option.
2013-05-09 16:28:18 +00:00
kib
3fa3fc83d1 Item 1 in r248830 causes earlier exits from the sendfile(2), before
all requested data was sent.  The reason is that xfsize <= 0 condition
must not be tested at all if space == loopbytes.  Otherwise, the done
is set to 1, and sendfile(2) is aborted too early.

Instead of moving the condition to exiting the inner loop after the
xfersize check, directly check for the completed transfer before the
testing of the available space in the socket buffer, and revert item 1
of r248830.  It is arguably another bug to sleep waiting for socket
buffer space (or return EAGAIN for non-blocking socket) if all bytes
are already transferred.

Reported by:	pho
Discussed with:	scottl, gibbs
Tested by:	scottl (stable/9 backport), pho
2013-05-09 16:05:51 +00:00
andre
cdd4fa931f When the accept queue is full print the number of already pending
new connections instead of by how many we're over the limit, which
is always 1.

Noticed by:	jmallet
MFC after:	1 week
2013-05-08 14:13:14 +00:00
scottl
a5718800b1 Add a sysctl vfs.read_min to complement the exiting vfs.read_max. It
defaults to 1, meaning that it's off.

When read-ahead is enabled on a file, the vfs cluster code deliberately
breaks a read into 2 I/O transactions; one to satisfy the actual read,
and one to perform read-ahead.  This makes sense in low-latency
circumstances, but often produces unbalanced i/o transactions that
penalize disks.  By setting vfs.read_min, we can tell the algorithm to
fetch a larger transaction that what we asked for, achieving the same
effect as the read-ahead but without the doubled, unbalanced transaction
and the slightly lower latency.  This significantly helps our workloads
with video streaming.

Submitted by:	emax
Reviewed by:	kib
Obtained from:	Netflix
2013-05-07 08:16:21 +00:00
andre
cc8c6e4d01 Back out r249318, r249320 and r249327 due to a heisenbug most
likely related to a race condition in the ipi_hash_lock with
the exact cause currently unknown but under investigation.
2013-05-06 16:42:18 +00:00
mdf
f106bf787a Add missing vdrop() in error case.
Submitted by:	Fahad (mohd.fahadullah@isilon.com)
MFC after:	1 week
2013-05-04 18:38:16 +00:00
jhb
679fa5ed4e Similar to 233760 and 236717, export some more useful info about the
kernel-based POSIX semaphore descriptors to userland via procstat(1) and
fstat(1):
- Change sem file descriptors to track the pathname they are associated
  with and add a ksem_info() method to copy the path out to a
  caller-supplied buffer.
- Use the fo_stat() method of shared memory objects and ksem_info() to
  export the path, mode, and value of a semaphore via struct kinfo_file.
- Add a struct semstat to the libprocstat(3) interface along with a
  procstat_get_sem_info() to export the mode and value of a semaphore.
- Teach fstat about semaphores and to display their path, mode, and value.

MFC after:	2 weeks
2013-05-03 21:11:57 +00:00
jhb
3ddeecb12b Fix FIONREAD on regular files. The computed result was being ignored and
it was being passed down to VOP_IOCTL() where it promptly resulted in
ENOTTY due to a missing else for the past 8 years.  While here, use a
shared vnode lock while fetching the current file's size.

MFC after:	1 week
2013-05-03 19:08:58 +00:00
jilles
49a5937b77 Regenerate files for pipe2(). 2013-05-01 22:45:04 +00:00
jilles
16772c421d Add pipe2() system call.
The pipe2() function is similar to pipe() but allows setting FD_CLOEXEC and
O_NONBLOCK (on both sides) as part of the function.

If p points to two writable ints, pipe2(p, 0) is equivalent to pipe(p).

If the pointer is not valid, behaviour differs: pipe2() writes into the
array from the kernel like socketpair() does, while pipe() writes into the
array from an architecture-specific assembler wrapper.

Reviewed by:	kan, kib
2013-05-01 22:42:42 +00:00
jilles
66a7a3379b Regenerate files for accept4(). 2013-05-01 20:12:58 +00:00
jilles
299afd25fd Add accept4() system call.
The accept4() function, compared to accept(), allows setting the new file
descriptor atomically close-on-exec and explicitly controlling the
non-blocking status on the new socket. (Note that the latter point means
that accept() is not equivalent to any form of accept4().)

The linuxulator's accept4 implementation leaves a race window where the new
file descriptor is not close-on-exec because it calls sys_accept(). This
implementation leaves no such race window (by using falloc() flags). The
linuxulator could be fixed and simplified by using the new code.

Like accept(), accept4() is async-signal-safe, a cancellation point and
permitted in capability mode.
2013-05-01 20:10:21 +00:00
trociny
a5058251d5 Introduce a constant, ELF_NOTE_ROUNDSIZE, which evidently declare our
intention to use 4-byte padding for elf notes.

MFC after:	3 weeks
2013-05-01 14:59:16 +00:00
jilles
51cb2d085a socket: Make shutdown() wake up a blocked accept().
A blocking accept (and some other operations) waits on &so->so_timeo. Once
it wakes up, it will detect the SBS_CANTRCVMORE bit.

The error from accept() is [ECONNABORTED] which is not the nicest one -- the
thread calling accept() needs to know out-of-band what is happening.

A spurious wakeup on so->so_timeo appears harmless (sleep retried) except
when lingering on close (SO_LINGER, and in that case there is no descriptor
to call shutdown() on) so this should be fairly safe.

A shutdown() already woke up a blocked accept() for TCP sockets, but not for
Unix domain sockets. This fix is generic for all domains.

This patch was sent to -hackers@ and -net@ on April 5.

MFC after:	2 weeks
2013-04-30 15:06:30 +00:00
kib
ebecd57ee2 Eliminate the layering violation in the kern_sendfile(). When quering
the file size, use VOP_GETATTR() instead of accessing vnode vm_object
un_pager.vnp.vnp_size.

Take the shared vnode lock earlier to cover the added VOP_GETATTR()
call and, as consequence, the whole internal sendfile loop.  Reduce vm
object lock scope to not protect the local calculations.

Note that this is the last misuse of the vnp_size in the tree, the
others were removed from the ELF image activator by r230246.

Reviewed by:	alc
Tested by:	pho, bf (previous version)
MFC after:	1 week
2013-04-28 19:12:09 +00:00
andre
87ef00d48c Base the calculation of maxmbufmem in part on kmem_map size
instead of kernel_map size to prevent kernel memory exhaustion
by mbufs and a subsequent panic on physical page allocation
failure.

On architectures without a direct map all mbuf memory (except
for jumbo mbufs larger than PAGE_SIZE) comes from kmem_map.
It is the limiting factor hence.

For architectures with a direct map using the size of kmem_map
is a good proxy of available kernel memory as well.  If it is
much smaller the mbuf limit may be sub-optimal but remains
reasonable, while avoiding panics under exhaustion.

The overall mbuf memory limit calculation may be reconsidered
again later, however due to the many different mbuf sizes and
different backing KVM maps it is a tricky subject.

Found by:	pho's new network stress test
Pointed out by:	alc (kmem_map instead of kernel_map)
Tested by:	pho
2013-04-24 13:54:55 +00:00
jh
f9bfc0cc7f Include PID in the error message which is printed when the maxproc limit
is exceeded. Improve formatting of the message while here.

PR:		kern/60550
Submitted by:	Lowell Gilbert, bde
2013-04-19 15:19:29 +00:00
glebius
b991db2beb Don't compare unsigned socklen_t against < 0.
Reviewed by:	jhb
2013-04-19 13:40:13 +00:00