Commit Graph

13459 Commits

Author SHA1 Message Date
Marcel Moolenaar
8ff6ca1e08 In uuid_ether_add(), avoid false positives due to the limited type
used to hold the sum of the bytes of the MAC address. While here,
rename the variable that holds the sum from 'c' to 'sum'.

Pointed out by: thompsa@
2013-07-24 16:22:27 +00:00
Andriy Gapon
785797c341 rename scheduler->swapper and SI_SUB_RUN_SCHEDULER->SI_SUB_LAST
Also directly call swapper() at the end of mi_startup instead of
relying on swapper being the last thing in sysinits order.

Rationale:

- "RUN_SCHEDULER" was misleading, scheduling already takes place at that stage
- "scheduler" was misleading, the function swaps in the swapped out processes
- another SYSINIT(SI_SUB_RUN_SCHEDULER, SI_ORDER_ANY) could never be
  invoked depending on its relative order with scheduler; this was not obvious
  and the bug actually used to exist

Reviewed by:	kib (ealier version)
MFC after:	14 days
2013-07-24 09:45:31 +00:00
Gleb Smirnoff
9e3cc17647 Remove unused argument from vmem_add1().
Reviewed by:	jeff
2013-07-24 08:02:56 +00:00
Marcel Moolenaar
ef1f916971 Decouple the UUID generator from network interfaces by having MAC
addresses added to the UUID generator using uuid_ether_add(). The
UUID generator keeps an arbitrary number of MAC addresses, under
the assumption that they are rarely removed (= uuid_ether_del()).
This achieves the following:
1.  It brings up closer to having the network stack as a loadable
    module.
2.  It allows the UUID generator to filter MAC addresses for best
    results (= highest chance of uniqeness).
3.  MAC addresses can come from anywhere, irrespactive of whether
    it's used for an interface or not.

A side-effect of the change is that when no MAC addresses have been
added, a random multicast MAC address is created once and re-used if
needed. Previusly, when a random MAC address was needed, it was
created for every call. Thus, a change in behaviour is introduced
for when no MAC addresses exist.

Obtained from:	Juniper Networks, Inc.
2013-07-24 04:24:21 +00:00
Gleb Smirnoff
e28a647db6 Revert r249590 and in case if mp_ncpus isn't initialized use MAXCPU. This
allows us to init counter zone at early stage of boot.

Reviewed by:	kib
Tested by:	Lytochkin Boris <lytboris gmail.com>
2013-07-23 11:16:40 +00:00
Mateusz Guzik
b1051d92fe Remove cr_prison NULL check from proc_to_reap.
Userspace processes always have a prison.

MFC after:	2 weeks
2013-07-22 02:07:15 +00:00
Mateusz Guzik
462314b3f9 Remove duplicate assertion from tdsendsignal.
MFC after:	2 weeks
2013-07-22 00:44:37 +00:00
Konstantin Belousov
643ee87175 Implement compat32 wrappers for the ktimer_* syscalls.
Reported, reviewed and tested by:	Petr Salinger <Petr.Salinger@seznam.cz>
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2013-07-21 19:43:52 +00:00
Konstantin Belousov
058262c69a Wrap kmq_notify(2) for compat32 to properly consume struct sigevent32
argument.

Reviewed and tested by:	Petr Salinger <Petr.Salinger@seznam.cz>
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2013-07-21 19:40:30 +00:00
Konstantin Belousov
9731998997 Move the convert_sigevent32() utility function into freebsd32_misc.c
for consumption outside the vfs_aio.c.

For SIGEV_THREAD_ID and SIGEV_SIGNAL notification delivery methods,
also copy in the sigev_value, since librt event pumping loop compares
note generation number with the value passed through sigev_value.

Tested by:	Petr Salinger <Petr.Salinger@seznam.cz>
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2013-07-21 19:33:48 +00:00
Konstantin Belousov
d31e4b3a58 id_t is 64bit, provide the compat32 wrapper for clock_getcpuclockid2(2).
Reported and tested by:	Petr Salinger <Petr.Salinger@seznam.cz>
PR:	threads/180652
Sponsored by:	The FreeBSD Foundation
2013-07-20 13:39:41 +00:00
John Baldwin
ff74a3fa6b Be more aggressive in using superpages in all mappings of objects:
- Add a new address space allocation method (VMFS_OPTIMAL_SPACE) for
  vm_map_find() that will try to alter the alignment of a mapping to match
  any existing superpage mappings of the object being mapped.  If no
  suitable address range is found with the necessary alignment,
  vm_map_find() will fall back to using the simple first-fit strategy
  (VMFS_ANY_SPACE).
- Change mmap() without MAP_FIXED, shmat(), and the GEM mapping ioctl to
  use VMFS_OPTIMAL_SPACE instead of VMFS_ANY_SPACE.

Reviewed by:	alc (earlier version)
MFC after:	2 weeks
2013-07-19 19:06:15 +00:00
Konstantin Belousov
a8b0523ae7 Clear the vnode knotes before destroying vpollinfo.
Reported and tested by:	Patrick Lamaiziere <patfbsd@davenulle.org>
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
2013-07-17 10:56:21 +00:00
Gleb Smirnoff
59b9c4f289 Nuke mbstat. It wasn't used for mbuf statistics since FreeBSD 5.
Now that r253351 moved sendfile() stats to a separate struct, the
last field used in mbstat is m_mcfail, which is updated, but never
read or obtained from userland.
2013-07-15 12:18:36 +00:00
Andrey V. Elsukov
05d1f5bce0 Introduce new structure sfstat for collecting sendfile's statistics
and remove corresponding fields from struct mbstat. Use PCPU counters
and SFSTAT_INC() macro for update these statistics.

Discussed with:	glebius
2013-07-15 06:16:57 +00:00
Craig Rodrigues
719fb72517 PR: 168520 170096
Submitted by: adrian, zec

Fix multiple kernel panics when VIMAGE is enabled in the kernel.
These fixes are based on patches submitted by Adrian Chadd and Marko Zec.

(1)  Set curthread->td_vnet to vnet0 in device_probe_and_attach() just before calling
     device_attach().  This fixes multiple VIMAGE related kernel panics
     when trying to attach Bluetooth or USB Ethernet devices because
     curthread->td_vnet is NULL.

(2)  Set curthread->td_vnet in if_detach().  This fixes kernel panics when detaching networking
     interfaces, especially USB Ethernet devices.

(3)  Use VNET_DOMAIN_SET() in ng_btsocket.c

(4)  In ng_unref_node() set curthread->td_vnet.  This fixes kernel panics
     when detaching Netgraph nodes.
2013-07-15 01:32:55 +00:00
Konstantin Belousov
982d771242 Assert that runningbufspace does not underflow.
Sponsored by:	The FreeBSD Foundation
2013-07-13 19:36:18 +00:00
Konstantin Belousov
da4ca6c8ab There is no need to count waiters for the runningbufspace.
Sponsored by:	The FreeBSD Foundation
2013-07-13 19:34:34 +00:00
Konstantin Belousov
c7c536c7f5 Allow to call clock_gettime() on the clock id for zombie process.
Reported by:	Petr Salinger <Petr.Salinger@seznam.cz>
PR:	threads/180496
Sponsored by:	The FreeBSD Foundation
2013-07-13 19:32:50 +00:00
Andre Oppermann
bc4a1b8ccd Make use of the fact that uma_zone_set_max(9) already returns the
rounded limit making a call to uma_zone_get_max(9) unnecessary.

MFC after:	1 day
2013-07-11 12:53:13 +00:00
Andre Oppermann
e0c00adda2 Fix style issues, a typo in "kern.ipc.nmbufs" and correctly plave and
expose the value of the tunable maxmbufmem as "kern.ipc.maxmbufmem"
through sysctl.

Reported by:	smh
MFC after:	1 day
2013-07-11 12:46:35 +00:00
Konstantin Belousov
92e5367354 Do not invalidate page of the B_NOCACHE buffer or buffer after an I/O
error if any user wired mappings exist.  Doing the invalidation
destroys the user wiring.

The change is the temporal measure to close the bug, the more proper
fix is to delegate the invalidation of the page to upper layers
always.

Reported and tested by:	pho
Reviewed by:	alc
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
2013-07-11 05:36:26 +00:00
Marcel Moolenaar
8939c0693c Add vfs_mounted and vfs_unmounted events so that components can be informed
about mount and unmount events. This is used by Juniper to implement a more
optimal implementation of NetBSD's veriexec.

This change differs from r253224 in the following way:
o   The vfs_mounted handler is called before mountcheckdirs() and with
    newdp locked. vp is unlocked.
o   The event handlers are declared in <sys/eventhandler.h> and not in
    <sys/mount.h>. The <sys/mount.h> header is used in user land code
    that pretends to be kernel code and as such creates a very convoluted
    environment. It's hard to untangle.

Submitted by:	stevek@juniper.net
Discussed with:	pjd@
Obtained from:	Juniper Networks, Inc.
2013-07-10 15:35:25 +00:00
Konstantin Belousov
cc3d8c35f5 There are several code sequences like
vfs_busy(mp);
      vfs_write_suspend(mp);
which are problematic if other thread starts unmount between two
calls.  The unmount starts a write, while vfs_write_suspend() drain
writers.  On the other hand, unmount drains busy references, causing
the deadlock.

Add a flag argument to vfs_write_suspend and require the callers of it
to specify VS_SKIP_UNMOUNT flag, when the call is performed not in the
mount path, i.e. the covered vnode is not locked.  The suspension is
not attempted if VS_SKIP_UNMOUNT is specified and unmount is in
progress.

Reported and tested by:	Andreas Longwitz <longwitz@incore.de>
Sponsored by:	The FreeBSD Foundation
MFC after:	3 weeks
2013-07-09 20:49:32 +00:00
Andriy Gapon
d6fc869ebd should_yield: protect from td_swvoltick being uninitialized or too stale
The distance between ticks and td_swvoltick should be calculated as
an unsigned number.  Previously we could end up comparing a negative
number with hogticks in which case should_yield() would give incorrect
answer.

We should probably ensure that td_swvoltick is properly initialized.

Sponsored by:	HybridCluster
MFC after:	5 days
2013-07-09 09:01:44 +00:00
Andriy Gapon
4633a4c379 namecache sdt: freebsd doesn't support structured characters yet
:-)

MFC after:	 7 days
2013-07-09 08:58:34 +00:00
John Baldwin
c64bc3a076 Fix build with INVARIANT_SUPPORT enabled but not INVARIANTS.
Reported by:	"Matthew D. Fuller" <fullermd@over-yonder.net>
2013-07-08 21:17:20 +00:00
Alfred Perlstein
d7b5c50b92 Make kassert_printf use __printflike.
Fix associated errors/warnings while I'm here.

Requested by: avg
2013-07-07 21:39:37 +00:00
Jamie Gritton
1e7df84305 Make the comments a little more clear about PRIV_KMEM_*, explicitly
referring to /dev/[k]mem and noting it's about opening the files rather
than actually reading and writing.

Reviewed by:	jmallett
2013-07-06 00:10:52 +00:00
Jamie Gritton
c71e336230 Add new privileges, PRIV_KMEM_READ and PRIV_KMEM_WRITE, used in opening
/dev/kmem and /dev/mem (in addition to traditional file permission checks).
PRIV_KMEM_READ is different from other PRIV_* checks in that it's allowed
by default.

Reviewed by:	kib, mckusick
2013-07-05 21:31:16 +00:00
Alfred Perlstein
3eebd44d0c The change in r236456 (atomic_store_rel not locked) exposed a bug
in the ithread code where we could lose ithread interrupts if
intr_event_schedule_thread() is called while the ithread is already
running.  Effectively memory writes could be ordered incorrectly
such that the unatomic write of 0 to ithd->it_need (in ithread_loop)
could be delayed until after it was set to be triggered again by
intr_event_schedule_thread().

This was particularly a big problem for CAM because CAM optimizes
scheduling of ithreads such that it only signals camisr() when it
queues to an empty queue.  This means that additional completion
events would not unstick CAM and quickly lead to a complete lockup
of the CAM stack.

To fix this use atomics in all places where ithd->it_need is accessed.

Submitted by: delphij, mav
Obtained from: TrueOS, iXsystems
MFC After: 1 week
2013-07-04 05:53:05 +00:00
Mateusz Guzik
a82a370603 Fix receiving fd over unix socket broken in r247740.
If n fds were passed, it would receive the first one n times.

Reported by:	Shawn Webb <lattera@gmail.com>, koobs, gleb
Tested by:	koobs, gleb
Reviewed by:	pjd
2013-07-02 07:36:04 +00:00
Mikolaj Golub
9e89077c65 Plug up the lock lock leakage when exporting to a short buffer.
Reported by:	Alexander Leidinger
Submitted by:	mjg
MFC after:	1 week
2013-07-01 03:27:14 +00:00
Konstantin Belousov
70a7dd5d5b Fix issues with zeroing and fetching the counters, on x86 and ppc64.
Issues were noted by Bruce Evans and are present on all architectures.

On i386, a counter fetch should use atomic read of 64bit value,
otherwise carry from the increment on other CPU could be lost for the
given fetch, making error of 2^32.  If 64bit read (cmpxchg8b) is not
available on the machine, it cannot be SMP and it is enough to disable
preemption around read to avoid the split read.

On x86 the counter increment is not atomic on purpose, which makes it
possible for the store of the incremented result to override just
zeroed per-cpu slot.  The effect would be a counter going off by
arbitrary value after zeroing.  Perform the counter zeroing on the
same processor which does the increments, making the operations
mutually exclusive.  On i386, same as for the fetching, if the
cmpxchg8b is not available, machine is not SMP and we disable
preemption for zeroing.

PowerPC64 is treated the same as amd64.

For other architectures, the changes made to allow the compilation to
succeed, without fixing the issues with zeroing or fetching.  It
should be possible to handle them by using the 64bit loads and stores
atomic WRT preemption (assuming the architectures also converted from
using critical sections to proper asm).  If architecture does not
provide the facility, using global (spin) mutex would be non-optimal
but working solution.

Noted by:  bde
Sponsored by:	The FreeBSD Foundation
2013-07-01 02:48:27 +00:00
Mateusz Guzik
f15ba03632 acct: create a special plimit object and set it for exiting processes
instead of allocating new one each time

All limits are set to RLIM_INFINITY which sould be ok (even though we
care only about RLIMT_FSIZE in this case).

MFC after:	1 week
2013-06-30 19:08:06 +00:00
Mateusz Guzik
4a3c4f41e9 acct: reduce code duplication by using acct_disable as cleanup for
failed kproc_create

MFC after:	1 week
2013-06-30 13:17:37 +00:00
Peter Wemm
0a943e59c9 Revert accidental commit.
Pointy hat to:	 peter
2013-06-29 05:05:57 +00:00
Peter Wemm
76665df9bf Help out gcc. clang understands.
sys_generic.c:1510: warning: 'precision' may be used uninitialized
*** [sys_generic.o] Error code 1
2013-06-29 04:35:04 +00:00
Davide Italiano
7563068847 Correct the comment above _sleep() function which still mentions 'timo'
instead of 'sbintime_t'.

Reported by:	kan
2013-06-28 21:04:15 +00:00
Davide Italiano
237abf0c56 - Trim an unused and bogus Makefile for mount_smbfs.
- Reconnect with some minor modifications, in particular now selsocket()
internals are adapted to use sbintime units after recent'ish calloutng
switch.
2013-06-28 21:00:08 +00:00
Mateusz Guzik
07bd8bf929 Remove duplicate NULL check in kern_proc_filedesc_out.
No functional changes.

MFC after:	1 week
2013-06-28 18:32:46 +00:00
Mikolaj Golub
6359d169ef Rework r252313:
The filedesc lock may not be dropped unconditionally before exporting
fd to sbuf: fd might go away during execution.  While it is ok for
DTYPE_VNODE and DTYPE_FIFO because the export is from a vrefed vnode
here, for other types it is unsafe.

Instead, drop the lock in export_fd_to_sb(), after preparing data in
memory and before writing to sbuf.

Spotted by:	mjg
Suggested by:	kib
Review by:	kib
MFC after:	1 week
2013-06-28 18:07:41 +00:00
Ryan Stone
4a7d0bfcaa Correct a bug that prevented deadlkres from (almost) ever firing.
deadlkres was using a reversed test to check whether ticks had rolled over.
This meant that deadlkres could only fire after ticks had rolled over.
This test was actually unnecessary as deadlkres only ever took the
difference of ticks values which is safe even in the presence of ticks
rollover.  Remove the tests entirely.  Now deadlkres will properly fire
after a lock has been held after the timeout period.

MFC after:	1 month
2013-06-28 15:55:30 +00:00
Jeff Roberson
5f51836645 - Add a general purpose resource allocator, vmem, from NetBSD. It was
originally inspired by the Solaris vmem detailed in the proceedings
   of usenix 2001.  The NetBSD version was heavily refactored for bugs
   and simplicity.
 - Use this resource allocator to allocate the buffer and transient maps.
   Buffer cache defrags are reduced by 25% when used by filesystems with
   mixed block sizes.  Ultimately this may permit dynamic buffer cache
   sizing on low KVA machines.

Discussed with:	alc, kib, attilio
Tested by:	pho
Sponsored by:	EMC / Isilon Storage Division
2013-06-28 03:51:20 +00:00
John Baldwin
e35ce1f271 Make detaching drivers from PCI devices more robust. While here, fix a
bug where a PCI device would be powered down if it failed to probe, but
not when its driver was detached (e.g. via kldunload).
- Add a new helper method resource_list_release_active() which forcefully
  releases any active resources of a specified type from a resource list.
- Add a bus_child_detached method for the PCI bus driver which forces any
  active resources to be released (and whines to the console if it finds
  any) and then powers the device down.
- Call pci_child_detached() if we fail to probe a device when a driver
  is kldloaded.  This isn't perfect but can avoid leaking resources
  from a probe() routine in the kldload case.

Reviewed by:	imp, brooks
MFC after:	1 month
2013-06-27 20:21:54 +00:00
Mikolaj Golub
bd973910c8 To avoid LOR, always drop the filedesc lock before exporting fd to sbuf.
Reviewed by:	kib
MFC after:	3 days
2013-06-27 19:14:03 +00:00
John Baldwin
b5fb43e572 A few mostly cosmetic nits to aid in debugging:
- Call lock_init() first before setting any lock_object fields in
  lock init routines.  This way if the machine panics due to a duplicate
  init the lock's original state is preserved.
- Somewhat similarly, don't decrement td_locks and td_slocks until after
  an unlock operation has completed successfully.
2013-06-25 20:23:08 +00:00
John Baldwin
cd32bd7ad1 Several improvements to rmlock(9). Many of these are based on patches
provided by Isilon.
- Add an rm_assert() supporting various lock assertions similar to other
  locking primitives.  Because rmlocks track readers the assertions are
  always fully accurate unlike rw_assert() and sx_assert().
- Flesh out the lock class methods for rmlocks to support sleeping via
  condvars and rm_sleep() (but only while holding write locks), rmlock
  details in 'show lock' in DDB, and the lc_owner method used by
  dtrace.
- Add an internal destroyed cookie so that API functions can assert
  that an rmlock is not destroyed.
- Make use of rm_assert() to add various assertions to the API (e.g.
  to assert locks are held when an unlock routine is called).
- Give RM_SLEEPABLE locks their own lock class and always use the
  rmlock's own lock_object with WITNESS.
- Use THREAD_NO_SLEEPING() / THREAD_SLEEPING_OK() to disallow sleeping
  while holding a read lock on an rmlock.

Submitted by:	andre
Obtained from:	EMC/Isilon
2013-06-25 18:44:15 +00:00
Lawrence Stewart
0963c8e431 When a previous call to sbsndptr() leaves sb->sb_sndptroff at the start of an
mbuf that was fully consumed by the previous call, the mbuf ptr returned by the
current call ends up being the previous mbuf in the sb chain to the one that
contains the data we want.

This does not cause any observable issues because the mbuf copy routines happily
walk the mbuf chain to get to the data at the moff offset, which in this case
means they effectively skip over the mbuf returned by sbsndptr().

We can't adjust sb->sb_sndptr during the previous call for this case because the
next mbuf in the chain may not exist yet. We therefore need to detect the
condition and make the adjustment during the current call.

Fix by detecting the special case of moff being at the start of the next mbuf in
the chain and adjust the required accounting variables accordingly.

Reviewed by:	andre
MFC after:	2 weeks
2013-06-19 03:08:01 +00:00
Lawrence Stewart
ec41a9a1bd The fix committed in r250951 replaced the reported panic with a deadlock... gold
star for me. EVENTHANDLER_DEREGISTER() attempts to acquire the lock which is
held by the event handler framework while executing event handler functions,
leading to deadlock.

Move EVENTHANDLER_DEREGISTER() to alq_load_handler() and thus deregister the ALQ
shutdown_pre_sync handler at module unload time, which takes care of the
originally reported panic and fixes the deadlock introduced in r250951.

Reported by:	Luiz Otavio O Souza
MFC after:	3 days
X-MFC with:	250951
2013-06-17 09:49:07 +00:00
Ed Schouten
2381f6ef8c Change callout use counter to use C11 atomics.
In order to get some coverage of C11 atomics in kernelspace, switch at
least one piece of code in kernelspace to use C11 atomics instead of
<machine/atomic.h>.

While there, slightly improve the code by adding an assertion to prevent
the use count from going negative.
2013-06-16 09:30:35 +00:00
Lawrence Stewart
38f080cb04 Move hhook's per-vnet initialisation to an earlier SYSINIT SI_SUB stage to
ensure all per-vnet related hhook initialisation is completed prior to any
virtualised hhook points attempting registration.

vnet_register_sysinit() requires that a stage later than SI_SUB_VNET be chosen.
There are no per-vnet initialisors in the source tree at this time which run
earlier than SI_SUB_INIT_IF. A quick audit of non-virtualised SYSINITs indicates
there are no subsystems pre SI_SUB_MBUF that would likely be interested in
registering a virtualised hhook point.

Settle on SI_SUB_MBUF as hhook's per-vnet initialisation stage as it's the first
overtly network-related initilisation stage to run after SI_SUB_VNET. If a
subsystem that initialises earlier than SI_SUB_MBUF ends up wanting to register
virtualised hhook points in future, hhook's use of SI_SUB_MBUF will need to be
revisited and would probably warrant creating a dedicated SI_SUB_HHOOK which
runs immediately after SI_SUB_VNET.

MFC after:	1 week
2013-06-15 10:08:34 +00:00
Lawrence Stewart
933a8bff73 Cleanup and simplification in khelp_{register|deregister}_helper(). No
functional changes.

MFC after:	1 week
2013-06-15 06:45:17 +00:00
Lawrence Stewart
58261d30e1 Add a private KPI between hhook and khelp that allows khelp modules to insert
hook functions into hhook points which register after the modules were loaded -
potentially useful during boot or if hhook points are dynamically registered.

MFC after:	1 week
2013-06-15 05:57:29 +00:00
Lawrence Stewart
b1f53277ec Internalise handling of virtualised hook points inside
hhook_{add|remove}_hook_lookup() so that khelp (and other potential API
consumers) do not have to care when they attempt to (un)hook a particular hook
point identified by id and type.

Reviewed by:	scottl
MFC after:	1 week
2013-06-15 04:03:40 +00:00
Lawrence Stewart
bfe72a58e2 Fix a major oversight in r251732 which causes non-VIMAGE kernels to trigger a
KASSERT during TCP hhook registration at boot. Virtualised hook points only
require extra housekeeping and sanity checking when "options VIMAGE" is present.

Reported by:	bdrewery,jh,dhw
Tested by:	dhw
MFC after:	1 week
X-MFC with:	251732
2013-06-14 18:11:21 +00:00
Lawrence Stewart
601d4c7543 Add support for non-virtualised hhook points, which are uniquely identified by
type and id, as compared to virtualised hook points which are now uniquely
identified by type, id and a vid (which for vimage is the pointer to the vnet
that the hhook resides in).

All hhook_head structs for both virtualised and non-virtualised hook points
coexist in hhook_head_list, and a separate list is maintained for hhook points
within each vnet to simplify some vimage-related housekeeping.

Reviewed by:	scottl
MFC after:	1 week
2013-06-14 04:10:34 +00:00
Lawrence Stewart
86241d89a9 Fix a potential NULL-pointer dereference that would trigger if the hhook
registration site did not provide storage for a copy of the hhook_head struct.

MFC after:	3 days
2013-06-14 02:25:40 +00:00
Jeff Roberson
17a2737732 - Add a BIT_FFS() macro and use it to replace cpusetffs_obj()
Discussed with:	attilio
Sponsored by:	EMC / Isilon Storage Division
2013-06-13 20:46:03 +00:00
Konstantin Belousov
1d7466bca4 Fix two issues with the spin loops in the umtx(2) implementation.
- When looping, check for the pending suspension.  Otherwise, other
  usermode thread which races with the looping one, could try to
  prevent the process from stopping or exiting.

- Add missed checks for the faults from casuword*().  The code is
  structured in a way which makes the loops exit if the specified
  address is invalid, since both fuword() and casuword() return -1 on
  the fault.  But if the address is mapped readonly, the typical value
  read by fuword() is different from -1, while casuword() returns -1.
  Absent the checks for casuword() faults, this is interpreted as the
  race with other thread and causes non-interruptible spinning in the
  kernel.

Reported and tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
2013-06-13 09:33:22 +00:00
Marcel Moolenaar
4612275fdb Revert r251590. It unexpectedly broke the build and there were some
questions on locking. As part of commit-bit grooming, I'd like Steve
to handle this, but can't leave things broken in the mean time.
2013-06-10 15:22:27 +00:00
Marcel Moolenaar
8c7ca16f63 Add vfs_mounted and vfs_unmounted events so that components can be informed
about mount and unmount events. This is used by Juniper to implement a more
optimal implementation of NetBSD's veriexec.

Submitted by:	stevek@juniper.net
Obtained from:	Juniper Networks, Inc
2013-06-09 23:51:26 +00:00
Gleb Smirnoff
8d1aa3c6b4 aio_mlock() added:
- Regen for r251526.
  - Bump __FreeBSD_version.
2013-06-08 13:30:13 +00:00
Gleb Smirnoff
6160e12c10 Add new system call - aio_mlock(). The name speaks for itself. It allows
to perform the mlock(2) operation, which can consume a lot of time, under
control of aio(4).

Reviewed by:	kib, jilles
Sponsored by:	Nginx, Inc.
2013-06-08 13:27:57 +00:00
Gleb Smirnoff
f95c13db04 Separate LIO_SYNC processing into a separate function aio_process_sync(),
and rename aio_process() into aio_process_rw().

Reviewed by:	kib
Sponsored by:	Nginx, Inc.
2013-06-08 13:02:43 +00:00
John Baldwin
c9813d0a37 Do not compare the existing mask of a cpuset with a new mask when changing
the mask of a cpuset.  Also, change the cpuset's mask before updating the
masks of all children.  Previously changing a cpuset's mask first required
setting the mask to a super-set of both the old and new masks and then
changing it a second time to the new mask.
2013-06-06 14:43:19 +00:00
Alan Cox
27a18d6a23 Don't busy the page unless we are likely to release the object lock.
Reviewed by:	kib
Sponsored by:	EMC / Isilon Storage Division
2013-06-06 06:17:20 +00:00
Jeff Roberson
ba39d89bc9 - Consolidate duplicate code into support functions.
- Split the bqlock into bqclean and bqdirty locks.
 - Only acquire the wakeup synchronization locks when we cross a
   threshold requiring them.
 - Restructure the way flushbufqueues() targets work so they are more
   smp friendly and sane.

Reviewed by:	kib
Discussed with:	mckusick, attilio
Sponsored by:	EMC / Isilon Storage Division

M    vfs_bio.c
2013-06-05 23:53:00 +00:00
Gleb Smirnoff
82e825c4c9 Improve r250890, so that we stop processing of a message with zero
descriptors as early as possible, and assert that number of descriptors
is positive in unp_freerights().

Reviewed by:	mjg, pjd, jilles
2013-06-04 11:19:08 +00:00
John Baldwin
24150d37d3 - Fix a couple of inverted panic messages for shared/exclusive mismatches
of a lock within a single thread.
- Fix handling of interlocks in WITNESS by properly requiring the interlock
  to be held exactly once if it is specified.
2013-06-03 17:41:11 +00:00
John Baldwin
95d28652af - Handle the recursed/not recursed flags with RA_RLOCKED in rw_assert().
- Tweak a panic message.
2013-06-03 17:38:57 +00:00
Konstantin Belousov
d39116f5d5 Be more generous when donating the current thread time to the owner of
the vnode lock while iterating over the free vnode list.  Instead of
yielding, pause for 1 tick.  The change is reported to help in some
virtualized environments.

Submitted by:	Roger Pau Monn? <roger.pau@citrix.com>
Discussed with:	jilles
Tested by:	pho
MFC after:	2 weeks
2013-06-03 17:36:43 +00:00
Konstantin Belousov
1e65d73c74 Do not map the shared page COW. If the process wired its address
space, fork(2) would cause shadowing of the physical object and
copying of the shared page into private copy, effectively preventing
updates for the exported timehands structure and stopping the clock.

Specify the maximum allowed permissions for the page to be read and
execute, preventing write from the user mode.

Reported and tested by:	<huanghwh@yahoo.com>
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
2013-06-03 04:32:53 +00:00
Konstantin Belousov
92fab43f7f When auto-sizing the buffer cache, limit the amount of physical memory
used as the estimation of size, to 32GB.  This provides around 100K of
buffer headers and corresponding KVA for buffer map at the peak.
Sizing the cache larger is not useful, also resulting in the wasting
and exhausting of KVA for large machines.

Reported and tested by:	bdrewery
Sponsored by:	The FreeBSD Foundation
2013-06-03 04:16:48 +00:00
Alan Cox
39a4cd0cec Reduce the scope of the VM object locking in brelse(). In my tests, this
change reduced the total number of VM object lock acquisitions by brelse()
by 74%.

Sponsored by:	EMC / Isilon Storage Division
2013-06-02 16:18:03 +00:00
Marius Strobl
0ad17e4b32 Move an assertion to the right spot; only bus_dmamap_load_mbuf(9)
requires a pkthdr being present but that's not the case for either
_bus_dmamap_load_mbuf_sg() or bus_dmamap_load_mbuf_sg(9).

Reported by:	sbruno
MFC after:	1 week
2013-06-01 11:42:47 +00:00
John Baldwin
3d4c503cf0 Style fixes to vn_ioctl().
Suggested by:	bde
2013-05-31 16:15:22 +00:00
Jeff Roberson
22a722605d - Convert the bufobj lock to rwlock.
- Use a shared bufobj lock in getblk() and inmem().
 - Convert softdep's lk to rwlock to match the bufobj lock.
 - Move INFREECNT to b_flags and protect it with the buf lock.
 - Remove unnecessary locking around bremfree() and BKGRDINPROG.

Sponsored by:	EMC / Isilon Storage Division
Discussed with:	mckusick, kib, mdf
2013-05-31 00:43:41 +00:00
Julian Elischer
4591f0d339 Initialising the new fibnum field to a known value turns out to
be a GOOD IDEA (TM).
Apparently MOST users set this (e.g. tcp and friends) but there are a few
users that just assume that it is a sensible value but then go on to read it.
These include SCTP, pf and the FLOWTABLE option (and maybe others).
2013-05-24 02:18:37 +00:00
Lawrence Stewart
7639c9be45 Ensure alq's shutdown_pre_sync event handler is deregistered on module unload to
avoid a dangling pointer and eventual panic on system shutdown.

Reported by:	Ali <comnetboy at gmail.com>
Tested by:	Ali <comnetboy at gmail.com>
MFC after:	1 week
2013-05-24 00:49:12 +00:00
Pawel Jakub Dawidek
92981fdf9e Use proper malloc type for ioctls white-list.
Reported by:	pho
Tested by:	pho
2013-05-23 21:07:26 +00:00
Luigi Rizzo
4b62214f4a Increase the (arbitrary) limit for the number of packets per tick
from 1k to 20k The previous value was good 10 years ago, but not
anymore now.

More importantly, lots of good surprises:
polling is incredibly effective under virtualization, and not only
prevents livelock but also saves most of the VM exit overhead in
receive mode.

Using polling, a FreeBSD instance under qemu-kvm remains perfectly
responsive even when bombed with 10 Mpps over an emulated e1000,
and happily processes 1.7 Mpps through ipfw.

Note that some incompatibilities still remain: e.g. polling is not
(yet) compatible with netmap, and seems to freeze the guest when
kern.polling.idle_poll=1

MFC after:	3 days
2013-05-22 16:32:18 +00:00
Mateusz Guzik
ecbb2a1819 passing fd over unix socket: fix a corner case where caller
wants to pass no descriptors.

Previously the kernel would leak memory and try to free a potentially
arbitrary pointer.

Reviewed by:	pjd
2013-05-21 21:58:00 +00:00
Attilio Rao
bed927ee17 vm_object locking is not needed there as pages are already wired.
Sponsored by:	EMC / Isilon storage division
Submitted by:	alc
2013-05-21 20:54:03 +00:00
Konstantin Belousov
f85769eb75 Regenerate. 2013-05-21 11:41:08 +00:00
Konstantin Belousov
48947eccee Fix the wait6(2) on 32bit architectures and for the compat32, by using
the right type for the argument in syscalls.master.  Also fix the
posix_fallocate(2) and posix_fadvise(2) compat32 syscalls on the
architectures which require padding of the 64bit argument.

Noted and reviewed by:	jhb
Pointy hat to:	kib
MFC after:	1 week
2013-05-21 11:40:16 +00:00
Pawel Jakub Dawidek
9b9ff7d390 Style nits. 2013-05-19 23:30:24 +00:00
Pawel Jakub Dawidek
9b1040a574 Use SDT_PROBE1() instead of SDT_PROBE(). 2013-05-19 23:29:22 +00:00
Jamie Gritton
761d2bb5b9 Refine the "nojail" rc keyword, adding "nojailvnet" for files that don't
apply to most jails but do apply to vnet jails.  This includes adding
a new sysctl "security.jail.vnet" to identify vnet jails.

PR:		conf/149050
Submitted by:	mdodd
MFC after:	3 days
2013-05-19 04:10:34 +00:00
Attilio Rao
e3ed7ff03f Use readlocking now that assertions on vm_page_lookup() are relaxed.
Sponsored by:	EMC / Isilon storage division
Reviewed by:	alc
Tested by:	flo, pho
2013-05-17 20:03:55 +00:00
Jaakko Heinonen
c532f8c4d6 A library function shall not set errno to 0.
Reviewed by:	mdf
2013-05-16 18:13:10 +00:00
Jeff Roberson
f2cc1285c2 - Add a new general purpose path-compressed radix trie which can be used
with any structure containing a uint64_t index.  The tree code
   auto-generates type safe wrappers.
 - Eliminate the buf splay and replace it with pctrie.  This is not only
   significantly faster with large files but also allows for the possibility
   of shared locking.

Reviewed by:    alc, attilio
Sponsored by:   EMC / Isilon Storage Division
2013-05-12 04:05:01 +00:00
Konstantin Belousov
0fc6daa72d - Fix nullfs vnode reference leak in nullfs_reclaim_lowervp(). The
null_hashget() obtains the reference on the nullfs vnode, which must
  be dropped.

- Fix a wart which existed from the introduction of the nullfs
  caching, do not unlock lower vnode in the nullfs_reclaim_lowervp().
  It should be innocent, but now it is also formally safe.  Inform the
  nullfs_reclaim() about this using the NULLV_NOUNLOCK flag set on
  nullfs inode.

- Add a callback to the upper filesystems for the lower vnode
  unlinking. When inactivating a nullfs vnode, check if the lower
  vnode was unlinked, indicated by nullfs flag NULLV_DROP or VV_NOSYNC
  on the lower vnode, and reclaim upper vnode if so.  This allows
  nullfs to purge cached vnodes for the unlinked lower vnode, avoiding
  excessive caching.

Reported by:	G??ran L??wkrantz <goran.lowkrantz@ismobile.com>
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
2013-05-11 11:17:44 +00:00
Eitan Adler
7a2b450ff8 Fxi a bunch of typos.
PR:	misc/174625
Submitted by:	Jeremy Chadwick <jdc@koitsu.org>
2013-05-10 16:41:26 +00:00
Marcel Moolenaar
e63091ea6c Add option WITNESS_NO_VNODE to suppress printing LORs between VNODE
locks. To support this, VNODE locks are created with the LK_IS_VNODE
flag. This flag is propagated down using the LO_IS_VNODE flag.

Note that WITNESS still records the LOR. Only the printing and the
optional entering into the kernel debugger is bypassed with the
WITNESS_NO_VNODE option.
2013-05-09 16:28:18 +00:00
Konstantin Belousov
3328431dec Item 1 in r248830 causes earlier exits from the sendfile(2), before
all requested data was sent.  The reason is that xfsize <= 0 condition
must not be tested at all if space == loopbytes.  Otherwise, the done
is set to 1, and sendfile(2) is aborted too early.

Instead of moving the condition to exiting the inner loop after the
xfersize check, directly check for the completed transfer before the
testing of the available space in the socket buffer, and revert item 1
of r248830.  It is arguably another bug to sleep waiting for socket
buffer space (or return EAGAIN for non-blocking socket) if all bytes
are already transferred.

Reported by:	pho
Discussed with:	scottl, gibbs
Tested by:	scottl (stable/9 backport), pho
2013-05-09 16:05:51 +00:00
Andre Oppermann
6753da1356 When the accept queue is full print the number of already pending
new connections instead of by how many we're over the limit, which
is always 1.

Noticed by:	jmallet
MFC after:	1 week
2013-05-08 14:13:14 +00:00
Scott Long
ab8f55b9fd Add a sysctl vfs.read_min to complement the exiting vfs.read_max. It
defaults to 1, meaning that it's off.

When read-ahead is enabled on a file, the vfs cluster code deliberately
breaks a read into 2 I/O transactions; one to satisfy the actual read,
and one to perform read-ahead.  This makes sense in low-latency
circumstances, but often produces unbalanced i/o transactions that
penalize disks.  By setting vfs.read_min, we can tell the algorithm to
fetch a larger transaction that what we asked for, achieving the same
effect as the read-ahead but without the doubled, unbalanced transaction
and the slightly lower latency.  This significantly helps our workloads
with video streaming.

Submitted by:	emax
Reviewed by:	kib
Obtained from:	Netflix
2013-05-07 08:16:21 +00:00
Andre Oppermann
f89d4c3acf Back out r249318, r249320 and r249327 due to a heisenbug most
likely related to a race condition in the ipi_hash_lock with
the exact cause currently unknown but under investigation.
2013-05-06 16:42:18 +00:00
Matthew D Fleming
770b41b349 Add missing vdrop() in error case.
Submitted by:	Fahad (mohd.fahadullah@isilon.com)
MFC after:	1 week
2013-05-04 18:38:16 +00:00
John Baldwin
958aa57537 Similar to 233760 and 236717, export some more useful info about the
kernel-based POSIX semaphore descriptors to userland via procstat(1) and
fstat(1):
- Change sem file descriptors to track the pathname they are associated
  with and add a ksem_info() method to copy the path out to a
  caller-supplied buffer.
- Use the fo_stat() method of shared memory objects and ksem_info() to
  export the path, mode, and value of a semaphore via struct kinfo_file.
- Add a struct semstat to the libprocstat(3) interface along with a
  procstat_get_sem_info() to export the mode and value of a semaphore.
- Teach fstat about semaphores and to display their path, mode, and value.

MFC after:	2 weeks
2013-05-03 21:11:57 +00:00
John Baldwin
dfa66c01ae Fix FIONREAD on regular files. The computed result was being ignored and
it was being passed down to VOP_IOCTL() where it promptly resulted in
ENOTTY due to a missing else for the past 8 years.  While here, use a
shared vnode lock while fetching the current file's size.

MFC after:	1 week
2013-05-03 19:08:58 +00:00
Jilles Tjoelker
b201f4a0dc Regenerate files for pipe2(). 2013-05-01 22:45:04 +00:00
Jilles Tjoelker
dc570d5e56 Add pipe2() system call.
The pipe2() function is similar to pipe() but allows setting FD_CLOEXEC and
O_NONBLOCK (on both sides) as part of the function.

If p points to two writable ints, pipe2(p, 0) is equivalent to pipe(p).

If the pointer is not valid, behaviour differs: pipe2() writes into the
array from the kernel like socketpair() does, while pipe() writes into the
array from an architecture-specific assembler wrapper.

Reviewed by:	kan, kib
2013-05-01 22:42:42 +00:00
Jilles Tjoelker
1bf6b724f1 Regenerate files for accept4(). 2013-05-01 20:12:58 +00:00
Jilles Tjoelker
da7d2afb6d Add accept4() system call.
The accept4() function, compared to accept(), allows setting the new file
descriptor atomically close-on-exec and explicitly controlling the
non-blocking status on the new socket. (Note that the latter point means
that accept() is not equivalent to any form of accept4().)

The linuxulator's accept4 implementation leaves a race window where the new
file descriptor is not close-on-exec because it calls sys_accept(). This
implementation leaves no such race window (by using falloc() flags). The
linuxulator could be fixed and simplified by using the new code.

Like accept(), accept4() is async-signal-safe, a cancellation point and
permitted in capability mode.
2013-05-01 20:10:21 +00:00
Mikolaj Golub
1b8388cde9 Introduce a constant, ELF_NOTE_ROUNDSIZE, which evidently declare our
intention to use 4-byte padding for elf notes.

MFC after:	3 weeks
2013-05-01 14:59:16 +00:00
Jilles Tjoelker
cd31b6dd08 socket: Make shutdown() wake up a blocked accept().
A blocking accept (and some other operations) waits on &so->so_timeo. Once
it wakes up, it will detect the SBS_CANTRCVMORE bit.

The error from accept() is [ECONNABORTED] which is not the nicest one -- the
thread calling accept() needs to know out-of-band what is happening.

A spurious wakeup on so->so_timeo appears harmless (sleep retried) except
when lingering on close (SO_LINGER, and in that case there is no descriptor
to call shutdown() on) so this should be fairly safe.

A shutdown() already woke up a blocked accept() for TCP sockets, but not for
Unix domain sockets. This fix is generic for all domains.

This patch was sent to -hackers@ and -net@ on April 5.

MFC after:	2 weeks
2013-04-30 15:06:30 +00:00
Konstantin Belousov
3d31767952 Eliminate the layering violation in the kern_sendfile(). When quering
the file size, use VOP_GETATTR() instead of accessing vnode vm_object
un_pager.vnp.vnp_size.

Take the shared vnode lock earlier to cover the added VOP_GETATTR()
call and, as consequence, the whole internal sendfile loop.  Reduce vm
object lock scope to not protect the local calculations.

Note that this is the last misuse of the vnp_size in the tree, the
others were removed from the ELF image activator by r230246.

Reviewed by:	alc
Tested by:	pho, bf (previous version)
MFC after:	1 week
2013-04-28 19:12:09 +00:00
Andre Oppermann
2ebcc8ac4a Base the calculation of maxmbufmem in part on kmem_map size
instead of kernel_map size to prevent kernel memory exhaustion
by mbufs and a subsequent panic on physical page allocation
failure.

On architectures without a direct map all mbuf memory (except
for jumbo mbufs larger than PAGE_SIZE) comes from kmem_map.
It is the limiting factor hence.

For architectures with a direct map using the size of kmem_map
is a good proxy of available kernel memory as well.  If it is
much smaller the mbuf limit may be sub-optimal but remains
reasonable, while avoiding panics under exhaustion.

The overall mbuf memory limit calculation may be reconsidered
again later, however due to the many different mbuf sizes and
different backing KVM maps it is a tricky subject.

Found by:	pho's new network stress test
Pointed out by:	alc (kmem_map instead of kernel_map)
Tested by:	pho
2013-04-24 13:54:55 +00:00
Jaakko Heinonen
a208417c41 Include PID in the error message which is printed when the maxproc limit
is exceeded. Improve formatting of the message while here.

PR:		kern/60550
Submitted by:	Lowell Gilbert, bde
2013-04-19 15:19:29 +00:00
Gleb Smirnoff
14658a80fe Don't compare unsigned socklen_t against < 0.
Reviewed by:	jhb
2013-04-19 13:40:13 +00:00
Jilles Tjoelker
1e367efa8b sem: Restart the POSIX sem_* calls after signals with SA_RESTART set.
Programs often do not expect an [EINTR] return from sem_wait() and POSIX
only allows it if the signal was installed without SA_RESTART. The timeout
in sem_timedwait() is absolute so it can be restarted normally.

The umtx call can be invoked with a relative timeout and in that case
[ERESTART] must be changed to [EINTR]. However, libc does not do this.

The old POSIX semaphore implementation did this correctly (before r249566),
unlike the new umtx one.

It may be desirable to avoid [EINTR] completely, which matches the pthread
functions and is explicitly permitted by POSIX. However, the kernel must
return [EINTR] at least for signals with SA_RESTART clear, otherwise pthread
cancellation will not abort a semaphore wait. In this commit, only restore
the 8.x behaviour which is also permitted by POSIX.

Discussed with:	jhb
MFC after:	1 week
2013-04-19 10:16:00 +00:00
Gleb Smirnoff
8f779cc541 On non-ACPI i386 mp_ncpus is initialized at SI_SUB_CPU, and this
prevents us from creating UMA_ZONE_PCPU zones earlier.

As bandaid shift initialization of counter(9) zone later.

Reviewed by:		kib
Reported & tested by:	Lytochkin Boris <lytboris gmail.com>
2013-04-17 18:43:33 +00:00
Gabor Kovesdan
a2098fea6d - Correct mispellings of the word necessary
Submitted by:	Christoph Mallon <christoph.mallon@gmx.de> (via private mail)
2013-04-17 11:42:40 +00:00
Gabor Kovesdan
ab3f6b347e - Correct mispellings of the word occurrence
Submitted by:	Christoph Mallon <christoph.mallon@gmx.de> (via private mail)
2013-04-17 11:40:10 +00:00
Warner Losh
11c601447d r249408 and r249436 cause a NULL pointer dereference on the CUBIEBOARD
since it doesn't set the kernel envrionment at all. Work around this
by making sure kern_envp is non-NULL before dereferencing it.
2013-04-16 22:09:08 +00:00
John Baldwin
8916af883c - Document that sem_wait() can fail with EINTR if it is interrupted by a
signal.
- Fix the old ksem implementation for POSIX semaphores to not restart
  sem_wait() or sem_timedwait() if interrupted by a signal.

MFC after:	1 week
2013-04-16 20:26:31 +00:00
Mikolaj Golub
f1fca82ed5 Add a new set of notes to a process core dump to store procstat data.
The notes format is a header of sizeof(int), which stores the size of
the corresponding data structure to provide some versioning, and data
in the format as it is returned by a related sysctl call.

The userland tools (procstat(1)) will be taught to extract this data,
providing additional info for postmortem analysis.

PR:		kern/173723
Suggested by:	jhb
Discussed with:	jhb, kib
Reviewed by:	jhb (initial version), kib
MFC after:	1 month
2013-04-16 19:19:14 +00:00
Rick Macklem
64fa8df6e0 Allow the vnode to be unlocked for the weird case of
LK_EXCLOTHER. LK_EXCLOTHER is only used to acquire a
usecount on a vnode during NFSv4 recovery from an
expired lease.

Reported and tested by:	pho
MFC after:	2 weeks
2013-04-16 14:22:16 +00:00
Konstantin Belousov
44d95698ba Some compilers issue a warning when wider integer is casted to narrow
pointer.  Supposedly shut down the warning by casting through
uintptr_t.

Reported by:	ian
2013-04-16 07:11:52 +00:00
George V. Neville-Neil
8f2ba63493 Point args[0] not at the thread that is ending but at the one that
is starting.  This is in line with practice in OpenSolaris.

Note that this change is only in ULE and not in the 4BSD scheduler.
Once this change settles in (MFC timeout has expired) we'll try it out
on 4BSD as well.

PR:		177706
Submitted by:	Tiwei Bie
MFC after:	1 month
2013-04-15 17:21:02 +00:00
Mikolaj Golub
5ea21e6904 Similarly to proc_getargv() and proc_getenvv(), export proc_getauxv()
to be able to reuse the code.

MFC after:	3 weeks
2013-04-14 20:03:48 +00:00
Mikolaj Golub
fe52cf5475 Re-factor the code to provide kern_proc_filedesc_out(), kern_proc_out(),
and kern_proc_vmmap_out() functions to output process kinfo structures
to sbuf, to make the code reusable.

The functions are going to be used in the coredump routine to store
procstat info in the core program header notes.

Reviewed by:	kib
MFC after:	3 weeks
2013-04-14 20:01:36 +00:00
Mikolaj Golub
bd3902134c Re-factor coredump routines. For each type of notes an output
function is provided, which is used either to calculate the note size
or output it to sbuf.  On the first pass the notes are registered in a
list and the resulting size is found, on the second pass the list is
traversed outputing notes to sbuf.  For the sbuf a drain routine is
provided that writes data to a core file.

The main goal of the change is to make coredump to write notes
directly to the core file, without preliminary preparing them all in a
memory buffer.  Storing notes in memory is not a problem for the
current, rather small, set of notes we write to the core, but it may
becomes an issue when we start to store procstat notes.

Reviewed by:	jhb (initial version), kib
Discussed with:	jhb, kib
MFC after:	3 weeks
2013-04-14 19:59:38 +00:00
Mateusz Guzik
db8f33fd32 Add fdallocn function and use it when passing fds over unix socket.
This gets rid of "unp_externalize fdalloc failed" panic.

Reviewed by:	pjd
MFC after:	1 week
2013-04-14 17:08:34 +00:00
Jayachandran C.
f46206c270 Fix changes made in r249408.
In some cases, kern_envp is set by the architecture code and env_pos does
not contain the length of the static kernel environment. In these cases
r249408 causes the kernel to discard the environment.

Fix this by updating the check for empty static env to *kern_envp != '\0'

Reported by:	np@
2013-04-13 07:23:37 +00:00
Jayachandran C.
15f9c9ed69 Fix kenv behavior when there is no static environment
In case where there are no static kernel environment entries, the
function init_dynamic_kenv() adds an incorrect entry at position 0 of
the dynamic kernel environment. This in turn causes kenv(1) to print
and empty list even though there are dynamic entries added later.

Fix this by checking env_pos in init_dynamic_kenv() and adding dynamic
entries only if there are static entries.
2013-04-12 15:58:53 +00:00
Mikolaj Golub
ddb9b61248 Add sbuf_start_section() and sbuf_end_section() functions, which can
be used for automatic section alignment.

Discussed with:	kib
Reviewed by:	kib
MFC after:	1 month
2013-04-11 19:49:18 +00:00
Jim Harris
d58a96538f Fix the build. 2013-04-10 00:35:08 +00:00
Andre Oppermann
e8b3186b6a Change certain heavily used network related mutexes and rwlocks to
reside on their own cache line to prevent false sharing with other
nearby structures, especially for those in the .bss segment.

NB: Those mutexes and rwlocks with variables next to them that get
changed on every invocation do not benefit from their own cache line.
Actually it may be net negative because two cache misses would be
incurred in those cases.
2013-04-09 21:02:20 +00:00
Attilio Rao
bc403f030d Switch some "low-hanging fruit" to acquire read lock on vmobjects
rather than write locks.

Sponsored by:	EMC / Isilon storage division
Reviewed by:	alc
Tested by:	pho
2013-04-08 19:58:32 +00:00
Gleb Smirnoff
4e76af6a41 Merge from projects/counters: counter(9).
Introduce counter(9) API, that implements fast and raceless counters,
provided (but not limited to) for gathering of statistical data.

See http://lists.freebsd.org/pipermail/freebsd-arch/2013-April/014204.html
for more details.

In collaboration with:	kib
Reviewed by:		luigi
Tested by:		ae, ray
Sponsored by:		Nginx, Inc.
2013-04-08 19:40:53 +00:00
Mikolaj Golub
c9d59a63e3 Use pget(9) to reduce code duplication.
MFC after:	1 week
2013-04-07 17:44:30 +00:00
Mikolaj Golub
fb5ea9d1c8 Fill p_flags and p_align fields of the core dump note segement.
Reviewed by:	kib
MFC after:	2 weeks
2013-04-07 17:42:27 +00:00
Mikolaj Golub
27b056480e Use 4-byte padding for core dump notes on both 32 and 64bit archs.
Although native word padding (i.e. 8-byte on 64bit arch) looks to be
in agreement with standards, other parts of our code and other OSes
use 4-byte alignment.

This is not expected to change alignment for currently generated core
dump notes, as the notes look to consist of structures with sizes
multiple of 8 on 64-bit archs. But there are plans to add additional
notes, where 4-byte vs 8-byte alignment makes difference.

Discussed with:	kib
Reviewed by:	kib
MFC after:	2 weeks
2013-04-07 17:40:49 +00:00
Jilles Tjoelker
b68cf25fe6 mqueue,ksem,shm: Fix race condition with setting UF_EXCLOSE.
POSIX mqueue, compatibility ksem and POSIX shm create a file descriptor that
has close-on-exec set. However, they do this incorrectly, leaving a window
where a thread may fork and exec while the flag has not been set yet. The
race is easily reproduced on a multicore system with one thread doing
shm_open and close and another thread doing posix_spawnp and waitpid.

Set UF_EXCLOSE via falloc()'s flags argument instead. This also simplifies
the code.

MFC after:	1 week
2013-04-07 15:26:09 +00:00
Jeff Roberson
26089666b6 Prepare to replace the buf splay with a trie:
- Don't insert BKGRDMARKER bufs into the splay or dirty/clean buf lists.
   No consumers need to find them there and it complicates the tree.
   These flags are all FFS specific and could be moved out of the buf
   cache.
 - Use pbgetvp() and pbrelvp() to associate the background and journal
   bufs with the vp.  Not only is this much cheaper it makes more sense
   for these transient bufs.
 - Fix the assertions in pbget* and pbrel*.  It's not safe to check list
   pointers which were never initialized.  Use the BX flags instead.  We
   also check B_PAGING in reassignbuf() so this should cover all cases.

Discussed with:	kib, mckusick, attilio
Sponsored by:	EMC / Isilon Storage Division
2013-04-06 22:21:23 +00:00
Gleb Smirnoff
b9ce4f67ae Fix memory leak in coredump().
Reviewed by:	kib
2013-04-05 20:24:51 +00:00
Konstantin Belousov
b887a1555c If filter of the interrupt event is not null, print it, in addition to
the handler address.  Add a mark to distinguish between filter and
handler.

Note that the arguments for both filter and handler are same.

Sponsored by:	The FreeBSD Foundation
Reviewed by:	jhb
MFC after:	1 week
2013-04-05 14:30:51 +00:00
Brooks Davis
56fddc5d8c MFP4 change 210763
Allow boothowto and bootverbose to be set via kernel options, which
is useful on architectures that are unable to rely on a boot loader
to pass configuration variables to the kernel.

Submitted by:	rwatson
2013-04-03 22:24:36 +00:00
Kenneth D. Merry
a358cf3aec Add support for XPT_CONT_TARGET_IO CCBs in _bus_dmamap_load_ccb().
Declare CCB types in their respective switch blocks.

Sponsored by:	Spectra Logic
2013-04-02 16:49:49 +00:00
Matthew D Fleming
b3e6bbc676 Regen.
MFC after:	1 week
2013-04-02 05:30:52 +00:00
Matthew D Fleming
e324bf91e8 Fix return type of extattr_set_* and fix rmextattr(8) utility.
extattr_set_{fd,file,link} is logically a write(2)-like operation and
should return ssize_t, just like extattr_get_*.  Also, the user-space
utility was using an int for the return value of extattr_get_* and
extattr_list_*, both of which return an ssize_t.

MFC after:	1 week
2013-04-02 05:30:41 +00:00
Konstantin Belousov
c686ee4685 Do not call the VOP_LOOKUP() for the doomed directory vnode. The
vnode could be reclaimed while lock upgrade was performed.

Sponsored by:	The FreeBSD Foundation
Reported and tested by:	pho
Diagnosed and reviewed by:	rmacklem
MFC after:	1 week
2013-04-01 09:59:38 +00:00
Jilles Tjoelker
d289dc7b73 Rename do_pipe() to kern_pipe2() and declare it properly. 2013-03-31 17:42:54 +00:00
Matthew D Fleming
926cd204c7 Use a shared lock for VOP_GETEXTATTR, as it is a read-like operation.
MFC after:	1 week
2013-03-30 15:09:04 +00:00
Jim Harris
10a93479b9 Add bus_dmamap_load_bio for non-CAM disk drivers that wish to enable
unmapped I/O.

Sponsored by:	Intel
Reviewed by:	kib
2013-03-29 16:26:25 +00:00
Jim Harris
86675b5c0d Add CTR5() to bus_dmamap_load_ccb, similar to other bus_dmamap_load_*
functions.

Sponsored by:	Intel
2013-03-29 16:00:16 +00:00
Jim Harris
ab72998ef7 Do not add 1 to nsegs before passing to CTR5(), since nsegs
has already been incremented before these calls.

Sponsored by:	Intel
2013-03-29 15:54:12 +00:00
Jim Harris
b327350604 Pass correct parameter to CTR5() in bus_dmamap_load_uio.
Sponsored by:	Intel
2013-03-29 15:51:45 +00:00
Gleb Smirnoff
21f398487c Fix bug in m_split() in a case when split len matches len of the
first mbuf, and the first mbuf is M_PKTHDR.

PR:		kern/176144
Submitted by:	Jacques Fourie <jacques.fourie gmail.com>
2013-03-29 14:10:40 +00:00
Gleb Smirnoff
844cacd17c Once ng_ksocket(4) is fixed, re-apply r194662. See this revision for
longer description.

Discussed with:	andre, rwatson
Sponsored by:	Nginx, Inc.
2013-03-29 14:06:04 +00:00
Gleb Smirnoff
a307eb26ed When soreceive_generic() hands off an mbuf from buffer,
clear its pointer to next record, since next record
belongs to the buffer, and shouldn't be leaked.

The ng_ksocket(4) used to clear this pointer itself,
but the correct place is here.

Sponsored by:	Nginx, Inc
2013-03-29 13:57:55 +00:00
Scott Long
07dbf2c768 Several fixes and improvements to sendfile()
1.  If we wanted to send exactly as many bytes as the socket buffer is
    sized for, the inner loop of kern_sendfile() would see that the
    socket is full before seeing that it had no more bytes left to send.
    This would cause it to return EAGAIN to the caller instead of
    success.  Fix by changing the order that these conditions are tested.
2.  Simplify the calculation for the bytes to send in each iteration of
    the inner loop of kern_sendfile()
3.  Fix some calls with bogus arguments to sf_buf_ext().  These would
    only trigger on mbuf allocation failure, but would be hilariously
    bad if they did trigger.

Submitted by:	gibbs(3), andre(2)
Reviewed by:	emax, andre
Obtained from:	Netflix
MFC after:	1 week
2013-03-28 14:14:28 +00:00
Jim Harris
47301c53ed deferal -> deferral 2013-03-27 23:07:43 +00:00
Konstantin Belousov
f3215a60fd Fix a race with the vnode reclamation in the aio_qphysio(). Obtain
the thread reference on the vp->v_rdev and use the returned struct
cdev *dev instead of using vp->v_rdev.  Call dev_strategy_csw()
instead of dev_strategy(), since we now own the reference.

Since the csw was already calculated, test d_flags to avoid mapping
the buffer if the driver supports unmapped requests [*].

Suggested by:	kan [*]
Reviewed by:	kan (previous version)
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
2013-03-27 11:47:52 +00:00
Konstantin Belousov
d1e99f43ed Add dev_strategy_csw() function, which is similar to dev_strategy()
but assumes that a thread reference was already obtained on the passed
device.  Use the function from physio(), to avoid two extra dev_mtx
lock and unlock.  Note that physio() is always used as the cdevsw
method, or is called from a cdevsw method, and the caller already owns
the reference.

dev_strategy() is left to keep KPI intact, but now it is implemented
as a wrapper around dev_strategy_csw().

Do some style cleanup in physio().

Requested and reviewed by:	kan (previous version)
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
2013-03-27 11:34:27 +00:00
Konstantin Belousov
88c8c0a70f On i386, double the default size of the bio transient map. With the
maxbcache size fixed, the auto-tuned transient map is too small for
real-world load on i386.

Tested by:	David Wolfskill
Sponsored by:	The FreeBSD Foundation
2013-03-27 10:56:15 +00:00
Alexander Kabaev
31932fae1e Do not pass unmapped buffers to drivers that cannot handle them
In physio, check if device can handle unmapped IO and pass an
appropriately mapped buffer to the driver strategy routine. The
only driver in the tree that can handle unmapped buffers is one
exposed by GEOM, so mark it as such with the new flag in the
driver cdevsw structure.

This fixes insta-panics on hosts, running dconschat, as /dev/fwmem
is an example of the driver that makes use of physio routine, but
bypasses the g_down thread, where the buffer gets mapped normally.

Discussed with: kib (earlier version)
2013-03-26 01:17:06 +00:00
Davide Italiano
3f321a4eac Cache the callout precision argument as part of the informations required
for migrating callouts to new CPU. This value is passed to
callout_cc_add() in order to update properly precision field in case of
rescheduling/migration.

Reviewed by:	mav
2013-03-25 09:43:50 +00:00
Will Andrews
fdbc71742b Extend taskqueue(9) to enable per-taskqueue callbacks.
The scope of these callbacks is primarily to support actions that affect the
taskqueue's thread environments.  They are entirely optional, and
consequently are introduced as a new API: taskqueue_set_callback().

This interface allows the caller to specify that a taskqueue requires a
callback and optional context pointer for a given callback type.

The callback types included in this commit can be used to register a
constructor and destructor for thread-local storage using osd(9).  This
allows a particular taskqueue to define that its threads require a specific
type of TLS, without the need for a specially-orchestrated task-based
mechanism for startup and shutdown in order to accomplish it.

Two callback types are supported at this point:

- TASKQUEUE_CALLBACK_TYPE_INIT, called by every thread when it starts, prior
  to processing any tasks.
- TASKQUEUE_CALLBACK_TYPE_SHUTDOWN, called by every thread when it exits,
  after it has processed its last task but before the taskqueue is
  reclaimed.

While I'm here:

- Add two new macros, TQ_ASSERT_LOCKED and TQ_ASSERT_UNLOCKED, and use them
  in appropriate locations.
- Fix taskqueue.9 to mention taskqueue_start_threads(), which is a required
  interface for all consumers of taskqueue(9).

Reviewed by:	kib (all), eadler (taskqueue.9), brd (taskqueue.9)
Approved by:	ken (mentor)
Sponsored by:	Spectra Logic
MFC after:	1 month
2013-03-23 15:11:53 +00:00
Andriy Gapon
ca84e042a3 post mountroot event after a real/final root is mounted
not every time an intermediate root (including the first devfs) is
mounted.
This is also consistent with waking up via root_mount_complete.

Reviewed by:	jhb
MFC after:	13 days
2013-03-23 08:59:34 +00:00
Pawel Jakub Dawidek
051a23d4e8 - Constify local path variable for chflagsat().
- Use correct format characters (%lx) for u_long.

This fixes the build broken in r248599.
2013-03-22 07:40:34 +00:00
Pawel Jakub Dawidek
5d46382415 Regenerate after r248599.
Sponsored by:	The FreeBSD Foundation
2013-03-21 23:02:19 +00:00
Pawel Jakub Dawidek
e948704e4b Implement chflagsat(2) system call, similar to fchmodat(2), but operates on
file flags.

Reviewed by:	kib, jilles
Sponsored by:	The FreeBSD Foundation
2013-03-21 22:59:01 +00:00
Pawel Jakub Dawidek
14cd1ffdf8 Regenerate after r248597.
Sponsored by:	The FreeBSD Foundation
2013-03-21 22:47:03 +00:00
Pawel Jakub Dawidek
b4b2596b97 - Make 'flags' argument to chflags(2), fchflags(2) and lchflags(2) of type
u_long. Before this change it was of type int for syscalls, but prototypes
  in sys/stat.h and documentation for chflags(2) and fchflags(2) (but not
  for lchflags(2)) stated that it was u_long. Now some related functions
  use u_long type for flags (strtofflags(3), fflagstostr(3)).
- Make path argument of type 'const char *' for consistency.

Discussed on:	arch
Sponsored by:	The FreeBSD Foundation
2013-03-21 22:44:33 +00:00
Jilles Tjoelker
46f10cc265 Allow O_CLOEXEC in posix_openpt() flags.
PR:		kern/162374
Reviewed by:	ed
2013-03-21 21:39:15 +00:00
Attilio Rao
d52d7aa871 Fix a bug in UMTX_PROFILING:
UMTX_PROFILING should really analyze the distribution of locks as they
index entries in the umtxq_chains hash-table.
However, the current implementation does add/dec the length counters
for *every* thread insert/removal, measuring at all really userland
contention and not the hash distribution.

Fix this by correctly add/dec the length counters in the points where
it is really needed.

Please note that this bug brought us questioning in the past the quality
of the umtx hash table distribution.
To date with all the benchmarks I could try I was not able to reproduce
any issue about the hash distribution on umtx.

Sponsored by:	EMC / Isilon storage division
Reviewed by:	jeff, davide
MFC after:	2 weeks
2013-03-21 19:58:25 +00:00
John Baldwin
d071a6fa33 Another NFS SIGSTOP related fix: Ignore thread suspend requests due to
SIGSTOP if stop signals are currently deferred.  This can occur if a
process is stopped via SIGSTOP while a thread is running or runnable
but before it has set TDF_SBDRY.

Tested by:	pho
Reviewed by:	kib
MFC after:	1 week
2013-03-21 14:06:27 +00:00
Konstantin Belousov
7db07e1c85 Only size and create the bio_transient_map when unmapped buffers are
enabled.  Now, disabling the unmapped buffers should result in the
kernel memory map identical to pre-r248550.

Sponsored by:	The FreeBSD Foundation
2013-03-21 07:28:15 +00:00
Konstantin Belousov
e3269b5096 In bufwrite(), a dirty buffer is moved to the clean queue before the
bufobj counter of the writes in progress is incremented.  Other thread
inspecting the bufobj would consider it clean.

For the regular vnodes, the vnode lock is typically held both by the
thread performing the bufwrite() and an other thread doing syncing,
which prevents the situation.  On the other hand, writes to the VCHR
vnodes are done without holding vnode lock.

Increment the write ref counter for the buffer object before calling
bundirty().

Sponsored by:	The FreeBSD Foundation
Tested by:	pho
MFC after:	2 weeks
2013-03-20 21:08:00 +00:00
Konstantin Belousov
8d6884ce9c When the journaled FFS volume is suspended due to the journal space
becoming too low, the softdep flush thread processes the workitems,
which frees the space in journal, and then unsuspends the fs.  The
softdep_flush() and other workitem processing functions busy the
filesystem before iterating over the worklist, to prevent the parallel
unmount from freeing the mount data. The vfs_busy() is called with
MBF_NOWAIT flag.

Now, if the unmount is already started and the filesystem is suspended
due to low journal space, the journal is never flushed and filesystem
is never unsuspended, because vfs_busy(MBF_NOWAIT) call cannot succeed
for the unmounting fs, and softdep_flush() does not process the
workitems. Unmount needs to write metadata, where it hangs in the
"suspfs" state.

Move the vn_start_write() call in the dounmount() before setting the
MNTK_UNMOUNT flag. This practically ensures that softdep_flush()
processed the pending journal writes by making dounmount() wait for
the lift of the suspension.

Sponsored by:	The FreeBSD Foundation
Reported and tested by:	pho
MFC after:	2 weeks
2013-03-20 21:07:49 +00:00
Kirk McKusick
3289d5877a When renaming a directory from one parent directory to another,
we need to call ufs_checkpath() to walk from our new location to
the root of the filesystem to ensure that we do not encounter
ourselves along the way. Until now, we accomplished this by reading
the ".." entries of each directory in our path until we reached
the root (or encountered an error). This change tries to avoid the
I/O of reading the ".." entries by first looking them up in the
name cache and only doing the I/O when the name cache lookup fails.

Reviewed by: kib
Tested by:   Peter Holm
MFC after:   4 weeks
2013-03-20 17:57:00 +00:00
Jilles Tjoelker
c2e3c52e0d Implement SOCK_CLOEXEC, SOCK_NONBLOCK and MSG_CMSG_CLOEXEC.
This change allows creating file descriptors with close-on-exec set in some
situations. SOCK_CLOEXEC and SOCK_NONBLOCK can be OR'ed in socket() and
socketpair()'s type parameter, and MSG_CMSG_CLOEXEC to recvmsg() makes file
descriptors (SCM_RIGHTS) atomically close-on-exec.

The numerical values for SOCK_CLOEXEC and SOCK_NONBLOCK are as in NetBSD.
MSG_CMSG_CLOEXEC is the first free bit for MSG_*.

The SOCK_* flags are not passed to MAC because this may cause incorrect
failures and can be done later via fcntl() anyway. On the other hand, audit
is expected to cope with the new flags.

For MSG_CMSG_CLOEXEC, unp_externalize() is extended to take a flags
argument.

Reviewed by:	kib
2013-03-19 20:58:17 +00:00
Konstantin Belousov
e81ff91e62 Do not remap usermode pages into KVA for physio.
Sponsored by:	The FreeBSD Foundation
Tested by:	pho
2013-03-19 14:43:57 +00:00
Konstantin Belousov
7d5365c70b Add a helper function vfs_bio_bzero_buf() to zero the portion of the
buffer, transparently handling mapped or unmapped buffers.  Its intent
is to replace the use of bzero(bp->b_data) in cases where the buffer
might be unmapped, to avoid unneeded upgrades.

Sponsored by:	The FreeBSD Foundation
Tested by:	pho
2013-03-19 14:27:14 +00:00
Konstantin Belousov
ee75e7de7b Implement the concept of the unmapped VMIO buffers, i.e. buffers which
do not map the b_pages pages into buffer_map KVA.  The use of the
unmapped buffers eliminate the need to perform TLB shootdown for
mapping on the buffer creation and reuse, greatly reducing the amount
of IPIs for shootdown on big-SMP machines and eliminating up to 25-30%
of the system time on i/o intensive workloads.

The unmapped buffer should be explicitely requested by the GB_UNMAPPED
flag by the consumer.  For unmapped buffer, no KVA reservation is
performed at all. The consumer might request unmapped buffer which
does have a KVA reserve, to manually map it without recursing into
buffer cache and blocking, with the GB_KVAALLOC flag.

When the mapped buffer is requested and unmapped buffer already
exists, the cache performs an upgrade, possibly reusing the KVA
reservation.

Unmapped buffer is translated into unmapped bio in g_vfs_strategy().
Unmapped bio carry a pointer to the vm_page_t array, offset and length
instead of the data pointer.  The provider which processes the bio
should explicitely specify a readiness to accept unmapped bio,
otherwise g_down geom thread performs the transient upgrade of the bio
request by mapping the pages into the new bio_transient_map KVA
submap.

The bio_transient_map submap claims up to 10% of the buffer map, and
the total buffer_map + bio_transient_map KVA usage stays the
same. Still, it could be manually tuned by kern.bio_transient_maxcnt
tunable, in the units of the transient mappings.  Eventually, the
bio_transient_map could be removed after all geom classes and drivers
can accept unmapped i/o requests.

Unmapped support can be turned off by the vfs.unmapped_buf_allowed
tunable, disabling which makes the buffer (or cluster) creation
requests to ignore GB_UNMAPPED and GB_KVAALLOC flags.  Unmapped
buffers are only enabled by default on the architectures where
pmap_copy_page() was implemented and tested.

In the rework, filesystem metadata is not the subject to maxbufspace
limit anymore. Since the metadata buffers are always mapped, the
buffers still have to fit into the buffer map, which provides a
reasonable (but practically unreachable) upper bound on it. The
non-metadata buffer allocations, both mapped and unmapped, is
accounted against maxbufspace, as before. Effectively, this means that
the maxbufspace is forced on mapped and unmapped buffers separately.
The pre-patch bufspace limiting code did not worked, because
buffer_map fragmentation does not allow the limit to be reached.

By Jeff Roberson request, the getnewbuf() function was split into
smaller single-purpose functions.

Sponsored by:	The FreeBSD Foundation
Discussed with:	jeff (previous version)
Tested by:	pho, scottl (previous version), jhb, bf
MFC after:	2 weeks
2013-03-19 14:13:12 +00:00
John Baldwin
1968f37bc9 Tweak some comments. 2013-03-18 18:04:09 +00:00
John Baldwin
3cf3b9f097 Partially revert r195702. Deferring stops is now implemented via a set of
calls to toggle TDF_SBDRY rather than passing PBDRY to individual sleep
calls.
- Remove the stop_allowed parameters from cursig() and issignal().
  issignal() checks TDF_SBDRY directly.
- Remove the PBDRY and SLEEPQ_STOP_ON_BDRY flags.
2013-03-18 17:23:58 +00:00
Gleb Smirnoff
4f67e14304 In m_align() add assertions that mbuf is virgin, similar to assertions
in M_ALIGN(), MH_ALIGN, MEXT_ALIGN() macros.
2013-03-17 07:41:14 +00:00
Pawel Jakub Dawidek
943c3bb968 Require CAP_SEEK if both O_APPEND and O_TRUNC flags are absent.
In other words we don't require CAP_SEEK if either O_APPEND or O_TRUNC flag is
given, because O_APPEND doesn't allow to overwrite existing data and O_TRUNC
requires CAP_FTRUNCATE already.

Sponsored by:	The FreeBSD Foundation
2013-03-16 23:19:13 +00:00
Pawel Jakub Dawidek
d6b2bd0bc9 Style: Whitespace fixes. 2013-03-16 22:37:30 +00:00
Pawel Jakub Dawidek
1ea67dd9e5 Style: Remove redundant space. 2013-03-16 22:36:24 +00:00
Gleb Smirnoff
c95be8b536 - Replace compat macros with function calls.
- Remove superfluous cleaning of m_len after allocating.

Sponsored by:	Nginx, Inc.
2013-03-16 08:57:36 +00:00
Gleb Smirnoff
5368b81eb0 Contrary to what the deleted comment said, the m_move_pkthdr()
will not smash the M_EXT and data pointer, so it is safe to
pass an mbuf with external storage procuded by m_getcl() to
m_move_pkthdr().

Reviewed by:	andre
Sponsored by:	Nginx, Inc.
2013-03-16 08:55:21 +00:00
Pawel Jakub Dawidek
c9cea47007 Sort syscalls properly. 2013-03-15 23:00:13 +00:00
Konstantin Belousov
aed5a114d7 Separate the copyright lines and the informational block by a blank line.
Requested by:	joel
MFC after:	2 weeks
2013-03-15 14:01:37 +00:00
Konstantin Belousov
5791cee883 Add my copyright for the 2012 year work, in particular vn_io_fault()
and f_offset locking.  Add required Foundation notice for r248319.

Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
2013-03-15 12:57:30 +00:00
Konstantin Belousov
5f5f055441 Implement the helper function vn_io_fault_pgmove(), intended to use by
the filesystem VOP_READ() and VOP_WRITE() implementations in the same
way as vn_io_fault_uiomove() over the unmapped buffers.  Helper
provides the convenient wrapper over the pmap_copy_pages() for struct
uio consumers, taking care of the TDP_UIOHELD situations.

Sponsored by:	The FreeBSD Foundation
Tested by:	pho
MFC after:	2 weeks
2013-03-15 11:16:12 +00:00
Gleb Smirnoff
c8b59ea750 Use m_get() and m_getcl() instead of compat macros. 2013-03-15 10:21:18 +00:00
Gleb Smirnoff
93cfe76349 - Use m_get2() instead of hand allocating.
- No need for u_int cast here.

Sponsored by:	Nginx, Inc.
2013-03-15 10:17:24 +00:00
Gleb Smirnoff
3112ae7644 Make m_get2() never use clusters that are bigger than PAGE_SIZE.
Requested by:	andre, jhb
Sponsored by:	Nginx, Inc.
2013-03-15 10:15:07 +00:00
Edward Tomasz Napierala
a8efb53478 When throttling a process to enforce RACCT limits, do not use neither
PBDRY (which simply doesn't make any sense) nor PCATCH (which could
be used by a malicious process to work around the PCPU limit).

Submitted by:	Rudo Tomori
Reviewed by:	kib
2013-03-14 23:25:42 +00:00
Edward Tomasz Napierala
16befafd16 Accessing td_state requires thread lock to be held.
Submitted by:	Rudo Tomori
Reviewed by:	kib
2013-03-14 23:20:18 +00:00
Konstantin Belousov
70e198dd07 Some style fixes.
Sponsored by:	The FreeBSD Foundation
2013-03-14 20:31:39 +00:00
Konstantin Belousov
c535690b33 Add currently unused flag argument to the cluster_read(),
cluster_write() and cluster_wbuild() functions.  The flags to be
allowed are a subset of the GB_* flags for getblk().

Sponsored by:	The FreeBSD Foundation
Tested by:	pho
2013-03-14 20:28:26 +00:00
Konstantin Belousov
a1143a3ba8 Rewrite the vfs_bio_clrbuf(9) to not access the b_data for B_VMIO
buffers directly, use pmap_zero_page_area(9) for each zeroing page
region instead.

Sponsored by:	The FreeBSD Foundation
Tested by:	pho
MFC after:	2 weeks
2013-03-14 19:48:25 +00:00
Tijl Coosemans
d19d5bf443 - Fix two possible overflows when testing if ELF program headers are on
the first page:
  1. Cast uint16_t operands in a multiplication to unsigned int because
     otherwise the implicit promotion to int results in a signed
     multiplication that can overflow and the behaviour on integer
     overflow is undefined.
  2. Replace (offset + size > PAGE_SIZE) with (size > PAGE_SIZE - offset)
     because the sum may overflow.
- Use the same tests to see if the path to the interpreter is on the first
  page. There's no overflow here because size is already limited by
  MAXPATHLEN, but the compiler optimises the new tests better. Also fix an
  off-by-one error.
- Simplify tests to see if an ELF note program header is on the first page.
  This also fixes an off-by-one error.

Reviewed by:	kib
MFC after:	1 week
2013-03-13 22:01:31 +00:00