Commit Graph

13459 Commits

Author SHA1 Message Date
Sean Bruno
d3baefa809 Change len checks for fstypelen and fspathlen to be against absolute len
not strlen as they are *not* strings.

Discovered by GSOC student, Mike Ma <mikemandarine@gmail.com> during his
fuse.glusterfs port to FreeBSD.

Final patch from mckusick@

Submitted by:	mckusick@
Approved by:	re (hrs)
MFC after:	2 weeks
2013-10-03 22:52:03 +00:00
Konstantin Belousov
432e79fc33 When helping the bufdaemon from the buffer allocation context, there
is no sense to walk the whole dirty buffer queue.  We are only
interested in, and can operate on, the buffers owned by the current
vnode [1].  Instead of calling generic queue flush routine, do
VOP_FSYNC() if possible.

Holding the dirty buffer queue lock in the bufdaemon, without dropping
it, can cause starvation of buffer writes from other threads. This is
esp. easy to reproduce on the big memory machines, where large files
are written, causing almost all dirty buffers accumulating in several
big files, which vnodes are locked by writers. Bufdaemon cannot flush
any buffer, but is iterating over the whole dirty queue
continuously. Since dirty queue mutex is not dropped, bufdone() in
g_up thread is starved, usually deadlocking the machine [2]. Mitigate
this by dropping the queue lock after the vnode is locked, allowing
other queue lock contenders to make a progress.

Discussed with:	Jeff [1]
Reported by:	pho [2]
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
Approved by:	re (hrs)
2013-10-02 06:00:34 +00:00
Konstantin Belousov
d6498b153e When printing the vnode information from ddb, print the lengths of the
dirty and clean buffer queues.

Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Approved by:	re (gjb)
2013-10-01 20:18:33 +00:00
Konstantin Belousov
fe39412e99 For vunref(), try to upgrade the vnode lock if the function was called
with the vnode shared-locked.  If upgrade succeeded, the inactivation
can be done immediately, instead of being postponed.

Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Approved by:	re (glebius)
2013-09-29 18:07:14 +00:00
Konstantin Belousov
ac34145005 Reimplement r255797 using LK_TRYUPGRADE.
The  r255797 was:
Increase the chance of the buffer write from the bufdaemon helper
context to succeed.  If the locked vnode which owns the buffer to be
written is shared locked, try the non-blocking upgrade of the lock to
exclusive.

PR:	kern/178997
Reported and tested by:	Klaus Weber <fbsd-bugs-2013-1@unix-admin.de>
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Approved by:	re (glebius)
2013-09-29 18:04:57 +00:00
Konstantin Belousov
7c6fe80353 Add LK_TRYUPGRADE operation for lockmgr(9), which attempts to
atomically upgrade shared lock to exclusive.  On failure, error is
returned and lock is not dropped in the process.

Tested by:      pho (previous version)
No objections from:     attilio
Sponsored by:   The FreeBSD Foundation
MFC after:      1 week
Approved by:	re (glebius)
2013-09-29 18:02:23 +00:00
John-Mark Gurney
da9442ef43 it must be the last member, not might...
Reviewed by:	attilio
Approved by:	re (delphij, gjb)
2013-09-26 17:55:04 +00:00
Konstantin Belousov
9d2abcd01a Do not allow negative timeouts for kqueue timers, check for the
negative timeout both before and after the conversion to sbintime_t.

For periodic kqueue timer, convert zero timeout into 1ms, to avoid
interrupt storm on fast event timers.

Reported and tested by:	pho
Discussed with:	mav
Reviewed by:	davide
Sponsored by:	The FreeBSD Foundation
Approved by:	re (marius)
2013-09-26 13:17:31 +00:00
Konstantin Belousov
27884e3bd1 Acquire a hold reference on the vnode when a knote is instantiated.
Otherwise, knote keeps a pointer to a vnode which could become invalid
any time.

Reported by:	many
Tested by:	Patrick Lamaiziere <patfbsd@davenulle.org>
Discussed with:	jmg
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Approved by:	re (marius)
2013-09-26 13:14:51 +00:00
Davide Italiano
1b0c144fc2 Make the callout arithmetic more robust adding checks for overflow.
Without these, if the timeout value passed is "large enough", the
value of the sum of it and other factors (e.g. current time as
returned by sbinuptime() or 'precision' argument) might result in a
negative number. This negative number is then passed to
eventtimers(4), which causes et_start() routine to load et_min_period
into eventtimer, making the CPU where the thread is stuck forever in
timer interrupt handler routine. This is now avoided rounding to
INT64_MAX the timeout period in case of overflow.

Reported by:	kib, pho
Discussed with:	kib, mav
Tested by:	pho (stress2 suite, kevent7.sh scenario)
Approved by:	re (kib)
2013-09-26 10:06:50 +00:00
Attilio Rao
57a9eeb4ed Avoid memory accesses reordering which can result in fget_unlocked()
seeing a stale fd_ofiles table once fd_nfiles is already updated,
resulting in OOB accesses.

Approved by:	re (kib)
Sponsored by:	EMC / Isilon storage division
Reported and tested by:	pho
Reviewed by:	benno
2013-09-25 13:37:52 +00:00
Alexander Motin
ea4af9c09a Make load average sampling asynchronous to hardclock ticks. This improves
measurement of load caused by time-related events still using hardclock.
For example, without this change dummynet, scheduling events each hardclock
tick, was always miscounted as load of 1.

There is still aliasing with events delayed by the new precision mechanism,
but it probably can't be avoided without moving this sampling from using
callout to some lower-level code or handling it in some other special way.

Reviewed by:	davide
Approved by:	re (marius)
2013-09-24 07:03:16 +00:00
Dag-Erling Smørgrav
0f7bc112c0 Always request zeroed memory, in case we're dumb enough to leak it later.
Approved by:	re (gjb)
2013-09-22 23:47:56 +00:00
Konstantin Belousov
12af71a69f Revert r255797. The LK_UPGRADE | LK_NOWAIT drops the lock.
Approved by:	re (marius, implicit)
2013-09-22 20:29:03 +00:00
Konstantin Belousov
19f6a6a1ca Pre-acquire the filedesc sx when a possibility exists that the later
code could need to remove a kqueue from the filedesc list.  Global
lock is already locked, which causes sleepable after non-sleepable
lock acquisition.

Reported and tested by:	pho
Reviewed by:	jmg
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
Approved by:	re (gjb)
2013-09-22 19:54:47 +00:00
Konstantin Belousov
d1f8ca485d Increase the chance of the buffer write from the bufdaemon helper
context to succeed.  If the locked vnode which owns the buffer to be
written is shared locked, try the non-blocking upgrade of the lock to
exclusive.

PR:	kern/178997
Reported and tested by:	Klaus Weber <fbsd-bugs-2013-1@unix-admin.de>
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Approved by:	re (marius)
2013-09-22 19:23:48 +00:00
Davide Italiano
cf6b879fad Consistently use the same value to indicate exclusively-held and
shared-held locks for all the primitives in lc_lock/lc_unlock routines.
This fixes the problems introduced in r255747, which indeed introduced an
inversion in the logic.

Reported by:	many
Tested by:	bdrewery, pho, lme, Adam McDougall, O. Hartmann
Approved by:	re (glebius)
2013-09-22 14:09:07 +00:00
Gleb Smirnoff
255c1caae3 - Create kern.ipc.sendfile namespace, and put the new "readhead" OID
there as "kern.ipc.sendfile.readahead".
- Push all nsfbuf related tunables into MD code. Don't move them
  to new namespace in favor of POLA.

Reviewed by:	scottl
Approved by:	re (gjb)
2013-09-22 13:36:52 +00:00
Justin T. Gibbs
255424ddb7 Fix ia64 and mips kernel builds due to XENHVM=>GENERIC integration in
revision 255744.

sys/kern/subr_smp.c:
	IPI_SUSPEND is only available on amd64 and i386.  Protect
	new uses of this constant with #ifdefs to avoid impacting
	other platforms.

Approved by:	re (blanket Xen)
2013-09-22 02:46:13 +00:00
Mark Johnston
8d305ba0dc Regenerate syscall argument strings after r255777.
Approved by:	re (gjb)
MFC after:	1 week
2013-09-21 23:06:36 +00:00
Mark Johnston
f17f2ffcdd Omit "__restrict" when generating syscall argument strings. DTrace doesn't
handle it and cannot determine the argument type when it's present.

Approved by:	re (gjb)
MFC after:	1 week
2013-09-21 23:05:44 +00:00
Davide Italiano
1f96759fb1 Fix callout_init_rm() in the shared case, allocating storage for 'struct
rm_priotracker' directly in the softclock thread. Now consumers can
pass CALLOUT_SHAREDLOCK flag to callout initialization routine safely.
The choice of the already existing flags  instead of special casing
shared rmlocks is done to prevent consumer footshooting.

Suggested by:	jhb
Reviewed by:	jhb
Approved by:	re (delphij)
2013-09-20 23:16:15 +00:00
Davide Italiano
7faf4d90e8 Fix lc_lock/lc_unlock() support for rmlocks held in shared mode. With
current lock classes KPI it was really difficult because there was no
way to pass an rmtracker object to the lock/unlock routines. In order
to accomplish the task, modify the aforementioned functions so that
they can return (or pass as argument) an uinptr_t, which is in the rm
case used to hold a pointer to struct rm_priotracker for current
thread. As an added bonus, this fixes rm_sleep() in the rm shared
case, which right now can communicate priotracker structure between
lc_unlock()/lc_lock().

Suggested by:	jhb
Reviewed by:	jhb
Approved by:	re (delphij)
2013-09-20 23:06:21 +00:00
Justin T. Gibbs
566a5f5020 Merge Xen PVHVM support into the GENERIC kernel config for both
amd64 and i386.

Submitted by:	Roger Pau Monné
Sponsored by:	Citrix Systems R&D
Reviewed by:	gibbs
Approved by:	re (blanket Xen)
MFC after:	2 weeks

sys/amd64/amd64/mp_machdep.c:
sys/amd64/include/cpu.h:
sys/i386/i386/mp_machdep.c:
sys/i386/include/cpu.h:
	- Introduce two new CPU hooks for initialization and resume
	  purposes. This allows us to get rid of the XENHVM ifdefs in
	  mp_machdep, and also sets some hooks into common code that can be
	  used by other hypervisor implementations.

sys/amd64/conf/XENHVM:
sys/i386/conf/XENHVM:
	- Remove these configs now that GENERIC has builtin support for Xen
	  HVM.

sys/kern/subr_smp.c:
	- Make sure there are no pending IPIs when suspending a system.

sys/x86/xen/hvm.c:
	- Add cpu init and resume vectors that are called from mp_machdep
	  using the new hooks.
	- Only clear the vcpu_info mapping data on resume.  It is already
	  clear for the BSP on a cold boot and is set correctly as APs
	  are started.
	- Gate xen_hvm_init_cpu only to systems running under Xen.

sys/x86/xen/xen_intr.c:
	 - Gate the setup of event channels only to systems running under Xen.
2013-09-20 22:59:22 +00:00
Justin T. Gibbs
428b7ca290 Add support for suspend/resume/migration operations when running as a
Xen PVHVM guest.

Submitted by:	Roger Pau Monné
Sponsored by:	Citrix Systems R&D
Reviewed by:	gibbs
Approved by:	re (blanket Xen)
MFC after:	2 weeks

sys/amd64/amd64/mp_machdep.c:
sys/i386/i386/mp_machdep.c:
	- Make sure that are no MMU related IPIs pending on migration.
	- Reset pending IPI_BITMAP on resume.
	- Init vcpu_info on resume.

sys/amd64/include/intr_machdep.h:
sys/i386/include/intr_machdep.h:
sys/x86/acpica/acpi_wakeup.c:
sys/x86/x86/intr_machdep.c:
sys/x86/isa/atpic.c:
sys/x86/x86/io_apic.c:
sys/x86/x86/local_apic.c:
	- Add a "suspend_cancelled" parameter to pic_resume().  For the
	  Xen PIC, restoration of interrupt services differs between
	  the aborted suspend and normal resume cases, so we must provide
	  this information.

sys/dev/acpica/acpi_timer.c:
sys/dev/xen/timer/timer.c:
sys/timetc.h:
	- Don't swap out "suspend safe" timers across a suspend/resume
	  cycle.  This includes the Xen PV and ACPI timers.

sys/dev/xen/control/control.c:
	- Perform proper suspend/resume process for PVHVM:
		- Suspend all APs before going into suspension, this allows us
		  to reset the vcpu_info on resume for each AP.
		- Reset shared info page and callback on resume.

sys/dev/xen/timer/timer.c:
	- Implement suspend/resume support for the PV timer. Since FreeBSD
	  doesn't perform a per-cpu resume of the timer, we need to call
	  smp_rendezvous in order to correctly resume the timer on each CPU.

sys/dev/xen/xenpci/xenpci.c:
	- Don't reset the PCI interrupt on each suspend/resume.

sys/kern/subr_smp.c:
	- When suspending a PVHVM domain make sure there are no MMU IPIs
	  in-flight, or we will get a lockup on resume due to the fact that
	  pending event channels are not carried over on migration.
	- Implement a generic version of restart_cpus that can be used by
	  suspended and stopped cpus.

sys/x86/xen/hvm.c:
	- Implement resume support for the hypercall page and shared info.
	- Clear vcpu_info so it can be reset by APs when resuming from
	  suspension.

sys/dev/xen/xenpci/xenpci.c:
sys/x86/xen/hvm.c:
sys/x86/xen/xen_intr.c:
	- Support UP kernel configurations.

sys/x86/xen/xen_intr.c:
	- Properly rebind per-cpus VIRQs and IPIs on resume.
2013-09-20 05:06:03 +00:00
John Baldwin
a566e8e3c5 Regen.
Approved by:	re (delphij)
2013-09-19 18:56:00 +00:00
John Baldwin
55648840de Extend the support for exempting processes from being killed when swap is
exhausted.
- Add a new protect(1) command that can be used to set or revoke protection
  from arbitrary processes.  Similar to ktrace it can apply a change to all
  existing descendants of a process as well as future descendants.
- Add a new procctl(2) system call that provides a generic interface for
  control operations on processes (as opposed to the debugger-specific
  operations provided by ptrace(2)).  procctl(2) uses a combination of
  idtype_t and an id to identify the set of processes on which to operate
  similar to wait6().
- Add a PROC_SPROTECT control operation to manage the protection status
  of a set of processes.  MADV_PROTECT still works for backwards
  compatability.
- Add a p_flag2 to struct proc (and a corresponding ki_flag2 to kinfo_proc)
  the first bit of which is used to track if P_PROTECT should be inherited
  by new child processes.

Reviewed by:	kib, jilles (earlier version)
Approved by:	re (delphij)
MFC after:	1 month
2013-09-19 18:53:42 +00:00
Pawel Jakub Dawidek
3fded357af Fix panic in ktrcapfail() when no capability rights are passed.
While here, correct all consumers to pass NULL instead of 0 as we pass
capability rights as pointers now, not uint64_t.

Reported by:	Daniel Peyrolon
Tested by:	Daniel Peyrolon
Approved by:	re (marius)
2013-09-18 19:26:08 +00:00
Roman Divacky
b12698e1a1 Revert r255672, it has some serious flaws, leaking file references etc.
Approved by:	re (delphij)
2013-09-18 18:48:33 +00:00
Roman Divacky
253c75c0de Implement epoll support in Linuxulator. This is a tiny wrapper around kqueue
to implement epoll subset of functionality. The kqueue user data are 32bit
on i386 which is not enough for epoll user data so this patch overrides
kqueue fileops to maintain enough space in struct file.

Initial patch developed by me in 2007 and then extended and finished
by Yuri Victorovich.

Approved by:    re (delphij)
Sponsored by:   Google Summer of Code
Submitted by:   Yuri Victorovich <yuri at rawbw dot com>
Tested by:      Yuri Victorovich <yuri at rawbw dot com>
2013-09-18 17:56:04 +00:00
Gleb Smirnoff
85fdd534cf Fix assertion in sendfile_readpage() to assert only the validity
of requested amount of data in a page. Move assertion down below
object unlock.

Approved by:	re (kib)
Sponsored by:	Nginx, Inc.
Sponsored by:	Netflix
2013-09-17 06:37:21 +00:00
Konstantin Belousov
3846a82284 Remove zero-copy sockets code. It only worked for anonymous memory,
and the equivalent functionality is now provided by sendfile(2) over
posix shared memory filedescriptor.

Remove the cow member of struct vm_page, and rearrange the remaining
members.  While there, make hold_count unsigned.

Requested and reviewed by:	alc
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
Approved by:	re (delphij)
2013-09-16 06:25:54 +00:00
Konstantin Belousov
e8de242d3a Use TAILQ instead of STAILQ for kqeueue filedescriptors to ensure constant
time removal on kqueue close.

Reported and tested by:	pho
Reviewed by:	jmg
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Approved by:	re (delphij)
2013-09-13 19:50:50 +00:00
Konstantin Belousov
9bec6325ad When opening or closing fifo, ensure that the vnode is locked
exclusively.  Filesystems are assumed to disable shared locking for
the fifo vnode locks, but some do not.

Reported and tested by:	olgeni
Discussed with:	avg
Sponsored by:   The FreeBSD Foundation
MFC after:	1 week
Approved by:	re (glebius)
2013-09-13 06:52:23 +00:00
Konstantin Belousov
8740a7112e Reduce the scope of the proctree_lock. If several processes cause
continuous calls to the uprintf(9), the proctree_lock could be
shared-locked for indefinite amount of time, starving exclusive
requests. Since proctree_lock is needed for fork() and exit(), this
effectively stops the machine.

While there, do the similar reduction for tprintf(9).

Reported and tested by: pho
Reviewed by:    ed
Sponsored by:   The FreeBSD Foundation
MFC after:	1 week
Approved by:	re (glebius)
2013-09-13 06:39:10 +00:00
John Baldwin
eb2e5544d3 Regen.
Approved by:	re (kib)
2013-09-12 18:03:51 +00:00
John Baldwin
ed749cf183 Fix the type of the idtype argument to wait6() in syscalls.master.
(Accidentally missed this in the previous commit)

Approved by:	re (kib)
MFC after:	1 week
2013-09-12 18:01:13 +00:00
John Baldwin
84c21af119 Fix the type of the idtype argument to wait6() in syscalls.master.
Approved by:	re (kib)
MFC after:	1 week
2013-09-12 17:52:18 +00:00
Gleb Smirnoff
e06432800f Provide pr_ctloutput method for AF_LOCAL/SOCK_SEQPACKET sockets.
This makes setsockopt() on them working.

Reported by:	Yuri <yuri rawbw.com>
Approved by:	re (kib)
2013-09-11 18:22:30 +00:00
Konstantin Belousov
64c5de5483 Fix build with gcc.
Build-tested by:	gjb
Approved by:	re (glebius)
2013-09-11 17:31:22 +00:00
Konstantin Belousov
227aaa86ed Implement sendfile(2) for the posix shared memory segment file descriptor,
in addition to the regular files.

Requested by:	alc
Discussed with:	emaste
Tested by:	pho (previous version)
Sponsored by:	The FreeBSD Foundation
Approved by:	re (hrs)
2013-09-11 06:41:15 +00:00
Dag-Erling Smørgrav
1a05c762b9 Fix the length calculation for the final block of a sendfile(2)
transmission which could be tricked into rounding up to the nearest
page size, leaking up to a page of kernel memory.  [13:11]

In IPv6 and NetATM, stop SIOCSIFADDR, SIOCSIFBRDADDR, SIOCSIFDSTADDR
and SIOCSIFNETMASK at the socket layer rather than pass them on to the
link layer without validation or credential checks.  [SA-13:12]

Prevent cross-mount hardlinks between different nullfs mounts of the
same underlying filesystem.  [SA-13:13]

Security:	CVE-2013-5666
Security:	FreeBSD-SA-13:11.sendfile
Security:	CVE-2013-5691
Security:	FreeBSD-SA-13:12.ifioctl
Security:	CVE-2013-5710
Security:	FreeBSD-SA-13:13.nullfs
Approved by:	re
2013-09-10 10:05:59 +00:00
John Baldwin
edb572a38c Add a mmap flag (MAP_32BIT) on 64-bit platforms to request that a mapping use
an address in the first 2GB of the process's address space.  This flag should
have the same semantics as the same flag on Linux.

To facilitate this, add a new parameter to vm_map_find() that specifies an
optional maximum virtual address.  While here, fix several callers of
vm_map_find() to use a VMFS_* constant for the findspace argument instead of
TRUE and FALSE.

Reviewed by:	alc
Approved by:	re (kib)
2013-09-09 18:11:59 +00:00
Xin LI
22ecadc03b In r243868, the error message buffer errmsg have been changed from
an on-stack array to a pointer and therefore sizeof(errmsg) would
become 4 or 8 bytes depending on the architecture.

Fix this by using ERRMSGL in place of sizeof().

Submitted by:	J David <j.david.lists@gmail.com>
MFC after:	3 days
Approved by:	re (kib)
2013-09-09 05:01:18 +00:00
Konstantin Belousov
3aaea6efd5 Drain for the xbusy state for two places which potentially do
pmap_remove_all(). Not doing the drain allows the pmap_enter() to
proceed in parallel, making the pmap_remove_all() effects void.

The race results in an invalidated page mapped wired by usermode.

Reported and tested by:	pho
Reviewed by:	alc
Sponsored by:	The FreeBSD Foundation
Approved by:	re (glebius)
2013-09-08 17:51:22 +00:00
Pawel Jakub Dawidek
013075d557 Sort properly. 2013-09-07 19:16:02 +00:00
Pawel Jakub Dawidek
5a1983cc41 Fix panic in cap_rights_is_valid() when invalid rights are provided -
the right_to_index() function should assert correctness in this case.

Improve other assertions.

Reported by:	pho
Tested by:	pho
2013-09-07 19:03:16 +00:00
Alexander Motin
58909b74b9 Micro-optimize cpu_search(), allowing compiler to use more efficient inline
ffsl() implementation, when it is available, instead of homegrown iteration.

On dual-E5645 amd64 system (2x6x2 cores) under heavy I/O load that reduces
time spent inside cpu_search() from 19% to 13%, while IOPS increased by 5%.
2013-09-07 15:16:30 +00:00
Mark Murray
9d32fc31c7 MFC 2013-09-07 07:58:29 +00:00
Navdeep Parhar
34c916c6d2 Add a vtprintf. It is to tprintf what vprintf is to printf.
Reviewed by:	kib
2013-09-07 07:53:21 +00:00