Commit Graph

13143 Commits

Author SHA1 Message Date
Pawel Jakub Dawidek
14cd1ffdf8 Regenerate after r248597.
Sponsored by:	The FreeBSD Foundation
2013-03-21 22:47:03 +00:00
Pawel Jakub Dawidek
b4b2596b97 - Make 'flags' argument to chflags(2), fchflags(2) and lchflags(2) of type
u_long. Before this change it was of type int for syscalls, but prototypes
  in sys/stat.h and documentation for chflags(2) and fchflags(2) (but not
  for lchflags(2)) stated that it was u_long. Now some related functions
  use u_long type for flags (strtofflags(3), fflagstostr(3)).
- Make path argument of type 'const char *' for consistency.

Discussed on:	arch
Sponsored by:	The FreeBSD Foundation
2013-03-21 22:44:33 +00:00
Jilles Tjoelker
46f10cc265 Allow O_CLOEXEC in posix_openpt() flags.
PR:		kern/162374
Reviewed by:	ed
2013-03-21 21:39:15 +00:00
Attilio Rao
d52d7aa871 Fix a bug in UMTX_PROFILING:
UMTX_PROFILING should really analyze the distribution of locks as they
index entries in the umtxq_chains hash-table.
However, the current implementation does add/dec the length counters
for *every* thread insert/removal, measuring at all really userland
contention and not the hash distribution.

Fix this by correctly add/dec the length counters in the points where
it is really needed.

Please note that this bug brought us questioning in the past the quality
of the umtx hash table distribution.
To date with all the benchmarks I could try I was not able to reproduce
any issue about the hash distribution on umtx.

Sponsored by:	EMC / Isilon storage division
Reviewed by:	jeff, davide
MFC after:	2 weeks
2013-03-21 19:58:25 +00:00
John Baldwin
d071a6fa33 Another NFS SIGSTOP related fix: Ignore thread suspend requests due to
SIGSTOP if stop signals are currently deferred.  This can occur if a
process is stopped via SIGSTOP while a thread is running or runnable
but before it has set TDF_SBDRY.

Tested by:	pho
Reviewed by:	kib
MFC after:	1 week
2013-03-21 14:06:27 +00:00
Konstantin Belousov
7db07e1c85 Only size and create the bio_transient_map when unmapped buffers are
enabled.  Now, disabling the unmapped buffers should result in the
kernel memory map identical to pre-r248550.

Sponsored by:	The FreeBSD Foundation
2013-03-21 07:28:15 +00:00
Konstantin Belousov
e3269b5096 In bufwrite(), a dirty buffer is moved to the clean queue before the
bufobj counter of the writes in progress is incremented.  Other thread
inspecting the bufobj would consider it clean.

For the regular vnodes, the vnode lock is typically held both by the
thread performing the bufwrite() and an other thread doing syncing,
which prevents the situation.  On the other hand, writes to the VCHR
vnodes are done without holding vnode lock.

Increment the write ref counter for the buffer object before calling
bundirty().

Sponsored by:	The FreeBSD Foundation
Tested by:	pho
MFC after:	2 weeks
2013-03-20 21:08:00 +00:00
Konstantin Belousov
8d6884ce9c When the journaled FFS volume is suspended due to the journal space
becoming too low, the softdep flush thread processes the workitems,
which frees the space in journal, and then unsuspends the fs.  The
softdep_flush() and other workitem processing functions busy the
filesystem before iterating over the worklist, to prevent the parallel
unmount from freeing the mount data. The vfs_busy() is called with
MBF_NOWAIT flag.

Now, if the unmount is already started and the filesystem is suspended
due to low journal space, the journal is never flushed and filesystem
is never unsuspended, because vfs_busy(MBF_NOWAIT) call cannot succeed
for the unmounting fs, and softdep_flush() does not process the
workitems. Unmount needs to write metadata, where it hangs in the
"suspfs" state.

Move the vn_start_write() call in the dounmount() before setting the
MNTK_UNMOUNT flag. This practically ensures that softdep_flush()
processed the pending journal writes by making dounmount() wait for
the lift of the suspension.

Sponsored by:	The FreeBSD Foundation
Reported and tested by:	pho
MFC after:	2 weeks
2013-03-20 21:07:49 +00:00
Kirk McKusick
3289d5877a When renaming a directory from one parent directory to another,
we need to call ufs_checkpath() to walk from our new location to
the root of the filesystem to ensure that we do not encounter
ourselves along the way. Until now, we accomplished this by reading
the ".." entries of each directory in our path until we reached
the root (or encountered an error). This change tries to avoid the
I/O of reading the ".." entries by first looking them up in the
name cache and only doing the I/O when the name cache lookup fails.

Reviewed by: kib
Tested by:   Peter Holm
MFC after:   4 weeks
2013-03-20 17:57:00 +00:00
Jilles Tjoelker
c2e3c52e0d Implement SOCK_CLOEXEC, SOCK_NONBLOCK and MSG_CMSG_CLOEXEC.
This change allows creating file descriptors with close-on-exec set in some
situations. SOCK_CLOEXEC and SOCK_NONBLOCK can be OR'ed in socket() and
socketpair()'s type parameter, and MSG_CMSG_CLOEXEC to recvmsg() makes file
descriptors (SCM_RIGHTS) atomically close-on-exec.

The numerical values for SOCK_CLOEXEC and SOCK_NONBLOCK are as in NetBSD.
MSG_CMSG_CLOEXEC is the first free bit for MSG_*.

The SOCK_* flags are not passed to MAC because this may cause incorrect
failures and can be done later via fcntl() anyway. On the other hand, audit
is expected to cope with the new flags.

For MSG_CMSG_CLOEXEC, unp_externalize() is extended to take a flags
argument.

Reviewed by:	kib
2013-03-19 20:58:17 +00:00
Konstantin Belousov
e81ff91e62 Do not remap usermode pages into KVA for physio.
Sponsored by:	The FreeBSD Foundation
Tested by:	pho
2013-03-19 14:43:57 +00:00
Konstantin Belousov
7d5365c70b Add a helper function vfs_bio_bzero_buf() to zero the portion of the
buffer, transparently handling mapped or unmapped buffers.  Its intent
is to replace the use of bzero(bp->b_data) in cases where the buffer
might be unmapped, to avoid unneeded upgrades.

Sponsored by:	The FreeBSD Foundation
Tested by:	pho
2013-03-19 14:27:14 +00:00
Konstantin Belousov
ee75e7de7b Implement the concept of the unmapped VMIO buffers, i.e. buffers which
do not map the b_pages pages into buffer_map KVA.  The use of the
unmapped buffers eliminate the need to perform TLB shootdown for
mapping on the buffer creation and reuse, greatly reducing the amount
of IPIs for shootdown on big-SMP machines and eliminating up to 25-30%
of the system time on i/o intensive workloads.

The unmapped buffer should be explicitely requested by the GB_UNMAPPED
flag by the consumer.  For unmapped buffer, no KVA reservation is
performed at all. The consumer might request unmapped buffer which
does have a KVA reserve, to manually map it without recursing into
buffer cache and blocking, with the GB_KVAALLOC flag.

When the mapped buffer is requested and unmapped buffer already
exists, the cache performs an upgrade, possibly reusing the KVA
reservation.

Unmapped buffer is translated into unmapped bio in g_vfs_strategy().
Unmapped bio carry a pointer to the vm_page_t array, offset and length
instead of the data pointer.  The provider which processes the bio
should explicitely specify a readiness to accept unmapped bio,
otherwise g_down geom thread performs the transient upgrade of the bio
request by mapping the pages into the new bio_transient_map KVA
submap.

The bio_transient_map submap claims up to 10% of the buffer map, and
the total buffer_map + bio_transient_map KVA usage stays the
same. Still, it could be manually tuned by kern.bio_transient_maxcnt
tunable, in the units of the transient mappings.  Eventually, the
bio_transient_map could be removed after all geom classes and drivers
can accept unmapped i/o requests.

Unmapped support can be turned off by the vfs.unmapped_buf_allowed
tunable, disabling which makes the buffer (or cluster) creation
requests to ignore GB_UNMAPPED and GB_KVAALLOC flags.  Unmapped
buffers are only enabled by default on the architectures where
pmap_copy_page() was implemented and tested.

In the rework, filesystem metadata is not the subject to maxbufspace
limit anymore. Since the metadata buffers are always mapped, the
buffers still have to fit into the buffer map, which provides a
reasonable (but practically unreachable) upper bound on it. The
non-metadata buffer allocations, both mapped and unmapped, is
accounted against maxbufspace, as before. Effectively, this means that
the maxbufspace is forced on mapped and unmapped buffers separately.
The pre-patch bufspace limiting code did not worked, because
buffer_map fragmentation does not allow the limit to be reached.

By Jeff Roberson request, the getnewbuf() function was split into
smaller single-purpose functions.

Sponsored by:	The FreeBSD Foundation
Discussed with:	jeff (previous version)
Tested by:	pho, scottl (previous version), jhb, bf
MFC after:	2 weeks
2013-03-19 14:13:12 +00:00
John Baldwin
1968f37bc9 Tweak some comments. 2013-03-18 18:04:09 +00:00
John Baldwin
3cf3b9f097 Partially revert r195702. Deferring stops is now implemented via a set of
calls to toggle TDF_SBDRY rather than passing PBDRY to individual sleep
calls.
- Remove the stop_allowed parameters from cursig() and issignal().
  issignal() checks TDF_SBDRY directly.
- Remove the PBDRY and SLEEPQ_STOP_ON_BDRY flags.
2013-03-18 17:23:58 +00:00
Gleb Smirnoff
4f67e14304 In m_align() add assertions that mbuf is virgin, similar to assertions
in M_ALIGN(), MH_ALIGN, MEXT_ALIGN() macros.
2013-03-17 07:41:14 +00:00
Pawel Jakub Dawidek
943c3bb968 Require CAP_SEEK if both O_APPEND and O_TRUNC flags are absent.
In other words we don't require CAP_SEEK if either O_APPEND or O_TRUNC flag is
given, because O_APPEND doesn't allow to overwrite existing data and O_TRUNC
requires CAP_FTRUNCATE already.

Sponsored by:	The FreeBSD Foundation
2013-03-16 23:19:13 +00:00
Pawel Jakub Dawidek
d6b2bd0bc9 Style: Whitespace fixes. 2013-03-16 22:37:30 +00:00
Pawel Jakub Dawidek
1ea67dd9e5 Style: Remove redundant space. 2013-03-16 22:36:24 +00:00
Gleb Smirnoff
c95be8b536 - Replace compat macros with function calls.
- Remove superfluous cleaning of m_len after allocating.

Sponsored by:	Nginx, Inc.
2013-03-16 08:57:36 +00:00
Gleb Smirnoff
5368b81eb0 Contrary to what the deleted comment said, the m_move_pkthdr()
will not smash the M_EXT and data pointer, so it is safe to
pass an mbuf with external storage procuded by m_getcl() to
m_move_pkthdr().

Reviewed by:	andre
Sponsored by:	Nginx, Inc.
2013-03-16 08:55:21 +00:00
Pawel Jakub Dawidek
c9cea47007 Sort syscalls properly. 2013-03-15 23:00:13 +00:00
Konstantin Belousov
aed5a114d7 Separate the copyright lines and the informational block by a blank line.
Requested by:	joel
MFC after:	2 weeks
2013-03-15 14:01:37 +00:00
Konstantin Belousov
5791cee883 Add my copyright for the 2012 year work, in particular vn_io_fault()
and f_offset locking.  Add required Foundation notice for r248319.

Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
2013-03-15 12:57:30 +00:00
Konstantin Belousov
5f5f055441 Implement the helper function vn_io_fault_pgmove(), intended to use by
the filesystem VOP_READ() and VOP_WRITE() implementations in the same
way as vn_io_fault_uiomove() over the unmapped buffers.  Helper
provides the convenient wrapper over the pmap_copy_pages() for struct
uio consumers, taking care of the TDP_UIOHELD situations.

Sponsored by:	The FreeBSD Foundation
Tested by:	pho
MFC after:	2 weeks
2013-03-15 11:16:12 +00:00
Gleb Smirnoff
c8b59ea750 Use m_get() and m_getcl() instead of compat macros. 2013-03-15 10:21:18 +00:00
Gleb Smirnoff
93cfe76349 - Use m_get2() instead of hand allocating.
- No need for u_int cast here.

Sponsored by:	Nginx, Inc.
2013-03-15 10:17:24 +00:00
Gleb Smirnoff
3112ae7644 Make m_get2() never use clusters that are bigger than PAGE_SIZE.
Requested by:	andre, jhb
Sponsored by:	Nginx, Inc.
2013-03-15 10:15:07 +00:00
Edward Tomasz Napierala
a8efb53478 When throttling a process to enforce RACCT limits, do not use neither
PBDRY (which simply doesn't make any sense) nor PCATCH (which could
be used by a malicious process to work around the PCPU limit).

Submitted by:	Rudo Tomori
Reviewed by:	kib
2013-03-14 23:25:42 +00:00
Edward Tomasz Napierala
16befafd16 Accessing td_state requires thread lock to be held.
Submitted by:	Rudo Tomori
Reviewed by:	kib
2013-03-14 23:20:18 +00:00
Konstantin Belousov
70e198dd07 Some style fixes.
Sponsored by:	The FreeBSD Foundation
2013-03-14 20:31:39 +00:00
Konstantin Belousov
c535690b33 Add currently unused flag argument to the cluster_read(),
cluster_write() and cluster_wbuild() functions.  The flags to be
allowed are a subset of the GB_* flags for getblk().

Sponsored by:	The FreeBSD Foundation
Tested by:	pho
2013-03-14 20:28:26 +00:00
Konstantin Belousov
a1143a3ba8 Rewrite the vfs_bio_clrbuf(9) to not access the b_data for B_VMIO
buffers directly, use pmap_zero_page_area(9) for each zeroing page
region instead.

Sponsored by:	The FreeBSD Foundation
Tested by:	pho
MFC after:	2 weeks
2013-03-14 19:48:25 +00:00
Tijl Coosemans
d19d5bf443 - Fix two possible overflows when testing if ELF program headers are on
the first page:
  1. Cast uint16_t operands in a multiplication to unsigned int because
     otherwise the implicit promotion to int results in a signed
     multiplication that can overflow and the behaviour on integer
     overflow is undefined.
  2. Replace (offset + size > PAGE_SIZE) with (size > PAGE_SIZE - offset)
     because the sum may overflow.
- Use the same tests to see if the path to the interpreter is on the first
  page. There's no overflow here because size is already limited by
  MAXPATHLEN, but the compiler optimises the new tests better. Also fix an
  off-by-one error.
- Simplify tests to see if an ELF note program header is on the first page.
  This also fixes an off-by-one error.

Reviewed by:	kib
MFC after:	1 week
2013-03-13 22:01:31 +00:00
Alexander Motin
ca9feb490c Fix incorrect assertion that caused panic when periodic-only timers used. 2013-03-13 06:42:01 +00:00
Gleb Smirnoff
41a7572b26 Functions m_getm2() and m_get2() have different order of arguments,
and that can drive someone crazy. While m_get2() is young and not
documented yet, change its order of arguments to match m_getm2().

Sorry for churn, but better now than later.
2013-03-12 13:42:47 +00:00
Gleb Smirnoff
3b4a84e757 In kern_sendfile() use m_extadd() instead of MEXTADD() macro, supplying
appropriate wait argument and checking return value. Before this change
m_extadd() could fail, and kern_sendfile() ignored that.

Sponsored by:	Nginx, Inc.
2013-03-12 12:15:24 +00:00
Gleb Smirnoff
8c629bdf05 The m_extadd() can fail due to memory allocation failure, thus:
- Make it return int, not void.
- Add wait parameter.
- Update MEXTADD() macro appropriately, defaults to M_NOWAIT, as
  before this change.

Sponsored by:	Nginx, Inc.
2013-03-12 12:12:16 +00:00
Alexander Motin
0dbf17e6eb Make kern_nanosleep() and pause_sbt() to use per-CPU sleep queues.
This removes significant sleep queue lock congestion on multithreaded
microbenchmarks, making them scale to multiple CPUs almost linearly.
2013-03-12 06:58:49 +00:00
Pawel Jakub Dawidek
be26ba7cd3 Fix memory leak when one process send descriptor over UNIX domain socket,
but the other process exited before receiving it.
2013-03-11 22:59:07 +00:00
Michael Tuexen
fbb3471022 Return an error if sctp_peeloff() fails because a socket can't be allocated.
MFC after: 3 days
2013-03-11 17:43:55 +00:00
Andre Oppermann
a7aea132cf Bring back the comment on the sizing of the callout array that got
lost in r248031.

Requested by:	alc, alfred
2013-03-10 22:55:35 +00:00
Davide Italiano
c5904471dc Fixup r248032:
Change size requested to malloc(9) now that callwheel buckets are
callout_list and not callout_tailq anymore. This change was already
there but it seems it got lost after code churn in r248032.

Reported by:	alc, kib
2013-03-09 20:03:10 +00:00
Attilio Rao
1fc8c346d5 Improve UMTX_PROFILING:
- Use u_int values for length and max_length values
- Add a way to reset the max_length heuristic in order to have the
  possibility to reuse the mechanism consecutively without rebooting
  the machine
- Add a way to quick display top5 contented buckets in the system for
  the max_length value.
  This should give a quick overview on the quality of the hash table
  distribution.

Sponsored by:	EMC / Isilon storage division
Reviewed by:	jeff, davide
2013-03-09 15:31:19 +00:00
Konstantin Belousov
7a61281f22 Correct the lock class for the vm object lock.
Reported and tested by:	joel
2013-03-09 10:16:08 +00:00
Alexander Motin
21a37a7196 Rework overflow checks of r247898 to not let too "intelligent" compiler to
optimize it out.

Submitted by:	bde
2013-03-09 09:07:13 +00:00
Attilio Rao
89f6b8632c Switch the vm_object mutex to be a rwlock. This will enable in the
future further optimizations where the vm_object lock will be held
in read mode most of the time the page cache resident pool of pages
are accessed for reading purposes.

The change is mostly mechanical but few notes are reported:
* The KPI changes as follow:
  - VM_OBJECT_LOCK() -> VM_OBJECT_WLOCK()
  - VM_OBJECT_TRYLOCK() -> VM_OBJECT_TRYWLOCK()
  - VM_OBJECT_UNLOCK() -> VM_OBJECT_WUNLOCK()
  - VM_OBJECT_LOCK_ASSERT(MA_OWNED) -> VM_OBJECT_ASSERT_WLOCKED()
    (in order to avoid visibility of implementation details)
  - The read-mode operations are added:
    VM_OBJECT_RLOCK(), VM_OBJECT_TRYRLOCK(), VM_OBJECT_RUNLOCK(),
    VM_OBJECT_ASSERT_RLOCKED(), VM_OBJECT_ASSERT_LOCKED()
* The vm/vm_pager.h namespace pollution avoidance (forcing requiring
  sys/mutex.h in consumers directly to cater its inlining functions
  using VM_OBJECT_LOCK()) imposes that all the vm/vm_pager.h
  consumers now must include also sys/rwlock.h.
* zfs requires a quite convoluted fix to include FreeBSD rwlocks into
  the compat layer because the name clash between FreeBSD and solaris
  versions must be avoided.
  At this purpose zfs redefines the vm_object locking functions
  directly, isolating the FreeBSD components in specific compat stubs.

The KPI results heavilly broken by this commit.  Thirdy part ports must
be updated accordingly (I can think off-hand of VirtualBox, for example).

Sponsored by:	EMC / Isilon storage division
Reviewed by:	jeff
Reviewed by:	pjd (ZFS specific review)
Discussed with:	alc
Tested by:	pho
2013-03-09 02:32:23 +00:00
Andre Oppermann
15ae0c9af9 Move the callout subsystem initialization to its own SYSINIT()
from being indirectly called via cpu_startup()+vm_ksubmap_init().
The boot order position remains the same at SI_SUB_CPU.

Allocation of the callout array is changed to stardard kernel malloc
from a slightly obscure direct kernel_map allocation.

kern_timeout_callwheel_alloc() is renamed to callout_callwheel_init()
to better describe its purpose.
kern_timeout_callwheel_init() is removed simplifying the per-cpu
initialization.

Reviewed by:	davide
2013-03-08 10:37:17 +00:00
Andre Oppermann
f8ccf82a4c Move the auto-sizing of the callout array from init_param2() to
kern_timeout_callwheel_alloc() where it is actually used.

This is a mechanical move and no tuning parameters are changed.

The pre-allocated callout array is only used for legacy timeout(9)
calls and is only allocated and active on cpu0.  Eventually all
remaining users of timeout(9) should switch to the callout_* API.

Reviewed by:	davide
2013-03-08 10:14:58 +00:00
Alexander Motin
836972b877 Fix off-by-one error in nanoseconds validation.
Submitted by:	bde
2013-03-07 16:50:07 +00:00