Commit Graph

13360 Commits

Author SHA1 Message Date
jilles
66a7a3379b Regenerate files for accept4(). 2013-05-01 20:12:58 +00:00
jilles
299afd25fd Add accept4() system call.
The accept4() function, compared to accept(), allows setting the new file
descriptor atomically close-on-exec and explicitly controlling the
non-blocking status on the new socket. (Note that the latter point means
that accept() is not equivalent to any form of accept4().)

The linuxulator's accept4 implementation leaves a race window where the new
file descriptor is not close-on-exec because it calls sys_accept(). This
implementation leaves no such race window (by using falloc() flags). The
linuxulator could be fixed and simplified by using the new code.

Like accept(), accept4() is async-signal-safe, a cancellation point and
permitted in capability mode.
2013-05-01 20:10:21 +00:00
trociny
a5058251d5 Introduce a constant, ELF_NOTE_ROUNDSIZE, which evidently declare our
intention to use 4-byte padding for elf notes.

MFC after:	3 weeks
2013-05-01 14:59:16 +00:00
jilles
51cb2d085a socket: Make shutdown() wake up a blocked accept().
A blocking accept (and some other operations) waits on &so->so_timeo. Once
it wakes up, it will detect the SBS_CANTRCVMORE bit.

The error from accept() is [ECONNABORTED] which is not the nicest one -- the
thread calling accept() needs to know out-of-band what is happening.

A spurious wakeup on so->so_timeo appears harmless (sleep retried) except
when lingering on close (SO_LINGER, and in that case there is no descriptor
to call shutdown() on) so this should be fairly safe.

A shutdown() already woke up a blocked accept() for TCP sockets, but not for
Unix domain sockets. This fix is generic for all domains.

This patch was sent to -hackers@ and -net@ on April 5.

MFC after:	2 weeks
2013-04-30 15:06:30 +00:00
kib
ebecd57ee2 Eliminate the layering violation in the kern_sendfile(). When quering
the file size, use VOP_GETATTR() instead of accessing vnode vm_object
un_pager.vnp.vnp_size.

Take the shared vnode lock earlier to cover the added VOP_GETATTR()
call and, as consequence, the whole internal sendfile loop.  Reduce vm
object lock scope to not protect the local calculations.

Note that this is the last misuse of the vnp_size in the tree, the
others were removed from the ELF image activator by r230246.

Reviewed by:	alc
Tested by:	pho, bf (previous version)
MFC after:	1 week
2013-04-28 19:12:09 +00:00
andre
87ef00d48c Base the calculation of maxmbufmem in part on kmem_map size
instead of kernel_map size to prevent kernel memory exhaustion
by mbufs and a subsequent panic on physical page allocation
failure.

On architectures without a direct map all mbuf memory (except
for jumbo mbufs larger than PAGE_SIZE) comes from kmem_map.
It is the limiting factor hence.

For architectures with a direct map using the size of kmem_map
is a good proxy of available kernel memory as well.  If it is
much smaller the mbuf limit may be sub-optimal but remains
reasonable, while avoiding panics under exhaustion.

The overall mbuf memory limit calculation may be reconsidered
again later, however due to the many different mbuf sizes and
different backing KVM maps it is a tricky subject.

Found by:	pho's new network stress test
Pointed out by:	alc (kmem_map instead of kernel_map)
Tested by:	pho
2013-04-24 13:54:55 +00:00
jh
f9bfc0cc7f Include PID in the error message which is printed when the maxproc limit
is exceeded. Improve formatting of the message while here.

PR:		kern/60550
Submitted by:	Lowell Gilbert, bde
2013-04-19 15:19:29 +00:00
glebius
b991db2beb Don't compare unsigned socklen_t against < 0.
Reviewed by:	jhb
2013-04-19 13:40:13 +00:00
jilles
be7967ddcc sem: Restart the POSIX sem_* calls after signals with SA_RESTART set.
Programs often do not expect an [EINTR] return from sem_wait() and POSIX
only allows it if the signal was installed without SA_RESTART. The timeout
in sem_timedwait() is absolute so it can be restarted normally.

The umtx call can be invoked with a relative timeout and in that case
[ERESTART] must be changed to [EINTR]. However, libc does not do this.

The old POSIX semaphore implementation did this correctly (before r249566),
unlike the new umtx one.

It may be desirable to avoid [EINTR] completely, which matches the pthread
functions and is explicitly permitted by POSIX. However, the kernel must
return [EINTR] at least for signals with SA_RESTART clear, otherwise pthread
cancellation will not abort a semaphore wait. In this commit, only restore
the 8.x behaviour which is also permitted by POSIX.

Discussed with:	jhb
MFC after:	1 week
2013-04-19 10:16:00 +00:00
glebius
a4662e151c On non-ACPI i386 mp_ncpus is initialized at SI_SUB_CPU, and this
prevents us from creating UMA_ZONE_PCPU zones earlier.

As bandaid shift initialization of counter(9) zone later.

Reviewed by:		kib
Reported & tested by:	Lytochkin Boris <lytboris gmail.com>
2013-04-17 18:43:33 +00:00
gabor
b86fa940aa - Correct mispellings of the word necessary
Submitted by:	Christoph Mallon <christoph.mallon@gmx.de> (via private mail)
2013-04-17 11:42:40 +00:00
gabor
d3ee8e3ff6 - Correct mispellings of the word occurrence
Submitted by:	Christoph Mallon <christoph.mallon@gmx.de> (via private mail)
2013-04-17 11:40:10 +00:00
imp
761f4524d4 r249408 and r249436 cause a NULL pointer dereference on the CUBIEBOARD
since it doesn't set the kernel envrionment at all. Work around this
by making sure kern_envp is non-NULL before dereferencing it.
2013-04-16 22:09:08 +00:00
jhb
0ed1bc2e92 - Document that sem_wait() can fail with EINTR if it is interrupted by a
signal.
- Fix the old ksem implementation for POSIX semaphores to not restart
  sem_wait() or sem_timedwait() if interrupted by a signal.

MFC after:	1 week
2013-04-16 20:26:31 +00:00
trociny
f172830e71 Add a new set of notes to a process core dump to store procstat data.
The notes format is a header of sizeof(int), which stores the size of
the corresponding data structure to provide some versioning, and data
in the format as it is returned by a related sysctl call.

The userland tools (procstat(1)) will be taught to extract this data,
providing additional info for postmortem analysis.

PR:		kern/173723
Suggested by:	jhb
Discussed with:	jhb, kib
Reviewed by:	jhb (initial version), kib
MFC after:	1 month
2013-04-16 19:19:14 +00:00
rmacklem
90b89365fd Allow the vnode to be unlocked for the weird case of
LK_EXCLOTHER. LK_EXCLOTHER is only used to acquire a
usecount on a vnode during NFSv4 recovery from an
expired lease.

Reported and tested by:	pho
MFC after:	2 weeks
2013-04-16 14:22:16 +00:00
kib
789ce3de65 Some compilers issue a warning when wider integer is casted to narrow
pointer.  Supposedly shut down the warning by casting through
uintptr_t.

Reported by:	ian
2013-04-16 07:11:52 +00:00
gnn
59c782b807 Point args[0] not at the thread that is ending but at the one that
is starting.  This is in line with practice in OpenSolaris.

Note that this change is only in ULE and not in the 4BSD scheduler.
Once this change settles in (MFC timeout has expired) we'll try it out
on 4BSD as well.

PR:		177706
Submitted by:	Tiwei Bie
MFC after:	1 month
2013-04-15 17:21:02 +00:00
trociny
4bb922ab39 Similarly to proc_getargv() and proc_getenvv(), export proc_getauxv()
to be able to reuse the code.

MFC after:	3 weeks
2013-04-14 20:03:48 +00:00
trociny
335f3dbd91 Re-factor the code to provide kern_proc_filedesc_out(), kern_proc_out(),
and kern_proc_vmmap_out() functions to output process kinfo structures
to sbuf, to make the code reusable.

The functions are going to be used in the coredump routine to store
procstat info in the core program header notes.

Reviewed by:	kib
MFC after:	3 weeks
2013-04-14 20:01:36 +00:00
trociny
99603a0f21 Re-factor coredump routines. For each type of notes an output
function is provided, which is used either to calculate the note size
or output it to sbuf.  On the first pass the notes are registered in a
list and the resulting size is found, on the second pass the list is
traversed outputing notes to sbuf.  For the sbuf a drain routine is
provided that writes data to a core file.

The main goal of the change is to make coredump to write notes
directly to the core file, without preliminary preparing them all in a
memory buffer.  Storing notes in memory is not a problem for the
current, rather small, set of notes we write to the core, but it may
becomes an issue when we start to store procstat notes.

Reviewed by:	jhb (initial version), kib
Discussed with:	jhb, kib
MFC after:	3 weeks
2013-04-14 19:59:38 +00:00
mjg
1798a915c4 Add fdallocn function and use it when passing fds over unix socket.
This gets rid of "unp_externalize fdalloc failed" panic.

Reviewed by:	pjd
MFC after:	1 week
2013-04-14 17:08:34 +00:00
jchandra
33ed81146c Fix changes made in r249408.
In some cases, kern_envp is set by the architecture code and env_pos does
not contain the length of the static kernel environment. In these cases
r249408 causes the kernel to discard the environment.

Fix this by updating the check for empty static env to *kern_envp != '\0'

Reported by:	np@
2013-04-13 07:23:37 +00:00
jchandra
b9f386539e Fix kenv behavior when there is no static environment
In case where there are no static kernel environment entries, the
function init_dynamic_kenv() adds an incorrect entry at position 0 of
the dynamic kernel environment. This in turn causes kenv(1) to print
and empty list even though there are dynamic entries added later.

Fix this by checking env_pos in init_dynamic_kenv() and adding dynamic
entries only if there are static entries.
2013-04-12 15:58:53 +00:00
trociny
156185aa6e Add sbuf_start_section() and sbuf_end_section() functions, which can
be used for automatic section alignment.

Discussed with:	kib
Reviewed by:	kib
MFC after:	1 month
2013-04-11 19:49:18 +00:00
jimharris
649c9a27e9 Fix the build. 2013-04-10 00:35:08 +00:00
andre
306fddaf78 Change certain heavily used network related mutexes and rwlocks to
reside on their own cache line to prevent false sharing with other
nearby structures, especially for those in the .bss segment.

NB: Those mutexes and rwlocks with variables next to them that get
changed on every invocation do not benefit from their own cache line.
Actually it may be net negative because two cache misses would be
incurred in those cases.
2013-04-09 21:02:20 +00:00
attilio
d67371dab6 Switch some "low-hanging fruit" to acquire read lock on vmobjects
rather than write locks.

Sponsored by:	EMC / Isilon storage division
Reviewed by:	alc
Tested by:	pho
2013-04-08 19:58:32 +00:00
glebius
9cf64d6c35 Merge from projects/counters: counter(9).
Introduce counter(9) API, that implements fast and raceless counters,
provided (but not limited to) for gathering of statistical data.

See http://lists.freebsd.org/pipermail/freebsd-arch/2013-April/014204.html
for more details.

In collaboration with:	kib
Reviewed by:		luigi
Tested by:		ae, ray
Sponsored by:		Nginx, Inc.
2013-04-08 19:40:53 +00:00
trociny
dc5f593dd8 Use pget(9) to reduce code duplication.
MFC after:	1 week
2013-04-07 17:44:30 +00:00
trociny
c71ec9b060 Fill p_flags and p_align fields of the core dump note segement.
Reviewed by:	kib
MFC after:	2 weeks
2013-04-07 17:42:27 +00:00
trociny
67b311e2f1 Use 4-byte padding for core dump notes on both 32 and 64bit archs.
Although native word padding (i.e. 8-byte on 64bit arch) looks to be
in agreement with standards, other parts of our code and other OSes
use 4-byte alignment.

This is not expected to change alignment for currently generated core
dump notes, as the notes look to consist of structures with sizes
multiple of 8 on 64-bit archs. But there are plans to add additional
notes, where 4-byte vs 8-byte alignment makes difference.

Discussed with:	kib
Reviewed by:	kib
MFC after:	2 weeks
2013-04-07 17:40:49 +00:00
jilles
fd361c8faa mqueue,ksem,shm: Fix race condition with setting UF_EXCLOSE.
POSIX mqueue, compatibility ksem and POSIX shm create a file descriptor that
has close-on-exec set. However, they do this incorrectly, leaving a window
where a thread may fork and exec while the flag has not been set yet. The
race is easily reproduced on a multicore system with one thread doing
shm_open and close and another thread doing posix_spawnp and waitpid.

Set UF_EXCLOSE via falloc()'s flags argument instead. This also simplifies
the code.

MFC after:	1 week
2013-04-07 15:26:09 +00:00
jeff
fa887dba7b Prepare to replace the buf splay with a trie:
- Don't insert BKGRDMARKER bufs into the splay or dirty/clean buf lists.
   No consumers need to find them there and it complicates the tree.
   These flags are all FFS specific and could be moved out of the buf
   cache.
 - Use pbgetvp() and pbrelvp() to associate the background and journal
   bufs with the vp.  Not only is this much cheaper it makes more sense
   for these transient bufs.
 - Fix the assertions in pbget* and pbrel*.  It's not safe to check list
   pointers which were never initialized.  Use the BX flags instead.  We
   also check B_PAGING in reassignbuf() so this should cover all cases.

Discussed with:	kib, mckusick, attilio
Sponsored by:	EMC / Isilon Storage Division
2013-04-06 22:21:23 +00:00
glebius
49b5e1ded6 Fix memory leak in coredump().
Reviewed by:	kib
2013-04-05 20:24:51 +00:00
kib
d6e3efdde9 If filter of the interrupt event is not null, print it, in addition to
the handler address.  Add a mark to distinguish between filter and
handler.

Note that the arguments for both filter and handler are same.

Sponsored by:	The FreeBSD Foundation
Reviewed by:	jhb
MFC after:	1 week
2013-04-05 14:30:51 +00:00
brooks
2a56708b49 MFP4 change 210763
Allow boothowto and bootverbose to be set via kernel options, which
is useful on architectures that are unable to rely on a boot loader
to pass configuration variables to the kernel.

Submitted by:	rwatson
2013-04-03 22:24:36 +00:00
ken
a19ebe4f90 Add support for XPT_CONT_TARGET_IO CCBs in _bus_dmamap_load_ccb().
Declare CCB types in their respective switch blocks.

Sponsored by:	Spectra Logic
2013-04-02 16:49:49 +00:00
mdf
a3d624db5a Regen.
MFC after:	1 week
2013-04-02 05:30:52 +00:00
mdf
da578c6492 Fix return type of extattr_set_* and fix rmextattr(8) utility.
extattr_set_{fd,file,link} is logically a write(2)-like operation and
should return ssize_t, just like extattr_get_*.  Also, the user-space
utility was using an int for the return value of extattr_get_* and
extattr_list_*, both of which return an ssize_t.

MFC after:	1 week
2013-04-02 05:30:41 +00:00
kib
060b76c803 Do not call the VOP_LOOKUP() for the doomed directory vnode. The
vnode could be reclaimed while lock upgrade was performed.

Sponsored by:	The FreeBSD Foundation
Reported and tested by:	pho
Diagnosed and reviewed by:	rmacklem
MFC after:	1 week
2013-04-01 09:59:38 +00:00
jilles
9d8a3c5c3b Rename do_pipe() to kern_pipe2() and declare it properly. 2013-03-31 17:42:54 +00:00
mdf
4c77a4b020 Use a shared lock for VOP_GETEXTATTR, as it is a read-like operation.
MFC after:	1 week
2013-03-30 15:09:04 +00:00
jimharris
f59af79144 Add bus_dmamap_load_bio for non-CAM disk drivers that wish to enable
unmapped I/O.

Sponsored by:	Intel
Reviewed by:	kib
2013-03-29 16:26:25 +00:00
jimharris
5febbe1181 Add CTR5() to bus_dmamap_load_ccb, similar to other bus_dmamap_load_*
functions.

Sponsored by:	Intel
2013-03-29 16:00:16 +00:00
jimharris
60c7cceb4c Do not add 1 to nsegs before passing to CTR5(), since nsegs
has already been incremented before these calls.

Sponsored by:	Intel
2013-03-29 15:54:12 +00:00
jimharris
7e64e1827b Pass correct parameter to CTR5() in bus_dmamap_load_uio.
Sponsored by:	Intel
2013-03-29 15:51:45 +00:00
glebius
11f04943de Fix bug in m_split() in a case when split len matches len of the
first mbuf, and the first mbuf is M_PKTHDR.

PR:		kern/176144
Submitted by:	Jacques Fourie <jacques.fourie gmail.com>
2013-03-29 14:10:40 +00:00
glebius
06ecb1b7ca Once ng_ksocket(4) is fixed, re-apply r194662. See this revision for
longer description.

Discussed with:	andre, rwatson
Sponsored by:	Nginx, Inc.
2013-03-29 14:06:04 +00:00
glebius
1bccb6e916 When soreceive_generic() hands off an mbuf from buffer,
clear its pointer to next record, since next record
belongs to the buffer, and shouldn't be leaked.

The ng_ksocket(4) used to clear this pointer itself,
but the correct place is here.

Sponsored by:	Nginx, Inc
2013-03-29 13:57:55 +00:00
scottl
84ae5b84bb Several fixes and improvements to sendfile()
1.  If we wanted to send exactly as many bytes as the socket buffer is
    sized for, the inner loop of kern_sendfile() would see that the
    socket is full before seeing that it had no more bytes left to send.
    This would cause it to return EAGAIN to the caller instead of
    success.  Fix by changing the order that these conditions are tested.
2.  Simplify the calculation for the bytes to send in each iteration of
    the inner loop of kern_sendfile()
3.  Fix some calls with bogus arguments to sf_buf_ext().  These would
    only trigger on mbuf allocation failure, but would be hilariously
    bad if they did trigger.

Submitted by:	gibbs(3), andre(2)
Reviewed by:	emax, andre
Obtained from:	Netflix
MFC after:	1 week
2013-03-28 14:14:28 +00:00
jimharris
6ed2dc4d7c deferal -> deferral 2013-03-27 23:07:43 +00:00
kib
c45e5da903 Fix a race with the vnode reclamation in the aio_qphysio(). Obtain
the thread reference on the vp->v_rdev and use the returned struct
cdev *dev instead of using vp->v_rdev.  Call dev_strategy_csw()
instead of dev_strategy(), since we now own the reference.

Since the csw was already calculated, test d_flags to avoid mapping
the buffer if the driver supports unmapped requests [*].

Suggested by:	kan [*]
Reviewed by:	kan (previous version)
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
2013-03-27 11:47:52 +00:00
kib
448e7c1290 Add dev_strategy_csw() function, which is similar to dev_strategy()
but assumes that a thread reference was already obtained on the passed
device.  Use the function from physio(), to avoid two extra dev_mtx
lock and unlock.  Note that physio() is always used as the cdevsw
method, or is called from a cdevsw method, and the caller already owns
the reference.

dev_strategy() is left to keep KPI intact, but now it is implemented
as a wrapper around dev_strategy_csw().

Do some style cleanup in physio().

Requested and reviewed by:	kan (previous version)
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
2013-03-27 11:34:27 +00:00
kib
df3795022f On i386, double the default size of the bio transient map. With the
maxbcache size fixed, the auto-tuned transient map is too small for
real-world load on i386.

Tested by:	David Wolfskill
Sponsored by:	The FreeBSD Foundation
2013-03-27 10:56:15 +00:00
kan
49a21b7c2e Do not pass unmapped buffers to drivers that cannot handle them
In physio, check if device can handle unmapped IO and pass an
appropriately mapped buffer to the driver strategy routine. The
only driver in the tree that can handle unmapped buffers is one
exposed by GEOM, so mark it as such with the new flag in the
driver cdevsw structure.

This fixes insta-panics on hosts, running dconschat, as /dev/fwmem
is an example of the driver that makes use of physio routine, but
bypasses the g_down thread, where the buffer gets mapped normally.

Discussed with: kib (earlier version)
2013-03-26 01:17:06 +00:00
davide
f736d35281 Cache the callout precision argument as part of the informations required
for migrating callouts to new CPU. This value is passed to
callout_cc_add() in order to update properly precision field in case of
rescheduling/migration.

Reviewed by:	mav
2013-03-25 09:43:50 +00:00
will
5d3a27c743 Extend taskqueue(9) to enable per-taskqueue callbacks.
The scope of these callbacks is primarily to support actions that affect the
taskqueue's thread environments.  They are entirely optional, and
consequently are introduced as a new API: taskqueue_set_callback().

This interface allows the caller to specify that a taskqueue requires a
callback and optional context pointer for a given callback type.

The callback types included in this commit can be used to register a
constructor and destructor for thread-local storage using osd(9).  This
allows a particular taskqueue to define that its threads require a specific
type of TLS, without the need for a specially-orchestrated task-based
mechanism for startup and shutdown in order to accomplish it.

Two callback types are supported at this point:

- TASKQUEUE_CALLBACK_TYPE_INIT, called by every thread when it starts, prior
  to processing any tasks.
- TASKQUEUE_CALLBACK_TYPE_SHUTDOWN, called by every thread when it exits,
  after it has processed its last task but before the taskqueue is
  reclaimed.

While I'm here:

- Add two new macros, TQ_ASSERT_LOCKED and TQ_ASSERT_UNLOCKED, and use them
  in appropriate locations.
- Fix taskqueue.9 to mention taskqueue_start_threads(), which is a required
  interface for all consumers of taskqueue(9).

Reviewed by:	kib (all), eadler (taskqueue.9), brd (taskqueue.9)
Approved by:	ken (mentor)
Sponsored by:	Spectra Logic
MFC after:	1 month
2013-03-23 15:11:53 +00:00
avg
1c06448efc post mountroot event after a real/final root is mounted
not every time an intermediate root (including the first devfs) is
mounted.
This is also consistent with waking up via root_mount_complete.

Reviewed by:	jhb
MFC after:	13 days
2013-03-23 08:59:34 +00:00
pjd
91184d303f - Constify local path variable for chflagsat().
- Use correct format characters (%lx) for u_long.

This fixes the build broken in r248599.
2013-03-22 07:40:34 +00:00
pjd
f44b21d5e5 Regenerate after r248599.
Sponsored by:	The FreeBSD Foundation
2013-03-21 23:02:19 +00:00
pjd
635dbe90f2 Implement chflagsat(2) system call, similar to fchmodat(2), but operates on
file flags.

Reviewed by:	kib, jilles
Sponsored by:	The FreeBSD Foundation
2013-03-21 22:59:01 +00:00
pjd
5fc1bac315 Regenerate after r248597.
Sponsored by:	The FreeBSD Foundation
2013-03-21 22:47:03 +00:00
pjd
2a3cf7f364 - Make 'flags' argument to chflags(2), fchflags(2) and lchflags(2) of type
u_long. Before this change it was of type int for syscalls, but prototypes
  in sys/stat.h and documentation for chflags(2) and fchflags(2) (but not
  for lchflags(2)) stated that it was u_long. Now some related functions
  use u_long type for flags (strtofflags(3), fflagstostr(3)).
- Make path argument of type 'const char *' for consistency.

Discussed on:	arch
Sponsored by:	The FreeBSD Foundation
2013-03-21 22:44:33 +00:00
jilles
bd09044d61 Allow O_CLOEXEC in posix_openpt() flags.
PR:		kern/162374
Reviewed by:	ed
2013-03-21 21:39:15 +00:00
attilio
83c8ef372d Fix a bug in UMTX_PROFILING:
UMTX_PROFILING should really analyze the distribution of locks as they
index entries in the umtxq_chains hash-table.
However, the current implementation does add/dec the length counters
for *every* thread insert/removal, measuring at all really userland
contention and not the hash distribution.

Fix this by correctly add/dec the length counters in the points where
it is really needed.

Please note that this bug brought us questioning in the past the quality
of the umtx hash table distribution.
To date with all the benchmarks I could try I was not able to reproduce
any issue about the hash distribution on umtx.

Sponsored by:	EMC / Isilon storage division
Reviewed by:	jeff, davide
MFC after:	2 weeks
2013-03-21 19:58:25 +00:00
jhb
1b6f4e466c Another NFS SIGSTOP related fix: Ignore thread suspend requests due to
SIGSTOP if stop signals are currently deferred.  This can occur if a
process is stopped via SIGSTOP while a thread is running or runnable
but before it has set TDF_SBDRY.

Tested by:	pho
Reviewed by:	kib
MFC after:	1 week
2013-03-21 14:06:27 +00:00
kib
9382f70781 Only size and create the bio_transient_map when unmapped buffers are
enabled.  Now, disabling the unmapped buffers should result in the
kernel memory map identical to pre-r248550.

Sponsored by:	The FreeBSD Foundation
2013-03-21 07:28:15 +00:00
kib
fbd9d518d5 In bufwrite(), a dirty buffer is moved to the clean queue before the
bufobj counter of the writes in progress is incremented.  Other thread
inspecting the bufobj would consider it clean.

For the regular vnodes, the vnode lock is typically held both by the
thread performing the bufwrite() and an other thread doing syncing,
which prevents the situation.  On the other hand, writes to the VCHR
vnodes are done without holding vnode lock.

Increment the write ref counter for the buffer object before calling
bundirty().

Sponsored by:	The FreeBSD Foundation
Tested by:	pho
MFC after:	2 weeks
2013-03-20 21:08:00 +00:00
kib
6fbc70a0bc When the journaled FFS volume is suspended due to the journal space
becoming too low, the softdep flush thread processes the workitems,
which frees the space in journal, and then unsuspends the fs.  The
softdep_flush() and other workitem processing functions busy the
filesystem before iterating over the worklist, to prevent the parallel
unmount from freeing the mount data. The vfs_busy() is called with
MBF_NOWAIT flag.

Now, if the unmount is already started and the filesystem is suspended
due to low journal space, the journal is never flushed and filesystem
is never unsuspended, because vfs_busy(MBF_NOWAIT) call cannot succeed
for the unmounting fs, and softdep_flush() does not process the
workitems. Unmount needs to write metadata, where it hangs in the
"suspfs" state.

Move the vn_start_write() call in the dounmount() before setting the
MNTK_UNMOUNT flag. This practically ensures that softdep_flush()
processed the pending journal writes by making dounmount() wait for
the lift of the suspension.

Sponsored by:	The FreeBSD Foundation
Reported and tested by:	pho
MFC after:	2 weeks
2013-03-20 21:07:49 +00:00
mckusick
9904f3d968 When renaming a directory from one parent directory to another,
we need to call ufs_checkpath() to walk from our new location to
the root of the filesystem to ensure that we do not encounter
ourselves along the way. Until now, we accomplished this by reading
the ".." entries of each directory in our path until we reached
the root (or encountered an error). This change tries to avoid the
I/O of reading the ".." entries by first looking them up in the
name cache and only doing the I/O when the name cache lookup fails.

Reviewed by: kib
Tested by:   Peter Holm
MFC after:   4 weeks
2013-03-20 17:57:00 +00:00
jilles
c9066bd014 Implement SOCK_CLOEXEC, SOCK_NONBLOCK and MSG_CMSG_CLOEXEC.
This change allows creating file descriptors with close-on-exec set in some
situations. SOCK_CLOEXEC and SOCK_NONBLOCK can be OR'ed in socket() and
socketpair()'s type parameter, and MSG_CMSG_CLOEXEC to recvmsg() makes file
descriptors (SCM_RIGHTS) atomically close-on-exec.

The numerical values for SOCK_CLOEXEC and SOCK_NONBLOCK are as in NetBSD.
MSG_CMSG_CLOEXEC is the first free bit for MSG_*.

The SOCK_* flags are not passed to MAC because this may cause incorrect
failures and can be done later via fcntl() anyway. On the other hand, audit
is expected to cope with the new flags.

For MSG_CMSG_CLOEXEC, unp_externalize() is extended to take a flags
argument.

Reviewed by:	kib
2013-03-19 20:58:17 +00:00
kib
e5332ab955 Do not remap usermode pages into KVA for physio.
Sponsored by:	The FreeBSD Foundation
Tested by:	pho
2013-03-19 14:43:57 +00:00
kib
51030488d7 Add a helper function vfs_bio_bzero_buf() to zero the portion of the
buffer, transparently handling mapped or unmapped buffers.  Its intent
is to replace the use of bzero(bp->b_data) in cases where the buffer
might be unmapped, to avoid unneeded upgrades.

Sponsored by:	The FreeBSD Foundation
Tested by:	pho
2013-03-19 14:27:14 +00:00
kib
7c26a038f9 Implement the concept of the unmapped VMIO buffers, i.e. buffers which
do not map the b_pages pages into buffer_map KVA.  The use of the
unmapped buffers eliminate the need to perform TLB shootdown for
mapping on the buffer creation and reuse, greatly reducing the amount
of IPIs for shootdown on big-SMP machines and eliminating up to 25-30%
of the system time on i/o intensive workloads.

The unmapped buffer should be explicitely requested by the GB_UNMAPPED
flag by the consumer.  For unmapped buffer, no KVA reservation is
performed at all. The consumer might request unmapped buffer which
does have a KVA reserve, to manually map it without recursing into
buffer cache and blocking, with the GB_KVAALLOC flag.

When the mapped buffer is requested and unmapped buffer already
exists, the cache performs an upgrade, possibly reusing the KVA
reservation.

Unmapped buffer is translated into unmapped bio in g_vfs_strategy().
Unmapped bio carry a pointer to the vm_page_t array, offset and length
instead of the data pointer.  The provider which processes the bio
should explicitely specify a readiness to accept unmapped bio,
otherwise g_down geom thread performs the transient upgrade of the bio
request by mapping the pages into the new bio_transient_map KVA
submap.

The bio_transient_map submap claims up to 10% of the buffer map, and
the total buffer_map + bio_transient_map KVA usage stays the
same. Still, it could be manually tuned by kern.bio_transient_maxcnt
tunable, in the units of the transient mappings.  Eventually, the
bio_transient_map could be removed after all geom classes and drivers
can accept unmapped i/o requests.

Unmapped support can be turned off by the vfs.unmapped_buf_allowed
tunable, disabling which makes the buffer (or cluster) creation
requests to ignore GB_UNMAPPED and GB_KVAALLOC flags.  Unmapped
buffers are only enabled by default on the architectures where
pmap_copy_page() was implemented and tested.

In the rework, filesystem metadata is not the subject to maxbufspace
limit anymore. Since the metadata buffers are always mapped, the
buffers still have to fit into the buffer map, which provides a
reasonable (but practically unreachable) upper bound on it. The
non-metadata buffer allocations, both mapped and unmapped, is
accounted against maxbufspace, as before. Effectively, this means that
the maxbufspace is forced on mapped and unmapped buffers separately.
The pre-patch bufspace limiting code did not worked, because
buffer_map fragmentation does not allow the limit to be reached.

By Jeff Roberson request, the getnewbuf() function was split into
smaller single-purpose functions.

Sponsored by:	The FreeBSD Foundation
Discussed with:	jeff (previous version)
Tested by:	pho, scottl (previous version), jhb, bf
MFC after:	2 weeks
2013-03-19 14:13:12 +00:00
jhb
8604015a2e Tweak some comments. 2013-03-18 18:04:09 +00:00
jhb
8b099870ed Partially revert r195702. Deferring stops is now implemented via a set of
calls to toggle TDF_SBDRY rather than passing PBDRY to individual sleep
calls.
- Remove the stop_allowed parameters from cursig() and issignal().
  issignal() checks TDF_SBDRY directly.
- Remove the PBDRY and SLEEPQ_STOP_ON_BDRY flags.
2013-03-18 17:23:58 +00:00
glebius
338f6e587c In m_align() add assertions that mbuf is virgin, similar to assertions
in M_ALIGN(), MH_ALIGN, MEXT_ALIGN() macros.
2013-03-17 07:41:14 +00:00
pjd
acae942b05 Require CAP_SEEK if both O_APPEND and O_TRUNC flags are absent.
In other words we don't require CAP_SEEK if either O_APPEND or O_TRUNC flag is
given, because O_APPEND doesn't allow to overwrite existing data and O_TRUNC
requires CAP_FTRUNCATE already.

Sponsored by:	The FreeBSD Foundation
2013-03-16 23:19:13 +00:00
pjd
a03e9d6f4c Style: Whitespace fixes. 2013-03-16 22:37:30 +00:00
pjd
27049a86ae Style: Remove redundant space. 2013-03-16 22:36:24 +00:00
glebius
9edd6e1174 - Replace compat macros with function calls.
- Remove superfluous cleaning of m_len after allocating.

Sponsored by:	Nginx, Inc.
2013-03-16 08:57:36 +00:00
glebius
9d8d25cc83 Contrary to what the deleted comment said, the m_move_pkthdr()
will not smash the M_EXT and data pointer, so it is safe to
pass an mbuf with external storage procuded by m_getcl() to
m_move_pkthdr().

Reviewed by:	andre
Sponsored by:	Nginx, Inc.
2013-03-16 08:55:21 +00:00
pjd
552a9fbc05 Sort syscalls properly. 2013-03-15 23:00:13 +00:00
kib
284e554a67 Separate the copyright lines and the informational block by a blank line.
Requested by:	joel
MFC after:	2 weeks
2013-03-15 14:01:37 +00:00
kib
28990b8686 Add my copyright for the 2012 year work, in particular vn_io_fault()
and f_offset locking.  Add required Foundation notice for r248319.

Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
2013-03-15 12:57:30 +00:00
kib
9b5b993aea Implement the helper function vn_io_fault_pgmove(), intended to use by
the filesystem VOP_READ() and VOP_WRITE() implementations in the same
way as vn_io_fault_uiomove() over the unmapped buffers.  Helper
provides the convenient wrapper over the pmap_copy_pages() for struct
uio consumers, taking care of the TDP_UIOHELD situations.

Sponsored by:	The FreeBSD Foundation
Tested by:	pho
MFC after:	2 weeks
2013-03-15 11:16:12 +00:00
glebius
a10c1a7c72 Use m_get() and m_getcl() instead of compat macros. 2013-03-15 10:21:18 +00:00
glebius
4d24ab3136 - Use m_get2() instead of hand allocating.
- No need for u_int cast here.

Sponsored by:	Nginx, Inc.
2013-03-15 10:17:24 +00:00
glebius
4502cb662c Make m_get2() never use clusters that are bigger than PAGE_SIZE.
Requested by:	andre, jhb
Sponsored by:	Nginx, Inc.
2013-03-15 10:15:07 +00:00
trasz
5b0f44c921 When throttling a process to enforce RACCT limits, do not use neither
PBDRY (which simply doesn't make any sense) nor PCATCH (which could
be used by a malicious process to work around the PCPU limit).

Submitted by:	Rudo Tomori
Reviewed by:	kib
2013-03-14 23:25:42 +00:00
trasz
8593db8ad4 Accessing td_state requires thread lock to be held.
Submitted by:	Rudo Tomori
Reviewed by:	kib
2013-03-14 23:20:18 +00:00
kib
7ca94eca24 Some style fixes.
Sponsored by:	The FreeBSD Foundation
2013-03-14 20:31:39 +00:00
kib
9b0c4b125b Add currently unused flag argument to the cluster_read(),
cluster_write() and cluster_wbuild() functions.  The flags to be
allowed are a subset of the GB_* flags for getblk().

Sponsored by:	The FreeBSD Foundation
Tested by:	pho
2013-03-14 20:28:26 +00:00
kib
376c14488b Rewrite the vfs_bio_clrbuf(9) to not access the b_data for B_VMIO
buffers directly, use pmap_zero_page_area(9) for each zeroing page
region instead.

Sponsored by:	The FreeBSD Foundation
Tested by:	pho
MFC after:	2 weeks
2013-03-14 19:48:25 +00:00
tijl
f0c5e71226 - Fix two possible overflows when testing if ELF program headers are on
the first page:
  1. Cast uint16_t operands in a multiplication to unsigned int because
     otherwise the implicit promotion to int results in a signed
     multiplication that can overflow and the behaviour on integer
     overflow is undefined.
  2. Replace (offset + size > PAGE_SIZE) with (size > PAGE_SIZE - offset)
     because the sum may overflow.
- Use the same tests to see if the path to the interpreter is on the first
  page. There's no overflow here because size is already limited by
  MAXPATHLEN, but the compiler optimises the new tests better. Also fix an
  off-by-one error.
- Simplify tests to see if an ELF note program header is on the first page.
  This also fixes an off-by-one error.

Reviewed by:	kib
MFC after:	1 week
2013-03-13 22:01:31 +00:00
mav
06e9d2ec10 Fix incorrect assertion that caused panic when periodic-only timers used. 2013-03-13 06:42:01 +00:00
glebius
37a43650ed Functions m_getm2() and m_get2() have different order of arguments,
and that can drive someone crazy. While m_get2() is young and not
documented yet, change its order of arguments to match m_getm2().

Sorry for churn, but better now than later.
2013-03-12 13:42:47 +00:00
glebius
89cbb92630 In kern_sendfile() use m_extadd() instead of MEXTADD() macro, supplying
appropriate wait argument and checking return value. Before this change
m_extadd() could fail, and kern_sendfile() ignored that.

Sponsored by:	Nginx, Inc.
2013-03-12 12:15:24 +00:00
glebius
ce515e6007 The m_extadd() can fail due to memory allocation failure, thus:
- Make it return int, not void.
- Add wait parameter.
- Update MEXTADD() macro appropriately, defaults to M_NOWAIT, as
  before this change.

Sponsored by:	Nginx, Inc.
2013-03-12 12:12:16 +00:00
mav
ca616198e5 Make kern_nanosleep() and pause_sbt() to use per-CPU sleep queues.
This removes significant sleep queue lock congestion on multithreaded
microbenchmarks, making them scale to multiple CPUs almost linearly.
2013-03-12 06:58:49 +00:00
pjd
46a0405f64 Fix memory leak when one process send descriptor over UNIX domain socket,
but the other process exited before receiving it.
2013-03-11 22:59:07 +00:00
tuexen
e9578ada2f Return an error if sctp_peeloff() fails because a socket can't be allocated.
MFC after: 3 days
2013-03-11 17:43:55 +00:00
andre
c7a8a1916c Bring back the comment on the sizing of the callout array that got
lost in r248031.

Requested by:	alc, alfred
2013-03-10 22:55:35 +00:00
davide
399582186f Fixup r248032:
Change size requested to malloc(9) now that callwheel buckets are
callout_list and not callout_tailq anymore. This change was already
there but it seems it got lost after code churn in r248032.

Reported by:	alc, kib
2013-03-09 20:03:10 +00:00
attilio
e5b2c298d6 Improve UMTX_PROFILING:
- Use u_int values for length and max_length values
- Add a way to reset the max_length heuristic in order to have the
  possibility to reuse the mechanism consecutively without rebooting
  the machine
- Add a way to quick display top5 contented buckets in the system for
  the max_length value.
  This should give a quick overview on the quality of the hash table
  distribution.

Sponsored by:	EMC / Isilon storage division
Reviewed by:	jeff, davide
2013-03-09 15:31:19 +00:00
kib
359add023c Correct the lock class for the vm object lock.
Reported and tested by:	joel
2013-03-09 10:16:08 +00:00
mav
c6f9be8cf5 Rework overflow checks of r247898 to not let too "intelligent" compiler to
optimize it out.

Submitted by:	bde
2013-03-09 09:07:13 +00:00
attilio
7fd2627275 MFC 2013-03-09 01:39:42 +00:00
attilio
bf1dc90446 MFC 2013-03-08 00:03:07 +00:00
mav
98e8f7838c Fix off-by-one error in nanoseconds validation.
Submitted by:	bde
2013-03-07 16:50:07 +00:00
ian
c959afb453 Call sched_prio() to immediately change the priority of the thread in
response to an rtprio_thread() call, when the priority is different
than the old priority, and either the old or the new priority class is
not RTP_PRIO_NORMAL (timeshare).

The reasoning for the second half of the test is that if it's a change in
timeshare priority, then the scheduler is going to adjust that priority
in a way that completely wipes out the requested change anyway, so
what's the point?  (If that's not true, then allowing a thread to change
its own timeshare priority would subvert the scheduler's adjustments and
let a cpu-bound thread monopolize the cpu; if allowed at all, that
should require priveleges.)

On the other hand, if either the old or new priority class is not
timeshare, then the scheduler doesn't make automatic adjustments, so we
should honor the request and make the priority change right away.  The
reason the old class gets caught up in this is the very reason for this
change:  when thread A changes the priority of its child thread B from
idle back to timeshare, thread B never actually gets moved to a
timeshare-range run queue unless there are some idle cycles available
to allow it to first get scheduled again as an idle thread.

Reviewed by:	jhb@
2013-03-07 02:53:29 +00:00
mav
5dfd8a110a Reduce minimal time intervals of setitimer(2) from 1/HZ to 1/(16*HZ) by
using callout_reset_sbt() instead of callout_reset().  We can't remove
lower limit completely in this case because of significant processing
overhead, caused by unability to use direct callout execution due to using
process mutex in callout handler for sending SEGALRM signal.  With support
of periodic events that would allow unprivileged user to abuse the system.

Reviewed by:	davide
2013-03-06 22:40:47 +00:00
mav
eb7c9d2685 Fix time math overflows and improve zero intervals handling in poll(),
select(), nanosleep() and kevent() functions after calloutng changes.

Reported by:	bde
2013-03-06 19:37:38 +00:00
fabient
5e81647255 Add a generic way to call per event allocate / release function.
Reviewed by:	mav
MFC after:	1 month
2013-03-05 10:18:48 +00:00
davide
63dc09eae5 Complete r247813:
Use true/false instead of TRUE/FALSE.

Reported by:	attilio
Requested by:	jhb
2013-03-04 21:52:12 +00:00
davide
322c45390e Use C99 'bool' rather than Machish 'boolean_t'.
Requested by:	jhb
2013-03-04 21:09:22 +00:00
davide
844e77458d MFcalloutng:
- Rewrite kevent() timeout implementation to allow sub-tick precision.
- Make the interval timings for EVFILT_TIMER more accurate. This also
removes an hack introduced in r238424.

Sponsored by:	Google Summer of Code 2012, iXsystems inc.
Tested by:	flo, marius, ian, markj, Fabian Keil
2013-03-04 16:55:16 +00:00
davide
473ef72351 MFcalloutng:
Fix kern_select() and sys_poll() so that they can handle sub-tick
precision for timeouts (in the same fashion it was done for nanosleep()
in r247797).

Sponsored by:	Google Summer of Code 2012, iXsystems inc.
Tested by:	flo, marius, ian, markj, Fabian Keil
2013-03-04 16:41:27 +00:00
davide
0aca596713 MFcalloutng (r244251 with minor changes):
Specify that precision of 0.5s is enough for resource limitation.

Sponsored by:	Google Summer of Code 2012, iXsystems inc.
Tested by:	flo, marius, ian, markj, Fabian Keil
2013-03-04 16:25:12 +00:00
davide
40e58fe548 MFcalloutng (r244255 by mav, with minor changes):
Specify that syslog doesn't need exactly 5 wakeups per second.

Sponsored by:	Google Summer of Code 2012, iXsystems inc.
Tested by:	flo, marius, ian, markj, Fabian Keil
2013-03-04 16:07:55 +00:00
davide
4d11390875 MFcalloutng:
kern_nanosleep() is now converted to use tsleep_sbt(). With this change
nanosleep() and usleep() can handle sub-tick precision for timeouts.
Also, try to help coalesce of events passing as argument to tsleep_bt()
a precision value calculated as a percentage of the sleep time.
This percentage is default 5%, but it can tuned according to users
need via the sysctl interface.

Sponsored by:	Google Summer of Code 2012, iXsystems inc.
Tested by:	flo, marius, ian, markj, Fabian Keil
2013-03-04 15:57:41 +00:00
davide
2ca848c3f6 Fix build with DIAGNOSTIC/CALLOUT_PROFILING options turned on.
Reported by:	kib, David Wolfskill <david at catwhisker dot org>
Pointy-hat to:	davide
2013-03-04 15:03:52 +00:00
davide
8374b0e141 MFcalloutng:
Introduce sbt variants of msleep(), msleep_spin(), pause(), tsleep() in
the KPI, allowing to specify timeout in 'sbintime_t' rather than ticks.

Sponsored by:	Google Summer of Code 2012, iXsystems inc.
Tested by:	flo, marius, ian, markj, Fabian Keil
2013-03-04 12:48:41 +00:00
davide
7395c58e52 MFcalloutng:
Extend condvar(9) KPI introducing sbt variant of cv_timedwait. This
rely on the previously committed sleepq_set_timeout_sbt().

Sponsored by:	Google Summer of Code 2012, iXsystems inc.
Tested by:	flo, marius, ian, markj, Fabian Keil
2013-03-04 12:20:48 +00:00
davide
a893437175 MFcalloutng:
Convert sleepqueue(9) bits to the new callout KPI. Take advantage of
the possibility to run callback directly from hw interrupt context.

Sponsored by:	Google Summer of Code 2012, iXsystems inc.
Tested by:	flo, marius, ian, markj, Fabian Keil
2013-03-04 11:51:46 +00:00
davide
9ad2265733 MFcalloutng (r244355):
Make loadavg calculation callout direct. There are several reasons for it:
 - it is very simple and doesn't worth context switch to SWI;
 - since SWI is no longer used here, we can remove twelve years old hack,
excluding this SWI from from the loadavg statistics;
 - it fixes problem when eventtimer (HPET) shares interrupt with some other
device, and that interrupt thread counted as permanent loadavg of 1; now
loadavg accounted before that interrupt thread is scheduled.

Sponsored by:	Google Summer of Code 2012, iXsystems inc.
Tested by:	flo, marius, ian, Fabian Keil, markj
2013-03-04 11:22:19 +00:00
davide
431035cf16 - Make callout(9) tickless, relying on eventtimers(4) as backend for
precise time event generation. This greatly improves granularity of
callouts which are not anymore constrained to wait next tick to be
scheduled.
- Extend the callout KPI introducing a set of callout_reset_sbt* functions,
which take a sbintime_t as timeout argument. The new KPI also offers a
way for consumers to specify precision tolerance they allow, so that
callout can coalesce events and reduce number of interrupts as well as
potentially avoid scheduling a SWI thread.
- Introduce support for dispatching callouts directly from hardware
interrupt context, specifying an additional flag. This feature should be
used carefully, as long as interrupt context has some limitations
(e.g. no sleeping locks can be held).
- Enhance mechanisms to gather informations about callwheel, introducing
a new sysctl to obtain stats.

This change breaks the KBI. struct callout fields has been changed, in
particular 'int ticks' (4 bytes) has been replaced with 'sbintime_t'
(8 bytes) and another 'sbintime_t' field was added for precision.

Together with:	mav
Reviewed by:	attilio, bde, luigi, phk
Sponsored by:	Google Summer of Code 2012, iXsystems inc.
Tested by:	flo (amd64, sparc64), marius (sparc64), ian (arm),
		markj (amd64), mav, Fabian Keil
2013-03-04 11:09:56 +00:00
pjd
c3b73942d3 For some reason when I started to pass filedescent structures instead of
pointers to the file structure receiving descriptors stopped to work when also
at least few kilobytes of data is being send. In the kernel the
soreceive_generic() function doesn't see control mbuf as the first mbuf and
unp_externalize() is never called, first 6(?) kilobytes of data is missing as
well on receiving end.

This breaks for example tmux.

I don't know yet why going from 8 bytes to sizeof(struct filedescent) per
descriptor (or even to 16 bytes per descriptor) breaks things, but to
work-around it for now use 8 bytes per file descriptor at the cost of memory
allocation.

Reported by:	flo, Diane Bruce, Jan Beich <jbeich@tormail.org>
Simple testcase provided by:	mjg
2013-03-03 23:39:30 +00:00
pjd
386f382f2d Use dedicated malloc type for filecaps-related data, so we can detect any
memory leaks easier.
2013-03-03 23:25:45 +00:00
pjd
1df614f5db Plug memory leaks in file descriptors passing. 2013-03-03 23:23:35 +00:00
davide
bcd9ca99af callwheelmask and callwheelsize are always greater than zero.
Switch their type to u_int.
2013-03-03 15:01:33 +00:00
davide
ff0e3dad9e Remove a couple of unused include. 2013-03-03 14:47:02 +00:00
mav
1cd8093cdd MFcalloutng:
Some whitespace fixes.
2013-03-03 09:11:24 +00:00
pjd
369ed4d4ad Regen after r247667. 2013-03-02 21:12:54 +00:00
pjd
702516e70b - Implement two new system calls:
int bindat(int fd, int s, const struct sockaddr *addr, socklen_t addrlen);
	int connectat(int fd, int s, const struct sockaddr *name, socklen_t namelen);

  which allow to bind and connect respectively to a UNIX domain socket with a
  path relative to the directory associated with the given file descriptor 'fd'.

- Add manual pages for the new syscalls.

- Make the new syscalls available for processes in capability mode sandbox.

- Add capability rights CAP_BINDAT and CAP_CONNECTAT that has to be present on
  the directory descriptor for the syscalls to work.

- Update audit(4) to support those two new syscalls and to handle path
  in sockaddr_un structure relative to the given directory descriptor.

- Update procstat(1) to recognize the new capability rights.

- Document the new capability rights in cap_rights_limit(2).

Sponsored by:	The FreeBSD Foundation
Discussed with:	rwatson, jilles, kib, des
2013-03-02 21:11:30 +00:00
attilio
e98f58faf6 MFC 2013-03-02 14:48:41 +00:00
pjd
73650b4f3a If the target file already exists, check for the CAP_UNLINKAT capabiity right
on the target directory descriptor, but only if this is renameat(2) and real
target directory descriptor is given (not AT_FDCWD). Without this fix regular
rename(2) fails if the target file already exists.

Reported by:	Michael Butler <imb@protected-networks.net>
Reported by:	Larry Rosenman <ler@lerctr.org>
Sponsored by:	The FreeBSD Foundation
2013-03-02 09:58:47 +00:00
pjd
48e0f13795 Regen after r247602. 2013-03-02 00:55:09 +00:00
pjd
f07ebb8888 Merge Capsicum overhaul:
- Capability is no longer separate descriptor type. Now every descriptor
  has set of its own capability rights.

- The cap_new(2) system call is left, but it is no longer documented and
  should not be used in new code.

- The new syscall cap_rights_limit(2) should be used instead of
  cap_new(2), which limits capability rights of the given descriptor
  without creating a new one.

- The cap_getrights(2) syscall is renamed to cap_rights_get(2).

- If CAP_IOCTL capability right is present we can further reduce allowed
  ioctls list with the new cap_ioctls_limit(2) syscall. List of allowed
  ioctls can be retrived with cap_ioctls_get(2) syscall.

- If CAP_FCNTL capability right is present we can further reduce fcntls
  that can be used with the new cap_fcntls_limit(2) syscall and retrive
  them with cap_fcntls_get(2).

- To support ioctl and fcntl white-listing the filedesc structure was
  heavly modified.

- The audit subsystem, kdump and procstat tools were updated to
  recognize new syscalls.

- Capability rights were revised and eventhough I tried hard to provide
  backward API and ABI compatibility there are some incompatible changes
  that are described in detail below:

	CAP_CREATE old behaviour:
	- Allow for openat(2)+O_CREAT.
	- Allow for linkat(2).
	- Allow for symlinkat(2).
	CAP_CREATE new behaviour:
	- Allow for openat(2)+O_CREAT.

	Added CAP_LINKAT:
	- Allow for linkat(2). ABI: Reuses CAP_RMDIR bit.
	- Allow to be target for renameat(2).

	Added CAP_SYMLINKAT:
	- Allow for symlinkat(2).

	Removed CAP_DELETE. Old behaviour:
	- Allow for unlinkat(2) when removing non-directory object.
	- Allow to be source for renameat(2).

	Removed CAP_RMDIR. Old behaviour:
	- Allow for unlinkat(2) when removing directory.

	Added CAP_RENAMEAT:
	- Required for source directory for the renameat(2) syscall.

	Added CAP_UNLINKAT (effectively it replaces CAP_DELETE and CAP_RMDIR):
	- Allow for unlinkat(2) on any object.
	- Required if target of renameat(2) exists and will be removed by this
	  call.

	Removed CAP_MAPEXEC.

	CAP_MMAP old behaviour:
	- Allow for mmap(2) with any combination of PROT_NONE, PROT_READ and
	  PROT_WRITE.
	CAP_MMAP new behaviour:
	- Allow for mmap(2)+PROT_NONE.

	Added CAP_MMAP_R:
	- Allow for mmap(PROT_READ).
	Added CAP_MMAP_W:
	- Allow for mmap(PROT_WRITE).
	Added CAP_MMAP_X:
	- Allow for mmap(PROT_EXEC).
	Added CAP_MMAP_RW:
	- Allow for mmap(PROT_READ | PROT_WRITE).
	Added CAP_MMAP_RX:
	- Allow for mmap(PROT_READ | PROT_EXEC).
	Added CAP_MMAP_WX:
	- Allow for mmap(PROT_WRITE | PROT_EXEC).
	Added CAP_MMAP_RWX:
	- Allow for mmap(PROT_READ | PROT_WRITE | PROT_EXEC).

	Renamed CAP_MKDIR to CAP_MKDIRAT.
	Renamed CAP_MKFIFO to CAP_MKFIFOAT.
	Renamed CAP_MKNODE to CAP_MKNODEAT.

	CAP_READ old behaviour:
	- Allow pread(2).
	- Disallow read(2), readv(2) (if there is no CAP_SEEK).
	CAP_READ new behaviour:
	- Allow read(2), readv(2).
	- Disallow pread(2) (CAP_SEEK was also required).

	CAP_WRITE old behaviour:
	- Allow pwrite(2).
	- Disallow write(2), writev(2) (if there is no CAP_SEEK).
	CAP_WRITE new behaviour:
	- Allow write(2), writev(2).
	- Disallow pwrite(2) (CAP_SEEK was also required).

	Added convinient defines:

	#define	CAP_PREAD		(CAP_SEEK | CAP_READ)
	#define	CAP_PWRITE		(CAP_SEEK | CAP_WRITE)
	#define	CAP_MMAP_R		(CAP_MMAP | CAP_SEEK | CAP_READ)
	#define	CAP_MMAP_W		(CAP_MMAP | CAP_SEEK | CAP_WRITE)
	#define	CAP_MMAP_X		(CAP_MMAP | CAP_SEEK | 0x0000000000000008ULL)
	#define	CAP_MMAP_RW		(CAP_MMAP_R | CAP_MMAP_W)
	#define	CAP_MMAP_RX		(CAP_MMAP_R | CAP_MMAP_X)
	#define	CAP_MMAP_WX		(CAP_MMAP_W | CAP_MMAP_X)
	#define	CAP_MMAP_RWX		(CAP_MMAP_R | CAP_MMAP_W | CAP_MMAP_X)
	#define	CAP_RECV		CAP_READ
	#define	CAP_SEND		CAP_WRITE

	#define	CAP_SOCK_CLIENT \
		(CAP_CONNECT | CAP_GETPEERNAME | CAP_GETSOCKNAME | CAP_GETSOCKOPT | \
		 CAP_PEELOFF | CAP_RECV | CAP_SEND | CAP_SETSOCKOPT | CAP_SHUTDOWN)
	#define	CAP_SOCK_SERVER \
		(CAP_ACCEPT | CAP_BIND | CAP_GETPEERNAME | CAP_GETSOCKNAME | \
		 CAP_GETSOCKOPT | CAP_LISTEN | CAP_PEELOFF | CAP_RECV | CAP_SEND | \
		 CAP_SETSOCKOPT | CAP_SHUTDOWN)

	Added defines for backward API compatibility:

	#define	CAP_MAPEXEC		CAP_MMAP_X
	#define	CAP_DELETE		CAP_UNLINKAT
	#define	CAP_MKDIR		CAP_MKDIRAT
	#define	CAP_RMDIR		CAP_UNLINKAT
	#define	CAP_MKFIFO		CAP_MKFIFOAT
	#define	CAP_MKNOD		CAP_MKNODAT
	#define	CAP_SOCK_ALL		(CAP_SOCK_CLIENT | CAP_SOCK_SERVER)

Sponsored by:	The FreeBSD Foundation
Reviewed by:	Christoph Mallon <christoph.mallon@gmx.de>
Many aspects discussed with:	rwatson, benl, jonathan
ABI compatibility discussed with:	kib
2013-03-02 00:53:12 +00:00
jhb
8857575b13 Replace the TDP_NOSLEEPING flag with a counter so that the
THREAD_NO_SLEEPING() and THREAD_SLEEPING_OK() macros can nest.

Reviewed by:	attilio
2013-03-01 22:03:31 +00:00
pjd
54fb726d9b Remove unnecessary variables. 2013-03-01 21:58:56 +00:00
pjd
ffd2079a18 Reduce lock scope a little. 2013-03-01 21:57:02 +00:00
marius
11e6f714be - Use strdup(9) instead of reimplementing it.
- Use __DECONST instead of strange casts.
- Reduce code duplication and simplify name2oid().

PR:		176373
Submitted by:	Christoph Mallon
MFC after:	1 week
2013-03-01 18:49:14 +00:00
kib
0da1880cbc Make the default implementation of the VOP_VPTOCNP() fail if the
directory entry, matched by the inode number, is ".".

NFSv4 client might instantiate the distinct vnodes which have the same
inode number, since single v4 export can be combined from several
filesystems on the server.  For instance, a case when the nested
server mount point is exactly one directory below the top of the
export, causes directory and its parent to have the same inode number
2.  The vop_stdvptocnp() algorithm then returns "." as the name of the
lower directory.

Filtering out the "." entry with ENOENT works around this behaviour,
the error forces getcwd(3) to fall back to usermode implementation,
which compares both st_dev and st_ino.

Based on the submission by:	rmacklem
Tested by:	rmacklem
MFC after:	1 week
2013-03-01 18:40:14 +00:00
davide
9f70262c67 MFcalloutng:
Style fixes.
2013-02-28 16:22:49 +00:00
mav
6cf7cc6e4d MFcalloutng:
Switch eventtimers(9) from using struct bintime to sbintime_t.
Even before this not a single driver really supported full dynamic range of
struct bintime even in theory, not speaking about practical inexpediency.
This change legitimates the status quo and cleans up the code.
2013-02-28 13:46:03 +00:00
davide
2bf12d0c7c MFcalloutng:
When CPU becomes idle, cpu_idleclock() calculates time to the next timer
event in order to reprogram hw timer. Return that time in sbintime_t to
the caller and pass it to acpi_cpu_idle(), where it can be used as one
more factor (quite precise) to extimate furter sleep time and choose
optimal sleep state. This is a preparatory change for further callout
improvements will be committed in the next days.

The commmit is not targeted for MFC.
2013-02-28 10:46:54 +00:00
attilio
52c57fbbdb MFC 2013-02-27 18:17:34 +00:00
kib
4bffdc6be7 Make recursive getblk() slightly more useful. Keep the buffer state
intact if getblk() is done on the already owned buffer.  Exit from
brelse() early when the lock recursion is detected, otherwise brelse()
might prematurely destroy the buffer under some circumstances.

Sponsored by:	The FreeBSD Foundation
Noted by:	mckusick
Tested by:	pho
MFC after:	2 weeks
2013-02-27 07:34:09 +00:00
attilio
820ab571ec MFC 2013-02-26 21:09:35 +00:00
mav
fdde785247 Add support for good old 8192Hz profiling clock to software PMC.
Reviewed by:	fabient
2013-02-26 18:13:42 +00:00
attilio
756a9b3e47 MFC 2013-02-26 01:05:25 +00:00
attilio
cb47f0509b Merge from vmobj-rwlock branch:
Remove unused inclusion of vm/vm_pager.h and vm/vnode_pager.h.

Sponsored by:	EMC / Isilon storage division
Tested by:	pho
Reviewed by:	alc
2013-02-26 01:00:11 +00:00
pjd
9dc77e1662 Style.
Suggested by:	kib
2013-02-25 20:51:29 +00:00
pjd
d632a8a17e After r237012, the fdgrowtable() doesn't drop the filedesc lock anymore,
so update a stale comment.

Reviewed by:	kib, keramida
2013-02-25 20:50:08 +00:00
attilio
d883da7ba4 MFC 2013-02-21 21:59:35 +00:00
attilio
905e648d42 Hide the details for the assertion for VM_OBJECT_LOCK operations.
Rename current VM_OBJECT_LOCK_ASSERT(foo, RA_WLOCKED) into
VM_OBJECT_ASSERT_WLOCKED(foo)

Sponsored by:	EMC / Isilon storage division
Requested by:	alc
2013-02-21 21:54:53 +00:00
jhb
ca1e2e0739 Further refine the handling of stop signals in the NFS client. The
changes in r246417 were incomplete as they did not add explicit calls to
sigdeferstop() around all the places that previously passed SBDRY to
_sleep().  In addition, nfs_getcacheblk() could trigger a write RPC from
getblk() resulting in sigdeferstop() recursing.  Rather than manually
deferring stop signals in specific places, change the VFS_*() and VOP_*()
methods to defer stop signals for filesystems which request this behavior
via a new VFCF_SBDRY flag.  Note that this has to be a VFC flag rather than
a MNTK flag so that it works properly with VFS_MOUNT() when the mount is
not yet fully constructed.  For now, only the NFS clients are set this new
flag in VFS_SET().

A few other related changes:
- Add an assertion to ensure that TDF_SBDRY doesn't leak to userland.
- When a lookup request uses VOP_READLINK() to follow a symlink, mark
  the request as being on behalf of the thread performing the lookup
  (cnp_thread) rather than using a NULL thread pointer.  This causes
  NFS to properly handle signals during this VOP on an interruptible
  mount.

PR:		kern/176179
Reported by:	Russell Cattelan (sigdeferstop() recursion)
Reviewed by:	kib
MFC after:	1 month
2013-02-21 19:02:50 +00:00
attilio
8746bf6a5f MFC 2013-02-21 15:06:19 +00:00
jamie
7e78ded6e9 Don't worry if a module is already loaded when looking for a fstype to mount
(possible in a race condition).

Reviewed by:	kib
MFC after:	1 week
2013-02-21 02:41:37 +00:00
attilio
15bf891afe Rename VM_OBJECT_LOCK(), VM_OBJECT_UNLOCK() and VM_OBJECT_TRYLOCK() to
their "write" versions.

Sponsored by:	EMC / Isilon storage division
2013-02-20 12:03:20 +00:00
attilio
658534ed5a Switch vm_object lock to be a rwlock.
* VM_OBJECT_LOCK and VM_OBJECT_UNLOCK are mapped to write operations
* VM_OBJECT_SLEEP() is introduced as a general purpose primitve to
  get a sleep operation using a VM_OBJECT_LOCK() as protection
* The approach must bear with vm_pager.h namespace pollution so many
  files require including directly rwlock.h
2013-02-20 10:38:34 +00:00
jhb
82e5f32ac4 Fix a few typos. 2013-02-19 16:35:27 +00:00
pjd
827a11a3e0 Update the comment: we do show the backtrace of misbehaving thread. 2013-02-17 21:37:32 +00:00
pjd
fdfa7cdb9e Style. 2013-02-17 11:56:36 +00:00
pjd
c0b1af2f13 - Require CAP_FSYNC capability right when opening a file with O_SYNC or O_FSYNC
flags.
- While here simplify check for locking flags.

Sponsored by:	The FreeBSD Foundation
2013-02-17 11:53:51 +00:00
pjd
06ac2f97fd Remove redundant parenthesis. 2013-02-17 11:49:21 +00:00
pjd
1f59e8bcc8 Remove redundant space. 2013-02-17 11:48:16 +00:00
pjd
30ca80d1fc Add break to the default case. 2013-02-17 11:47:58 +00:00
pjd
3a0f30d9ae Don't treat pointers as booleans. 2013-02-17 11:47:30 +00:00
pjd
f88c314f24 Remove redundant parenthesis. 2013-02-17 11:47:01 +00:00
mckusick
c04f4382b3 Add barrier write capability to the VFS buffer interface. A barrier
write is a disk write request that tells the disk that the buffer
being written must be committed to the media along with any writes
that preceeded it before any future blocks may be written to the drive.

Barrier writes are provided by adding the functions bbarrierwrite
(bwrite with barrier) and babarrierwrite (bawrite with barrier).

Following a bbarrierwrite the client knows that the requested buffer
is on the media. It does not ensure that buffers written before that
buffer are on the media. It only ensure that buffers written before
that buffer will get to the media before any buffers written after
that buffer. A flush command must be sent to the disk to ensure that
all earlier written buffers are on the media.

Reviewed by: kib
Tested by:   Peter Holm
2013-02-16 14:51:30 +00:00
ian
683ffef109 Add PPS_CANWAIT support for time_pps_fetch(). This adds support for all three
blocking modes described in section 3.4.3 of RFC 2783, allowing the caller
to retrieve the most recent values without blocking, to block for a specified
time, or to block forever.

Reviewed by:	discussion on hackers@
2013-02-15 18:30:32 +00:00
pluknet
a846ac5e8a vn_io_faults_cnt:
- use u_long consistently
- use SYSCTL_ULONG to match the type of variable

Reviewed by:	kib
MFC after:	1 week
2013-02-15 14:22:05 +00:00
pluknet
b585f0db6b Add support of passing SCM_BINTIME ancillary data object for PF_LOCAL
sockets.

PR:		kern/175883
Submitted by:	Andrey Simonenko <simon@comsys.ntu-kpi.kiev.ua>
Discussed with:	glebius, phk
MFC after:	2 weeks
2013-02-15 13:00:20 +00:00
ian
cfdf5de939 Make the F_READAHEAD option to fcntl(2) work as documented: a value of zero
now disables read-ahead.  It used to effectively restore the system default
readahead hueristic if it had been changed; a negative value now restores
the default.

Reviewed by:	kib
2013-02-13 15:09:16 +00:00
kib
bd7f0fa0bb Reform the busdma API so that new types may be added without modifying
every architecture's busdma_machdep.c.  It is done by unifying the
bus_dmamap_load_buffer() routines so that they may be called from MI
code.  The MD busdma is then given a chance to do any final processing
in the complete() callback.

The cam changes unify the bus_dmamap_load* handling in cam drivers.

The arm and mips implementations are updated to track virtual
addresses for sync().  Previously this was done in a type specific
way.  Now it is done in a generic way by recording the list of
virtuals in the map.

Submitted by:	jeff (sponsored by EMC/Isilon)
Reviewed by:	kan (previous version), scottl,
	mjacob (isp(4), no objections for target mode changes)
Discussed with:	     ian (arm changes)
Tested by:	marius (sparc64), mips (jmallet), isci(4) on x86 (jharris),
	amd64 (Fabian Keil <freebsd-listen@fabiankeil.de>)
2013-02-12 16:57:20 +00:00
marius
27d6f70455 Update comments to reflect r246689. 2013-02-11 23:05:10 +00:00
marius
eee08c3a98 Make SYSCTL_{LONG,QUAD,ULONG,UQUAD}(9) work as advertised and also handle
constant values.

Reviewed by:	kib
MFC after:	3 days
2013-02-11 21:50:00 +00:00
kib
f24ec7b4ea Remove the ia64-specific code fragment, which effect is more cleanly
done by the call to trans_prot() function a line before.

Discussed with:	Oliver Pinter <oliver.pntr@gmail.com>
MFC after:	1 week
2013-02-10 20:08:33 +00:00
avg
ce73fe0e97 ktr: correctly handle possible wrap-around in the boot buffer
Older entries should be 'before' newer entries in the new buffer too
and there should be no zero-filled gap between them.

Pointed out by:	jhb
MFC after:	3 days
X-MFC with:	r246282
2013-02-08 07:29:07 +00:00
kib
de5bb93ec1 When vforked child is traced, the debugging events are not generated
until child performs exec().  The behaviour is reasonable when a
debugger is the real parent, because the parent is stopped until
exec(), and sending a debugging event to the debugger would deadlock
both parent and child.

On the other hand, when debugger is not the parent of the vforked
child, not sending debugging signals makes it impossible to debug
across vfork.

Fix the issue by declining generating debug signals only when vfork()
was done and child called ptrace(PT_TRACEME).  Set a new process flag
P_PPTRACE from the attach code for PT_TRACEME, if P_PPWAIT flag is
set, which indicates that the process was created with vfork() and
still did not execed. Check P_PPTRACE from issignal(), instead of
refusing the trace outright for the P_PPWAIT case.  The scope of
P_PPTRACE is exactly contained in the scope of P_PPWAIT.

Found and tested by:  zont
Reviewed by:	pluknet
MFC after:	2 weeks
2013-02-07 15:34:22 +00:00
kib
92d95b8406 Stop translating the ERESTART error from the open(2) into EINTR.
Posix requires that open(2) is restartable for SA_RESTART.

For non-posix objects, in particular, devfs nodes, still disable
automatic restart of the opens. The open call to a driver could have
significant side effects for the hardware.

Noted and reviewed by:	jilles
Discussed with:	bde
MFC after:	2 weeks
2013-02-07 14:53:33 +00:00
neel
3cb1652f75 If an interrupt event's assign_cpu method fails, then restore the original
cpuset mask for the associated interrupt thread.

The text used above is verbatim from r195249 and the code should now be
in line with the intent of that commit.
2013-02-07 06:48:47 +00:00
pjd
1d08d5f677 Audit sockaddr argument for bind(2), connect(2), accept(2), sendto(2) and
recvfrom(2) syscalls.

Sponsored by:	The FreeBSD Foundation
2013-02-07 00:36:00 +00:00
pjd
8753fea2d8 Minor style tweaks. 2013-02-07 00:27:11 +00:00
jhb
0fee3f66b8 Rework the handling of stop signals in the NFS client. The changes in
195702, 195703, and 195821 prevented a thread from suspending while holding
locks inside of NFS by forcing the thread to fail sleeps with EINTR or
ERESTART but defer the thread suspension to the user boundary.  However,
this had the effect that stopping a process during an NFS request could
abort the request and trigger EINTR errors that were visible to userland
processes (previously the thread would have suspended and completed the
request once it was resumed).

This change instead effectively masks stop signals while in the NFS client.
It uses the existing TDF_SBDRY flag to effect this since SIGSTOP cannot
be masked directly.  Also, instead of setting PBDRY on individual sleeps,
the NFS client now sets the TDF_SBDRY flag around each NFS request and
stop signals are masked for all sleeps during that region (the previous
change missed sleeps in lockmgr locks).  The end result is that stop
signals sent to threads performing an NFS request are completely
ignored until after the NFS request has finished processing and the
thread prepares to return to userland.  This restores the behavior of
stop signals being transparent to userland processes while still
preventing threads from suspending while holding NFS locks.

Reviewed by:	kib
MFC after:	1 month
2013-02-06 17:06:51 +00:00
pluknet
85b0c7ccf9 Prezero the acl structure which is to be copied to usermode, to avoid
leakage of the previous content of padding and unitialized fields.

Reported by:	Ilia Noskov <noskov@nic.ru>
Reviewed by:	kib
MFC after:	1 week
2013-02-06 15:18:46 +00:00
pluknet
eb7eb10dd6 Remove reference to the rlist code from comments, and fix a typo visible
in the resulted change.

Reviewed by:	kib
MFC after:	1 week
2013-02-05 20:08:33 +00:00
avg
cfd3c02e72 ktr: prevent possible footshooting with KTR_ENTRIES and KTR_BOOT_ENTRIES
Suggested by:	adrian
MFC after:	14 days
X-MFC with:	r246282
2013-02-04 21:58:57 +00:00
avg
3a33f3e282 ktr: copy content from the early static buffer if KTR_ENTRIES !=
KTR_BOOT_ENTRIES

Reported by:	glebius, jhb
Pointyhat to:	avg
MFC after:	14 days
X-MFC with:	r246282
2013-02-04 21:50:55 +00:00
marius
790d2fce4f Try to improve r242655 take III: move these SYSCTLs describing the kernel
map, which is defined and initialized in vm/vm_kern.c, to the latter.

Submitted by:	alc
2013-02-04 09:35:48 +00:00
marius
a41f9579b9 Further improve r242655 and supply VM_{MIN,MAX}_KERNEL_ADDRESS as constant
values to SYSCTL_ULONG(9) where possible.

Submitted by:	bde
2013-02-03 21:43:55 +00:00
avg
08016a40d7 allow for large KTR_ENTRIES values by allocating ktr_buf using malloc(9)
Only during very early boot, before malloc(9) is functional (SI_SUB_KMEM),
the static ktr_buf_init is used.  Size of the static buffer is determined
by a new kernel option KTR_BOOT_ENTRIES.  Its default value is 1024.

This commit builds on top of r243046.

Reviewed by:	alc
MFC after:	17 days
2013-02-03 09:57:39 +00:00
avg
223577af14 fix some fat-fingering in r246246
Submitted by:	mjg
Pointyhat to:	avg
MFC after:	5 days
X-MFC with:	r246246
2013-02-02 14:19:50 +00:00
avg
8e238c660c print compiler version in the kernel banner
And provide kernel compiler version as a sysctl as well.
This is useful while we have gcc and clang cohabitation.
This could be even more useful when we have support
for external toolchains.

In cooperation with:	mjg
MFC after:		13 days
2013-02-02 11:58:35 +00:00
gber
d005d97a01 Get time of next event from other cores only if SMP is already started.
Reviewed by: mav
Obtained from: Semihalf
2013-02-01 11:39:03 +00:00
pjd
2163564eab Now that MPSAFE flag is gone, we can arrange code a bit better. 2013-01-31 22:20:05 +00:00
pjd
8a682d18ff Remove leftover label after Giant removal from VFS. 2013-01-31 22:15:41 +00:00