Commit Graph

14805 Commits

Author SHA1 Message Date
John Baldwin
6270fa5f72 Mark proc0 as a kernel process via the P_KTHREAD flag.
All other kernel processes have this flag set and all threads in proc0
(including thread0) have the similar TDP_KTHREAD flag set.

PR:		204999
Submitted by:	Oliver Pinter @ HardenedBSD
Reviewed by:	kib
MFC after:	1 week
2016-02-08 23:06:27 +00:00
Konstantin Belousov
2dab579bfc Remove the assert which outlived its usefulness, and, by default,
disable compilation of the code which made it possible to call
stop_all_proc() from usermode at all.

Move the comment to the preamble of stop_all_proc() and reword it to
give overview of the function intent.

proc0 has P_HADTHREADS flag set due to kthread_add(), but no
P_KTHREAD, which triggered the assert, which does not serve a purpose
now.

Reported by:	Oliver Pinter
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
2016-02-08 10:54:27 +00:00
Jilles Tjoelker
f00fb5457e semget(): Check for [EEXIST] error first.
Although POSIX literally permits failing with [EINVAL] if IPC_CREAT and
IPC_EXCL were both passed, the semaphore set already exists and has fewer
semaphores than nsems, this does not allow an application to retry safely:
if the [EINVAL] is actually because of the semmsl limit, an infinite loop
would result.

PR:		206927
2016-02-07 22:12:39 +00:00
Pedro F. Giffuni
7559912039 Minor grammar fix in comment. 2016-02-07 16:18:12 +00:00
Kirk McKusick
b00b459084 Clarify a comment in kern_openat() about the use of falloc_noinstall().
Suggested by: Steve Jacobson
2016-02-07 01:04:47 +00:00
Mateusz Guzik
0c829a301d fork: ansify sys_pdfork
No functional changes.
2016-02-06 09:01:03 +00:00
John Baldwin
5652770d8f Rename aiocblist to kaiocb and use consistent variable names.
Typically <foo>list is used for a structure that holds a list head in
FreeBSD, not for members of a list.  As such, rename 'struct aiocblist'
to 'struct kaiocb' (the kernel version of 'struct aiocb').

While here, use more consistent variable names for AIO control blocks:

- Use 'job' instead of 'aiocbe', 'cb', 'cbe', or 'iocb' for kernel job
  objects.
- Use 'jobn' instead of 'cbn' for use with TAILQ_FOREACH_SAFE().
- Use 'sjob' and 'sjobn' instead of 'scb' and 'scbn' for fsync jobs.
- Use 'ujob' instead of 'aiocbp', 'job', 'uaiocb', or 'uuaiocb' to hold
  a user pointer to a 'struct aiocb'.
- Use 'ujobp' instead of 'aiocbp' for a user pointer to a 'struct aiocb *'.

Reviewed by:	kib
Sponsored by:	Chelsio Communications
Differential Revision:	https://reviews.freebsd.org/D5125
2016-02-05 20:38:09 +00:00
Konstantin Belousov
af582aaed1 When matching brand to the ELF binary by notes, try to find a brand
with interpreter name exactly matching one wanted by the binary.  If
no such brand exists, return first brand which accepted the binary by
note.

The change fixes a regression after r292749, where e.g. our two ia32
compat brands, ia32_brand_info and ia32_brand_oinfo, only differ by
the interpeter path and binary matches to a brand by linkage order.
Then old binaries which require /usr/libexec/ld-elf.so.1 but matched
against ia32_brand_info with interp_path /libexec/ld-elf.so.1, were
considered requiring non-standard interpreter name, and magic to force
ld-elf32.so.1 did not happen.

Note that it might make sense to apply the same selection of brands
for other matching criteria, SCO EI_OSABI and 3.x string.

Reported and tested by:	dwmalone
Sponsored by:	The FreeBSD Foundation
MFC after:	3 days
2016-02-04 20:55:49 +00:00
Konstantin Belousov
76c404fce5 Do not copy by field when converting struct oexport_args to struct
export_args on mount update, bzero() is consistent with
vfs_oexport_conv().
Make the code structure more explicit by using switch.
Return EINVAL if export option layout (deduced from size) is unknown.

Based on the submission by:	bde
Sponsored by:	The FreeBSD Foundation
2016-02-04 16:32:21 +00:00
Konstantin Belousov
4732ae438a Guard against runnable td2 exiting and than being reused for unrelated
process when the parent sleeps waiting for the debugger attach on
fork.

Diagnosed and reviewed by:	mjg
Sponsored by:	The FreeBSD Foundation
2016-02-04 10:49:34 +00:00
Mateusz Guzik
813361c140 fork: plug a use after free of the returned process
fork1 required its callers to pass a pointer to struct proc * which would
be set to the new process (if any). procdesc and racct manipulation also
used said pointer.

However, the process could have exited prior to do_fork return and be
automatically reaped, thus making this a use-after-free.

Fix the problem by letting callers indicate whether they want the pid or
the struct proc, return the process in stopped state for the latter case.

Reviewed by:	kib
2016-02-04 04:25:30 +00:00
Mateusz Guzik
33fd9b9a2b fork: pass arguments to fork1 in a dedicated structure
Suggested by:	kib
2016-02-04 04:22:18 +00:00
Gleb Smirnoff
e60b2fcbeb Redo r292484. Embed task(9) into zone, so that uz_maxaction is called
in a context that can sleep, allowing consumers of the KPI to run their
drain routines without any extra measures.

Discussed with:	jtl
2016-02-03 23:30:17 +00:00
Gleb Smirnoff
9542ea7b80 Move uma_dbg_alloc() and uma_dbg_free() into uma_core.c, which allows
to make uma_dbg.h not depend on uma_int.h, which allows to uninclude
uma_int.h from the mbuf(9) allocator.
2016-02-03 22:02:36 +00:00
Alfred Perlstein
7325dfbb59 Increase max allowed backlog for listen sockets
from short to int.

PR: 203922
Submitted by: White Knight <white_knight@2ch.net>
MFC After: 4 weeks
2016-02-02 05:57:59 +00:00
Gleb Smirnoff
8ec07310fa These files were getting sys/malloc.h and vm/uma.h with header pollution
via sys/mbuf.h
2016-02-01 17:41:21 +00:00
Andrew Turner
f9c2ec64be Fix the logic in the ddb command 'show ktr /a'. Prior to r118269 it would
print until cncheckc returned a non -1, i.e. a character had been entered.
After this change it would print only if cncheckc returned a character.
As this was before each call to db_mach_vtrace the normal outcome was
nothing was printed.

With this change 'show ktr /a' will now keep printing until the user stops
the command with a key press, or there is no more entries to print.
2016-01-31 17:32:20 +00:00
Eric van Gyzen
0e3d6ed44e kqueue EVFILT_PROC: avoid collision between NOTE_CHILD and NOTE_EXIT
NOTE_CHILD and NOTE_EXIT return something in kevent.data: the parent
pid (ppid) for NOTE_CHILD and the exit status for NOTE_EXIT.
Do not let the two events be combined, since one would overwrite
the other's data.

PR:		180385
Submitted by:	David A. Bright <david_a_bright@dell.com>
Reviewed by:	jhb
MFC after:	1 month
Sponsored by:	Dell Inc.
Differential Revision:	https://reviews.freebsd.org/D4900
2016-01-28 20:24:15 +00:00
Kirk McKusick
d2a28cb080 The bread() function was inconsistent about whether it would return
a buffer pointer in the event of an error (for some errors it would
return a buffer pointer and for other errors it would not return a
buffer pointer). The cluster_read() function was similarly inconsistent.

Clients of these functions were inconsistent in handling errors.
Some would assume that no buffer was returned after an error and
would thus lose buffers under certain error conditions. Others would
assume that brelse() should always be called after an error and
would thus panic the system under certain error conditions.

To correct both of these problems with minimal code churn, bread()
and cluster_write() now always free the buffer when returning an
error thus ensuring that buffers will never be lost. The brelse()
routine checks for being passed a NULL buffer pointer and silently
returns to avoid panics. Thus both approaches to handling error
returns from bread() and cluster_read() will work correctly.

Future code should be written assuming that bread() and cluster_read()
will never return a buffer with an error, so should not attempt to
brelse() the buffer when an error is returned.

Reviewed by: kib
2016-01-27 21:23:01 +00:00
Mateusz Guzik
4dd3a21fb3 ktrace: tidy up ktrstruct
- minor style fixes
- avoid doing strlen twice [1]

PR:		206648
Submitted by:	C Turt <ecturt gmail.com> (original version) [1]
2016-01-27 19:55:02 +00:00
Justin Hibbits
2dd1bdf183 Convert rman to use rman_res_t instead of u_long
Summary:
Migrate to using the semi-opaque type rman_res_t to specify rman resources.  For
now, this is still compatible with u_long.

This is step one in migrating rman to use uintmax_t for resources instead of
u_long.

Going forward, this could feasibly be used to specify architecture-specific
definitions of resource ranges, rather than baking a specific integer type into
the API.

This change has been broken out to facilitate MFC'ing drivers back to 10 without
breaking ABI.

Reviewed By: jhb
Sponsored by:	Alex Perez/Inertial Computing
Differential Revision: https://reviews.freebsd.org/D5075
2016-01-27 02:23:54 +00:00
John Baldwin
0dd6c0352b Various style fixes.
- Wrap long lines.
- Fix indentation.
- Remove excessive parens.
- Whitespace fixes in struct definitions.

Reviewed by:	kib
Differential Revision:	https://reviews.freebsd.org/D5025
2016-01-26 21:24:49 +00:00
Konstantin Belousov
2cd5358a66 Don't clear the software flow control flag before draining for last
close or assert the bug that it is clear when leaving.

Remove an unrelated rotted comment that was attached to the buggy
clearing.

Since draining is not done in more cases, flushing is needed in more
cases, so start fixing flushing:
- do a full flush in ttydisc_close().  State what POSIX requires more
  clearly.  This was missing ttydevsw_pktnotify() calls to tell the
  devsw layer to flush.  Hardware tty drivers don't actually flush
  since they don't understand this API.
- fix 2 missing wakeups in tty_flush().  Most of the wakeups here are
  unnecessary for last close.  But ttydisc_close() did one of the
  missing ones.

This flow control bug ameliorated the design bug of requiring
potentially unbounded waits in draining.  Software flow control is the
easiest way to get an unbounded wait, and a long wait is sometimes
actually useful.  Users can type the xoff character on the receiver
and (if ixon is set on the sender) expect the output to be held until
the user is ready for more.

Hardware flow control can also give the unbounded wait, and this bug
didn't affect hardware flow control.  Unbounded waits from hardware
flow control take a more unusual configuration.  E.g., a terminal
program that controls the modem status lines, or unplugging the cable
in a configuration where this doesn't break the connection.

The design bug is still ameliorated by a newer bug in draining for
last close -- the 1 second timeout.  E.g., if the user types the
xoff character and the sender reaches last close, then output is
not resumed and the wait times out after just 1 second.  This is
broken, but preferable to an unbounded wait.  Before this change,
the output was resumed immediately and usually completed.

Submitted by:	bde
MFC after:	2 weeks
2016-01-26 14:46:39 +00:00
Edward Tomasz Napierala
ed81020097 Fix the way RCTL handles rules' rrl_exceeded on credenials change.
Because of what this variable does, it was probably harmless - but
still incorrect.

MFC after:	1 month
Sponsored by:	The FreeBSD Foundation
2016-01-26 11:28:55 +00:00
Konstantin Belousov
88d74d64d7 Restore flushing of output for revoke(2) again. Document revoke()'s
intended behaviour in its man page.  Simplify tty_drain() to match.
Don't call ttydevsw methods in tty_flush() if the device is gone
since we now sometimes call it then.

The flushing was supposed to be implemented by passing the FNONBLOCK
flag to VOP_CLOSE() for revoke().  The tty driver is one of the few
that can block in close and was one of the fewer that knew about this.

This almost worked in FreeBSD-1 and similarly in Net/2.  These
versions only almost worked because there was and is considerable
confusion between IO_NDELAY and FNONBLOCK (aka O_NONBLOCK).  IO_NDELAY
is only valid for VOP_READ() and VOP_WRITE().  For other VOPs it has
the same value as O_SHLOCK.  But since vfs_subr.c and tty.c
consistently used the wrong flag and the O_SHLOCK flag is rarely set,
this mostly worked.  It also gave the feature than applications could
get the non-blocking close by abusing O_SHLOCK.

This was first broken then fixed in 1995.  I changed only the tty
driver to use FNONBLOCK, as a hack to get non-blocking via the normal
flag FNONBLOCK for last closes.  I didn't know about revoke()'s use
of IO_NDELAY or change it to be consistent, so revoke() was broken.
Then I changed revoke() to match.

This was next broken in 1997 then fixed in 1998.  Importing Lite2 made
the flags inconsistent again by undoing the fix only in vfs_subr.c.

This was next broken in 2008 by replacing everything in tty.c and not
checking any flags in last close.  Other bugs in draining limited the
resulting unbounded waits to drain in some cases.

It is now possible to fix this better using the new FREVOKE flag.
Just restore flushing for revoke() for now.  Don't restore or undo any
hacks for ordinary last closes yet.  But remove dead code in the
1-second relative timeout (r272789).  This did extra work to extend
the buggy draining for revoke() for as long as possible.  The 1-second
timeout made this not very long by usually flushing after 1 second.

Submitted by:	bde
MFC after:	2 weeks
2016-01-26 07:57:44 +00:00
Mark Johnston
87b87bb853 Evaluate the sysctl_running fail point before taking the sysctl lock.
The fail point handler may sleep, but this is not permitted while holding a
rm read lock.

MFC after:	1 week
Sponsored by:	EMC / Isilon Storage Division
2016-01-26 01:15:18 +00:00
Marius Strobl
57169cea64 - Make the code consistent with itself style-wise and bring it closer
to style(9).
- Mark unused arguments as such.
- Make the ttystates table const.
2016-01-25 22:58:06 +00:00
Konstantin Belousov
2e77021e4c Don't allow opening the callout device when the callin device is already
open (in disguise as the console device).  The only allowed combination
was supposed to be the callin device with the console.

Fix the assertion in ttydev_close() that was meant to detect this (it
only detected all 3 devices being open).  Assert this in ttydev_open()
too.

Submitted by:	bde
MFC after:	2 weeks
2016-01-25 16:47:20 +00:00
Konstantin Belousov
3593a18a91 Fix the %b flags string for ddb. All bits above the 5th
(TF_OPENED_CONS) were broken in r188147 by adding TF_OPENED_CONS
without updating the string.  It was especially confusing to display
OPENED_CONS as GONE and BYPASS as ZOMBIE.  2 flags at the end were
not updated in r188487.

Don't print an extra 0x prefix for %p in a ddb command.  In the rest
of the kernel there are more than 6000 lines with %p and only about
40 with this bug.

Print a non-extra 0x prefix for %b in a ddb command.  In the rest
of the kernel, there are approx. 180 lines with %b and 2/3 of them
have this bug.

Submitted by:	bde
MFC after:	2 weeks
2016-01-25 15:37:01 +00:00
Alexander V. Chernikov
61eee0e202 MFP r287070,r287073: split radix implementation and route table structure.
There are number of radix consumers in kernel land (pf,ipfw,nfs,route)
  with different requirements. In fact, first 3 don't have _any_ requirements
  and first 2 does not use radix locking. On the other hand, routing
  structure do have these requirements (rnh_gen, multipath, custom
  to-be-added control plane functions, different locking).
Additionally, radix should not known anything about its consumers internals.

So, radix code now uses tiny 'struct radix_head' structure along with
  internal 'struct radix_mask_head' instead of 'struct radix_node_head'.
  Existing consumers still uses the same 'struct radix_node_head' with
  slight modifications: they need to pass pointer to (embedded)
  'struct radix_head' to all radix callbacks.

Routing code now uses new 'struct rib_head' with different locking macro:
  RADIX_NODE_HEAD prefix was renamed to RIB_ (which stands for routing
  information base).

New net/route_var.h header was added to hold routing subsystem internal
  data. 'struct rib_head' was placed there. 'struct rtentry' will also
  be moved there soon.
2016-01-25 06:33:15 +00:00
Konstantin Belousov
fa28b6e7da In tty_dealloc(), clear the queues. See the comment for a scenario
which explains why ttydev_leave() cleanup might not happen.

Submitted by:	bde
MFC after:	3 weeks
2016-01-22 20:38:46 +00:00
Konstantin Belousov
6adf19481c The struct file f_advice member is overlaid with the devfs f_cdevpriv
data.  If vnode bypass for devfs file failed, vn_read/vn_write are
called and might try to dereference f_advice.  Limit the accesses to
f_advice to VREG vnodes only, which is the type ensured by
posix_fadvise().

The f_advice for regular files is protected by mtxpool lock.  Recheck
that f_advice is not NULL after lock is taken.

Reported and tested by:	bde
Sponsored by:	The FreeBSD Foundation
MFC after:	3 weeks
2016-01-22 20:35:20 +00:00
Gleb Smirnoff
33a2a37b86 - Separate sendfile(2) implementation from uipc_syscalls.c into
separate file.  Claim my copyright.
- Provide more comments, better function and structure names.
- Sort out unneeded includes from resulting two files.

No functional changes.
2016-01-22 02:23:18 +00:00
John Baldwin
39314b7d99 AIO daemons have always been kernel processes to facilitate switching to
user VM spaces while servicing jobs.  Update various comments and data
structures that refer to AIO daemons as threads to refer to processes
instead.

Reviewed by:	kib
Sponsored by:	Chelsio Communications
Differential Revision:	https://reviews.freebsd.org/D4999
2016-01-21 02:20:38 +00:00
John Baldwin
4429f0e2e3 Remove unused variables for socket AIO.
In r55943, a per-process queue of pending socket AIO requests (requests
waiting for the socket to become ready) was added so that they could be
cancelled during process rundown.  In r154765, the rundown code was
changed to handle jobs in this state (JOBST_JOBQSOCK) directly removing
the need for the extra queue.  However, the per-process queue head and
global lock were never removed.

Reviewed by:	kib
Sponsored by:	Chelsio Communications
Differential Revision:	https://reviews.freebsd.org/D4997
2016-01-21 01:28:31 +00:00
Mateusz Guzik
b0632ab432 cache: minor changes
1. vhold and zap immediately instead of postponing few lines later
2. increment numneg after new entry is added

No functional changes.

No objections:		kib
2016-01-21 01:09:39 +00:00
Mateusz Guzik
baa2bcf572 cache: perform . lockup without the namecache lock
Reviewed by:	kib
2016-01-21 01:07:05 +00:00
Mateusz Guzik
db709ecbcc cache: provide a helper for computing the hash
Reviewed by:	kib
2016-01-21 01:05:41 +00:00
Mateusz Guzik
76583fa294 cache: use counter(9) API to maintain statistics
Previously the code would just increment statistics while only holding a
shared lock, in effect losing updates.

Separate tracking for nchstats is removed as values can be obtained from
existing counters. Note that some fields are updated by external
consumers and are left unfixed. This should not be a serious issue as
this structure looks quite obsolete.

No strong objections: kib
2016-01-21 01:04:03 +00:00
Mateusz Guzik
17182b1351 session: avoid proctree lock on proc exit when possible
We can get away with the common case with only proc lock held.

Reviewed by:	kib
2016-01-20 23:33:58 +00:00
Mateusz Guzik
6760182c5d session: tidy up fixjobc
This stops abusing the 'p' pointer for iteration over children processes
and gets rid of useless locking around PRS_ZOMBIE check.

Suggested by:	kib
2016-01-20 23:22:36 +00:00
Marius Strobl
9750d9e5b6 Fix tty_drain() and, thus, TIOCDRAIN of the current tty(4) incarnation
to actually wait until the TX FIFOs of UARTs have be drained before
returning. This is done by bringing the equivalent of the TS_BUSY flag
found in the previous implementation back in an ABI-preserving way.
Reported and tested by: Patrick Powell

Most likely, drivers for USB-serial-adapters likewise incorporating
TX FIFOs as well as other terminal devices that buffer output in some
form should also provide implementations of tsw_busy.

MFC after:	3 days
2016-01-19 23:34:27 +00:00
John Baldwin
8a4dc40ff4 Various cleanups to the main function for AIO kernel processes:
- Pull the vmspace logic out into helper functions and reduce duplication.
  Operations on the vmspace are all isolated to vm_map.c, but it now exports
  a new 'vmspace_switch_aio' for use by AIO kernel processes.
- When an AIO kernel process wants to exit, break out of the main loop and
  perform cleanup after the loop end.  This reduces a lot of indentation and
  allows cleanup to more closely mirror setup actions before the loop starts.
- Convert a DIAGNOSTIC to KASSERT().
- Replace mycp with more typical 'p'.

Reviewed by:	kib
Sponsored by:	Chelsio Communications
Differential Revision:	https://reviews.freebsd.org/D4990
2016-01-19 21:37:51 +00:00
John Baldwin
f2e7f06a0d Don't create a dedicated session for each AIO kernel process.
This code dates back to the initial AIO support and the commit log does
not explain why it is needed.  However, I cannot find anything in the
AIO code or the various file methods (fo_read/fo_write) that would change
behavior due to using a private session instead of proc0's session.

Reviewed by:	kib
Sponsored by:	Chelsio Communications
Differential Revision:	https://reviews.freebsd.org/D4988
2016-01-19 20:46:30 +00:00
Mark Johnston
793c381706 Add vrefl(), a locked variant of vref(9).
This API has no in-tree consumers at the moment but is useful to at least
one out-of-tree consumer, and naturally complements existing vnode refcount
functions (vholdl(9), vdropl(9)).

Obtained from:	kib (sys/ portion)
Sponsored by:	EMC / Isilon Storage Division
Differential Revision:	https://reviews.freebsd.org/D4947
Differential Revision:	https://reviews.freebsd.org/D4953
2016-01-18 22:21:46 +00:00
Konstantin Belousov
ce958bdefe When cleaning up from failed adv locking and checking for write, do
not call VOP_CLOSE() manually.  Instead, delegate the close to
fo_close() performed as part of the fdrop() on the file failed to
open.  For this, finish constructing file on error, in particular, set
f_vnode and f_ops.

Forcibly resetting f_ops to badfileops disabled additional cleanups
performed by fo_close() for some file types, in this case it was noted
that cdevpriv data was corrupted.  Since fo_close() call must be
enabled for some file types, it makes more sense to enable it for all
files opened through vn_open_cred().

In collaboration with:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
2016-01-17 08:40:51 +00:00
John Baldwin
6c8fd02283 Remove aiod_timeout.
It hasn't been used since the AIO code was made MPSAFE 10 years ago.

Reviewed by:	kib
Sponsored by:	Chelsio Communications
Differential Revision:	https://reviews.freebsd.org/D4946
2016-01-14 21:28:56 +00:00
John Baldwin
c85650cacc Rename aiod_bio taskqueue to aiod_kick.
This taskqueue is not used to handle bio requests.  It is only used to
run aio_kick_nowait() to spin up new aio daemon processes.

Reviewed by:	kib
Differential Revision:	https://reviews.freebsd.org/D4904
2016-01-14 20:51:48 +00:00
Gleb Smirnoff
c8358c6e0d Call crextend() before copying old credentials to the new credentials
and replace crcopysafe by crcopy as crcopysafe is is not intended to be
safe in a threaded environment, it drops PROC_LOCK() in while() that
can lead to unexpected results, such as overwrite kernel memory.

In my POV crcopysafe() needs special attention. For now I do not see
any problems with this function, but who knows.

Submitted by:	dchagin
Found by:	trinity
Security:	SA-16:04.linux
2016-01-14 10:16:25 +00:00
Colin Percival
c0ada0377a Fix a bug introduced in r291716:
"The problem with the approach taken both in _bus_dmamap_load_pages and
bus_dmamap_load_ma_triv is that they split the request buffer into
arbitrary chunks based on page boundaries, creating segments that no
longer have a size that's a multiple of the sector size. This breaks
drivers like blkfront (and probably other stuff)." [1]

This was most easily triggered by running `fsck /` on a system running
in Xen (e.g. Amazon EC2) but also showed up via growfs(8) and probably
many other userland tools which access the disk directly.

Patch by:	royger [1]
"Thinks this should be fine" by:	ken
2016-01-11 20:38:39 +00:00
Dmitry Chagin
038c720553 Implement vsyscall hack. Prior to 2.13 glibc uses vsyscall
instead of vdso. An upcoming linux_base-c6 needs it.

Differential Revision:  https://reviews.freebsd.org/D1090

Reviewed by:	kib, trasz
MFC after:	1 week
2016-01-09 20:18:53 +00:00
Mark Johnston
11f9ca696e Prevent cv_waiters wraparound.
r282971 attempted to fix this problem by decrementing cv_waiters after
waking up from sleeping on a condition variable, but this can result in
a use-after-free if the CV is freed before all woken threads have had a
chance to run. Instead, avoid incrementing cv_waiters past INT_MAX, and
have cv_signal() explicitly check for sleeping threads once cv_waiters has
reached this bound.

Reviewed by:	jhb
MFC after:	2 weeks
Sponsored by:	EMC / Isilon Storage Division
Differential Revision:	https://reviews.freebsd.org/D4822
2016-01-09 01:56:46 +00:00
Gleb Smirnoff
2bab0c5535 New sendfile(2) syscall. A joint effort of NGINX and Netflix from 2013 and
up to now.

The new sendfile is the code that Netflix uses to send their multiple tens
of gigabits of data per second. The new implementation features asynchronous
I/O, when I/O operations are launched, but not awaited to be complete. An
explanation of why such behavior is beneficial compared to old one is
going to be too long for a commit message, so we will skip it here.

Additional features of new syscall are extra flags, which provide an
application more control over data sent. The SF_NOCACHE flag tells
kernel that data shouldn't be cached after it was sent. The SF_READAHEAD()
macro allows to specify readahead size in pages.

The new syscalls is a drop in replacement. No modifications are required
to applications. One can take nginx binary for stable/10 and run it
successfully on head. Although SF_NODISKIO lost its original sense, as now
sendfile doesn't block, and now means something completely different (tm),
using the new sendfile the old way is absolutely safe.

Celebrates:	Netflix global launch!
Sponsored by:	Nginx, Inc.
Sponsored by:	Netflix
Relnotes:	yes
2016-01-08 20:34:57 +00:00
Gleb Smirnoff
829fae9063 Make it possible for sbappend() to preserve M_NOTREADY on mbufs, just like
sbappendstream() does. Although, M_NOTREADY may appear only on SOCK_STREAM
sockets, due to sendfile(2) supporting only the latter, there is a corner
case of AF_UNIX/SOCK_STREAM socket, that still uses records for the sake
of control data, albeit being stream socket.

Provide private version of m_clrprotoflags(), which understands PRUS_NOTREADY,
similar to m_demote().
2016-01-08 19:03:20 +00:00
Gleb Smirnoff
ffd1c319a9 Revert r293405: it breaks socket buffer INVARIANTS when sending control
data over local sockets.
2016-01-08 17:27:23 +00:00
Gleb Smirnoff
2f2edf0a08 For SOCK_STREAM socket use sbappendstream() instead of sbappend(). 2016-01-08 01:16:03 +00:00
Konstantin Belousov
0de1455462 Convert tty common code to use make_dev_s().
Tty.c was untypical in that it handled the si_drv1 issue consistently
and correctly, by always checking for si_drv1 being non-NULL and
sleeping if NULL.  The removed code also illustrated unneeded
complications in drivers which are eliminated by the use of new KPI.

Reviewed by:	hps, jhb
Discussed with:	bde
Sponsored by:	The FreeBSD Foundation
MFC after:	3 weeks
Differential revision:	https://reviews.freebsd.org/D4746
2016-01-07 20:15:09 +00:00
Konstantin Belousov
48ce5d4cac Provide yet another KPI for cdev creation, make_dev_s(9).
Immediate problem fixed by the new KPI is the long-standing race
between device creation and assignments to cdev->si_drv1 and
cdev->si_drv2, which allows the window where cdevsw methods might be
called with si_drv1,2 fields not yet set.  Devices typically checked
for NULL and returned spurious errors to usermode, and often left some
methods unchecked.

The new function interface is designed to be extensible, which should
allow to add more features to make_dev_s(9) without inventing yet
another name for function to create devices, while maintaining KPI and
even KBI backward-compatibility.

Reviewed by:	hps, jhb
Sponsored by:	The FreeBSD Foundation
MFC after:	3 weeks
Differential revision:	https://reviews.freebsd.org/D4746
2016-01-07 20:08:02 +00:00
Mateusz Guzik
6b53d1bc6f cache: ansify functions and fix some style issues
No functional changes.
2016-01-07 02:04:17 +00:00
Konstantin Belousov
1041e09089 Two fixes for excessive iterations after r292326.
Advance the logical block number to the lblkno of the found block plus
one, instead of incrementing the block number which was used for
lookup.  This change skips sparcely populated buffer ranges, similar
to r292325, instead of doing useless lookups.

Do not restart the bnoreuselist() from the start of the range if
buffer lock cannot be obtained without sleep.  Only retry lookup and
lock for the same queue and same logical block number.

Reported by:	benno
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	3 days
2016-01-05 14:48:40 +00:00
Ian Lepore
69dcb7e771 Make the 'env' directive described in config(5) work on all architectures,
providing compiled-in static environment data that is used instead of any
data passed in from a boot loader.

Previously 'env' worked only on i386 and arm xscale systems, because it
required the MD startup code to examine the global envmode variable and
decide whether to use static_env or an environment obtained from the boot
loader, and set the global kern_envp accordingly.  Most startup code wasn't
doing so.  Making things even more complex, some mips startup code uses an
alternate scheme that involves calling init_static_kenv() to pass an empty
buffer and its size, then uses a series of kern_setenv() calls to populate
that buffer.

Now all MD startup code calls init_static_kenv(), and that routine provides
a single point where envmode is checked and the decision is made whether to
use the compiled-in static_kenv or the values provided by the MD code.

The routine also continues to serve its original purpose for mips; if a
non-zero buffer size is passed the routine installs the empty buffer ready
to accept kern_setenv() values.  Now if the size is zero, the provided buffer
full of existing env data is installed.  A NULL pointer can be passed if the
boot loader provides no env data; this allows the static env to be installed
if envmode is set to do so.

Most of the work here is a near-mechanical change to call the init function
instead of directly setting kern_envp.  A notable exception is in xen/pv.c;
that code was originally installing a buffer full of preformatted env data
along with its non-zero size (like mips code does), which would have allowed
kern_setenv() calls to wipe out the preformatted data.  Now it passes a zero
for the size so that the buffer of data it installs is treated as
non-writeable.
2016-01-02 02:53:48 +00:00
Marius Strobl
45005f7907 - (Ab)use udivx for dividing the u_int pc_cpuid when implementing
CPU_ISSET(), CPU_SET etc. in sparc64 asm. This approach has the
  benefit of not clobbering %y, allowing to revert r222827 and
  partially r222828.
- In r222828, CATR() already was changed to use the equivalent of
  PCPU_GET(cpuid) instead of the MD module ID for KTR_CPU, so
  belatedly also catch up with the C side of ktr(9). Originally,
  in r203838 CATR() was moved away from directly reading the
  module ID or equivalent as that became impractical with other
  CPU types than USI/II supported. With r222828 in place, per-CPU
  data generally is set up soon enough, though, that employing
  PCPU things in ktr(9) also for use during early stages works.
- Unfortunately, an exception to the latter is the ktr(9) use
  in pmap_bootstrap(), which actually is run so early that even
  checking for bootverbose being set via the loader doesn't work.
  Consequently, replace the ktr(9) use in pmap_bootstrap() with
  OF_printf(9) and put it under #ifdef DIAGNOSTIC instead.

MFC after:	3 days
2015-12-30 13:49:20 +00:00
John Baldwin
5fcfab6e32 Add ptrace(2) reporting for LWP events.
Add two new LWPINFO flags: PL_FLAG_BORN and PL_FLAG_EXITED for reporting
thread creation and destruction. Newly created threads will stop to report
PL_FLAG_BORN before returning to userland and exiting threads will stop to
report PL_FLAG_EXIT before exiting completely. Both of these events are
only enabled and reported if PT_LWP_EVENTS is enabled on a process.
2015-12-29 23:25:26 +00:00
John Baldwin
d1e7a4a553 Call kern_thr_exit() instead of duplicating it.
This code is missing the racct_subr() call from kern_thr_exit() and would
require further code duplication in future changes.

Reviewed by:	kib
MFC after:	1 week
2015-12-29 23:16:20 +00:00
Dmitry Chagin
3e18d701de Verify that tv_sec value specified in settimeofday() and clock_settime()
(CLOCK_REALTIME case) system calls is non negative.
This commit hides a kernel panic in atrtc_settime() as the clock_ts_to_ct()
does not properly convert negative tv_sec.

ps. in my opinion clock_ts_to_ct() should be rewritten to properly handle
negative tv_sec values.

Differential Revision:	https://reviews.freebsd.org/D4714
Reviewed by:		kib

MFC after:	1 week
2015-12-27 15:37:07 +00:00
Konstantin Belousov
1899507706 Do not substitute interpeter if the brand interpreter path is
different from the interpreter path requested by the binary.

Before this change, it is impossible to activate non-default
interpreter for 32bit image on amd64, when /libexec/ld-elf32.so.1 file
exists.

Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
2015-12-26 15:40:12 +00:00
Jonathan T. Looney
d3ee0a15a4 Only allow one PT_INTERP ELF program header. This also fixes a potential
memory leak for interp_buf.

Differential Revision:	https://reviews.freebsd.org/D4692
Reviewed by:	kib
MFC after:	2 weeks
Sponsored by:	Juniper Networks
2015-12-24 00:58:11 +00:00
Enji Cooper
853a17ad6f Fix r292640
vim overzealously removed some trailing `+' and I didn't check the
diff

MFC after: 1 week
X-MFC with: r292640
Pointyhat to: ngie
Sponsored by: EMC / Isilon Storage Division
2015-12-23 03:34:43 +00:00
Enji Cooper
905b145f0a Clean up trailing whitespace; no functional change
MFC after: 1 week
Sponsored by: EMC / Isilon Storage Division
2015-12-23 03:29:37 +00:00
Enji Cooper
20b4c1d2cb Fold lim_shared into lim_copy to mute a -Wunused compiler warning from
clang when the kernel is compiled without INVARIANTS

Differential Revision: https://reviews.freebsd.org/D4683
Reviewed by: kib, jhb
MFC after: 1 week
Sponsored by: EMC / Isilon Storage Division
2015-12-22 21:07:33 +00:00
Konstantin Belousov
d943fa35ad If we annoy user with the terminal output due to failed load of
interpreter, also show the actual error code instead of some
interpretation.

Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2015-12-22 20:12:52 +00:00
Jonathan T. Looney
54503a13d8 Add a safety net to reclaim mbufs when one of the mbuf zones become
exhausted.

It is possible for a bug in the code (or, theoretically, even unusual
network conditions) to exhaust all possible mbufs or mbuf clusters.
When this occurs, things can grind to a halt fairly quickly. However,
we currently do not call mb_reclaim() unless the entire system is
experiencing a low-memory condition.

While it is best to try to prevent exhaustion of one of the mbuf zones,
it would also be useful to have a mechanism to attempt to recover from
these situations by freeing "expendable" mbufs.

This patch makes two changes:

a) The patch adds a generic API to the UMA zone allocator to set a
function that should be called when an allocation fails because the
zone limit has been reached. Because of the way this function can be
called, it really should do minimal work.

b) The patch uses this API to try to free mbufs when an allocation
fails from one of the mbuf zones because the zone limit has been
reached. The function schedules a callout to run mb_reclaim().

Differential Revision:	https://reviews.freebsd.org/D3864
Reviewed by:	gnn
Comments by:	rrs, glebius
MFC after:	2 weeks
Sponsored by:	Juniper Networks
2015-12-20 02:05:33 +00:00
Mateusz Guzik
aa0241d623 proc: fix a race which could result in dereference of bad p_pgrp pointer on fork
During fork p_starcopy - p_endcopy area of a process is populated with bcopy
with only proc lock held. Another forking thread can find such a process and
proceed to access p_pgrp included in said area.

Fix the problem by moving the field outside. It is being properly assigned
later.

Reviewed by:	kib
Diagnosed by:	kib
Tested by:	Fabian Keil <freebsd-listen fabiankeil.de>
MFC after:	10 days
2015-12-18 16:33:15 +00:00
Adrian Chadd
2b3ad18853 [intrng] Migrate the intrng code from sys/arm/arm to sys/kern/subr_intr.c.
The ci20 port (by kan@) is going to reuse almost all of the intrng code
since the SoC in question looks suspiciously like someone took an ARM
SoC design and replaced the ARM core with a MIPS core.

* migrate out the code;
* rename ARM_ -> INTR_;
* rename arm_ -> intr_;
* move the interrupt flush routine from intr.c / intrng.c into
  arm/machdep_intr.c - removing the code duplication and removing
  the ARM specific bits from here.

Thanks to the Star Wars: The Force Awakens premiere line for allowing
me a couple hours of quiet time to finish the universe builds.

Tested:

* make universe

TODO:

* The structure definitions in subr_intr.c still includes machine/intr.h
  which requires one duplicates all of the intrng definitions in
  the platform code (which kan has done, and I think we don't have to.)

  Instead I should break out the generic things (function declarations,
  common intr structures, etc) into a separate header.

* Kan has requested I make the PIC based IPI stuff optional.
2015-12-18 05:43:59 +00:00
Mark Johnston
8ff6d9dd22 Support an arbitrary number of arguments to DTrace syscall probes.
Rather than pushing all eight possible arguments into dtrace_probe()'s
stack frame, make the syscall_args struct for the current syscall available
via the current thread. Using a custom getargval method for the systrace
provider, this allows any syscall argument to be fetched, even in kernels
that have modified the maximum number of system call arguments.

Sponsored by:	EMC / Isilon Storage Division
2015-12-17 00:00:27 +00:00
Mark Johnston
3616095801 Fix style issues around existing SDT probes.
- Use SDT_PROBE<N>() instead of SDT_PROBE(). This has no functional effect
  at the moment, but will be needed for some future changes.
- Don't hardcode the module component of the probe identifier. This is
  set automatically by the SDT framework.

MFC after:	1 week
2015-12-16 23:39:27 +00:00
Gleb Smirnoff
b0cd20172d A change to KPI of vm_pager_get_pages() and underlying VOP_GETPAGES().
o With new KPI consumers can request contiguous ranges of pages, and
  unlike before, all pages will be kept busied on return, like it was
  done before with the 'reqpage' only. Now the reqpage goes away. With
  new interface it is easier to implement code protected from race
  conditions.

  Such arrayed requests for now should be preceeded by a call to
  vm_pager_haspage() to make sure that request is possible. This
  could be improved later, making vm_pager_haspage() obsolete.

  Strenghtening the promises on the business of the array of pages
  allows us to remove such hacks as swp_pager_free_nrpage() and
  vm_pager_free_nonreq().

o New KPI accepts two integer pointers that may optionally point at
  values for read ahead and read behind, that a pager may do, if it
  can. These pages are completely owned by pager, and not controlled
  by the caller.

  This shifts the UFS-specific readahead logic from vm_fault.c, which
  should be file system agnostic, into vnode_pager.c. It also removes
  one VOP_BMAP() request per hard fault.

Discussed with:	kib, alc, jeff, scottl
Sponsored by:	Nginx, Inc.
Sponsored by:	Netflix
2015-12-16 21:30:45 +00:00
Konstantin Belousov
106ebb761a Optimize vop_stdadvise(POSIX_FADV_DONTNEED). Instead of looking up a
buffer for each block number in the range with gbincore(), look up the
next instantiated buffer with the logical block number which is
greater or equal to the next lblkno.  This significantly speeds up the
iteration for sparce-populated range.

Move the iteration into new helper bnoreuselist(), which is structured
similarly to flushbuflist().

Reported and tested by:	pho
Reviewed by:	markj
Sponsored by:	The FreeBSD Foundation
2015-12-16 08:48:37 +00:00
Konstantin Belousov
8549b4b9fe Simplify the loop step in the flushbuflist() and make it independed on
the type stability of the buffers memory.  Instead of memoizing
pointer to the next buffer and validating it, remember the next
logical block number in the bo list and re-lookup.

Reviewed by:	markj
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
2015-12-16 08:39:51 +00:00
Adrian Chadd
45130aa12e Don't call wakeup if we're just returning reserved space; just
return the reservation and wait for more space to appear.

Submitted by:	jeff
Reviewed by:	kib
2015-12-16 00:13:16 +00:00
Jamie Gritton
6ab6058ec4 Fix jail name checking that disallowed anything that starts with '0'.
The intention was to just limit leading zeroes on numeric names.  That
check is now improved to also catch the leading spaces and '+' that
strtoul can pass through.

PR:		204897
MFC after:	3 days
2015-12-15 17:25:00 +00:00
Edward Tomasz Napierala
af1a7b2526 Tweak comments.
MFC after:	1 month
Sponsored by:	The FreeBSD Foundation
2015-12-13 11:30:36 +00:00
Edward Tomasz Napierala
a8e0740e85 Actually make the 'amount' argument to racct_adjust_resource() signed,
as it was always supposed to be.

MFC after:	1 month
Sponsored by:	The FreeBSD Foundation
2015-12-13 11:21:13 +00:00
Edward Tomasz Napierala
57e2ebdc08 Avoid useless relocking.
MFC after:	1 month
Sponsored by:	The FreeBSD Foundation
2015-12-13 11:08:29 +00:00
Mark Johnston
d9e2e68d38 Don't make assertions about td_critnest when the scheduler is stopped.
A panicking thread always executes with a critical section held, so any
attempt to allocate or free memory while dumping will otherwise cause a
second panic. This can occur, for example, if xpt_polled_action() completes
non-dump I/O that was pending at the time of the panic. The fact that this
can occur is itself a bug, but asserting in this case does little but
reduce the reliability of kernel dumps.

Suggested by:	kib
Reported by:	pho
2015-12-11 20:05:07 +00:00
Warner Losh
da82615ae2 Create the MDT_PNP_INFO metadata record to communicate PNP info about
modules. External agents may use this data to automatically load those
modules.

Differential Review: https://reviews.freebsd.org/D3461
2015-12-11 05:27:53 +00:00
Steven Hartland
f1b13a89b0 Don't use 0 for pointer comparison
Use NULL instead of 0 for comparison with panicstr.

MFC after:	1 week
Sponsored by:	Multiplay
2015-12-08 18:38:33 +00:00
Mark Johnston
711fbd17ec Add helper functions proc_readmem() and proc_writemem().
These helper functions can be used to read in or write a buffer from or to
an arbitrary process' address space. Without them, this can only be done
using proc_rwmem(), which requires the caller to fill out a uio. This is
onerous and results in code duplication; the new functions provide a simpler
interface which is sufficient for most existing callers of proc_rwmem().

This change also adds a manual page for proc_rwmem() and the new functions.

Reviewed by:	jhb, kib
Differential Revision:	https://reviews.freebsd.org/D4245
2015-12-07 21:33:15 +00:00
Ed Maste
4c22b4686b Replace magic value ELF note type with NT_FREEBSD_ABI_TAG
As of r291909 elf_common.h provides a definition.

Suggested by:	kib
Sponsored by:	The FreeBSD Foundation
2015-12-07 18:43:27 +00:00
Konstantin Belousov
4d22d07a07 Add support for usermode (vdso-like) gettimeofday(2) and
clock_gettime(2) on ARMv7 and ARMv8 systems which have architectural
generic timer hardware. It is similar how the RDTSC timer is used in
userspace on x86.

Fix a permission problem where generic timer access from EL0 (or
userspace on v7) was not properly initialized on APs.

For ARMv7, mark the stack non-executable. The shared page is added for
all arms (including ARMv8 64bit), and the signal trampoline code is
moved to the page.

Reviewed by:	andrew
Discussed with:	emaste, mmel
Sponsored by:	The FreeBSD Foundation
Differential revision:	https://reviews.freebsd.org/D4209
2015-12-07 12:20:26 +00:00
Kirk McKusick
d9ea698c75 We need to zero out the clustering variables in a freed vnode structure.
For completeness add a VNASSERT that there are no threads waiting on a
range lock (this was previously checked on every vnode free).

Reported by; Rick Macklem
Fix from:    Mateusz Guzik
PR:          204949
2015-12-04 03:54:18 +00:00
Kenneth D. Merry
a9934668aa Add asynchronous command support to the pass(4) driver, and the new
camdd(8) utility.

CCBs may be queued to the driver via the new CAMIOQUEUE ioctl, and
completed CCBs may be retrieved via the CAMIOGET ioctl.  User
processes can use poll(2) or kevent(2) to get notification when
I/O has completed.

While the existing CAMIOCOMMAND blocking ioctl interface only
supports user virtual data pointers in a CCB (generally only
one per CCB), the new CAMIOQUEUE ioctl supports user virtual and
physical address pointers, as well as user virtual and physical
scatter/gather lists.  This allows user applications to have more
flexibility in their data handling operations.

Kernel memory for data transferred via the queued interface is
allocated from the zone allocator in MAXPHYS sized chunks, and user
data is copied in and out.  This is likely faster than the
vmapbuf()/vunmapbuf() method used by the CAMIOCOMMAND ioctl in
configurations with many processors (there are more TLB shootdowns
caused by the mapping/unmapping operation) but may not be as fast
as running with unmapped I/O.

The new memory handling model for user requests also allows
applications to send CCBs with request sizes that are larger than
MAXPHYS.  The pass(4) driver now limits queued requests to the I/O
size listed by the SIM driver in the maxio field in the Path
Inquiry (XPT_PATH_INQ) CCB.

There are some things things would be good to add:

1. Come up with a way to do unmapped I/O on multiple buffers.
   Currently the unmapped I/O interface operates on a struct bio,
   which includes only one address and length.  It would be nice
   to be able to send an unmapped scatter/gather list down to
   busdma.  This would allow eliminating the copy we currently do
   for data.

2. Add an ioctl to list currently outstanding CCBs in the various
   queues.

3. Add an ioctl to cancel a request, or use the XPT_ABORT CCB to do
   that.

4. Test physical address support.  Virtual pointers and scatter
   gather lists have been tested, but I have not yet tested
   physical addresses or scatter/gather lists.

5. Investigate multiple queue support.  At the moment there is one
   queue of commands per pass(4) device.  If multiple processes
   open the device, they will submit I/O into the same queue and
   get events for the same completions.  This is probably the right
   model for most applications, but it is something that could be
   changed later on.

Also, add a new utility, camdd(8) that uses the asynchronous pass(4)
driver interface.

This utility is intended to be a basic data transfer/copy utility,
a simple benchmark utility, and an example of how to use the
asynchronous pass(4) interface.

It can copy data to and from pass(4) devices using any target queue
depth, starting offset and blocksize for the input and ouptut devices.
It currently only supports SCSI devices, but could be easily extended
to support ATA devices.

It can also copy data to and from regular files, block devices, tape
devices, pipes, stdin, and stdout.  It does not support queueing
multiple commands to any of those targets, since it uses the standard
read(2)/write(2)/writev(2)/readv(2) system calls.

The I/O is done by two threads, one for the reader and one for the
writer.  The reader thread sends completed read requests to the
writer thread in strictly sequential order, even if they complete
out of order.  That could be modified later on for random I/O patterns
or slightly out of order I/O.

camdd(8) uses kqueue(2)/kevent(2) to get I/O completion events from
the pass(4) driver and also to send request notifications internally.

For pass(4) devcies, camdd(8) uses a single buffer (CAM_DATA_VADDR)
per CAM CCB on the reading side, and a scatter/gather list
(CAM_DATA_SG) on the writing side.  In addition to testing both
interfaces, this makes any potential reblocking of I/O easier.  No
data is copied between the reader and the writer, but rather the
reader's buffers are split into multiple I/O requests or combined
into a single I/O request depending on the input and output blocksize.

For the file I/O path, camdd(8) also uses a single buffer (read(2),
write(2), pread(2) or pwrite(2)) on reads, and a scatter/gather list
(readv(2), writev(2), preadv(2), pwritev(2)) on writes.

Things that would be nice to do for camdd(8) eventually:

1.  Add support for I/O pattern generation.  Patterns like all
    zeros, all ones, LBA-based patterns, random patterns, etc. Right
    Now you can always use /dev/zero, /dev/random, etc.

2.  Add support for a "sink" mode, so we do only reads with no
    writes.  Right now, you can use /dev/null.

3.  Add support for automatic queue depth probing, so that we can
    figure out the right queue depth on the input and output side
    for maximum throughput.  At the moment it defaults to 6.

4.  Add support for SATA device passthrough I/O.

5.  Add support for random LBAs and/or lengths on the input and
    output sides.

6.  Track average per-I/O latency and busy time.  The busy time
    and latency could also feed in to the automatic queue depth
    determination.

sys/cam/scsi/scsi_pass.h:
	Define two new ioctls, CAMIOQUEUE and CAMIOGET, that queue
	and fetch asynchronous CAM CCBs respectively.

	Although these ioctls do not have a declared argument, they
	both take a union ccb pointer.  If we declare a size here,
	the ioctl code in sys/kern/sys_generic.c will malloc and free
	a buffer for either the CCB or the CCB pointer (depending on
	how it is declared).  Since we have to keep a copy of the
	CCB (which is fairly large) anyway, having the ioctl malloc
	and free a CCB for each call is wasteful.

sys/cam/scsi/scsi_pass.c:
	Add asynchronous CCB support.

	Add two new ioctls, CAMIOQUEUE and CAMIOGET.

	CAMIOQUEUE adds a CCB to the incoming queue.  The CCB is
	executed immediately (and moved to the active queue) if it
	is an immediate CCB, but otherwise it will be executed
	in passstart() when a CCB is available from the transport layer.

	When CCBs are completed (because they are immediate or
	passdone() if they are queued), they are put on the done
	queue.

	If we get the final close on the device before all pending
	I/O is complete, all active I/O is moved to the abandoned
	queue and we increment the peripheral reference count so
	that the peripheral driver instance doesn't go away before
	all pending I/O is done.

	The new passcreatezone() function is called on the first
	call to the CAMIOQUEUE ioctl on a given device to allocate
	the UMA zones for I/O requests and S/G list buffers.  This
	may be good to move off to a taskqueue at some point.
	The new passmemsetup() function allocates memory and
	scatter/gather lists to hold the user's data, and copies
	in any data that needs to be written.  For virtual pointers
	(CAM_DATA_VADDR), the kernel buffer is malloced from the
	new pass(4) driver malloc bucket.  For virtual
	scatter/gather lists (CAM_DATA_SG), buffers are allocated
	from a new per-pass(9) UMA zone in MAXPHYS-sized chunks.
	Physical pointers are passed in unchanged.  We have support
	for up to 16 scatter/gather segments (for the user and
	kernel S/G lists) in the default struct pass_io_req, so
	requests with longer S/G lists require an extra kernel malloc.

	The new passcopysglist() function copies a user scatter/gather
	list to a kernel scatter/gather list.  The number of elements
	in each list may be different, but (obviously) the amount of data
	stored has to be identical.

	The new passmemdone() function copies data out for the
	CAM_DATA_VADDR and CAM_DATA_SG cases.

	The new passiocleanup() function restores data pointers in
	user CCBs and frees memory.

	Add new functions to support kqueue(2)/kevent(2):

	passreadfilt() tells kevent whether or not the done
	queue is empty.

	passkqfilter() adds a knote to our list.

	passreadfiltdetach() removes a knote from our list.

	Add a new function, passpoll(), for poll(2)/select(2)
	to use.

	Add devstat(9) support for the queued CCB path.

sys/cam/ata/ata_da.c:
	Add support for the BIO_VLIST bio type.

sys/cam/cam_ccb.h:
	Add a new enumeration for the xflags field in the CCB header.
	(This doesn't change the CCB header, just adds an enumeration to
	use.)

sys/cam/cam_xpt.c:
	Add a new function, xpt_setup_ccb_flags(), that allows specifying
	CCB flags.

sys/cam/cam_xpt.h:
	Add a prototype for xpt_setup_ccb_flags().

sys/cam/scsi/scsi_da.c:
	Add support for BIO_VLIST.

sys/dev/md/md.c:
	Add BIO_VLIST support to md(4).

sys/geom/geom_disk.c:
	Add BIO_VLIST support to the GEOM disk class.  Re-factor the I/O size
	limiting code in g_disk_start() a bit.

sys/kern/subr_bus_dma.c:
	Change _bus_dmamap_load_vlist() to take a starting offset and
	length.

	Add a new function, _bus_dmamap_load_pages(), that will load a list
	of physical pages starting at an offset.

	Update _bus_dmamap_load_bio() to allow loading BIO_VLIST bios.
	Allow unmapped I/O to start at an offset.

sys/kern/subr_uio.c:
	Add two new functions, physcopyin_vlist() and physcopyout_vlist().

sys/pc98/include/bus.h:
	Guard kernel-only parts of the pc98 machine/bus.h header with
	#ifdef _KERNEL.

	This allows userland programs to include <machine/bus.h> to get the
	definition of bus_addr_t and bus_size_t.

sys/sys/bio.h:
	Add a new bio flag, BIO_VLIST.

sys/sys/uio.h:
	Add prototypes for physcopyin_vlist() and physcopyout_vlist().

share/man/man4/pass.4:
	Document the CAMIOQUEUE and CAMIOGET ioctls.

usr.sbin/Makefile:
	Add camdd.

usr.sbin/camdd/Makefile:
	Add a makefile for camdd(8).

usr.sbin/camdd/camdd.8:
	Man page for camdd(8).

usr.sbin/camdd/camdd.c:
	The new camdd(8) utility.

Sponsored by:	Spectra Logic
MFC after:	1 week
2015-12-03 20:54:55 +00:00
Kirk McKusick
003a7c2b01 We need to zero out the union of pointers in a freed vnode structure.
PR:        204949
Fix from:  Mateusz Guzik
Tested by: Jason Unovitch
2015-12-03 02:04:22 +00:00
Nathan Whitehorn
f19d421ac6 Missed header_supported call from r291020: make really, really sure the brand
likes the executable.
2015-12-01 17:00:31 +00:00
Mateusz Guzik
d6767a972c capsicum: plug spurious memset in __cap_rights_init
Reviewed by:	pjd
2015-12-01 02:48:42 +00:00
Kirk McKusick
41d4f10391 As the kernel allocates and frees vnodes, it fully initializes them
on every allocation and fully releases them on every free.  These
are not trivial costs: it starts by zeroing a large structure then
initializes a mutex, a lock manager lock, an rw lock, four lists,
and six pointers. And looking at vfs.vnodes_created, these operations
are being done millions of times an hour on a busy machine.

As a performance optimization, this code update uses the uma_init
and uma_fini routines to do these initializations and cleanups only
as the vnodes enter and leave the vnode_zone. With this change the
initializations are only done kern.maxvnodes times at system startup
and then only rarely again. The frees are done only if the vnode_zone
shrinks which never happens in practice. For those curious about the
avoided work, look at the vnode_init() and vnode_fini() functions in
kern/vfs_subr.c to see the code that has been removed from the main
vnode allocation/free path.

Reviewed by: kib
Tested by:   Peter Holm
2015-11-29 21:42:26 +00:00
Konstantin Belousov
724f4b62b0 Remove sv_prepsyscall, sv_sigsize and sv_sigtbl members of the struct
sysent.

sv_prepsyscall is unused.

sv_sigsize and sv_sigtbl translate signal number from the FreeBSD
namespace into the ABI domain.  It is only utilized on i386 for iBCS2
binaries.  The issue with this approach is that signals for iBCS2 were
delivered with the FreeBSD signal frame layout, which does not follow
iBCS2.  The same note is true for any other potential user if
sv_sigtbl.  In other words, if ABI needs signal number translation, it
really needs custom sv_sendsig method instead.

Sponsored by:	The FreeBSD Foundation
2015-11-28 08:49:07 +00:00
Konstantin Belousov
f186a80d8a Remove VI_AGE vnode iflag, it is unused.
Noted by:	bde
Sponsored by:	The FreeBSD Foundation
2015-11-27 01:45:40 +00:00
Konstantin Belousov
b3162b45e1 Move the comment about resident pages preventing vnode from leaving
active list, into the header comment for vdrop(), which is the
function that decides whether to leave the vnode on the list.  Note
that dirty page write-out in vinactive() is asynchronous.

Discussed with:	alc
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2015-11-27 01:16:35 +00:00
Andrey V. Elsukov
0991fe0117 Check that hhk_helper pointer isn't NULL before access.
It isn't forbidden to use NULL pointer for hook_helper in hookinfo
structure when hhook_add_hook() adds new helper hook.
2015-11-25 07:14:58 +00:00
Konstantin Belousov
547831b6fd Rework the vnode cache recycling to meet free and unused vnodes
targets.  See the comment above wantfreevnodes variable for the
description of the algorithm.

The vfs.vlru_alloc_cache_src sysctl is removed.  New code frees
namecache sources as the last chance to satisfy the highest watermark,
instead of selecting the source vnodes randomly. This provides good
enough behaviour to keep vn_fullpath() working in most situations.
The filesystem layout with deep trees, where the removed knob was
required, is thus handled automatically.

Submitted by:	bde
Discussed with:	mckusick
Tested by:	pho
MFC after:	1 month
2015-11-24 09:45:36 +00:00
Mark Johnston
1155462e3a The buffer passed to an sbuf drain callback is not necessarily
null-terminated, so don't assume that it is.

Reported by:	pho
X-MFC-With:	r291059
2015-11-23 18:45:35 +00:00
Konstantin Belousov
5e27d79314 Split kerne timekeep ABI structure vdso_sv_tk out of the struct
sysentvec.  This allows the timekeep data to be shared between similar
ABIs which cannot share sysentvec.

Make the timekeep_push_vdso() tick callback to the timekeep structures
instead of sysentvecs.  If several sysentvec share the vdso_sv_tk
structure, we would update the userspace data several times on each
tick, without the change.

Only allocate vdso_sv_tk in the exec_sysvec_init() sysinit when
sysentvec is marked with the new SV_TIMEKEEP flag.  This saves
allocation and update of unneeded vdso_sv_tk for ABIs which do not
provide userspace gettimeofday yet, which are PowerPCs arches right
now.

Make vdso_sv_tk allocator public, namely split out and export
alloc_sv_tk() and alloc_sv_tk_compat32().  ABIs which share timekeep
data now can allocate it manually and share as appropriate.

Requested by:	nwhitehorn
Tested by:	nwhitehorn, pho
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
2015-11-23 07:09:35 +00:00
Gleb Smirnoff
09c837b897 Remove remnants of the old NFS from vnode pager.
Reviewed by:	kib
Sponsored by:	Netflix
2015-11-20 23:52:27 +00:00
Edward Tomasz Napierala
a8723fb881 The freebsd4_getfsstat() was broken in r281551 to always return 0 on success.
All versions of getfsstat(3) are supposed to return the number of [o]statfs
structs in the array that was copied out.

Also fix missing bounds checking and signed comparison of unsigned types.

Submitted by:	bde@
MFC after:	1 month
Sponsored by:	The FreeBSD Foundation
2015-11-20 14:08:12 +00:00
Jonathan T. Looney
1067a2ba68 Consistently enforce the restriction against calling malloc/free when in a
critical section.

uma_zalloc_arg()/uma_zalloc_free() may acquire a sleepable lock on the
zone. The malloc() family of functions may call uma_zalloc_arg() or
uma_zalloc_free().

The malloc(9) man page currently claims that free() will never sleep.
It also implies that the malloc() family of functions will not sleep
when called with M_NOWAIT. However, it is more correct to say that
these functions will not sleep indefinitely. Indeed, they may acquire
a sleepable lock. However, a developer may overlook this restriction
because the WITNESS check that catches attempts to call the malloc()
family of functions within a critical section is inconsistenly
applied.

This change clarifies the language of the malloc(9) man page to clarify
the restriction against calling the malloc() family of functions
while in a critical section or holding a spin lock. It also adds
KASSERTs at appropriate points to make the enforcement of this
restriction more consistent.

PR:		204633
Differential Revision:	https://reviews.freebsd.org/D4197
Reviewed by:	markj
Approved by:	gnn (mentor)
Sponsored by:	Juniper Networks
2015-11-19 14:04:53 +00:00
Mark Johnston
150b709793 Remove a commented-out debug print.
MFC after:	1 week
2015-11-19 05:58:51 +00:00
Mark Johnston
1b9254f885 Add support for a configurable output channel to witness(4).
This is useful in environments where system configuration is performed by
automated interaction with the system console, since unexpected witness
output makes such automation difficult. With this change, the new
debug.witness.output_channel sysctl allows one to specify that witness
output is to be printed to the kernel log (using log(9)) rather than the
console.

Reviewed by:	cem, jhb
MFC after:	2 weeks
Relnotes:	yes
Sponsored by:	EMC / Isilon Storage Division
Differential Revision:	https://reviews.freebsd.org/D4183
2015-11-19 05:56:59 +00:00
Mark Johnston
07713dde22 Add vlog(9).
Reviewed by:	cem, jhb
MFC after:	1 week
Differential Revision:	https://reviews.freebsd.org/D4183
2015-11-19 05:50:22 +00:00
Nathan Whitehorn
686d2f317a Extend r270123 to run the brand info's header_supported() routine for
branded as well as unbranded binaries. This will be required to add
support for the new ELFv2 ABI on powerpc64, which is distinguished from
ELFv1 by the contents of the ELF header's flags field.

Reviewed by:	imp
MFC after:	2 weeks
2015-11-18 17:03:22 +00:00
Marius Strobl
7888a51f8d - Unbreak dumpsys(9) on sparc64 after r276772
- While at it, arrange #ifndefs in kern_dump.c more intelligently; it's
  rather confusing to have multiple competing and/or unused functions in
  the kernel.
2015-11-16 23:02:33 +00:00
Edward Tomasz Napierala
15db3c0738 Speed up rctl operation with large rulesets, by holding the lock
during iteration instead of relocking it for each traversed rule.

Reviewed by:	mjg@
MFC after:	1 month
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D4110
2015-11-15 12:10:51 +00:00
Randall Stewart
7c4676ddee This fixes several places where callout_stops return is examined. The
new return codes of -1 were mistakenly being considered "true". Callout_stop
now returns -1 to indicate the callout had either already completed or
was not running and 0 to indicate it could not be stopped.  Also update
the manual page to make it more consistent no non-zero in the callout_stop
or callout_reset descriptions.

MFC after:	1 Month with associated callout change.
2015-11-13 22:51:35 +00:00
John Baldwin
645743ea99 Export various helper variables describing the layout and size of
certain kernel structures for use by debuggers. This mostly aids
in examining cores from a kernel without debug symbols as a debugger
can infer these values if debug symbols are available.

One set of variables describes the layout of 'struct linker_file' to
walk the list of loaded kernel modules.

A second set of variables describes the layout of 'struct proc' and
'struct thread' to walk the list of processes in the kernel and the
threads in each process.

The 'pcb_size' variable is used to index into the stoppcbs[] array.

The 'vm_maxuser_address' is used to distinguish kernel virtual addresses
from user addresses. This doesn't have to be perfect, and
'vm_maxuser_address' is a cheap and simple way to differentiate kernel
pointers from simple values like TIDs and PIDs.

While here, annotate the fields in struct pcb used by kgdb on amd64
and i386 to note that their ABI should be preserved.  Annotations for
other platforms will be added in the future.

Reviewed by:	kib
MFC after:	2 weeks
Differential Revision:	https://reviews.freebsd.org/D3773
2015-11-12 22:00:59 +00:00
Randall Stewart
18b4fd62e0 Add new async_drain to the callout system. This is so-far not used but
should be used by TCP for sure in its cleanup of the IN-PCB (will be coming shortly).

Sponsored by:	Netflix Inc.
Differential Revision:	https://reviews.freebsd.org/D4076
2015-11-10 14:49:32 +00:00
Josh Paetzel
09766dd52a Fix a bug in the CPU % limiting code
If you attempt to set a pcpu limit that is higher than
110% using rctl (for instance, you want a jail to be
able to use 2 cores on your system so you set pcpu to
200%) the thing you are trying to limit becomes unthrottled.

PR:	189870
Submitted by:	dustinwenz@ebureau.com
Reviewed by:	trasz
MFC after:	1 week
2015-11-10 14:14:32 +00:00
Edward Tomasz Napierala
ea228b482e Make naming more consistent; no functional changes.
MFC after:	1 month
Sponsored by:	The FreeBSD Foundation
2015-11-08 18:11:24 +00:00
Edward Tomasz Napierala
2b4035eeb6 Speed up rctl(8) rule retrieval; the difference shows mostly in "rctl -n",
as otherwise most of the time is spent resolving UIDs to names.

Reviewed by:	mjg@
MFC after:	1 month
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D4059
2015-11-08 18:08:31 +00:00
Tijl Coosemans
27f38a8d69 Since r289279 bufinit() uses mp_ncpus, but some architectures set this
variable during mp_start() which is too late.  Move this to mp_setmaxid()
where other architectures set it and move x86 assertions to MI code.

Reviewed by:	kib (x86 part)
2015-11-08 14:26:50 +00:00
Mark Johnston
d28713378a - Consistently use PROC_ASSERT_HELD() to verify that a process' hold count
is non-zero.
- Include the process address in the PROC_ASSERT_HELD() and
  PROC_ASSERT_NOT_HELD() assertion messages so that the corresponding
  process can be found easily when debugging.

MFC after:	1 week
2015-11-08 01:38:56 +00:00
Conrad Meyer
e072f955ff Flesh out sysctl types further (follow-up of r290475)
Use the right intmax_t type instead of intptr_t in a few remaining
places.

Add support for CTLFLAG_TUN for the new fixed with types.  Bruce will be
upset that the new handlers silently truncate tuned quad-sized inputs,
but so do all of the existing handlers.

Add the new types to debug_dump_node, for whatever use that is.

Bump FreeBSD_version again, for good measure.  We are changing
SYSCTL_HANDLER_ARGS and a member of struct sysctl_oid to intmax_t.

Correct the sysctl typed NULL values for the fixed-width types.  (Hat
tip: hps@.)

Suggested by:	hps (partial)
Sponsored by:	EMC / Isilon Storage Division
2015-11-07 18:26:32 +00:00
Adrian Chadd
2eb936acb9 Add a sched_yield() to work around low memory conditions in the current code.
Things seem to get stuck in low memory conditions where no bufs are available,
the reclamation path is called to wakeup the daemon, but no sleeping is done.
Because of this, we are stuck in a tight loop in the current process and
never run said reclamation path.

This was introduced in r289279 . This is only a temporary workaround
to restore system usefulness until the more permanent solutions can be
found.

Tested:

* Carambola2, 64MB (and 32MB by manual config.)
2015-11-07 04:04:00 +00:00
Conrad Meyer
be87839e56 Round out SYSCTL macros to the full set of fixed-width types
Add S8, S16, S32, and U32 types;  add SYSCTL*() macros for them, as well
as for the existing 64-bit types.  (While SYSCTL*QUAD and UQUAD macros
already exist, they do not take the same sort of 'val' parameter that
the other macros do.)

Clean up the documented "types" in the sysctl.9 document.  (These are
macros and thus not real types, but the manual page documents intent.)

The sysctl_add_oid(9) arg2 has been bumped from intptr_t to intmax_t to
accommodate 64-bit types on 32-bit pointer architectures.

This is just the kernel support piece; the userspace sysctl(1) support
will follow in a later patch.

Submitted by:	Ravi Pokala <rpokala@panasas.com>
Reviewed by:	cem
Relnotes:	no
Sponsored by:	Panasas
Differential Revision:	https://reviews.freebsd.org/D4091
2015-11-07 01:43:01 +00:00
Mateusz Guzik
b577e693aa fd: implement kern.proc.nfds sysctl
Intended purpose is to provide an equivalent of OpenBSD's getdtablecount
syscall for the compat library..
2015-11-07 00:18:14 +00:00
John Baldwin
db41d262d3 When dumping an rman in DDB, include the RID of each resource.
Submitted by:	Ravi Pokala (rpokala@panasas.com)
Reviewed by:	imp
MFC after:	1 week
Differential Revision:	https://reviews.freebsd.org/D4086
2015-11-05 23:12:23 +00:00
Mark Johnston
e8e0fac552 Have elf_lookup() return an error if the specified non-weak symbol could
not be found. Otherwise, relocations against such symbols will be silently
ignored instead of causing an error to be raised.

Reviewed by:	kib
MFC after:	1 week
2015-11-03 03:29:35 +00:00
Enji Cooper
aaca704590 Define fhard in pps_event(..) only when PPS_SYNC is defined to mute
an -Wunused-but-set-variable warning

Reported by: FreeBSD_HEAD_amd64_gcc4.9 jenkins job
Sponsored by: EMC / Isilon Storage Division
2015-11-02 03:14:37 +00:00
Enji Cooper
9a12e28212 Define compress in __elfN(coredump) when #ifdef GZIO is true to mute
an -Wunused-but-set-variable warning

Reported by: FreeBSD_HEAD_amd64_gcc4.9 jenkins job
Sponsored by: EMC / Isilon Storage Division
2015-11-02 01:47:26 +00:00
Warner Losh
a1f26210f3 The error classification from lower layers is a poor indicator of
whether an error is recoverable. Always re-dirty the buffer on errors
from write requests. The invalidation we used to do for errors not EIO
doesn't need to be done for a device that's really gone, since that's
done in a different path.

Reviewed by: mckusick@, kib@
2015-10-31 04:53:07 +00:00
Konstantin Belousov
50657fd342 Minor (and incomplete) style cleanup.
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2015-10-30 20:47:42 +00:00
Konstantin Belousov
2936e0013c Also mark compat32 umtx op table as constant.
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2015-10-30 19:32:30 +00:00
Konstantin Belousov
c539e87014 Use C99 array initialization, which also makes the code
self-documented, and eases addition of new ops.

For the similar reasons, eliminate UMTX_OP_MAX.  nitems() handles the
only use of the symbol.

Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2015-10-30 19:20:40 +00:00
Edward Tomasz Napierala
665aea9323 After r290196, the kernel won't wait for stuff like gmirror nodes
if they are not required for mounting rootfs.  However, it's possible
that some setups try to mount them in mountcritlocal (ie from fstab).

Export the list of current root mount holds using a new sysctl,
vfs.root_mount_hold, and make mountcritlocal retry if "mount -a" fails
and the list is not empty.

MFC after:	1 month
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D3709
2015-10-30 15:52:10 +00:00
Edward Tomasz Napierala
a3ba3d09c2 Make root mount wait mechanism smarter, by making it wait only if the root
device doesn't yet exist.

Reviewed by:	kib@, marcel@
MFC after:	1 month
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D3709
2015-10-30 15:35:04 +00:00
Bryan Drewery
156c04f793 getnewbuf: Initialize bp to avoid uninitialized pointer dereference and brelse().
This came in recently in r289279.

Coverity CID:	1331561
2015-10-29 19:02:24 +00:00
Hans Petter Selasky
cb3450e26e Add missing NULL check in physio().
When destroying a character device the si_devsw field is set to NULL
before all references are gone, to indicate the character device is
going away. This can cause a NULL-dereference fault inside physio().

The callers of physio() should own a thread reference on the cdev and
if si_devsw is seen as non-NULL, it is usable during the execution of
the function. Else an ENXIO error code is returned.

Reviewed by:	kib
MFC after:	2 weeks
2015-10-29 13:53:37 +00:00
Warner Losh
83a283cf62 Add a note to the effect that BUS_ADD_CHILD calls
device_add_child_ordered to add the child. device_add_child_ordered
doesn't call BUS_ADD_CHILD.
2015-10-28 18:53:18 +00:00
Kirk McKusick
a57418a761 Bring the tags and links entries for amd64 up to date.
Based on how out of date it is, I doubt that anyone
other than me and my code-reading students still use it.
2015-10-27 22:59:24 +00:00
Pawel Jakub Dawidek
38d68e2d42 The aio_waitcomplete(2) syscall should not sleep when the given timeout
is 0. Without this change it was sleeping for one tick. Maybe not a big
deal, but it makes share/dtrace/blocking script to report that.

Reviewed by:	jhb
Differential Revision:	https://reviews.freebsd.org/D3814
Sponsored by:	Wheel Systems, http://wheelsystems.com
2015-10-25 18:48:09 +00:00
Conrad Meyer
c3220d0b6d Sysctl: Add common support for U8, U16 types
Sponsored by:	EMC / Isilon Storage Division
2015-10-22 23:03:06 +00:00
John Baldwin
ea7b054e99 Missing regen after last change to sys/kern/syscalls.master. 2015-10-22 21:30:39 +00:00
John Baldwin
2f99bcce1e Rename remaining linux32 symbols such as linux_sysent[] and
linux_syscallnames[] from linux_* to linux32_* to avoid conflicts with
linux64.ko.  While here, add support for linux64 binaries to systrace.
- Update NOPROTO entries in amd64/linux/syscalls.master to match the
  main table to fix systrace build.
- Add a special case for union l_semun arguments to the systrace
  generation.
- The systrace_linux32 module now only builds the systrace_linux32.ko.
  module on amd64.
- Add a new systrace_linux module that builds on both i386 and amd64.
  For i386 it builds the existing systrace_linux.ko.  For amd64 it
  builds a systrace_linux.ko for 64-bit binaries.

Reviewed by:	markj
Differential Revision:	https://reviews.freebsd.org/D3954
2015-10-22 21:28:20 +00:00
Ed Schouten
aff5735784 Add a way to distinguish between forking and thread creation in schedtail.
For CloudABI we need to initialize the registers of new threads
differently based on whether the thread got created through a fork or
through simple thread creation.

Add a flag, TDP_FORKING, that is set by do_fork() and cleared by
fork_exit(). This can be tested against in schedtail.

Reviewed by:	kib
Differential Revision:	https://reviews.freebsd.org/D3973
2015-10-22 09:33:34 +00:00
Konstantin Belousov
8fc3db038e Trim spaces at end of line to record the proper commit message for
r289660:

Do not allow to execute ptrace(PT_TRACE_ME) when the process is
already traced.

Do not allow to execute ptrace(PT_TRACE_ME) when there is no parent
which can trace the process, i.e. when the parent is already init.
Note that after the PT_TRACE_ME request the process is unkillable and
non-continuable until a debugger is attached, or parent is killed, the
later clears P_TRACED state.  Since init clearly would not debug the
caller, and cannot be killed, disallow creation of unkillable
processes.

Reviewed by:	jhb, pho
Reported and tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
Differential revision:	https://reviews.freebsd.org/D3908
2015-10-20 20:38:20 +00:00
Konstantin Belousov
1b253694f4 Mark struct thread zone as type-stable.
When establishing the locking state for several lock types (including
blockable mutexes and sx) failed, locking primitives try to spin while
the owner thread is running.  The spinning loop performs the test for
running condition by dereferencing the owner->td_state field of the
owner thread.  If the owner thread exited while spinner was put off
the processor, it is harmless to access reused struct thread owner,
since in some near future the current processor would notice the owner
change and make appropriate progress.  But it could be that the page
which carried the freed struct thread was unmapped, then we fault
(this cannot happen on amd64).

For now, disallowing free of the struct thread seems to be good
enough, and tests which create a lot of threads once, did not
demonstrated regressions.

Reviewed by:	jhb, pho
Reported and tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
Differential revision:	https://reviews.freebsd.org/D3908
2015-10-20 20:29:21 +00:00
Konstantin Belousov
77b9bec37b Reviewed by: jhb, pho
Reported and tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
Differential revision:	https://reviews.freebsd.org/D3908
2015-10-20 20:22:57 +00:00
Konstantin Belousov
1ed2e49b55 No need to dereference struct proc to pids when comparing processes
for equality.

Reviewed by:	jhb, pho
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
2015-10-20 20:12:42 +00:00
John Baldwin
c814b86843 Switch pl_child_pid from int to pid_t.
Reviewed by:	emaste, kib
MFC after:	1 week
Differential Revision:	https://reviews.freebsd.org/D3857
2015-10-20 17:58:21 +00:00
Ian Lepore
11ead989eb Fix printf format to allow for bus_size_t not being u_long on all platforms. 2015-10-20 03:25:17 +00:00
Enji Cooper
3f18b7fa12 Replace /dev/acd0 with /dev/cd1
atapicd(4) has been removed since r249083, and if a system has more than one
optical drive, it will likely be /dev/cd1

Update mount.conf(8) to reflect the change in behavior

MFC after: never
Sponsored by: EMC / Isilon Storage Division
2015-10-17 08:51:10 +00:00
Konstantin Belousov
d8f3dc7871 If falloc_caps() failed, cleanup needs to be performed. This is a bug
in r289026.

Sponsored by:	The FreeBSD Foundation
2015-10-16 14:55:39 +00:00
Konstantin Belousov
6c775eb64e Allow PT_INTERP and PT_NOTES segments to be located anywhere in the
executable image.  Keep one page (arbitrary) limit on the max allowed
size of the PT_NOTES.

The ELF image activators still require that program headers of the
executable are fully contained in the first page of the image file.

Reviewed by:	emaste, jhb
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
Differential revision:	https://reviews.freebsd.org/D3871
2015-10-14 18:27:35 +00:00
Jeff Roberson
21fae96123 Parallelize the buffer cache and rewrite getnewbuf(). This results in a
8x performance improvement in a micro benchmark on a 4 socket machine.

 - Get buffer headers from a per-cpu uma cache that sits in from of the
   free queue.
 - Use a per-cpu quantum cache in vmem to eliminate contention for kva.
 - Use multiple clean queues according to buffer cache size to eliminate
   clean queue lock contention.
 - Introduce a bufspace daemon that attempts to prevent getnewbuf() callers
   from blocking or doing direct recycling.
 - Close some bufspace allocation races that could lead to endless
   recycling.
 - Further the transition to a more modern style of small functions grouped
   by prefix in order to improve growing complexity.

Sponsored by:	EMC / Isilon
Reviewed by:	kib
Tested by:	pho
2015-10-14 02:10:07 +00:00
Hiren Panchasara
86a996e6bd There are times when it would be really nice to have a record of the last few
packets and/or state transitions from each TCP socket. That would help with
narrowing down certain problems we see in the field that are hard to reproduce
without understanding the history of how we got into a certain state. This
change provides just that.

It saves copies of the last N packets in a list in the tcpcb. When the tcpcb is
destroyed, the list is freed. I thought this was likely to be more
performance-friendly than saving copies of the tcpcb. Plus, with the packets,
you should be able to reverse-engineer what happened to the tcpcb.

To enable the feature, you will need to compile a kernel with the TCPPCAP
option. Even then, the feature defaults to being deactivated. You can activate
it by setting a positive value for the number of captured packets. You can do
that on either a global basis or on a per-socket basis (via a setsockopt call).

There is no way to get the packets out of the kernel other than using kmem or
getting a coredump. I thought that would help some of the legal/privacy concerns
regarding such a feature. However, it should be possible to add a future effort
to export them in PCAP format.

I tested this at low scale, and found that there were no mbuf leaks and the peak
mbuf usage appeared to be unchanged with and without the feature.

The main performance concern I can envision is the number of mbufs that would be
used on systems with a large number of sockets. If you save five packets per
direction per socket and have 3,000 sockets, that will consume at least 30,000
mbufs just to keep these packets. I tried to reduce the concerns associated with
this by limiting the number of clusters (not mbufs) that could be used for this
feature. Again, in my testing, that appears to work correctly.

Differential Revision:	D3100
Submitted by:		Jonathan Looney <jlooney at juniper dot net>
Reviewed by:		gnn, hiren
2015-10-14 00:35:37 +00:00
Edward Tomasz Napierala
92001b9497 Change the default setting of kern.ipc.shm_allow_removed from 0 to 1.
This removes the need for manually changing this flag for Google Chrome
users. It also improves compatibility with Linux applications running under
Linuxulator compatibility layer, and possibly also helps in porting software
from Linux.

Generally speaking, the flag allows applications to create the shared memory
segment, attach it, remove it, and then continue to use it and to reattach it
later. This means that the kernel will automatically "clean up" after the
application exits.

It could be argued that it's against POSIX. However, SUSv3 says this
about IPC_RMID: "Remove the shared memory identifier specified by shmid from
the system and destroy the shared memory segment and shmid_ds data structure
associated with it." From my reading, we break it in any case by deferring
removal of the segment until it's detached; we won't break it any more
by also deferring removal of the identifier.

This is the behaviour exhibited by Linux since... probably always, and
also by OpenBSD since the following commit:

revision 1.54
date: 2011/10/27 07:56:28; author: robert; state: Exp; lines: +3 -8;
Allow segments to be used even after they were marked for deletion with
the IPC_RMID flag.
This is permitted as an extension beyond the standards and this is similar
to what other operating systems like linux do.

MFC after:	1 month
Relnotes:	yes
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D3603
2015-10-10 09:29:47 +00:00
Edward Tomasz Napierala
b9a5c7b595 Provide better debug message on kernel module name clash.
Reviewed by:	kib@
MFC after:	1 month
Sponsored by:	The FreeBSD Foundation
2015-10-10 09:21:55 +00:00
Edward Tomasz Napierala
8d90e66066 Remove root_mount_wait(). It's not used anywhere.
Reviewed by:	bapt@
MFC after:	1 month
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D3787
2015-10-09 12:11:37 +00:00
Konstantin Belousov
4b48959f9f Enforce the maxproc limitation before allocating struct proc, initial
struct thread and kernel stack for the thread.  Otherwise, a load
similar to a fork bomb would exhaust KVA and possibly kmem, mostly due
to the struct proc being type-stable.

The nprocs counter is changed from being protected by allproc_lock sx
to be an atomic variable.  Note that ddb/db_ps.c:db_ps() use of nprocs
was unsafe before, and is still unsafe, but it seems that the only
possible undesired consequence is the harmless warning printed when
allproc linked list length does not match nprocs.

Diagnosed by:	Svatopluk Kraus <onwahe@gmail.com>
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2015-10-08 11:07:09 +00:00
Fabien Thomas
78e79434d2 Fix r283998 that broke mapin events for hwpmc.
Reviewed by:	jhb
Sponsored by:	Stormshield
2015-10-08 09:54:33 +00:00
Gleb Smirnoff
e40e8705db Fix regression from r248371. We need to copy packet header to new
mbuf. Unlike in the pre-r248371 code, assert that M_PKTHDR is set
only on a first mbuf.

Reported & tested by:	Andriy Voskoboinyk <s3erios gmail.com>
Sponsored by:		Nginx, Inc.
2015-10-07 12:40:00 +00:00
John Baldwin
189ac973de Fix various edge cases related to system call tracing.
- Always set td_dbg_sc_* when P_TRACED is set on system call entry
  even if the debugger is not tracing system call entries.  This
  ensures the fields are valid when reporting other stops that
  occur at system call boundaries such as for PT_FOLLOW_FORKS or
  when only tracing system call exits.
- Set TDB_SCX when reporting the stop for a new child process in
  fork_return().  This causes the event to be reported as a system
  call exit.
- Report a system call exit event in fork_return() for new threads in
  a traced process.
- Copy td_dbg_sc_* to new threads instead of zeroing.  This ensures
  that td_dbg_sc_code in particular will report the system call that
  created the new thread or process when it reports a system call
  exit event in fork_return().
- Add new ptrace tests to verify that new child processes and threads
  report system call exit events with a valid pl_syscall_code via
  PT_LWPINFO.

Reviewed by:	kib
Differential Revision:	https://reviews.freebsd.org/D3822
2015-10-06 19:29:05 +00:00
Conrad Meyer
e6b95927f3 Fix core corruption caused by race in note_procstat_vmmap
This fix is spiritually similar to r287442 and was discovered thanks to
the KASSERT added in that revision.

NT_PROCSTAT_VMMAP output length, when packing kinfo structs, is tied to
the length of filenames corresponding to vnodes in the process' vm map
via vn_fullpath.  As vnodes may move during coredump, this is racy.

We do not remove the race, only prevent it from causing coredump
corruption.

- Add a sysctl, kern.coredump_pack_vmmapinfo, to allow users to disable
  kinfo packing for PROCSTAT_VMMAP notes.  This avoids VMMAP corruption
  and truncation, even if names change, at the cost of up to PATH_MAX
  bytes per mapped object.  The new sysctl is documented in core.5.

- Fix note_procstat_vmmap to self-limit in the second pass.  This
  addresses corruption, at the cost of sometimes producing a truncated
  result.

- Fix PROCSTAT_VMMAP consumers libutil (and libprocstat, via copy-paste)
  to grok the new zero padding.

Reported by:	pho (https://people.freebsd.org/~pho/stress/log/datamove4-2.txt)
Relnotes:	yes
Sponsored by:	EMC / Isilon Storage Division
Differential Revision:	https://reviews.freebsd.org/D3824
2015-10-06 18:07:00 +00:00
Gleb Smirnoff
640082d498 Remove debugging variable from r143761. 2015-10-06 09:43:49 +00:00
John Baldwin
3edd0ffffe Include additional info in ptrace(2) KTR traces:
- The new PC value and signal passed to PT_CONTINUE, PT_DETACH, PT_SYSCALL,
  and PT_TO_SC[EX].
- The system call code returned via PT_LWPINFO.

MFC after:	1 week
2015-10-05 21:36:53 +00:00
Mark Johnston
403ec61cbb Revert r288628 and instead fix a discrepancy between the posix_fadvise(2)
man page and POSIX: posix_fadvise(2) returns an error number on failure.

Reported by:	jilles
MFC after:	1 week
2015-10-03 22:27:14 +00:00
Mark Johnston
a7713f7631 The return value of posix_fadvise(2) is just an error status, so
sys_posix_fadvise() should simply return the errno (or 0) to syscallenter()
rather than setting a return value.

MFC after:	1 week
2015-10-03 19:37:41 +00:00
Alan Cox
acada7aef0 Perform a single batched update to the object's paging-in-progress count
rather than updating it for each page.
2015-10-03 17:04:52 +00:00
Poul-Henning Kamp
d58b610faa Fail the sbuf if vsnprintf(3) fails. 2015-10-02 09:23:14 +00:00
Mark Johnston
0a19cfd454 Ensure that vop_stdadvise() does not call getblk() on vnodes that have an
empty bufobj. Otherwise, vnodes belonging to filesystems that do not use the
buffer cache may trigger assertion failures.

Reported by:	Fabien Keil
2015-10-01 16:34:53 +00:00
Colin Percival
2eb0015ab7 Disable suspend when we're shutting down. This solves the "tell FreeBSD
to shut down; close laptop lid" scenario which otherwise tended to end
with a laptop overheating or the battery dying.

The implementation uses a new sysctl, kern.suspend_blocked; init(8) sets
this while rc.suspend runs, and the ACPI sleep code ignores requests while
the sysctl is set.

Discussed on:	freebsd-acpi (35 emails)
MFC after:	1 week
2015-10-01 10:52:26 +00:00
Mark Johnston
3138cd3670 As a step towards the elimination of PG_CACHED pages, rework the handling
of POSIX_FADV_DONTNEED so that it causes the backing pages to be moved to
the head of the inactive queue instead of being cached.

This affects the implementation of POSIX_FADV_NOREUSE as well, since it
works by applying POSIX_FADV_DONTNEED to file ranges after they have been
read or written.  At that point the corresponding buffers may still be
dirty, so the previous implementation would coalesce successive ranges and
apply POSIX_FADV_DONTNEED to the result, ensuring that pages backing the
dirty buffers would eventually be cached.  To preserve this behaviour in an
efficient manner, this change adds a new buf flag, B_NOREUSE, which causes
the pages backing a VMIO buf to be placed at the head of the inactive queue
when the buf is released.  POSIX_FADV_NOREUSE then works by setting this
flag in bufs that underlie the specified range.

Reviewed by:	alc, kib
Sponsored by:	EMC / Isilon Storage Division
Differential Revision:	https://reviews.freebsd.org/D3726
2015-09-30 23:06:29 +00:00
Andriy Gapon
2f2f522b5d save some bytes by using more concise SDT_PROBE<n> instead of SDT_PROBE
SDT_PROBE requires 5 parameters whereas SDT_PROBE<n> requires n parameters
where n is typically smaller than 5.

Perhaps SDT_PROBE should be made a private implementation detail.

MFC after:	20 days
2015-09-28 12:14:16 +00:00
Jeff Roberson
4615830db2 - Collapse vfs_vmio_truncate & vfs_vmio_release into a single function.
- Allow vfs_vmio_invalidate() to free the pages, leaving us with a
   single loop and bufobj lock when B_NOCACHE/B_INVAL is used.
 - Eliminate the special B_ASYNC handling on free that has not been
   relevant for some time.
 - Remove the extraneous page busy from vfs_vmio_truncate().

Reviewed by:	kib
Tested by:	pho
Sponsored by:	EMC / Isilon storage division
2015-09-27 05:16:06 +00:00
Mark Johnston
0a805de6f3 Remove a check for a condition that is always false by a preceding KASSERT
that was added in r144704.
2015-09-26 22:26:55 +00:00
Mark Johnston
d925c2e800 Fix argument ordering in vn_printf().
MFC after:	3 days
2015-09-26 22:16:54 +00:00
Conrad Meyer
2f1c4e0ebf sbuf: Process more than one char at a time
Revamp sbuf_put_byte() to sbuf_put_bytes() in the obvious fashion and
fixup callers.

Add a thin shim around sbuf_put_bytes() with the old ABI to avoid ugly
changes to some callers.

Reviewed by:	jhb, markj
Obtained from:	Dan Sledz
Sponsored by:	EMC / Isilon Storage Division
Differential Revision:	https://reviews.freebsd.org/D3717
2015-09-25 18:37:14 +00:00
Konstantin Belousov
b2557db607 Use per-cpu values for base and last in tc_cpu_ticks(). The values
are updated lockess, different CPUs write its own view of timecounter
state.  The critical section is done for safety, callers of
tc_cpu_ticks() are supposed to already enter critical section, or to
own a spinlock.

The change fixes sporadical reports of too high values reported for
the (W)CPU on platforms that do not provide cpu ticker and use
tc_cpu_ticks(), in particular, arm*.

Diagnosed and reviewed by:	jhb
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2015-09-25 13:03:57 +00:00
Mateusz Guzik
3c44a3495f kqueue: simplify kern_kqueue by not refing/unrefing creds too early
No functional changes.
2015-09-23 12:45:08 +00:00
Jeff Roberson
589c956a5a - Fix a nonsense reordering that somehow slipped into my last diff.
Reported by:	pho
2015-09-23 07:44:07 +00:00
Jeff Roberson
8264830c95 Some refactoring of the buf/vm interface.
- Eliminate bogus page replacement that is inconsistently applied in the
   invalidation loop in brelse.  This has been a no-op in modern times as
   biodone() is responsible for cleaning up after bogus pages.  This
   would've spammed the console with printfs at a minimum.
 - Allow the compiler and human readers alike to reason about allocbuf()
   by splitting it into constituent parts.
 - Separate the VM manipulating and buf manipulating code in brelse() and
   bufdone() so that the intentions are clear.  This makes it evident that
   there are several duplicated buf pages loops that will be consolidated
   at a later time.

Reviewed by:	kib
Tested by:	pho
Sponsored by:	EMC / Isilon Storage Division
2015-09-22 23:57:52 +00:00
Alan Cox
15aaea7892 Change vm_page_unwire() such that it (1) accepts PQ_NONE as the specified
queue and (2) returns a Boolean indicating whether the page's wire count
transitioned to zero.

Exploit this change in vfs_vmio_release() to avoid pointlessly enqueueing
a page that is about to be freed.

(An earlier version of this change was developed by attilio@ and kmacy@.
Any errors in this version are my own.)

Reviewed by:	kib
Sponsored by:	EMC / Isilon Storage Division
2015-09-22 18:16:52 +00:00
Hans Petter Selasky
c55f4c9445 Revert r287780 until more developers have their say.
Differential Revision:	https://reviews.freebsd.org/D3521
Requested by:		gnn
2015-09-22 06:51:55 +00:00
Bryan Drewery
6c5c24c98c vfs_mountroot_shuffle() never returns non-zero. 2015-09-22 03:34:07 +00:00
Konstantin Belousov
1f57d8c66b Ensure that maxproc does not exceed pid_max, at the time of boot.
Changes to kern.pid_max mib after the boot can break this relation.

The maxfiles value was calculated by the MAXFILES formula based on
maxproc value, but this change decouples them, and MAXFILES now
references maxusers.  Without manual tuning, the maxfiles default
value remains as it was prior to this commit.  But for systems which
have tuned maxproc and rely on maxfiles to adjust, additional
reconfiguration is needed.

Reported by:	rwatson
Reviewed by:	emaste
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
2015-09-21 15:02:59 +00:00
Konstantin Belousov
cff8c6f2d1 Add support for weak symbols to the kernel linkers. It means that
linkers no longer raise an error when undefined weak symbols are
found, but relocate as if the symbol value was 0.  Note that we do not
repeat the mistake of userspace dynamic linker of making the symbol
lookup prefer non-weak symbol definition over the weak one, if both
are available.  In fact, kernel linker uses the first definition
found, and ignores duplicates.

Signature of the elf_lookup() and elf_obj_lookup() functions changed
to split result/error code and the symbol address returned.
Otherwise, it is impossible to return zero address as the symbol
value, to MD relocation code.  This explains the mechanical changes in
elf_machdep.c sources.

The powerpc64 R_PPC_JMP_SLOT handler did not checked error from the
lookup() call, the patch leaves the code as is (untested).

Reported by:	glebius
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2015-09-20 01:27:59 +00:00
Edward Tomasz Napierala
0d3d0cc358 Kernel part of reroot support - a way to change rootfs without reboot.
Note that the mountlist manipulations are somewhat fragile, and not very
pretty.  The reason for this is to avoid changing vfs_mountroot(), which
is (obviously) rather mission-critical, but not very well documented,
and thus hard to test properly.  It might be possible to rework it to use
its own simple root mount mechanism instead of vfs_mountroot().

Reviewed by:	kib@
MFC after:	1 month
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D2698
2015-09-18 17:32:22 +00:00
John Baldwin
bdd64116b0 Always clear TDB_USERWR before fetching system call arguments. The
TDB_USERWR flag may still be set after a debugger detaches from a
process via PT_DETACH.  Previously the flag would never be cleared
forcing a double fetch of the system call arguments for each system
call.  Note that the flag cannot be cleared at PT_DETACH time in case
one of the threads in the process is currently stopped in
syscallenter() and the debugger has modified the arguments for that
pending system call before detaching.

Reviewed by:	kib
MFC after:	1 week
Differential Revision:	https://reviews.freebsd.org/D3678
2015-09-16 20:55:00 +00:00
John Baldwin
4295bec16a When a process group leader exits, all of the processes in the group are
sent SIGHUP and SIGCONT if any of the processes are stopped.  Currently this
behavior is triggered for any type of process stop including ptrace() stops
and transient stops for single threading during exit() and execve().
Thus, if a debugger is attached to a process in a group when the leader
exits, the entire group can be HUPed.  Instead, only send the signals if a
process in the group is stopped due to SIGSTOP.

PR:		201149
Reviewed by:	kib
MFC after:	1 week
Differential Revision:	https://reviews.freebsd.org/D3681
2015-09-16 16:40:07 +00:00
Mateusz Guzik
7665e341ca sysctl: switch sysctllock to a sleepable rmlock, take 2
This restores r285125. Previous attempt was reverted due to a bug in rmlocks,
which is fixed since r287833.
2015-09-15 23:06:56 +00:00
John Baldwin
e89d5f43da Threads holding a read lock of a sleepable rm lock are not permitted
to sleep.  The rmlock implementation enforces this by disabling
sleeping when a read lock is acquired. To simplify the implementation,
sleeping is disabled for most of the duration of rm_rlock.  However,
it doesn't need to be disabled until the lock is acquired.  If a
sleepable rm lock is contested, then rm_rlock may need to acquire the
backing sx lock.  This tripped the overly-broad assertion.  Fix by
relaxing the assertion around the call to sx_xlock().

Reported by:	mjg
Reviewed by:	kib, mjg
MFC after:	1 week
Differential Revision:	https://reviews.freebsd.org/D3324
2015-09-15 22:16:21 +00:00
Conrad Meyer
55d33667ee kevent(2): Note DOOMED vnodes with NOTE_REVOKE
In poll mode, check for and wake VBAD vnodes.  (Vnodes that are VBAD at
registration will never be woken by the RECLAIM trigger.)

Add post-VOP_RECLAIM hook to trigger notes on vnode reclamation.  (Vnodes that
were fine at registration but are vgoned while being monitored should signal
waiters.)

Reviewed by:	kib
Approved by:	markj (mentor)
Sponsored by:	EMC / Isilon Storage Division
Differential Revision:	https://reviews.freebsd.org/D3675
2015-09-15 20:22:30 +00:00
Hans Petter Selasky
9acc0eafd7 Implement callout_drain_async(), inspired by the projects/hps_head
branch.

This function is used to drain a callout via a callback instead of
blocking the caller until the drain is complete. Refer to the
callout_drain_async() manual page for a detailed description.

Limitation: If a lock is used with the callout, the callout can only
be drained asynchronously one time unless the callout_init_mtx()
function is called again. This limitation is not present in
projects/hps_head and will require more invasive changes to the
timeout code, which was not in the scope of this patch.

Differential Revision:	https://reviews.freebsd.org/D3521
Reviewed by:		wblock
MFC after:		1 month
2015-09-14 10:52:26 +00:00
Warner Losh
7297c5e535 bufdonebio is now unused. Retire it too. 2015-09-11 04:20:04 +00:00
Mark Johnston
610141cebb Add stack_save_td_running(), a function to trace the kernel stack of a
running thread.

It is currently implemented only on amd64 and i386; on these
architectures, it is implemented by raising an NMI on the CPU on which
the target thread is currently running. Unlike stack_save_td(), it may
fail, for example if the thread is running in user mode.

This change also modifies the kern.proc.kstack sysctl to use this function,
so that stacks of running threads are shown in the output of "procstat -kk".
This is handy for debugging threads that are stuck in a busy loop.

Reviewed by:	bdrewery, jhb, kib
Sponsored by:	EMC / Isilon Storage Division
Differential Revision:	https://reviews.freebsd.org/D3256
2015-09-11 03:54:37 +00:00
Warner Losh
ad8d57a99d dev_strategy and dev_strategy_csw are unused since r281825. Remove
them.

Differential Revision: https://reviews.freebsd.org/D3620
2015-09-11 00:38:58 +00:00
Adrian Chadd
32766cd281 Also make kern.maxfilesperproc a boot time tunable.
Auto-tuning threshold discussions aside, it turns out that if you want
to lower this on say, rather memory-packed machines, you either set maxusers
or kern.maxfiles, or you set it in sysctl.  The former is a non-exact
way to tune this; the latter doesn't actually affect anything in the
startup scripts.

This first occured because I wondered why the hell screen would take upwards
of 10 seconds to spawn a new screen.  I then found python doing the same
thing during fork/exec of child processes - it calls close() on each FD
up to the current openfiles limit.  On a 1TB machine this is like, 26 million
FDs per process.  Ugh.

So:

* This allows it to be set early in /boot/loader.conf;
* It can be used to work around the ridiculous situation of
  screen, python, etc doing a close() on potentially millions of FDs
  even though you only have four open.

Tested:

* 4GB, 32GB, 64GB, 128GB, 384GB, 1TB systems with autotune, ensuring
  screen and python forking doesn't result in some pretty hilariously
  bad behaviour.

TODO:

* Note that the default login.conf sets openfiles-cur to unlimited,
  effectively obeying kern.maxfilesperproc.  Perhaps we should fix
  this.

* .. and even if we do, we need to also ensure that daemons get
  a soft limit of something reasonable and capped - they can request
  more FDs themselves.

MFC after:	1 week
Sponsored by:	Norse Corp, Inc.
2015-09-10 04:05:58 +00:00
Konstantin Belousov
9e18c9eb27 For open("name", O_DIRECTORY | O_CREAT), do not try to create the
named node, open(2) cannot create directories.  But do allow the flag
combination to succeed if the directory already exists.

Declare the open("name", O_DIRECTORY | O_CREAT | O_EXCL) always
invalid for the same reason, since open(2) cannot create directory.

Note that there is an argument that O_DIRECTORY | O_CREAT should be
invalid always, regardless of the target directory existence or
O_EXCL.  The current fix is conservative and allows the call to
succeed in the situation where it succeeded before the patch.

Reported by:	Tom Ridge <freebsd@tom-ridge.com>
Reviewed by:	rwatson
PR:	 202892
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2015-09-09 19:31:08 +00:00
Mateusz Guzik
9af8c8b72b fd: make rights a mandatory argument to fgetvp_rights
The only caller already always passes rights.
2015-09-07 20:05:56 +00:00
Mateusz Guzik
d7832811a7 fd: make the common case in filecaps_copy work lockless
The filedesc lock is only needed if ioctls caps are present, which is a
rare situation. This is a step towards reducing the scope of the filedesc
lock.
2015-09-07 20:02:56 +00:00
Conrad Meyer
bcb60d52e6 Follow-up to r287442: Move sysctl to compiled-once file
Avoid duplicate sysctl nodes.

Found by:	tijl
Approved by:	markj (mentor)
Sponsored by:	EMC / Isilon Storage Division
Differential Revision:	https://reviews.freebsd.org/D3586
2015-09-07 16:44:28 +00:00