Commit Graph

14616 Commits

Author SHA1 Message Date
Mateusz Guzik
17182b1351 session: avoid proctree lock on proc exit when possible
We can get away with the common case with only proc lock held.

Reviewed by:	kib
2016-01-20 23:33:58 +00:00
Mateusz Guzik
6760182c5d session: tidy up fixjobc
This stops abusing the 'p' pointer for iteration over children processes
and gets rid of useless locking around PRS_ZOMBIE check.

Suggested by:	kib
2016-01-20 23:22:36 +00:00
Marius Strobl
9750d9e5b6 Fix tty_drain() and, thus, TIOCDRAIN of the current tty(4) incarnation
to actually wait until the TX FIFOs of UARTs have be drained before
returning. This is done by bringing the equivalent of the TS_BUSY flag
found in the previous implementation back in an ABI-preserving way.
Reported and tested by: Patrick Powell

Most likely, drivers for USB-serial-adapters likewise incorporating
TX FIFOs as well as other terminal devices that buffer output in some
form should also provide implementations of tsw_busy.

MFC after:	3 days
2016-01-19 23:34:27 +00:00
John Baldwin
8a4dc40ff4 Various cleanups to the main function for AIO kernel processes:
- Pull the vmspace logic out into helper functions and reduce duplication.
  Operations on the vmspace are all isolated to vm_map.c, but it now exports
  a new 'vmspace_switch_aio' for use by AIO kernel processes.
- When an AIO kernel process wants to exit, break out of the main loop and
  perform cleanup after the loop end.  This reduces a lot of indentation and
  allows cleanup to more closely mirror setup actions before the loop starts.
- Convert a DIAGNOSTIC to KASSERT().
- Replace mycp with more typical 'p'.

Reviewed by:	kib
Sponsored by:	Chelsio Communications
Differential Revision:	https://reviews.freebsd.org/D4990
2016-01-19 21:37:51 +00:00
John Baldwin
f2e7f06a0d Don't create a dedicated session for each AIO kernel process.
This code dates back to the initial AIO support and the commit log does
not explain why it is needed.  However, I cannot find anything in the
AIO code or the various file methods (fo_read/fo_write) that would change
behavior due to using a private session instead of proc0's session.

Reviewed by:	kib
Sponsored by:	Chelsio Communications
Differential Revision:	https://reviews.freebsd.org/D4988
2016-01-19 20:46:30 +00:00
Mark Johnston
793c381706 Add vrefl(), a locked variant of vref(9).
This API has no in-tree consumers at the moment but is useful to at least
one out-of-tree consumer, and naturally complements existing vnode refcount
functions (vholdl(9), vdropl(9)).

Obtained from:	kib (sys/ portion)
Sponsored by:	EMC / Isilon Storage Division
Differential Revision:	https://reviews.freebsd.org/D4947
Differential Revision:	https://reviews.freebsd.org/D4953
2016-01-18 22:21:46 +00:00
Konstantin Belousov
ce958bdefe When cleaning up from failed adv locking and checking for write, do
not call VOP_CLOSE() manually.  Instead, delegate the close to
fo_close() performed as part of the fdrop() on the file failed to
open.  For this, finish constructing file on error, in particular, set
f_vnode and f_ops.

Forcibly resetting f_ops to badfileops disabled additional cleanups
performed by fo_close() for some file types, in this case it was noted
that cdevpriv data was corrupted.  Since fo_close() call must be
enabled for some file types, it makes more sense to enable it for all
files opened through vn_open_cred().

In collaboration with:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
2016-01-17 08:40:51 +00:00
John Baldwin
6c8fd02283 Remove aiod_timeout.
It hasn't been used since the AIO code was made MPSAFE 10 years ago.

Reviewed by:	kib
Sponsored by:	Chelsio Communications
Differential Revision:	https://reviews.freebsd.org/D4946
2016-01-14 21:28:56 +00:00
John Baldwin
c85650cacc Rename aiod_bio taskqueue to aiod_kick.
This taskqueue is not used to handle bio requests.  It is only used to
run aio_kick_nowait() to spin up new aio daemon processes.

Reviewed by:	kib
Differential Revision:	https://reviews.freebsd.org/D4904
2016-01-14 20:51:48 +00:00
Gleb Smirnoff
c8358c6e0d Call crextend() before copying old credentials to the new credentials
and replace crcopysafe by crcopy as crcopysafe is is not intended to be
safe in a threaded environment, it drops PROC_LOCK() in while() that
can lead to unexpected results, such as overwrite kernel memory.

In my POV crcopysafe() needs special attention. For now I do not see
any problems with this function, but who knows.

Submitted by:	dchagin
Found by:	trinity
Security:	SA-16:04.linux
2016-01-14 10:16:25 +00:00
Colin Percival
c0ada0377a Fix a bug introduced in r291716:
"The problem with the approach taken both in _bus_dmamap_load_pages and
bus_dmamap_load_ma_triv is that they split the request buffer into
arbitrary chunks based on page boundaries, creating segments that no
longer have a size that's a multiple of the sector size. This breaks
drivers like blkfront (and probably other stuff)." [1]

This was most easily triggered by running `fsck /` on a system running
in Xen (e.g. Amazon EC2) but also showed up via growfs(8) and probably
many other userland tools which access the disk directly.

Patch by:	royger [1]
"Thinks this should be fine" by:	ken
2016-01-11 20:38:39 +00:00
Dmitry Chagin
038c720553 Implement vsyscall hack. Prior to 2.13 glibc uses vsyscall
instead of vdso. An upcoming linux_base-c6 needs it.

Differential Revision:  https://reviews.freebsd.org/D1090

Reviewed by:	kib, trasz
MFC after:	1 week
2016-01-09 20:18:53 +00:00
Mark Johnston
11f9ca696e Prevent cv_waiters wraparound.
r282971 attempted to fix this problem by decrementing cv_waiters after
waking up from sleeping on a condition variable, but this can result in
a use-after-free if the CV is freed before all woken threads have had a
chance to run. Instead, avoid incrementing cv_waiters past INT_MAX, and
have cv_signal() explicitly check for sleeping threads once cv_waiters has
reached this bound.

Reviewed by:	jhb
MFC after:	2 weeks
Sponsored by:	EMC / Isilon Storage Division
Differential Revision:	https://reviews.freebsd.org/D4822
2016-01-09 01:56:46 +00:00
Gleb Smirnoff
2bab0c5535 New sendfile(2) syscall. A joint effort of NGINX and Netflix from 2013 and
up to now.

The new sendfile is the code that Netflix uses to send their multiple tens
of gigabits of data per second. The new implementation features asynchronous
I/O, when I/O operations are launched, but not awaited to be complete. An
explanation of why such behavior is beneficial compared to old one is
going to be too long for a commit message, so we will skip it here.

Additional features of new syscall are extra flags, which provide an
application more control over data sent. The SF_NOCACHE flag tells
kernel that data shouldn't be cached after it was sent. The SF_READAHEAD()
macro allows to specify readahead size in pages.

The new syscalls is a drop in replacement. No modifications are required
to applications. One can take nginx binary for stable/10 and run it
successfully on head. Although SF_NODISKIO lost its original sense, as now
sendfile doesn't block, and now means something completely different (tm),
using the new sendfile the old way is absolutely safe.

Celebrates:	Netflix global launch!
Sponsored by:	Nginx, Inc.
Sponsored by:	Netflix
Relnotes:	yes
2016-01-08 20:34:57 +00:00
Gleb Smirnoff
829fae9063 Make it possible for sbappend() to preserve M_NOTREADY on mbufs, just like
sbappendstream() does. Although, M_NOTREADY may appear only on SOCK_STREAM
sockets, due to sendfile(2) supporting only the latter, there is a corner
case of AF_UNIX/SOCK_STREAM socket, that still uses records for the sake
of control data, albeit being stream socket.

Provide private version of m_clrprotoflags(), which understands PRUS_NOTREADY,
similar to m_demote().
2016-01-08 19:03:20 +00:00
Gleb Smirnoff
ffd1c319a9 Revert r293405: it breaks socket buffer INVARIANTS when sending control
data over local sockets.
2016-01-08 17:27:23 +00:00
Gleb Smirnoff
2f2edf0a08 For SOCK_STREAM socket use sbappendstream() instead of sbappend(). 2016-01-08 01:16:03 +00:00
Konstantin Belousov
0de1455462 Convert tty common code to use make_dev_s().
Tty.c was untypical in that it handled the si_drv1 issue consistently
and correctly, by always checking for si_drv1 being non-NULL and
sleeping if NULL.  The removed code also illustrated unneeded
complications in drivers which are eliminated by the use of new KPI.

Reviewed by:	hps, jhb
Discussed with:	bde
Sponsored by:	The FreeBSD Foundation
MFC after:	3 weeks
Differential revision:	https://reviews.freebsd.org/D4746
2016-01-07 20:15:09 +00:00
Konstantin Belousov
48ce5d4cac Provide yet another KPI for cdev creation, make_dev_s(9).
Immediate problem fixed by the new KPI is the long-standing race
between device creation and assignments to cdev->si_drv1 and
cdev->si_drv2, which allows the window where cdevsw methods might be
called with si_drv1,2 fields not yet set.  Devices typically checked
for NULL and returned spurious errors to usermode, and often left some
methods unchecked.

The new function interface is designed to be extensible, which should
allow to add more features to make_dev_s(9) without inventing yet
another name for function to create devices, while maintaining KPI and
even KBI backward-compatibility.

Reviewed by:	hps, jhb
Sponsored by:	The FreeBSD Foundation
MFC after:	3 weeks
Differential revision:	https://reviews.freebsd.org/D4746
2016-01-07 20:08:02 +00:00
Mateusz Guzik
6b53d1bc6f cache: ansify functions and fix some style issues
No functional changes.
2016-01-07 02:04:17 +00:00
Konstantin Belousov
1041e09089 Two fixes for excessive iterations after r292326.
Advance the logical block number to the lblkno of the found block plus
one, instead of incrementing the block number which was used for
lookup.  This change skips sparcely populated buffer ranges, similar
to r292325, instead of doing useless lookups.

Do not restart the bnoreuselist() from the start of the range if
buffer lock cannot be obtained without sleep.  Only retry lookup and
lock for the same queue and same logical block number.

Reported by:	benno
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	3 days
2016-01-05 14:48:40 +00:00
Ian Lepore
69dcb7e771 Make the 'env' directive described in config(5) work on all architectures,
providing compiled-in static environment data that is used instead of any
data passed in from a boot loader.

Previously 'env' worked only on i386 and arm xscale systems, because it
required the MD startup code to examine the global envmode variable and
decide whether to use static_env or an environment obtained from the boot
loader, and set the global kern_envp accordingly.  Most startup code wasn't
doing so.  Making things even more complex, some mips startup code uses an
alternate scheme that involves calling init_static_kenv() to pass an empty
buffer and its size, then uses a series of kern_setenv() calls to populate
that buffer.

Now all MD startup code calls init_static_kenv(), and that routine provides
a single point where envmode is checked and the decision is made whether to
use the compiled-in static_kenv or the values provided by the MD code.

The routine also continues to serve its original purpose for mips; if a
non-zero buffer size is passed the routine installs the empty buffer ready
to accept kern_setenv() values.  Now if the size is zero, the provided buffer
full of existing env data is installed.  A NULL pointer can be passed if the
boot loader provides no env data; this allows the static env to be installed
if envmode is set to do so.

Most of the work here is a near-mechanical change to call the init function
instead of directly setting kern_envp.  A notable exception is in xen/pv.c;
that code was originally installing a buffer full of preformatted env data
along with its non-zero size (like mips code does), which would have allowed
kern_setenv() calls to wipe out the preformatted data.  Now it passes a zero
for the size so that the buffer of data it installs is treated as
non-writeable.
2016-01-02 02:53:48 +00:00
Marius Strobl
45005f7907 - (Ab)use udivx for dividing the u_int pc_cpuid when implementing
CPU_ISSET(), CPU_SET etc. in sparc64 asm. This approach has the
  benefit of not clobbering %y, allowing to revert r222827 and
  partially r222828.
- In r222828, CATR() already was changed to use the equivalent of
  PCPU_GET(cpuid) instead of the MD module ID for KTR_CPU, so
  belatedly also catch up with the C side of ktr(9). Originally,
  in r203838 CATR() was moved away from directly reading the
  module ID or equivalent as that became impractical with other
  CPU types than USI/II supported. With r222828 in place, per-CPU
  data generally is set up soon enough, though, that employing
  PCPU things in ktr(9) also for use during early stages works.
- Unfortunately, an exception to the latter is the ktr(9) use
  in pmap_bootstrap(), which actually is run so early that even
  checking for bootverbose being set via the loader doesn't work.
  Consequently, replace the ktr(9) use in pmap_bootstrap() with
  OF_printf(9) and put it under #ifdef DIAGNOSTIC instead.

MFC after:	3 days
2015-12-30 13:49:20 +00:00
John Baldwin
5fcfab6e32 Add ptrace(2) reporting for LWP events.
Add two new LWPINFO flags: PL_FLAG_BORN and PL_FLAG_EXITED for reporting
thread creation and destruction. Newly created threads will stop to report
PL_FLAG_BORN before returning to userland and exiting threads will stop to
report PL_FLAG_EXIT before exiting completely. Both of these events are
only enabled and reported if PT_LWP_EVENTS is enabled on a process.
2015-12-29 23:25:26 +00:00
John Baldwin
d1e7a4a553 Call kern_thr_exit() instead of duplicating it.
This code is missing the racct_subr() call from kern_thr_exit() and would
require further code duplication in future changes.

Reviewed by:	kib
MFC after:	1 week
2015-12-29 23:16:20 +00:00
Dmitry Chagin
3e18d701de Verify that tv_sec value specified in settimeofday() and clock_settime()
(CLOCK_REALTIME case) system calls is non negative.
This commit hides a kernel panic in atrtc_settime() as the clock_ts_to_ct()
does not properly convert negative tv_sec.

ps. in my opinion clock_ts_to_ct() should be rewritten to properly handle
negative tv_sec values.

Differential Revision:	https://reviews.freebsd.org/D4714
Reviewed by:		kib

MFC after:	1 week
2015-12-27 15:37:07 +00:00
Konstantin Belousov
1899507706 Do not substitute interpeter if the brand interpreter path is
different from the interpreter path requested by the binary.

Before this change, it is impossible to activate non-default
interpreter for 32bit image on amd64, when /libexec/ld-elf32.so.1 file
exists.

Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
2015-12-26 15:40:12 +00:00
Jonathan T. Looney
d3ee0a15a4 Only allow one PT_INTERP ELF program header. This also fixes a potential
memory leak for interp_buf.

Differential Revision:	https://reviews.freebsd.org/D4692
Reviewed by:	kib
MFC after:	2 weeks
Sponsored by:	Juniper Networks
2015-12-24 00:58:11 +00:00
Enji Cooper
853a17ad6f Fix r292640
vim overzealously removed some trailing `+' and I didn't check the
diff

MFC after: 1 week
X-MFC with: r292640
Pointyhat to: ngie
Sponsored by: EMC / Isilon Storage Division
2015-12-23 03:34:43 +00:00
Enji Cooper
905b145f0a Clean up trailing whitespace; no functional change
MFC after: 1 week
Sponsored by: EMC / Isilon Storage Division
2015-12-23 03:29:37 +00:00
Enji Cooper
20b4c1d2cb Fold lim_shared into lim_copy to mute a -Wunused compiler warning from
clang when the kernel is compiled without INVARIANTS

Differential Revision: https://reviews.freebsd.org/D4683
Reviewed by: kib, jhb
MFC after: 1 week
Sponsored by: EMC / Isilon Storage Division
2015-12-22 21:07:33 +00:00
Konstantin Belousov
d943fa35ad If we annoy user with the terminal output due to failed load of
interpreter, also show the actual error code instead of some
interpretation.

Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2015-12-22 20:12:52 +00:00
Jonathan T. Looney
54503a13d8 Add a safety net to reclaim mbufs when one of the mbuf zones become
exhausted.

It is possible for a bug in the code (or, theoretically, even unusual
network conditions) to exhaust all possible mbufs or mbuf clusters.
When this occurs, things can grind to a halt fairly quickly. However,
we currently do not call mb_reclaim() unless the entire system is
experiencing a low-memory condition.

While it is best to try to prevent exhaustion of one of the mbuf zones,
it would also be useful to have a mechanism to attempt to recover from
these situations by freeing "expendable" mbufs.

This patch makes two changes:

a) The patch adds a generic API to the UMA zone allocator to set a
function that should be called when an allocation fails because the
zone limit has been reached. Because of the way this function can be
called, it really should do minimal work.

b) The patch uses this API to try to free mbufs when an allocation
fails from one of the mbuf zones because the zone limit has been
reached. The function schedules a callout to run mb_reclaim().

Differential Revision:	https://reviews.freebsd.org/D3864
Reviewed by:	gnn
Comments by:	rrs, glebius
MFC after:	2 weeks
Sponsored by:	Juniper Networks
2015-12-20 02:05:33 +00:00
Mateusz Guzik
aa0241d623 proc: fix a race which could result in dereference of bad p_pgrp pointer on fork
During fork p_starcopy - p_endcopy area of a process is populated with bcopy
with only proc lock held. Another forking thread can find such a process and
proceed to access p_pgrp included in said area.

Fix the problem by moving the field outside. It is being properly assigned
later.

Reviewed by:	kib
Diagnosed by:	kib
Tested by:	Fabian Keil <freebsd-listen fabiankeil.de>
MFC after:	10 days
2015-12-18 16:33:15 +00:00
Adrian Chadd
2b3ad18853 [intrng] Migrate the intrng code from sys/arm/arm to sys/kern/subr_intr.c.
The ci20 port (by kan@) is going to reuse almost all of the intrng code
since the SoC in question looks suspiciously like someone took an ARM
SoC design and replaced the ARM core with a MIPS core.

* migrate out the code;
* rename ARM_ -> INTR_;
* rename arm_ -> intr_;
* move the interrupt flush routine from intr.c / intrng.c into
  arm/machdep_intr.c - removing the code duplication and removing
  the ARM specific bits from here.

Thanks to the Star Wars: The Force Awakens premiere line for allowing
me a couple hours of quiet time to finish the universe builds.

Tested:

* make universe

TODO:

* The structure definitions in subr_intr.c still includes machine/intr.h
  which requires one duplicates all of the intrng definitions in
  the platform code (which kan has done, and I think we don't have to.)

  Instead I should break out the generic things (function declarations,
  common intr structures, etc) into a separate header.

* Kan has requested I make the PIC based IPI stuff optional.
2015-12-18 05:43:59 +00:00
Mark Johnston
8ff6d9dd22 Support an arbitrary number of arguments to DTrace syscall probes.
Rather than pushing all eight possible arguments into dtrace_probe()'s
stack frame, make the syscall_args struct for the current syscall available
via the current thread. Using a custom getargval method for the systrace
provider, this allows any syscall argument to be fetched, even in kernels
that have modified the maximum number of system call arguments.

Sponsored by:	EMC / Isilon Storage Division
2015-12-17 00:00:27 +00:00
Mark Johnston
3616095801 Fix style issues around existing SDT probes.
- Use SDT_PROBE<N>() instead of SDT_PROBE(). This has no functional effect
  at the moment, but will be needed for some future changes.
- Don't hardcode the module component of the probe identifier. This is
  set automatically by the SDT framework.

MFC after:	1 week
2015-12-16 23:39:27 +00:00
Gleb Smirnoff
b0cd20172d A change to KPI of vm_pager_get_pages() and underlying VOP_GETPAGES().
o With new KPI consumers can request contiguous ranges of pages, and
  unlike before, all pages will be kept busied on return, like it was
  done before with the 'reqpage' only. Now the reqpage goes away. With
  new interface it is easier to implement code protected from race
  conditions.

  Such arrayed requests for now should be preceeded by a call to
  vm_pager_haspage() to make sure that request is possible. This
  could be improved later, making vm_pager_haspage() obsolete.

  Strenghtening the promises on the business of the array of pages
  allows us to remove such hacks as swp_pager_free_nrpage() and
  vm_pager_free_nonreq().

o New KPI accepts two integer pointers that may optionally point at
  values for read ahead and read behind, that a pager may do, if it
  can. These pages are completely owned by pager, and not controlled
  by the caller.

  This shifts the UFS-specific readahead logic from vm_fault.c, which
  should be file system agnostic, into vnode_pager.c. It also removes
  one VOP_BMAP() request per hard fault.

Discussed with:	kib, alc, jeff, scottl
Sponsored by:	Nginx, Inc.
Sponsored by:	Netflix
2015-12-16 21:30:45 +00:00
Konstantin Belousov
106ebb761a Optimize vop_stdadvise(POSIX_FADV_DONTNEED). Instead of looking up a
buffer for each block number in the range with gbincore(), look up the
next instantiated buffer with the logical block number which is
greater or equal to the next lblkno.  This significantly speeds up the
iteration for sparce-populated range.

Move the iteration into new helper bnoreuselist(), which is structured
similarly to flushbuflist().

Reported and tested by:	pho
Reviewed by:	markj
Sponsored by:	The FreeBSD Foundation
2015-12-16 08:48:37 +00:00
Konstantin Belousov
8549b4b9fe Simplify the loop step in the flushbuflist() and make it independed on
the type stability of the buffers memory.  Instead of memoizing
pointer to the next buffer and validating it, remember the next
logical block number in the bo list and re-lookup.

Reviewed by:	markj
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
2015-12-16 08:39:51 +00:00
Adrian Chadd
45130aa12e Don't call wakeup if we're just returning reserved space; just
return the reservation and wait for more space to appear.

Submitted by:	jeff
Reviewed by:	kib
2015-12-16 00:13:16 +00:00
Jamie Gritton
6ab6058ec4 Fix jail name checking that disallowed anything that starts with '0'.
The intention was to just limit leading zeroes on numeric names.  That
check is now improved to also catch the leading spaces and '+' that
strtoul can pass through.

PR:		204897
MFC after:	3 days
2015-12-15 17:25:00 +00:00
Edward Tomasz Napierala
af1a7b2526 Tweak comments.
MFC after:	1 month
Sponsored by:	The FreeBSD Foundation
2015-12-13 11:30:36 +00:00
Edward Tomasz Napierala
a8e0740e85 Actually make the 'amount' argument to racct_adjust_resource() signed,
as it was always supposed to be.

MFC after:	1 month
Sponsored by:	The FreeBSD Foundation
2015-12-13 11:21:13 +00:00
Edward Tomasz Napierala
57e2ebdc08 Avoid useless relocking.
MFC after:	1 month
Sponsored by:	The FreeBSD Foundation
2015-12-13 11:08:29 +00:00
Mark Johnston
d9e2e68d38 Don't make assertions about td_critnest when the scheduler is stopped.
A panicking thread always executes with a critical section held, so any
attempt to allocate or free memory while dumping will otherwise cause a
second panic. This can occur, for example, if xpt_polled_action() completes
non-dump I/O that was pending at the time of the panic. The fact that this
can occur is itself a bug, but asserting in this case does little but
reduce the reliability of kernel dumps.

Suggested by:	kib
Reported by:	pho
2015-12-11 20:05:07 +00:00
Warner Losh
da82615ae2 Create the MDT_PNP_INFO metadata record to communicate PNP info about
modules. External agents may use this data to automatically load those
modules.

Differential Review: https://reviews.freebsd.org/D3461
2015-12-11 05:27:53 +00:00
Steven Hartland
f1b13a89b0 Don't use 0 for pointer comparison
Use NULL instead of 0 for comparison with panicstr.

MFC after:	1 week
Sponsored by:	Multiplay
2015-12-08 18:38:33 +00:00
Mark Johnston
711fbd17ec Add helper functions proc_readmem() and proc_writemem().
These helper functions can be used to read in or write a buffer from or to
an arbitrary process' address space. Without them, this can only be done
using proc_rwmem(), which requires the caller to fill out a uio. This is
onerous and results in code duplication; the new functions provide a simpler
interface which is sufficient for most existing callers of proc_rwmem().

This change also adds a manual page for proc_rwmem() and the new functions.

Reviewed by:	jhb, kib
Differential Revision:	https://reviews.freebsd.org/D4245
2015-12-07 21:33:15 +00:00
Ed Maste
4c22b4686b Replace magic value ELF note type with NT_FREEBSD_ABI_TAG
As of r291909 elf_common.h provides a definition.

Suggested by:	kib
Sponsored by:	The FreeBSD Foundation
2015-12-07 18:43:27 +00:00