Commit Graph

14722 Commits

Author SHA1 Message Date
Ian Lepore
85143dd18d Allow a dynamic env to override a compiled-in static env by passing in the
override indication in the env data.

Submitted by:	bde
2016-02-21 18:35:01 +00:00
Justin Hibbits
7915adb560 Introduce a RMAN_IS_DEFAULT_RANGE() macro, and use it.
This simplifies checking for default resource range for bus_alloc_resource(),
and improves readability.

This is part of, and related to, the migration of rman_res_t from u_long to
uintmax_t.

Discussed with:	jhb
Suggested by:	marcel
2016-02-20 01:32:58 +00:00
Mark Johnston
88c2beac9c Ensure that we test the event condition when a disabled kevent is enabled.
r274560 modified kqueue_register() to only test the event condition if the
corresponding knote is not disabled. However, this check takes place before
the EV_ENABLE flag is used to clear the KN_DISABLED flag on the knote, so
enabling a previously-disabled kevent would not result in a notification for
a triggered event. This change fixes the problem by testing for EV_ENABLED
before possibly checking the event condition.

This change also updates a kqueue regression test to exercise this case.

PR:		206368
Reviewed by:	kib
Sponsored by:	EMC / Isilon Storage Division
Differential Revision:	https://reviews.freebsd.org/D5307
2016-02-19 01:49:33 +00:00
Mark Johnston
fe169828c3 Return an error if both EV_ENABLE and EV_DISABLE are specified for a kevent.
Currently, this combination results in EV_DISABLE being ignored.

Reviewed by:	kib
Sponsored by:	EMC / Isilon Storage Division
Differential Revision:	https://reviews.freebsd.org/D5307
2016-02-19 01:35:01 +00:00
Zbigniew Bodek
910905c74f Fix build for i386 and arm64 after r295755
- Take bus_space_tag_t type into consideration when returning
  default, zero value.
- Include missing rman.h required by ofw_pci.h
2016-02-18 15:44:45 +00:00
Zbigniew Bodek
b998c9656b Introduce bus_get_bus_tag() method
Provide bus_get_bus_tag() for sparc64, powerpc, arm, arm64 and mips
nexus and its children in order to return a platform specific default tag.

This is required to ensure generic correctness of the bus_space tag.
It is especially needed for arches where child bus tag does not match
the parent bus tag. This solves the problem with ppc architecture
where the PCI bus tag differs from parent bus tag which is big-endian.

This commit is a part of the following patch:
https://reviews.freebsd.org/D4879

Submitted by:  Marcin Mazurek <mma@semihalf.com>
Obtained from: Semihalf
Sponsored by:  Annapurna Labs
Reviewed by:   jhibbits, mmel
Differential Revision: https://reviews.freebsd.org/D4879
2016-02-18 13:00:04 +00:00
Konstantin Belousov
fa48f413ef In bnoreuselist(), check both ends of the specified logical block
numbers range.

This effectively skips indirect and extdata blocks on the buffer
queue.  Since their logical block numbers are negative, bnoreuselist()
could loop infinitely.

Reported and tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2016-02-17 19:39:57 +00:00
Warner Losh
c55f57071a Create an API to reset a struct bio (g_reset_bio). This is mandatory
for all struct bio you get back from g_{new,alloc}_bio. Temporary
bios that you create on the stack or elsewhere should use this before
first use of the bio, and between uses of the bio. At the moment, it
is nothing more than a wrapper around bzero, but that may change in
the future. The wrapper also removes one place where we encode the
size of struct bio in the KBI.
2016-02-17 17:16:02 +00:00
Andrew Turner
96922be6b5 Remove an unused FDT header, fdt_common.h should only be needed in a few
places, mostly in sys/dev/fdt and legacy code.

Sponsored by:	ABT Systems Ltd
2016-02-15 17:05:03 +00:00
Adrian Chadd
0cc5515a85 Allow MIPS INTRNG code to be built without FDT support.
This patch allows the newly imported INTRNG code to be built without necessarily
having FDT support in the kernel.  This may be useful for some MIPS platforms
that wish to move to INTRNG, but not to FDT at the same time.

Basically all the code is already within ifdef's where FDT is concerned,
it's just the headers that aren't.

Submitted by:	Stanislav Galabov <sgalabov@gmail.com>
Differential Revision:	https://reviews.freebsd.org/D5249
2016-02-15 14:34:35 +00:00
Gleb Smirnoff
5e4bc63b7c o Gather all mbuf(9) allocation functions into kern_mbuf.c, and all
mbuf(9) manipulation functions into uipc_mbuf.c.  This looks like
  the initial intent, but had diffused in the last decade.

o Gather all declarations in mbuf.h in one place and sort them.

o Uninline m_clget() and m_cljget().

There are no functional changes in this patch.

The patch comes from a larger version, where all mbuf(9) allocation was
uninlined, which allowed to make mbuf(9) UMA zones private to kern_mbuf.c.
The performance impact of the total uninlining is still unclear, so we
are holding on now with larger version.

Together with:	melifaro, olivier
2016-02-11 21:32:23 +00:00
Konstantin Belousov
0be1e0e879 Remove useless checks for NULL before calling free(9), in the kernel
elf linkers.

Found by:	Related PVS-Studio diagnostic
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2016-02-10 21:35:00 +00:00
Konstantin Belousov
86a448c3a4 Finish r173600. There is no need to test a condition if both cases
result in the same value.

Found by:	PVS-Studio
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2016-02-10 21:16:37 +00:00
Gleb Smirnoff
b4b12e52fb Garbage collect unused arguments of m_init(). 2016-02-10 18:54:18 +00:00
Gleb Smirnoff
b28cc462ad Include sys/_task.h into uma_int.h, so that taskqueue.h isn't a
requirement for uma_int.h.

Suggested by:	jhb
2016-02-09 20:22:35 +00:00
Konstantin Belousov
db57c70a5b Rename P_KTHREAD struct proc p_flag to P_KPROC.
I left as is an apparent bug in ntoskrnl_var.h:AT_PASSIVE_LEVEL()
definition.

Suggested by:	jhb
Sponsored by:	The FreeBSD Foundation
2016-02-09 16:30:16 +00:00
John Baldwin
fb1f4582ff Call kthread_exit() rather than kproc_exit() for a premature kthread exit.
Kernel threads (and processes) are supposed to call kthread_exit() (or
kproc_exit()) to terminate.  However, the kernel includes a fallback in
fork_exit() to force a kthread exit if a kernel thread's "main" routine
returns.  This fallback was added back when the kernel only had processes
and was not updated to call kthread_exit() instead of kproc_exit() when
threads were added to the kernel.

This mistake was particular exciting when the errant thread belonged to
proc0.  Due to the missing P_KTHREAD flag the fallback did not kick in
and instead tried to return to userland via whatever garbage was in the
trapframe.  With P_KTHREAD set it tried to terminate proc0 resulting in
other amusements.

PR:		204999
MFC after:	1 week
2016-02-08 23:11:23 +00:00
John Baldwin
6270fa5f72 Mark proc0 as a kernel process via the P_KTHREAD flag.
All other kernel processes have this flag set and all threads in proc0
(including thread0) have the similar TDP_KTHREAD flag set.

PR:		204999
Submitted by:	Oliver Pinter @ HardenedBSD
Reviewed by:	kib
MFC after:	1 week
2016-02-08 23:06:27 +00:00
Konstantin Belousov
2dab579bfc Remove the assert which outlived its usefulness, and, by default,
disable compilation of the code which made it possible to call
stop_all_proc() from usermode at all.

Move the comment to the preamble of stop_all_proc() and reword it to
give overview of the function intent.

proc0 has P_HADTHREADS flag set due to kthread_add(), but no
P_KTHREAD, which triggered the assert, which does not serve a purpose
now.

Reported by:	Oliver Pinter
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
2016-02-08 10:54:27 +00:00
Jilles Tjoelker
f00fb5457e semget(): Check for [EEXIST] error first.
Although POSIX literally permits failing with [EINVAL] if IPC_CREAT and
IPC_EXCL were both passed, the semaphore set already exists and has fewer
semaphores than nsems, this does not allow an application to retry safely:
if the [EINVAL] is actually because of the semmsl limit, an infinite loop
would result.

PR:		206927
2016-02-07 22:12:39 +00:00
Pedro F. Giffuni
7559912039 Minor grammar fix in comment. 2016-02-07 16:18:12 +00:00
Kirk McKusick
b00b459084 Clarify a comment in kern_openat() about the use of falloc_noinstall().
Suggested by: Steve Jacobson
2016-02-07 01:04:47 +00:00
Mateusz Guzik
0c829a301d fork: ansify sys_pdfork
No functional changes.
2016-02-06 09:01:03 +00:00
John Baldwin
5652770d8f Rename aiocblist to kaiocb and use consistent variable names.
Typically <foo>list is used for a structure that holds a list head in
FreeBSD, not for members of a list.  As such, rename 'struct aiocblist'
to 'struct kaiocb' (the kernel version of 'struct aiocb').

While here, use more consistent variable names for AIO control blocks:

- Use 'job' instead of 'aiocbe', 'cb', 'cbe', or 'iocb' for kernel job
  objects.
- Use 'jobn' instead of 'cbn' for use with TAILQ_FOREACH_SAFE().
- Use 'sjob' and 'sjobn' instead of 'scb' and 'scbn' for fsync jobs.
- Use 'ujob' instead of 'aiocbp', 'job', 'uaiocb', or 'uuaiocb' to hold
  a user pointer to a 'struct aiocb'.
- Use 'ujobp' instead of 'aiocbp' for a user pointer to a 'struct aiocb *'.

Reviewed by:	kib
Sponsored by:	Chelsio Communications
Differential Revision:	https://reviews.freebsd.org/D5125
2016-02-05 20:38:09 +00:00
Konstantin Belousov
af582aaed1 When matching brand to the ELF binary by notes, try to find a brand
with interpreter name exactly matching one wanted by the binary.  If
no such brand exists, return first brand which accepted the binary by
note.

The change fixes a regression after r292749, where e.g. our two ia32
compat brands, ia32_brand_info and ia32_brand_oinfo, only differ by
the interpeter path and binary matches to a brand by linkage order.
Then old binaries which require /usr/libexec/ld-elf.so.1 but matched
against ia32_brand_info with interp_path /libexec/ld-elf.so.1, were
considered requiring non-standard interpreter name, and magic to force
ld-elf32.so.1 did not happen.

Note that it might make sense to apply the same selection of brands
for other matching criteria, SCO EI_OSABI and 3.x string.

Reported and tested by:	dwmalone
Sponsored by:	The FreeBSD Foundation
MFC after:	3 days
2016-02-04 20:55:49 +00:00
Konstantin Belousov
76c404fce5 Do not copy by field when converting struct oexport_args to struct
export_args on mount update, bzero() is consistent with
vfs_oexport_conv().
Make the code structure more explicit by using switch.
Return EINVAL if export option layout (deduced from size) is unknown.

Based on the submission by:	bde
Sponsored by:	The FreeBSD Foundation
2016-02-04 16:32:21 +00:00
Konstantin Belousov
4732ae438a Guard against runnable td2 exiting and than being reused for unrelated
process when the parent sleeps waiting for the debugger attach on
fork.

Diagnosed and reviewed by:	mjg
Sponsored by:	The FreeBSD Foundation
2016-02-04 10:49:34 +00:00
Mateusz Guzik
813361c140 fork: plug a use after free of the returned process
fork1 required its callers to pass a pointer to struct proc * which would
be set to the new process (if any). procdesc and racct manipulation also
used said pointer.

However, the process could have exited prior to do_fork return and be
automatically reaped, thus making this a use-after-free.

Fix the problem by letting callers indicate whether they want the pid or
the struct proc, return the process in stopped state for the latter case.

Reviewed by:	kib
2016-02-04 04:25:30 +00:00
Mateusz Guzik
33fd9b9a2b fork: pass arguments to fork1 in a dedicated structure
Suggested by:	kib
2016-02-04 04:22:18 +00:00
Gleb Smirnoff
e60b2fcbeb Redo r292484. Embed task(9) into zone, so that uz_maxaction is called
in a context that can sleep, allowing consumers of the KPI to run their
drain routines without any extra measures.

Discussed with:	jtl
2016-02-03 23:30:17 +00:00
Gleb Smirnoff
9542ea7b80 Move uma_dbg_alloc() and uma_dbg_free() into uma_core.c, which allows
to make uma_dbg.h not depend on uma_int.h, which allows to uninclude
uma_int.h from the mbuf(9) allocator.
2016-02-03 22:02:36 +00:00
Alfred Perlstein
7325dfbb59 Increase max allowed backlog for listen sockets
from short to int.

PR: 203922
Submitted by: White Knight <white_knight@2ch.net>
MFC After: 4 weeks
2016-02-02 05:57:59 +00:00
Gleb Smirnoff
8ec07310fa These files were getting sys/malloc.h and vm/uma.h with header pollution
via sys/mbuf.h
2016-02-01 17:41:21 +00:00
Andrew Turner
f9c2ec64be Fix the logic in the ddb command 'show ktr /a'. Prior to r118269 it would
print until cncheckc returned a non -1, i.e. a character had been entered.
After this change it would print only if cncheckc returned a character.
As this was before each call to db_mach_vtrace the normal outcome was
nothing was printed.

With this change 'show ktr /a' will now keep printing until the user stops
the command with a key press, or there is no more entries to print.
2016-01-31 17:32:20 +00:00
Eric van Gyzen
0e3d6ed44e kqueue EVFILT_PROC: avoid collision between NOTE_CHILD and NOTE_EXIT
NOTE_CHILD and NOTE_EXIT return something in kevent.data: the parent
pid (ppid) for NOTE_CHILD and the exit status for NOTE_EXIT.
Do not let the two events be combined, since one would overwrite
the other's data.

PR:		180385
Submitted by:	David A. Bright <david_a_bright@dell.com>
Reviewed by:	jhb
MFC after:	1 month
Sponsored by:	Dell Inc.
Differential Revision:	https://reviews.freebsd.org/D4900
2016-01-28 20:24:15 +00:00
Kirk McKusick
d2a28cb080 The bread() function was inconsistent about whether it would return
a buffer pointer in the event of an error (for some errors it would
return a buffer pointer and for other errors it would not return a
buffer pointer). The cluster_read() function was similarly inconsistent.

Clients of these functions were inconsistent in handling errors.
Some would assume that no buffer was returned after an error and
would thus lose buffers under certain error conditions. Others would
assume that brelse() should always be called after an error and
would thus panic the system under certain error conditions.

To correct both of these problems with minimal code churn, bread()
and cluster_write() now always free the buffer when returning an
error thus ensuring that buffers will never be lost. The brelse()
routine checks for being passed a NULL buffer pointer and silently
returns to avoid panics. Thus both approaches to handling error
returns from bread() and cluster_read() will work correctly.

Future code should be written assuming that bread() and cluster_read()
will never return a buffer with an error, so should not attempt to
brelse() the buffer when an error is returned.

Reviewed by: kib
2016-01-27 21:23:01 +00:00
Mateusz Guzik
4dd3a21fb3 ktrace: tidy up ktrstruct
- minor style fixes
- avoid doing strlen twice [1]

PR:		206648
Submitted by:	C Turt <ecturt gmail.com> (original version) [1]
2016-01-27 19:55:02 +00:00
Justin Hibbits
2dd1bdf183 Convert rman to use rman_res_t instead of u_long
Summary:
Migrate to using the semi-opaque type rman_res_t to specify rman resources.  For
now, this is still compatible with u_long.

This is step one in migrating rman to use uintmax_t for resources instead of
u_long.

Going forward, this could feasibly be used to specify architecture-specific
definitions of resource ranges, rather than baking a specific integer type into
the API.

This change has been broken out to facilitate MFC'ing drivers back to 10 without
breaking ABI.

Reviewed By: jhb
Sponsored by:	Alex Perez/Inertial Computing
Differential Revision: https://reviews.freebsd.org/D5075
2016-01-27 02:23:54 +00:00
John Baldwin
0dd6c0352b Various style fixes.
- Wrap long lines.
- Fix indentation.
- Remove excessive parens.
- Whitespace fixes in struct definitions.

Reviewed by:	kib
Differential Revision:	https://reviews.freebsd.org/D5025
2016-01-26 21:24:49 +00:00
Konstantin Belousov
2cd5358a66 Don't clear the software flow control flag before draining for last
close or assert the bug that it is clear when leaving.

Remove an unrelated rotted comment that was attached to the buggy
clearing.

Since draining is not done in more cases, flushing is needed in more
cases, so start fixing flushing:
- do a full flush in ttydisc_close().  State what POSIX requires more
  clearly.  This was missing ttydevsw_pktnotify() calls to tell the
  devsw layer to flush.  Hardware tty drivers don't actually flush
  since they don't understand this API.
- fix 2 missing wakeups in tty_flush().  Most of the wakeups here are
  unnecessary for last close.  But ttydisc_close() did one of the
  missing ones.

This flow control bug ameliorated the design bug of requiring
potentially unbounded waits in draining.  Software flow control is the
easiest way to get an unbounded wait, and a long wait is sometimes
actually useful.  Users can type the xoff character on the receiver
and (if ixon is set on the sender) expect the output to be held until
the user is ready for more.

Hardware flow control can also give the unbounded wait, and this bug
didn't affect hardware flow control.  Unbounded waits from hardware
flow control take a more unusual configuration.  E.g., a terminal
program that controls the modem status lines, or unplugging the cable
in a configuration where this doesn't break the connection.

The design bug is still ameliorated by a newer bug in draining for
last close -- the 1 second timeout.  E.g., if the user types the
xoff character and the sender reaches last close, then output is
not resumed and the wait times out after just 1 second.  This is
broken, but preferable to an unbounded wait.  Before this change,
the output was resumed immediately and usually completed.

Submitted by:	bde
MFC after:	2 weeks
2016-01-26 14:46:39 +00:00
Edward Tomasz Napierala
ed81020097 Fix the way RCTL handles rules' rrl_exceeded on credenials change.
Because of what this variable does, it was probably harmless - but
still incorrect.

MFC after:	1 month
Sponsored by:	The FreeBSD Foundation
2016-01-26 11:28:55 +00:00
Konstantin Belousov
88d74d64d7 Restore flushing of output for revoke(2) again. Document revoke()'s
intended behaviour in its man page.  Simplify tty_drain() to match.
Don't call ttydevsw methods in tty_flush() if the device is gone
since we now sometimes call it then.

The flushing was supposed to be implemented by passing the FNONBLOCK
flag to VOP_CLOSE() for revoke().  The tty driver is one of the few
that can block in close and was one of the fewer that knew about this.

This almost worked in FreeBSD-1 and similarly in Net/2.  These
versions only almost worked because there was and is considerable
confusion between IO_NDELAY and FNONBLOCK (aka O_NONBLOCK).  IO_NDELAY
is only valid for VOP_READ() and VOP_WRITE().  For other VOPs it has
the same value as O_SHLOCK.  But since vfs_subr.c and tty.c
consistently used the wrong flag and the O_SHLOCK flag is rarely set,
this mostly worked.  It also gave the feature than applications could
get the non-blocking close by abusing O_SHLOCK.

This was first broken then fixed in 1995.  I changed only the tty
driver to use FNONBLOCK, as a hack to get non-blocking via the normal
flag FNONBLOCK for last closes.  I didn't know about revoke()'s use
of IO_NDELAY or change it to be consistent, so revoke() was broken.
Then I changed revoke() to match.

This was next broken in 1997 then fixed in 1998.  Importing Lite2 made
the flags inconsistent again by undoing the fix only in vfs_subr.c.

This was next broken in 2008 by replacing everything in tty.c and not
checking any flags in last close.  Other bugs in draining limited the
resulting unbounded waits to drain in some cases.

It is now possible to fix this better using the new FREVOKE flag.
Just restore flushing for revoke() for now.  Don't restore or undo any
hacks for ordinary last closes yet.  But remove dead code in the
1-second relative timeout (r272789).  This did extra work to extend
the buggy draining for revoke() for as long as possible.  The 1-second
timeout made this not very long by usually flushing after 1 second.

Submitted by:	bde
MFC after:	2 weeks
2016-01-26 07:57:44 +00:00
Mark Johnston
87b87bb853 Evaluate the sysctl_running fail point before taking the sysctl lock.
The fail point handler may sleep, but this is not permitted while holding a
rm read lock.

MFC after:	1 week
Sponsored by:	EMC / Isilon Storage Division
2016-01-26 01:15:18 +00:00
Marius Strobl
57169cea64 - Make the code consistent with itself style-wise and bring it closer
to style(9).
- Mark unused arguments as such.
- Make the ttystates table const.
2016-01-25 22:58:06 +00:00
Konstantin Belousov
2e77021e4c Don't allow opening the callout device when the callin device is already
open (in disguise as the console device).  The only allowed combination
was supposed to be the callin device with the console.

Fix the assertion in ttydev_close() that was meant to detect this (it
only detected all 3 devices being open).  Assert this in ttydev_open()
too.

Submitted by:	bde
MFC after:	2 weeks
2016-01-25 16:47:20 +00:00
Konstantin Belousov
3593a18a91 Fix the %b flags string for ddb. All bits above the 5th
(TF_OPENED_CONS) were broken in r188147 by adding TF_OPENED_CONS
without updating the string.  It was especially confusing to display
OPENED_CONS as GONE and BYPASS as ZOMBIE.  2 flags at the end were
not updated in r188487.

Don't print an extra 0x prefix for %p in a ddb command.  In the rest
of the kernel there are more than 6000 lines with %p and only about
40 with this bug.

Print a non-extra 0x prefix for %b in a ddb command.  In the rest
of the kernel, there are approx. 180 lines with %b and 2/3 of them
have this bug.

Submitted by:	bde
MFC after:	2 weeks
2016-01-25 15:37:01 +00:00
Alexander V. Chernikov
61eee0e202 MFP r287070,r287073: split radix implementation and route table structure.
There are number of radix consumers in kernel land (pf,ipfw,nfs,route)
  with different requirements. In fact, first 3 don't have _any_ requirements
  and first 2 does not use radix locking. On the other hand, routing
  structure do have these requirements (rnh_gen, multipath, custom
  to-be-added control plane functions, different locking).
Additionally, radix should not known anything about its consumers internals.

So, radix code now uses tiny 'struct radix_head' structure along with
  internal 'struct radix_mask_head' instead of 'struct radix_node_head'.
  Existing consumers still uses the same 'struct radix_node_head' with
  slight modifications: they need to pass pointer to (embedded)
  'struct radix_head' to all radix callbacks.

Routing code now uses new 'struct rib_head' with different locking macro:
  RADIX_NODE_HEAD prefix was renamed to RIB_ (which stands for routing
  information base).

New net/route_var.h header was added to hold routing subsystem internal
  data. 'struct rib_head' was placed there. 'struct rtentry' will also
  be moved there soon.
2016-01-25 06:33:15 +00:00
Konstantin Belousov
fa28b6e7da In tty_dealloc(), clear the queues. See the comment for a scenario
which explains why ttydev_leave() cleanup might not happen.

Submitted by:	bde
MFC after:	3 weeks
2016-01-22 20:38:46 +00:00
Konstantin Belousov
6adf19481c The struct file f_advice member is overlaid with the devfs f_cdevpriv
data.  If vnode bypass for devfs file failed, vn_read/vn_write are
called and might try to dereference f_advice.  Limit the accesses to
f_advice to VREG vnodes only, which is the type ensured by
posix_fadvise().

The f_advice for regular files is protected by mtxpool lock.  Recheck
that f_advice is not NULL after lock is taken.

Reported and tested by:	bde
Sponsored by:	The FreeBSD Foundation
MFC after:	3 weeks
2016-01-22 20:35:20 +00:00
Gleb Smirnoff
33a2a37b86 - Separate sendfile(2) implementation from uipc_syscalls.c into
separate file.  Claim my copyright.
- Provide more comments, better function and structure names.
- Sort out unneeded includes from resulting two files.

No functional changes.
2016-01-22 02:23:18 +00:00
John Baldwin
39314b7d99 AIO daemons have always been kernel processes to facilitate switching to
user VM spaces while servicing jobs.  Update various comments and data
structures that refer to AIO daemons as threads to refer to processes
instead.

Reviewed by:	kib
Sponsored by:	Chelsio Communications
Differential Revision:	https://reviews.freebsd.org/D4999
2016-01-21 02:20:38 +00:00
John Baldwin
4429f0e2e3 Remove unused variables for socket AIO.
In r55943, a per-process queue of pending socket AIO requests (requests
waiting for the socket to become ready) was added so that they could be
cancelled during process rundown.  In r154765, the rundown code was
changed to handle jobs in this state (JOBST_JOBQSOCK) directly removing
the need for the extra queue.  However, the per-process queue head and
global lock were never removed.

Reviewed by:	kib
Sponsored by:	Chelsio Communications
Differential Revision:	https://reviews.freebsd.org/D4997
2016-01-21 01:28:31 +00:00
Mateusz Guzik
b0632ab432 cache: minor changes
1. vhold and zap immediately instead of postponing few lines later
2. increment numneg after new entry is added

No functional changes.

No objections:		kib
2016-01-21 01:09:39 +00:00
Mateusz Guzik
baa2bcf572 cache: perform . lockup without the namecache lock
Reviewed by:	kib
2016-01-21 01:07:05 +00:00
Mateusz Guzik
db709ecbcc cache: provide a helper for computing the hash
Reviewed by:	kib
2016-01-21 01:05:41 +00:00
Mateusz Guzik
76583fa294 cache: use counter(9) API to maintain statistics
Previously the code would just increment statistics while only holding a
shared lock, in effect losing updates.

Separate tracking for nchstats is removed as values can be obtained from
existing counters. Note that some fields are updated by external
consumers and are left unfixed. This should not be a serious issue as
this structure looks quite obsolete.

No strong objections: kib
2016-01-21 01:04:03 +00:00
Mateusz Guzik
17182b1351 session: avoid proctree lock on proc exit when possible
We can get away with the common case with only proc lock held.

Reviewed by:	kib
2016-01-20 23:33:58 +00:00
Mateusz Guzik
6760182c5d session: tidy up fixjobc
This stops abusing the 'p' pointer for iteration over children processes
and gets rid of useless locking around PRS_ZOMBIE check.

Suggested by:	kib
2016-01-20 23:22:36 +00:00
Marius Strobl
9750d9e5b6 Fix tty_drain() and, thus, TIOCDRAIN of the current tty(4) incarnation
to actually wait until the TX FIFOs of UARTs have be drained before
returning. This is done by bringing the equivalent of the TS_BUSY flag
found in the previous implementation back in an ABI-preserving way.
Reported and tested by: Patrick Powell

Most likely, drivers for USB-serial-adapters likewise incorporating
TX FIFOs as well as other terminal devices that buffer output in some
form should also provide implementations of tsw_busy.

MFC after:	3 days
2016-01-19 23:34:27 +00:00
John Baldwin
8a4dc40ff4 Various cleanups to the main function for AIO kernel processes:
- Pull the vmspace logic out into helper functions and reduce duplication.
  Operations on the vmspace are all isolated to vm_map.c, but it now exports
  a new 'vmspace_switch_aio' for use by AIO kernel processes.
- When an AIO kernel process wants to exit, break out of the main loop and
  perform cleanup after the loop end.  This reduces a lot of indentation and
  allows cleanup to more closely mirror setup actions before the loop starts.
- Convert a DIAGNOSTIC to KASSERT().
- Replace mycp with more typical 'p'.

Reviewed by:	kib
Sponsored by:	Chelsio Communications
Differential Revision:	https://reviews.freebsd.org/D4990
2016-01-19 21:37:51 +00:00
John Baldwin
f2e7f06a0d Don't create a dedicated session for each AIO kernel process.
This code dates back to the initial AIO support and the commit log does
not explain why it is needed.  However, I cannot find anything in the
AIO code or the various file methods (fo_read/fo_write) that would change
behavior due to using a private session instead of proc0's session.

Reviewed by:	kib
Sponsored by:	Chelsio Communications
Differential Revision:	https://reviews.freebsd.org/D4988
2016-01-19 20:46:30 +00:00
Mark Johnston
793c381706 Add vrefl(), a locked variant of vref(9).
This API has no in-tree consumers at the moment but is useful to at least
one out-of-tree consumer, and naturally complements existing vnode refcount
functions (vholdl(9), vdropl(9)).

Obtained from:	kib (sys/ portion)
Sponsored by:	EMC / Isilon Storage Division
Differential Revision:	https://reviews.freebsd.org/D4947
Differential Revision:	https://reviews.freebsd.org/D4953
2016-01-18 22:21:46 +00:00
Konstantin Belousov
ce958bdefe When cleaning up from failed adv locking and checking for write, do
not call VOP_CLOSE() manually.  Instead, delegate the close to
fo_close() performed as part of the fdrop() on the file failed to
open.  For this, finish constructing file on error, in particular, set
f_vnode and f_ops.

Forcibly resetting f_ops to badfileops disabled additional cleanups
performed by fo_close() for some file types, in this case it was noted
that cdevpriv data was corrupted.  Since fo_close() call must be
enabled for some file types, it makes more sense to enable it for all
files opened through vn_open_cred().

In collaboration with:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
2016-01-17 08:40:51 +00:00
John Baldwin
6c8fd02283 Remove aiod_timeout.
It hasn't been used since the AIO code was made MPSAFE 10 years ago.

Reviewed by:	kib
Sponsored by:	Chelsio Communications
Differential Revision:	https://reviews.freebsd.org/D4946
2016-01-14 21:28:56 +00:00
John Baldwin
c85650cacc Rename aiod_bio taskqueue to aiod_kick.
This taskqueue is not used to handle bio requests.  It is only used to
run aio_kick_nowait() to spin up new aio daemon processes.

Reviewed by:	kib
Differential Revision:	https://reviews.freebsd.org/D4904
2016-01-14 20:51:48 +00:00
Gleb Smirnoff
c8358c6e0d Call crextend() before copying old credentials to the new credentials
and replace crcopysafe by crcopy as crcopysafe is is not intended to be
safe in a threaded environment, it drops PROC_LOCK() in while() that
can lead to unexpected results, such as overwrite kernel memory.

In my POV crcopysafe() needs special attention. For now I do not see
any problems with this function, but who knows.

Submitted by:	dchagin
Found by:	trinity
Security:	SA-16:04.linux
2016-01-14 10:16:25 +00:00
Colin Percival
c0ada0377a Fix a bug introduced in r291716:
"The problem with the approach taken both in _bus_dmamap_load_pages and
bus_dmamap_load_ma_triv is that they split the request buffer into
arbitrary chunks based on page boundaries, creating segments that no
longer have a size that's a multiple of the sector size. This breaks
drivers like blkfront (and probably other stuff)." [1]

This was most easily triggered by running `fsck /` on a system running
in Xen (e.g. Amazon EC2) but also showed up via growfs(8) and probably
many other userland tools which access the disk directly.

Patch by:	royger [1]
"Thinks this should be fine" by:	ken
2016-01-11 20:38:39 +00:00
Dmitry Chagin
038c720553 Implement vsyscall hack. Prior to 2.13 glibc uses vsyscall
instead of vdso. An upcoming linux_base-c6 needs it.

Differential Revision:  https://reviews.freebsd.org/D1090

Reviewed by:	kib, trasz
MFC after:	1 week
2016-01-09 20:18:53 +00:00
Mark Johnston
11f9ca696e Prevent cv_waiters wraparound.
r282971 attempted to fix this problem by decrementing cv_waiters after
waking up from sleeping on a condition variable, but this can result in
a use-after-free if the CV is freed before all woken threads have had a
chance to run. Instead, avoid incrementing cv_waiters past INT_MAX, and
have cv_signal() explicitly check for sleeping threads once cv_waiters has
reached this bound.

Reviewed by:	jhb
MFC after:	2 weeks
Sponsored by:	EMC / Isilon Storage Division
Differential Revision:	https://reviews.freebsd.org/D4822
2016-01-09 01:56:46 +00:00
Gleb Smirnoff
2bab0c5535 New sendfile(2) syscall. A joint effort of NGINX and Netflix from 2013 and
up to now.

The new sendfile is the code that Netflix uses to send their multiple tens
of gigabits of data per second. The new implementation features asynchronous
I/O, when I/O operations are launched, but not awaited to be complete. An
explanation of why such behavior is beneficial compared to old one is
going to be too long for a commit message, so we will skip it here.

Additional features of new syscall are extra flags, which provide an
application more control over data sent. The SF_NOCACHE flag tells
kernel that data shouldn't be cached after it was sent. The SF_READAHEAD()
macro allows to specify readahead size in pages.

The new syscalls is a drop in replacement. No modifications are required
to applications. One can take nginx binary for stable/10 and run it
successfully on head. Although SF_NODISKIO lost its original sense, as now
sendfile doesn't block, and now means something completely different (tm),
using the new sendfile the old way is absolutely safe.

Celebrates:	Netflix global launch!
Sponsored by:	Nginx, Inc.
Sponsored by:	Netflix
Relnotes:	yes
2016-01-08 20:34:57 +00:00
Gleb Smirnoff
829fae9063 Make it possible for sbappend() to preserve M_NOTREADY on mbufs, just like
sbappendstream() does. Although, M_NOTREADY may appear only on SOCK_STREAM
sockets, due to sendfile(2) supporting only the latter, there is a corner
case of AF_UNIX/SOCK_STREAM socket, that still uses records for the sake
of control data, albeit being stream socket.

Provide private version of m_clrprotoflags(), which understands PRUS_NOTREADY,
similar to m_demote().
2016-01-08 19:03:20 +00:00
Gleb Smirnoff
ffd1c319a9 Revert r293405: it breaks socket buffer INVARIANTS when sending control
data over local sockets.
2016-01-08 17:27:23 +00:00
Gleb Smirnoff
2f2edf0a08 For SOCK_STREAM socket use sbappendstream() instead of sbappend(). 2016-01-08 01:16:03 +00:00
Konstantin Belousov
0de1455462 Convert tty common code to use make_dev_s().
Tty.c was untypical in that it handled the si_drv1 issue consistently
and correctly, by always checking for si_drv1 being non-NULL and
sleeping if NULL.  The removed code also illustrated unneeded
complications in drivers which are eliminated by the use of new KPI.

Reviewed by:	hps, jhb
Discussed with:	bde
Sponsored by:	The FreeBSD Foundation
MFC after:	3 weeks
Differential revision:	https://reviews.freebsd.org/D4746
2016-01-07 20:15:09 +00:00
Konstantin Belousov
48ce5d4cac Provide yet another KPI for cdev creation, make_dev_s(9).
Immediate problem fixed by the new KPI is the long-standing race
between device creation and assignments to cdev->si_drv1 and
cdev->si_drv2, which allows the window where cdevsw methods might be
called with si_drv1,2 fields not yet set.  Devices typically checked
for NULL and returned spurious errors to usermode, and often left some
methods unchecked.

The new function interface is designed to be extensible, which should
allow to add more features to make_dev_s(9) without inventing yet
another name for function to create devices, while maintaining KPI and
even KBI backward-compatibility.

Reviewed by:	hps, jhb
Sponsored by:	The FreeBSD Foundation
MFC after:	3 weeks
Differential revision:	https://reviews.freebsd.org/D4746
2016-01-07 20:08:02 +00:00
Mateusz Guzik
6b53d1bc6f cache: ansify functions and fix some style issues
No functional changes.
2016-01-07 02:04:17 +00:00
Konstantin Belousov
1041e09089 Two fixes for excessive iterations after r292326.
Advance the logical block number to the lblkno of the found block plus
one, instead of incrementing the block number which was used for
lookup.  This change skips sparcely populated buffer ranges, similar
to r292325, instead of doing useless lookups.

Do not restart the bnoreuselist() from the start of the range if
buffer lock cannot be obtained without sleep.  Only retry lookup and
lock for the same queue and same logical block number.

Reported by:	benno
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	3 days
2016-01-05 14:48:40 +00:00
Ian Lepore
69dcb7e771 Make the 'env' directive described in config(5) work on all architectures,
providing compiled-in static environment data that is used instead of any
data passed in from a boot loader.

Previously 'env' worked only on i386 and arm xscale systems, because it
required the MD startup code to examine the global envmode variable and
decide whether to use static_env or an environment obtained from the boot
loader, and set the global kern_envp accordingly.  Most startup code wasn't
doing so.  Making things even more complex, some mips startup code uses an
alternate scheme that involves calling init_static_kenv() to pass an empty
buffer and its size, then uses a series of kern_setenv() calls to populate
that buffer.

Now all MD startup code calls init_static_kenv(), and that routine provides
a single point where envmode is checked and the decision is made whether to
use the compiled-in static_kenv or the values provided by the MD code.

The routine also continues to serve its original purpose for mips; if a
non-zero buffer size is passed the routine installs the empty buffer ready
to accept kern_setenv() values.  Now if the size is zero, the provided buffer
full of existing env data is installed.  A NULL pointer can be passed if the
boot loader provides no env data; this allows the static env to be installed
if envmode is set to do so.

Most of the work here is a near-mechanical change to call the init function
instead of directly setting kern_envp.  A notable exception is in xen/pv.c;
that code was originally installing a buffer full of preformatted env data
along with its non-zero size (like mips code does), which would have allowed
kern_setenv() calls to wipe out the preformatted data.  Now it passes a zero
for the size so that the buffer of data it installs is treated as
non-writeable.
2016-01-02 02:53:48 +00:00
Marius Strobl
45005f7907 - (Ab)use udivx for dividing the u_int pc_cpuid when implementing
CPU_ISSET(), CPU_SET etc. in sparc64 asm. This approach has the
  benefit of not clobbering %y, allowing to revert r222827 and
  partially r222828.
- In r222828, CATR() already was changed to use the equivalent of
  PCPU_GET(cpuid) instead of the MD module ID for KTR_CPU, so
  belatedly also catch up with the C side of ktr(9). Originally,
  in r203838 CATR() was moved away from directly reading the
  module ID or equivalent as that became impractical with other
  CPU types than USI/II supported. With r222828 in place, per-CPU
  data generally is set up soon enough, though, that employing
  PCPU things in ktr(9) also for use during early stages works.
- Unfortunately, an exception to the latter is the ktr(9) use
  in pmap_bootstrap(), which actually is run so early that even
  checking for bootverbose being set via the loader doesn't work.
  Consequently, replace the ktr(9) use in pmap_bootstrap() with
  OF_printf(9) and put it under #ifdef DIAGNOSTIC instead.

MFC after:	3 days
2015-12-30 13:49:20 +00:00
John Baldwin
5fcfab6e32 Add ptrace(2) reporting for LWP events.
Add two new LWPINFO flags: PL_FLAG_BORN and PL_FLAG_EXITED for reporting
thread creation and destruction. Newly created threads will stop to report
PL_FLAG_BORN before returning to userland and exiting threads will stop to
report PL_FLAG_EXIT before exiting completely. Both of these events are
only enabled and reported if PT_LWP_EVENTS is enabled on a process.
2015-12-29 23:25:26 +00:00
John Baldwin
d1e7a4a553 Call kern_thr_exit() instead of duplicating it.
This code is missing the racct_subr() call from kern_thr_exit() and would
require further code duplication in future changes.

Reviewed by:	kib
MFC after:	1 week
2015-12-29 23:16:20 +00:00
Dmitry Chagin
3e18d701de Verify that tv_sec value specified in settimeofday() and clock_settime()
(CLOCK_REALTIME case) system calls is non negative.
This commit hides a kernel panic in atrtc_settime() as the clock_ts_to_ct()
does not properly convert negative tv_sec.

ps. in my opinion clock_ts_to_ct() should be rewritten to properly handle
negative tv_sec values.

Differential Revision:	https://reviews.freebsd.org/D4714
Reviewed by:		kib

MFC after:	1 week
2015-12-27 15:37:07 +00:00
Konstantin Belousov
1899507706 Do not substitute interpeter if the brand interpreter path is
different from the interpreter path requested by the binary.

Before this change, it is impossible to activate non-default
interpreter for 32bit image on amd64, when /libexec/ld-elf32.so.1 file
exists.

Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
2015-12-26 15:40:12 +00:00
Jonathan T. Looney
d3ee0a15a4 Only allow one PT_INTERP ELF program header. This also fixes a potential
memory leak for interp_buf.

Differential Revision:	https://reviews.freebsd.org/D4692
Reviewed by:	kib
MFC after:	2 weeks
Sponsored by:	Juniper Networks
2015-12-24 00:58:11 +00:00
Enji Cooper
853a17ad6f Fix r292640
vim overzealously removed some trailing `+' and I didn't check the
diff

MFC after: 1 week
X-MFC with: r292640
Pointyhat to: ngie
Sponsored by: EMC / Isilon Storage Division
2015-12-23 03:34:43 +00:00
Enji Cooper
905b145f0a Clean up trailing whitespace; no functional change
MFC after: 1 week
Sponsored by: EMC / Isilon Storage Division
2015-12-23 03:29:37 +00:00
Enji Cooper
20b4c1d2cb Fold lim_shared into lim_copy to mute a -Wunused compiler warning from
clang when the kernel is compiled without INVARIANTS

Differential Revision: https://reviews.freebsd.org/D4683
Reviewed by: kib, jhb
MFC after: 1 week
Sponsored by: EMC / Isilon Storage Division
2015-12-22 21:07:33 +00:00
Konstantin Belousov
d943fa35ad If we annoy user with the terminal output due to failed load of
interpreter, also show the actual error code instead of some
interpretation.

Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2015-12-22 20:12:52 +00:00
Jonathan T. Looney
54503a13d8 Add a safety net to reclaim mbufs when one of the mbuf zones become
exhausted.

It is possible for a bug in the code (or, theoretically, even unusual
network conditions) to exhaust all possible mbufs or mbuf clusters.
When this occurs, things can grind to a halt fairly quickly. However,
we currently do not call mb_reclaim() unless the entire system is
experiencing a low-memory condition.

While it is best to try to prevent exhaustion of one of the mbuf zones,
it would also be useful to have a mechanism to attempt to recover from
these situations by freeing "expendable" mbufs.

This patch makes two changes:

a) The patch adds a generic API to the UMA zone allocator to set a
function that should be called when an allocation fails because the
zone limit has been reached. Because of the way this function can be
called, it really should do minimal work.

b) The patch uses this API to try to free mbufs when an allocation
fails from one of the mbuf zones because the zone limit has been
reached. The function schedules a callout to run mb_reclaim().

Differential Revision:	https://reviews.freebsd.org/D3864
Reviewed by:	gnn
Comments by:	rrs, glebius
MFC after:	2 weeks
Sponsored by:	Juniper Networks
2015-12-20 02:05:33 +00:00
Mateusz Guzik
aa0241d623 proc: fix a race which could result in dereference of bad p_pgrp pointer on fork
During fork p_starcopy - p_endcopy area of a process is populated with bcopy
with only proc lock held. Another forking thread can find such a process and
proceed to access p_pgrp included in said area.

Fix the problem by moving the field outside. It is being properly assigned
later.

Reviewed by:	kib
Diagnosed by:	kib
Tested by:	Fabian Keil <freebsd-listen fabiankeil.de>
MFC after:	10 days
2015-12-18 16:33:15 +00:00
Adrian Chadd
2b3ad18853 [intrng] Migrate the intrng code from sys/arm/arm to sys/kern/subr_intr.c.
The ci20 port (by kan@) is going to reuse almost all of the intrng code
since the SoC in question looks suspiciously like someone took an ARM
SoC design and replaced the ARM core with a MIPS core.

* migrate out the code;
* rename ARM_ -> INTR_;
* rename arm_ -> intr_;
* move the interrupt flush routine from intr.c / intrng.c into
  arm/machdep_intr.c - removing the code duplication and removing
  the ARM specific bits from here.

Thanks to the Star Wars: The Force Awakens premiere line for allowing
me a couple hours of quiet time to finish the universe builds.

Tested:

* make universe

TODO:

* The structure definitions in subr_intr.c still includes machine/intr.h
  which requires one duplicates all of the intrng definitions in
  the platform code (which kan has done, and I think we don't have to.)

  Instead I should break out the generic things (function declarations,
  common intr structures, etc) into a separate header.

* Kan has requested I make the PIC based IPI stuff optional.
2015-12-18 05:43:59 +00:00
Mark Johnston
8ff6d9dd22 Support an arbitrary number of arguments to DTrace syscall probes.
Rather than pushing all eight possible arguments into dtrace_probe()'s
stack frame, make the syscall_args struct for the current syscall available
via the current thread. Using a custom getargval method for the systrace
provider, this allows any syscall argument to be fetched, even in kernels
that have modified the maximum number of system call arguments.

Sponsored by:	EMC / Isilon Storage Division
2015-12-17 00:00:27 +00:00
Mark Johnston
3616095801 Fix style issues around existing SDT probes.
- Use SDT_PROBE<N>() instead of SDT_PROBE(). This has no functional effect
  at the moment, but will be needed for some future changes.
- Don't hardcode the module component of the probe identifier. This is
  set automatically by the SDT framework.

MFC after:	1 week
2015-12-16 23:39:27 +00:00
Gleb Smirnoff
b0cd20172d A change to KPI of vm_pager_get_pages() and underlying VOP_GETPAGES().
o With new KPI consumers can request contiguous ranges of pages, and
  unlike before, all pages will be kept busied on return, like it was
  done before with the 'reqpage' only. Now the reqpage goes away. With
  new interface it is easier to implement code protected from race
  conditions.

  Such arrayed requests for now should be preceeded by a call to
  vm_pager_haspage() to make sure that request is possible. This
  could be improved later, making vm_pager_haspage() obsolete.

  Strenghtening the promises on the business of the array of pages
  allows us to remove such hacks as swp_pager_free_nrpage() and
  vm_pager_free_nonreq().

o New KPI accepts two integer pointers that may optionally point at
  values for read ahead and read behind, that a pager may do, if it
  can. These pages are completely owned by pager, and not controlled
  by the caller.

  This shifts the UFS-specific readahead logic from vm_fault.c, which
  should be file system agnostic, into vnode_pager.c. It also removes
  one VOP_BMAP() request per hard fault.

Discussed with:	kib, alc, jeff, scottl
Sponsored by:	Nginx, Inc.
Sponsored by:	Netflix
2015-12-16 21:30:45 +00:00
Konstantin Belousov
106ebb761a Optimize vop_stdadvise(POSIX_FADV_DONTNEED). Instead of looking up a
buffer for each block number in the range with gbincore(), look up the
next instantiated buffer with the logical block number which is
greater or equal to the next lblkno.  This significantly speeds up the
iteration for sparce-populated range.

Move the iteration into new helper bnoreuselist(), which is structured
similarly to flushbuflist().

Reported and tested by:	pho
Reviewed by:	markj
Sponsored by:	The FreeBSD Foundation
2015-12-16 08:48:37 +00:00
Konstantin Belousov
8549b4b9fe Simplify the loop step in the flushbuflist() and make it independed on
the type stability of the buffers memory.  Instead of memoizing
pointer to the next buffer and validating it, remember the next
logical block number in the bo list and re-lookup.

Reviewed by:	markj
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
2015-12-16 08:39:51 +00:00
Adrian Chadd
45130aa12e Don't call wakeup if we're just returning reserved space; just
return the reservation and wait for more space to appear.

Submitted by:	jeff
Reviewed by:	kib
2015-12-16 00:13:16 +00:00
Jamie Gritton
6ab6058ec4 Fix jail name checking that disallowed anything that starts with '0'.
The intention was to just limit leading zeroes on numeric names.  That
check is now improved to also catch the leading spaces and '+' that
strtoul can pass through.

PR:		204897
MFC after:	3 days
2015-12-15 17:25:00 +00:00
Edward Tomasz Napierala
af1a7b2526 Tweak comments.
MFC after:	1 month
Sponsored by:	The FreeBSD Foundation
2015-12-13 11:30:36 +00:00
Edward Tomasz Napierala
a8e0740e85 Actually make the 'amount' argument to racct_adjust_resource() signed,
as it was always supposed to be.

MFC after:	1 month
Sponsored by:	The FreeBSD Foundation
2015-12-13 11:21:13 +00:00