Commit Graph

14900 Commits

Author SHA1 Message Date
Brooks Davis
a72c64b0b6 Generate syscall tables and update pipe() implementation after r302094.
Mark the pipe() system call as COMPAT10.

As of r302092 libc uses pipe2() with a zero flags value instead of pipe().

Approved by:	re (gjb)
Sponsored by:	DARPA, AFRL
Differential Revision:	https://reviews.freebsd.org/D6816
2016-06-22 21:18:19 +00:00
Brooks Davis
e16e64098c Mark the pipe() system call as COMPAT10.
As of r302092 libc uses pipe2() with a zero flags value instead of pipe().

Commit with regenerated files and implementation to follow.

Approved by:	re (gjb)
Sponsored by:	DARPA, AFRL
Differential Revision:	https://reviews.freebsd.org/D6816
2016-06-22 21:15:59 +00:00
Brooks Davis
e52e02ba24 Add support for COMPAT10 keywords in syscalls.master.
Approved by:	re (gjb)
Sponsored by:	DARPA, AFRL
2016-06-22 21:12:53 +00:00
John Baldwin
b1012d8036 Account for AIO socket operations in thread/process resource usage.
File and disk-backed I/O requests store counts of read/written disk
blocks in each AIO job so that they can be charged to the thread that
completes an AIO request via aio_return() or aio_waitcomplete().  This
change extends AIO jobs to store counts of received/sent messages and
updates socket backends to set these counts accordingly.  Note that
the socket backends are careful to only charge a single messages for
each AIO request even though a single request on a blocking socket might
invoke sosend or soreceive multiple times.  This is to mimic the
resource accounting of synchronous read/write.

Adjust the UNIX socketpair AIO test to verify that the message resource
usage counts update accordingly for aio_read and aio_write.

Approved by:	re (hrs)
Sponsored by:	Chelsio Communications
Differential Revision:	https://reviews.freebsd.org/D6911
2016-06-21 22:19:06 +00:00
Bjoern A. Zeeb
89856f7e2d Get closer to a VIMAGE network stack teardown from top to bottom rather
than removing the network interfaces first. This change is rather larger
and convoluted as the ordering requirements cannot be separated.

Move the pfil(9) framework to SI_SUB_PROTO_PFIL, move Firewalls and
related modules to their own SI_SUB_PROTO_FIREWALL.
Move initialization of "physical" interfaces to SI_SUB_DRIVERS,
move virtual (cloned) interfaces to SI_SUB_PSEUDO.
Move Multicast to SI_SUB_PROTO_MC.

Re-work parts of multicast initialisation and teardown, not taking the
huge amount of memory into account if used as a module yet.

For interface teardown we try to do as many of them as we can on
SI_SUB_INIT_IF, but for some this makes no sense, e.g., when tunnelling
over a higher layer protocol such as IP. In that case the interface
has to go along (or before) the higher layer protocol is shutdown.

Kernel hhooks need to go last on teardown as they may be used at various
higher layers and we cannot remove them before we cleaned up the higher
layers.

For interface teardown there are multiple paths:
(a) a cloned interface is destroyed (inside a VIMAGE or in the base system),
(b) any interface is moved from a virtual network stack to a different
network stack ("vmove"), or (c) a virtual network stack is being shut down.
All code paths go through if_detach_internal() where we, depending on the
vmove flag or the vnet state, make a decision on how much to shut down;
in case we are destroying a VNET the individual protocol layers will
cleanup their own parts thus we cannot do so again for each interface as
we end up with, e.g., double-frees, destroying locks twice or acquiring
already destroyed locks.
When calling into protocol cleanups we equally have to tell them
whether they need to detach upper layer protocols ("ulp") or not
(e.g., in6_ifdetach()).

Provide or enahnce helper functions to do proper cleanup at a protocol
rather than at an interface level.

Approved by:		re (hrs)
Obtained from:		projects/vnet
Reviewed by:		gnn, jhb
Sponsored by:		The FreeBSD Foundation
MFC after:		2 weeks
Differential Revision:	https://reviews.freebsd.org/D6747
2016-06-21 13:48:49 +00:00
Konstantin Belousov
d9a503bec4 Fix typo. Note that atomic is still required even for interlocked case.
Sponsored by:	The FreeBSD Foundation
Approved by:	re (marius)
2016-06-20 15:45:50 +00:00
Mateusz Guzik
e896fb3bae vfs: ifdef out noop vop_* primitives on !DEBUG_VFS_LOCKS kernels
This removes calls to empty functions like vop_lock_{pre/post} from
common vfs routines.

Approved by:	re (gjb)
2016-06-17 19:41:30 +00:00
Konstantin Belousov
f8a75278dc Add VFS interface to flush specified amount of free vnodes belonging
to mount points with the given filesystem type, specified by mount
vfs_ops pointer.

Based on patch by:	mckusick
Reviewed by:	avg, mckusick
Tested by:	allanjude, madpilot
Sponsored by:	The FreeBSD Foundation
Approved by:	re (gjb)
2016-06-17 17:33:25 +00:00
Konstantin Belousov
5c2cf81845 Update comments for the MD functions managing contexts for new
threads, to make it less confusing and using modern kernel terms.

Rename the functions to reflect current use of the functions, instead
of the historic KSE conventions:
  cpu_set_fork_handler -> cpu_fork_kthread_handler (for kthreads)
  cpu_set_upcall -> cpu_copy_thread (for forks)
  cpu_set_upcall_kse -> cpu_set_upcall (for new threads creation)

Reviewed by:	jhb (previous version)
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Approved by:	re (hrs)
Differential revision:	https://reviews.freebsd.org/D6731
2016-06-16 12:05:44 +00:00
Konstantin Belousov
bd07998e0e Remove XXX comments from kern_thread.c. In one case, there is no
reason for it in modern times.  In the other case, expand the comment
stating instead of doubting.

Reviewed by:	jhb
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Approved by:	re (hrs)
X-Differential revision:	https://reviews.freebsd.org/D6731
2016-06-16 12:01:11 +00:00
Konstantin Belousov
13d2cd3b68 Remove code duplication.
Reviewed by:	jhb
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Approved by:	re (hrs)
X-Differential revision:	https://reviews.freebsd.org/D6731
2016-06-16 11:58:46 +00:00
John Baldwin
fe0bdd1d2c Move backend-specific fields of kaiocb into a union.
This reduces the size of kaiocb slightly. I've also added some generic
fields that other backends can use in place of the BIO-specific fields.

Change the socket and Chelsio DDP backends to use 'backend3' instead of
abusing _aiocb_private.status directly. This confines the use of
_aiocb_private to the AIO internals in vfs_aio.c.

Reviewed by:	kib (earlier version)
Approved by:	re (gjb)
Sponsored by:	Chelsio Communications
Differential Revision:	https://reviews.freebsd.org/D6547
2016-06-15 20:56:45 +00:00
Konstantin Belousov
9fdbfd3b6c Do not assume that we own the use reference on the covered vnode until
we set MNTK_UNMOUNT flag on the mp.  Otherwise parallel unmount which
wins race with us could dereference the covered vnode, and we are
left with the locked freed memory.

Reported and tested by:	pho
Sponsored by:	The FreeBSD Foundation
Approved by:	re (gjb)
MFC after:	1 week
2016-06-15 15:56:03 +00:00
Jamie Gritton
932a6e432d Fix a vnode leak when giving a child jail a too-long path when
debug.disablefullpath=1.
2016-06-09 21:59:11 +00:00
Jamie Gritton
cf0313c679 Re-order some jail parameter reading to prevent a vnode leak. 2016-06-09 20:43:14 +00:00
Jamie Gritton
176ff3a066 Clean up some logic in jail error messages, replacing a missing test and
a redundant test with a single correct test.
2016-06-09 20:39:57 +00:00
Mariusz Zaborski
fb4cdc96a8 Define tunable instead of using CTLFLAG_RWTUN flag with kern.corefile.
The allproc_lock lock used in the sysctl_kern_corefile function is initialized
in the procinit function which is called after setting sysctl values at boot.
That means if we set kern.corefile at boot we will be trying to use
lock with is uninitialized and machine will crash.

If we define kern.corefile as tunable instead of using CTFLAG_RWTUN we will
not call the sysctl_kern_corefile function and we will not use an uninitialized
lock. When machine will boot then we will start using function depending on
the lock.

Reviewed by:	pjd
2016-06-09 20:23:30 +00:00
Conrad Meyer
8a3aeac27b Add DDB command "kldstat"
It prints much the same information as kldstat(8) without any arguments.

Suggested by:	jhibbits
Sponsored by:	EMC / Isilon Storage Division
2016-06-09 18:27:41 +00:00
Conrad Meyer
dd6ea7f7bc kvprintf: Pad %*c to width, like %*s
Sponsored by:	EMC / Isilon Storage Division
2016-06-09 18:24:51 +00:00
Jamie Gritton
ef0ddea316 Make sure the OSD methods for jail set and remove can't run concurrently,
by holding allprison_lock exclusively (even if only for a moment before
downgrading) on all paths that call PR_METHOD_REMOVE.  Since they may run
on a downgraded lock, it's still possible for them to run concurrently
with PR_METHOD_GET, which will need to use the prison lock.
2016-06-09 16:41:41 +00:00
Jamie Gritton
5f02f22af1 Remove a comment that was part of copied code, and is misleading in
the new location.
2016-06-09 15:34:33 +00:00
Mark Johnston
508d856999 Fix some cosmetic issues in kern_fail.c omitted from r296927.
Obtained from:	Matthew Bryan <matthew.bryan@isilon.com>
2016-06-09 13:17:08 +00:00
Konstantin Belousov
3fc292d56b Old process credentials for setuid execve must not be dereferenced
when the process credentials were not changed.  This can happen if an
error occured trying to activate the setuid binary.  And on error, if
new credentials were not yet assigned, they must be freed to not
create the leak.

Use oldcred == NULL as the predicate to detect credential
reassignment.

Reported and tested by:	pho
Sponsored by:	The FreeBSD Foundation
2016-06-08 04:37:03 +00:00
Mariusz Zaborski
b3a734483e Introduce the PD_CLOEXEC for pdfork(2).
Reviewed by:	mjg
2016-06-08 02:09:14 +00:00
Svatopluk Kraus
c4263292fe Remove temporary solution for storing interrupt mapping data as
it's not needed after r301451 and follow-ups r301453, r301539.

This makes INTRNG clean of all additions related to various buses.
2016-06-07 09:03:27 +00:00
Michal Meloun
949883bd72 INTRNG: As follow up of r301451, implement mapping and configuration
of gpio pin interrupts by new way.

Note: This removes last consumer of intr_ddata machinery and we remove it
in separate commit.
2016-06-07 05:08:24 +00:00
Bjoern A. Zeeb
3af72c1124 Implement a show panic command to DDB which will helpfully print the
panic string again if set, in case it scrolled out of the active
window.  This avoids having to remember the symbol name.

Also add a show callout <addr> command to DDB in order to inspect
some struct callout fields in case of panics in the callout code.
This may help to see if there was memory corruption or to further
ease debugging problems.

Obtained from:	projects/vnet
MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
Reviewed by:	jhb (comment only on the show panic initally)
Differential Revision:	https://reviews.freebsd.org/D4527
2016-06-06 20:57:24 +00:00
Konstantin Belousov
93ccd6bf87 Get rid of struct proc p_sched and struct thread td_sched pointers.
p_sched is unused.

The struct td_sched is always co-allocated with the struct thread,
except for the thread0.  Avoid useless indirection, instead calculate
td_sched location using simple pointer arithmetic in td_get_sched(9).
For thread0, which is statically allocated, create a structure to
emulate layout of the dynamic allocation.

Reviewed by:	jhb (previous version)
Sponsored by:	The FreeBSD Foundation
Differential revision:	https://reviews.freebsd.org/D6711
2016-06-05 17:04:03 +00:00
Konstantin Belousov
314381b529 Use ANSI function definition.
Sponsored by:	The FreeBSD Foundation
2016-06-05 16:55:55 +00:00
Svatopluk Kraus
ad5244ece1 INTRNG - change the way how an interrupt mapping data are provided
to the framework in OFW (FDT) case.

This is a follow-up to r301451.

Differential Revision:	https://reviews.freebsd.org/D6634
2016-06-05 16:20:12 +00:00
Svatopluk Kraus
0869297dd9 (1) Add a new bus method to get a mapping data for an interrupt.
BUS_MAP_INTR() is used to get an interrupt mapping data according
to provided hints. The hints could be modified afterwards, but only
if mapping data was allocated. This method is intended to be called
before BUS_ALLOC_RESOURCE().

An interrupt mapping data describes an interrupt - hardware number,
type, configuration, cpu binding, and whatever is needed to setup it.

(2) Introduce a method which allows storing of an additional data
in struct resource to be available for bus drivers. This method is
convenient in two ways:
 - there is no need to rework existing bus drivers as they can simply
   be extended to provide an additional data,
 - there is no need to modify any existing bus methods as struct
   resource is already passed to them as argument and thus stored data
   is simply accessible by other bus drivers.
For now, implement this method only for INTRNG.

This is motivated by needs of modern SOCs where hardware initialization
is not straightforward and resources descriptions are complex, opaque
for everyone but provider, and may vary from SOC to SOC. Typical
situation is that one bus driver can fetch a resource description for
its child device, but it's opaque for this driver. Another bus driver
knows a provider for this kind of resource and can pass this resource
description to it. In fact, something like device IVARS would be
perfect for that if implemented generally enough. Unfortunatelly, IVARS
are usable only by their owners now. Only owner knows its IVARS layout,
thus other bus drivers are not able to use them.

Differential Revision:	https://reviews.freebsd.org/D6632
2016-06-05 16:07:57 +00:00
Andrew Turner
d1605cda2b Add an interface to handle interrupt controllers that have a contiguous
range of interrupts they pass to a second controller driver to handle.
The parent driver is expected to detect when one of these interrupts has
been triggered and call intr_child_irq_handler to pass the interrupt to
a child. The children controllers are then expected to manage the range
by allocating interrupts as needed.

This will initially be used by the ARM GICv3 driver, but is is expected to
be useful for other driver where this type of allocation applies.

Obtained from:	ABT Systems Ltd
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D6436
2016-06-03 10:13:18 +00:00
Mateusz Guzik
f3c8e16ea5 taskqueue: plug a leak in _taskqueue_create
While here make some style fixes and postpone the sprintf so that it is
only done when the function can no longer fail.

CID:	1356041
2016-06-02 15:52:34 +00:00
Mateusz Guzik
fc4f686d59 Microoptimize locking primitives by avoiding unnecessary atomic ops.
Inline version of primitives do an atomic op and if it fails they fallback to
actual primitives, which immediately retry the atomic op.

The obvious optimisation is to check if the lock is free and only then proceed
to do an atomic op.

Reviewed by:	jhb, vangyzen
2016-06-01 18:32:20 +00:00
Bjoern A. Zeeb
3f58662dd9 The pr_destroy field does not allow us to run the teardown code in a
specific order.  VNET_SYSUNINITs however are doing exactly that.
Thus remove the VIMAGE conditional field from the domain(9) protosw
structure and replace it with VNET_SYSUNINITs.
This also allows us to change some order and to make the teardown functions
file local static.
Also convert divert(4) as it uses the same mechanism ip(4) and ip6(4) use
internally.

Slightly reshuffle the SI_SUB_* fields in kernel.h and add a new ones, e.g.,
for pfil consumers (firewalls), partially for this commit and for others
to come.

Reviewed by:		gnn, tuexen (sctp), jhb (kernel.h)
Obtained from:		projects/vnet
MFC after:		2 weeks
X-MFC:			do not remove pr_destroy
Sponsored by:		The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D6652
2016-06-01 10:14:04 +00:00
Gleb Smirnoff
34e05ebe72 Fix kernel stack disclosures in the Linux and 4.3BSD compat layers.
Submitted by:	CTurt
Security:	SA-16:20
Security:	SA-16:21
2016-05-31 16:56:30 +00:00
Edward Tomasz Napierala
f7bd221730 Cosmetics - add missing space after ellipses in shutdown messages.
MFC after:	1 month
Sponsored by:	The FreeBSD Foundation
2016-05-31 15:27:33 +00:00
Jamie Gritton
ee8d6bd352 Mark jail(2), and the sysctls that it (and only it) uses as deprecated.
jail(8) has long used jail_set(2), and those sysctl only cause confusion.
2016-05-30 05:21:24 +00:00
Mateusz Guzik
2dbdf49cf4 fd: provide a common exit point for unlock in kern_dup
While here assert dropped filedesc lock on return from closefp.
2016-05-27 17:00:15 +00:00
Mateusz Guzik
cda688443a exec: get rid of one vnode lock/unlock pair in do_execve
The lock was temporarily dropped for vrele calls, but they can be
postponed to a point where the lock is not held in the first place.

While here shuffle other code not needing the lock.
2016-05-27 15:03:38 +00:00
Bryan Drewery
1afd78b34d exec: Provide execpath in imgp for the process_exec hook.
This was previously set after the hook and only if auxargs were present.
Now always provide it if possible.

MFC after:	2 weeks
Reviewed by:	kib
Sponsored by:	EMC / Isilon Storage Division
Differential Revision:	https://reviews.freebsd.org/D6546
2016-05-26 23:19:39 +00:00
Bryan Drewery
881010f05d exec: Add credential change information into imgp for process_exec hook.
This allows an EVENTHANDLER(process_exec) hook to see if the new image
will cause credentials to change whether due to setgid/setuid or because
of POSIX saved-id semantics.

This adds 3 new fields into image_params:
  struct ucred *newcred		Non-null if the credentials will change.
  bool credential_setid		True if the new image is setuid or setgid.

This will pre-determine the new credentials before invoking the image
activators, where the process_exec hook is called.  The new credentials
will be installed into the process in the same place as before, after
image activators are done handling the image.

MFC after:	2 weeks
Reviewed by:	kib
Sponsored by:	EMC / Isilon Storage Division
Differential Revision:	https://reviews.freebsd.org/D6544
2016-05-26 23:18:54 +00:00
Conrad Meyer
571ebf7685 crypto routines: Hint minimum buffer sizes to the compiler
Use the C99 'static' keyword to hint to the compiler IVs and output digest
sizes.  The keyword informs the compiler of the minimum valid size for a given
array.  Obviously not every pointer can be validated (i.e., the compiler can
produce false negative but not false positive reports).

No functional change.  No ABI change.

Sponsored by:	EMC / Isilon Storage Division
2016-05-26 19:29:29 +00:00
Hans Petter Selasky
84e717c4cf Add support for boolean sysctl's.
Because the size of bool can be implementation defined, make a bool
sysctl handler which handle bools. Userspace sees the bools like
unsigned 8-bit integers. Values are filtered to either 1 or 0 upon
read and write, similar to what a compiler would do.

Requested by:	kmacy @
Sponsored by:	Mellanox Technologies
2016-05-26 08:41:55 +00:00
Ian Lepore
a66dc0c52b Include machine/acle-compat.h in cdefs.h on arm if the compiler doesn't
have ACLE support built in.  The ACLE (ARM C Language Extensions) defines
a set of standardized symbols which indicate the architecture version and
features available.  ACLE support is built in to modern compilers (both
clang and gcc), but absent from gcc prior to 4.4.

ARM (the company) provides the acle-compat.h header file to define the
right symbols for older versions of gcc.  Basically, acle-compat.h does
for arm about the same thing cdefs.h does for freebsd: defines
standardized macros that work no matter which compiler you use.  If ARM
hadn't provided this file we would have ended up with a big #ifdef __arm__
section in cdefs.h with our own compatibility shims.

Remove #include <machine/acle-compat.h> from the zillion other places (an
ever-growing list) that it appears.  Since style(9) requires sys/types.h
or sys/param.h early in the include list, and both of those lead to
including cdefs.h, only a couple special cases still need to include
acle-compat.h directly.

Loves it:     imp
2016-05-25 19:44:26 +00:00
Konstantin Belousov
c5e44d6cd5 Silence false LOR report due to the taskqueue mutex and kqueue lock
named the same.

Reported by:	Doug Luce <doug@freebsd.con.com>
Sponsored by:	The FreeBSD Foundation
2016-05-24 21:13:33 +00:00
John Baldwin
778ce4f297 Return the correct status when a partially completed request is cancelled.
After the previous changes to fix requests on blocking sockets to complete
across multiple operations, an edge case exists where a request can be
cancelled after it has partially completed.  POSIX doesn't appear to
dictate exactly how to handle this case, but in general I feel that
aio_cancel() should arrange to cancel any request it can, but that any
partially completed requests should return a partial completion rather
than ECANCELED.  To that end, fix the socket AIO cancellation routine to
return a short read/write if a partially completed request is cancelled
rather than ECANCELED.

Sponsored by:	Chelsio Communications
2016-05-24 21:09:05 +00:00
Andrew Turner
974692e3bf Limit calling pmc_hook to when the interrupt comes while running userspace.
We may enable interrupts from within the callback, e.g. in a data abort
during copyin. If we receive an interrupt at that time pmc_hook will be
called again and, as it is handling userspace stack tracing, will hit a
KASSERT as it checks if the trapframe is from userland.

With this I can run hwpmc with intrng on a ThunderX and have it trace all
CPUs.

Obtained from:	ABT Systems Ltd
Sponsored by:	The FreeBSD Foundation
2016-05-24 12:06:56 +00:00
John Baldwin
1717b68af1 Don't prematurely return short completions on blocking sockets.
Always requeue an AIO job at the head of the socket buffer's queue if
sosend() or soreceive() returns EWOULDBLOCK on a blocking socket.
Previously, requests were only requeued if they returned EWOULDBLOCK
and completed no data.  Now after a partial completion on a blocking
socket the request is queued and the remaining request is retried when
the socket is ready.  This allows writes larger than the currently
available space on a blocking socket to fully complete.  Reads on a
blocking socket that satifsy the low watermark can still return a short
read (same as read()).

In order to track previously completed data, the internal 'status'
field of the AIO job is used to store the amount of previously
computed data.

Non-blocking sockets continue to return short completions for both
reads and writes.

Add a test for a "large" AIO write on a blocking socket that writes
twice the socket buffer size to a UNIX domain socket.

Sponsored by:	Chelsio Communications
2016-05-24 03:13:27 +00:00
Alan Somers
37f32e5379 Fix build of kern/subr_unit.c, broken by r300539
Reported by:	peter
Pointyhat to:	asomers
Sponsored by:	Spectra Logic Corp
2016-05-24 00:14:58 +00:00