Commit Graph

120801 Commits

Author SHA1 Message Date
Warner Losh
2d87718fda Use bool instead of int for predicate functions relating to work
available.
2018-02-23 16:06:54 +00:00
Konstantin Belousov
33099716f3 Do not return out of bound pointers from intr_lookup_source().
This hardens the code against driver and upper level bugs causing
invalid indexes used, e.g. on msi release.

Reported by:	gallatin
Reviewed by:	gallatin, hselasky
Sponsored by:	Mellanox Technologies
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D14470
2018-02-23 11:20:59 +00:00
Wojciech Macek
3c41c1d446 powerpc64: add NVMe to GENERIC64
NVMe support is ready and should be compiled-in
to the ppc64 kernel.

Submitted by:          Wojciech Macek <wma@semihalf.org>
Obtained from:         Semihalf
Sponsored by:          IBM, QCM Technologies
2018-02-23 07:43:52 +00:00
Warner Losh
ef1fcaf0f5 Do not include float interfaces when using libsa.
We don't support float in the boot loaders, so don't include
interfaces for float or double in systems headers. In addition, take
the unusual step of spiking double and float to prevent any more
accidental seepage.
2018-02-23 04:04:25 +00:00
David C Somayajulu
b65c0c07b2 1. Added support to offline a port if is error recovery on successful.
2. Sysctls to enable/disable driver_state_dump and error_recovery.
3. Sysctl to control the delay between hw/fw reinitialization and
   restarting the fastpath.
4. Stop periodic stats retrieval if interface has IFF_DRV_RUNNING flag off.
5. Print contents of PEG_HALT_STATUS1 and PEG_HALT_STATUS2 on heartbeat
   failure.
6. Speed up slowpath shutdown during error recovery.
7. link_state update using atomic_store.
8. Added timestamp information on driver state and minidump captures.
9. Added support for Slowpath event logging
10.Added additional failure injection types to simulate failures.
2018-02-23 03:36:24 +00:00
Don Lewis
97e9382d56 Decrease latency by not wrapping the idle loop's potentially lengthy
search for a thread to steal inside a critical section.  Since this
allows the search to be preempted, restart the search if preemption
happens since the search results found earlier may no longer be
valid.

Decrease the latency of starting a thread that may be assigned to
this CPU during the search by polling for incoming threads during
the search and switching to that thread instead of continuing the
search.

Test for stale search results and restart the search before going
through the expense of calling tdq_lock_pair().  Retry some tests
after grabbing the locks since things may have changed while waiting
to get both locks.

Eliminate special case handling for stealing from an SMT peer that
uses 1 as the steal threshold.  This can only succeed if a thread
has been assigned but our SMT peer has not yet started executing
it.  This is quite rare and when it happens the other SMT thread
is generally waiting for the same tdq lock that we hold.  Basically
both SMT threads are racing to grab the same spin lock.

Add the kern.sched.always_steal knob from a ULE patch by jeff@.

Incorporate another idea from Jeff's ULE patch.  If the sched_switch()
detects that the CPU is about to go idle, try to steal a thread
before switching to the idle thread.  Since the search for a thread
to steal has to be done inside a critical section in this context,
limit the impact on latency by adding the knob kern.sched.trysteal_limit
to limit the topological distance of the search and don't restart
the search if we detect stale results.  If this search can't find
an stealable thread, the idle loop can do a more complete search.
Also poll for threads being assigned to this CPU during the search
and switch to them instead of continuing the search.  This change
is responsibile for the majority of the improvement in parallel
buildworld times.

In sched_balance_group() change the minimum threshold from stealing
a thread from 1 to 2.  Poaching a newly assigned thread from a CPU
that is waking up hasn't yet switched to that thread from idle is
likely very rare and is likely to have the same lock race as is
seen when stealing threads in the idle loop.  Also use tdq_notify()
to kick the destintation CPU instead of always sending an IPI.
Update a stale comment, the number of transferable threads is not
calculated.

Reviewed by:	kib (earlier version)
Comments by:	avg, jeff, mav
MFC after:	1 month
Differential Revision:	https://reviews.freebsd.org/D12130
2018-02-23 00:12:51 +00:00
Ravi Pokala
dcd935dfd1 jedec_dimm(4): report asset info and temperatures for DDR3 and DDR4 DIMMs
A super-set of the functionality of jedec_ts(4). jedec_dimm(4) reports asset
information (Part Number, Serial Number) encoded in the "Serial Presence
Detect" (SPD) data on JEDEC DDR3 and DDR4 DIMMs. It also calculates and
reports the memory capacity of the DIMM, in megabytes. If the DIMM includes
a "Thermal Sensor On DIMM" (TSOD), the temperature is also reported.

Reviewed by:	cem
MFC after:	1 week
Relnotes:	yes
Sponsored by:	Panasas
Differential Revision:	https://reviews.freebsd.org/D14392
Discussed with:	avg, cem
Tested by:	avg, cem (previous version, no semantic changes)
2018-02-22 23:18:46 +00:00
Ian Lepore
363b2c7fd2 Add a missing line continuation.
How many commits does it take to get a simple module makefile working?
Apparently at least three.

Pointy hat to:  ian
2018-02-22 22:25:26 +00:00
Mateusz Guzik
a0c722bdbf Fix up sysctl vfs.buffercache broken in r329612
Sample problem:
top: sysctl(vfs.bufspace...) expected 8, got 4

Reported by:	O. Hartmann <ohartmann walstatt.org>
2018-02-22 20:39:25 +00:00
Oleksandr Tymoshenko
94b8a54ae6 [chvgpio] add GPIO driver for Intel Z8xxx SoC family
Add chvgpio(4) driver for Intel Z8xxx SoC family. This product
was formerly known as Cherry Trail but Linux and OpenBSD drivers
refer to it as Cherry View. This driver is derived from OpenBSD
one so the name is kept for alignment with another BSD system.

Submitted by:	Tom Jones <tj@enoti.me>
Reviewed by:	gonzo, wblock(man page)
MFC after:	2 weeks
Differential Revision:	https://reviews.freebsd.org/D13086
2018-02-22 19:12:32 +00:00
Eric van Gyzen
0127914caa sched_ule: update a comment to reflect reality
MFC after:	3 days
Sponsored by:	Dell EMC
2018-02-22 17:09:26 +00:00
Kyle Evans
afdc2600c2 nvme: Unbreak LE builds after r329824
The parameter 'p' is unused if _BYTE_ORDER == _LITTLE_ENDIAN. Add in a
(void)p to fix the build.
2018-02-22 16:16:49 +00:00
Hans Petter Selasky
949440623b Return correct error code to user-space when a system call receives a
signal in the LinuxKPI.

The read(), write() and mmap() system calls can return either EINTR or
ERESTART upon receiving a signal. Add code to figure out the correct
return value by temporarily storing the return code from the relevant
FreeBSD kernel APIs in the Linux task structure.

MFC after:	3 days
Sponsored by:	Mellanox Technologies
2018-02-22 15:29:19 +00:00
Wojciech Macek
0d787e9b35 NVMe: Add big-endian support
Remove bitfields from defined structures as they are not portable.
Instead use shift and mask macros in the driver and nvmecontrol application.

NVMe is now working on powerpc64 host.

Submitted by:          Michal Stanek <mst@semihalf.com>
Obtained from:         Semihalf
Reviewed by:           imp, wma
Sponsored by:          IBM, QCM Technologies
Differential revision: https://reviews.freebsd.org/D13916
2018-02-22 13:32:31 +00:00
Andriy Gapon
de2cb430ad another rework of getzfsvfs / getzfsvfs_impl code
This change is designed to account for yet another difference between
illumos and FreeBSD VFS.  In FreeBSD a filesystem driver is supposed to
clean up mnt_data in its VFS_UNMOUNT method because it's the last call
into the driver before a struct mount object is destroyed.  The VFS
drains all references to the object before destroying it, but for the
driver it's already as good as gone.
In contrast, illumos VFS provides another method, VFS_FREEVFS, that is
called when all references are drained.  So, the driver can keep its
data after VFS_UNMOUNT and clean it up in VFS_FREEVFS after all
references are gone. This is what ZFS does on illumos.
So there a reference to a filesystem is sufficient to guarantee that the
ZFS specific data, aka zfsvfs_t, stays around (even if the filesystem
gets unmounted).  In FreeBSD we need to vfs_busy the filesystem to get
the same guarantee.  vfs_ref guarantees only that the struct mount is
kept.

The following rules should be observed in getzfsvfs / getzfsvfs_impl on
FreeBSD:
- if we need access to zfsvfs_t then we must use vfs_busy
- if only we need to access struct mount (aka vfs_t), then vfs_ref is
  enough
- when illumos code actually needs only the vfs_t, they still can pass
  the zfsvfs_t and get the vfs_t from it;  that can work in FreeBSD if
  the filesystem is busied, but when it's just referenced then we have
  to pass the vfs_t explicitly
- we cannot call vfs_busy while holding a dataset because that creates a
  LOR with dp_config_rwlock

As a result:
- getzfsvfs_impl now only references the filesystem, same as in illumos,
  but unlike illumos it has to return the vfs_t
- the consumers are updated to account for the change
- getzfsvfs busies the filesystem (and drops the reference from
  getzfsvfs_impl)

Also, zfs_unmount_snap() now gets a busied a filesystem, references it
and then unbusies it essentially reverting actions done in getzfsvfs.
This is needed because the code may perform some checks that require the
zfsvfs_t.  So, those are done before the unbusying.

MFC after:	2 weeks
2018-02-22 13:06:27 +00:00
Warner Losh
07e5967a22 Revert r329814 as well. It should have been in r329819. 2018-02-22 11:51:50 +00:00
Andriy Gapon
8d69fe5cc8 followup to r329556, completely remove the covered vnode assert
vrele() acquires the vnode lock only if the hold count drops to zero.
In other scenarios it needs only the interlock.  So,
zfsctl_snapdir_lookup() can race with vfs_mount_destroy() -> vrele()
such that the lookup adds a new reference and then vrele() drops the
mountpoint's reference and only then we check the reference count.
It would be just one in this case.

In fact, the assert should have been removed in r323483 when the code
learned how to deal with the uncovered vnode.

PR:		225795
MFC after:	4 days
X-MFC with:	r329556
2018-02-22 11:41:00 +00:00
Warner Losh
0028abe633 Backout r329818, r329816 and r329815.
These aren't the commits I thought I was testing prior to
commit. Revert until I can sort out what happened and fix it.
2018-02-22 11:18:33 +00:00
Warner Losh
91acaad987 Fix typo in last commit after last rebase before commit... 2018-02-22 10:55:23 +00:00
Warner Losh
4d87e27125 Combine BIO_DELETE requests for nda devices
Now that we're queueing BIO_DELETE requests in the CAM I/O scheduler,
it make sense to try to combine as many as possible into a single
request to send down to hardware. Hopefully, lots of larger requests
like this are better than lots of individual transactions.

Note for future: need to limit based on total size of the trim
request. Should also collapse adjacent ranges where possible to
increase the size of the max payload.

Sponsored by: Netflix
2018-02-22 05:44:00 +00:00
Warner Losh
c5fe3ae9b8 Introduce capacity flags for periphs
Introduce flags word to describe the capacities of the peripheral.
First bit will describe if the periph driver allows multiple
outstanding TRIMS to be active in a device.

Modify the I/O scheduler so that the nda driver can queue trims
for a while after the first one arrives. We'll queue until we see
a I/O scheduler tick, then we'll schedule as many TRIMs as allowed
by other factors (currently this is slocts in the NVMe controller).
This mariginally helps the read latency issues we see with reads,
but sets the stage for the nda driver to do TRIM collapsing like the
da and ada drivers do today.

Sponsored by: Netflix
2018-02-22 05:43:55 +00:00
Warner Losh
c9878d6d63 Note when we tick.
To help implement a policy of 'queue all trims until next I/O sched
tick' policy to help coalesce them, note when we tick so we can do
something special on the first call after the tick to get more work.

Sponsored by: Netflix
2018-02-22 05:43:50 +00:00
Warner Losh
f2b9885036 Wrap an extra long line
This debugging line is too big for even my largest xterm. wrap it at
about 80 columns.

Sponsored by: Netflix
2018-02-22 05:43:45 +00:00
Warner Losh
97f8aa050e Don't sort TRIMs.
While the code for ada and da both assume that the trim list is
ordered when doing the coaleascing the TRIMs, it turns out that
creating the sorted list uses more resources than are saved by having
slightly fewer trims sent to the device.

Sponsored by: Netflix
2018-02-22 05:43:20 +00:00
Alexander Motin
dd9ceab333 MFV r329803:
9080 recursive enter of vdev_indirect_rwlock from vdev_indirect_remap()

illumos/illumos-gate@bdfded42e6

A scenario came up where a callback executed by vdev_indirect_remap() on a vdev, calls
vdev_indirect_remap() on the same vdev and tries to reacquire vdev_indirect_rwlock that
was already acquired from the first call to vdev_indirect_remap(). The specific scenario,
is that we want to remap a block pointer that is snapshoted but its dataset's remap_deadlist
is not cached. So in order to add it we issue a read through a vdev_indirect_remap() on the
same vdev, which brings up the aforementioned issue.

Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Approved by: Hans Rosenfeld <rosenfeld@grumpf.hope-2000.org>
Author: Serapheim Dimitropoulos <serapheim.dimitro@delphix.com>
2018-02-22 03:54:59 +00:00
Alexander Motin
064827be34 MFV r329799, r329800:
9079 race condition in starting and ending condesing thread for indirect vdevs

illumos/illumos-gate@667ec66f1b

The timeline of the race condition is the following:
[1] Thread A is about to finish condesing the first vdev in spa_condense_indirect_thread(),
so it calls the spa_condense_indirect_complete_sync() sync task which sets the
spa_condensing_indirect field to NULL. Waiting for the sync task to finish, thread A
sleeps until the txg is done. When this happens, thread A will acquire spa_async_lock
and set spa_condense_thread to NULL.
[2] While thread A waits for the txg to finish, thread B which is running spa_sync() checks
whether it should condense the second vdev in vdev_indirect_should_condense() by checking
the spa_condensing_indirect field which was set to NULL by spa_condense_indirect_thread()
from thread A. So it goes on and tries to spawn a new condensing thread in
spa_condense_indirect_start_sync() and the aforementioned assertions fails because thread A
has not set spa_condense_thread to NULL (which is basically the last thing it does before
returning).

The main issue here is that we rely on both spa_condensing_indirect and spa_condense_thread to
signify whether a condensing thread is running. Ideally we would only use one throughout the
codebase. In addition, for managing spa_condense_thread we currently use spa_async_lock which
basically tights condensing to scrubing when it comes to pausing and resuming those actions
during spa export.

Reviewed by: Matt Ahrens <mahrens@delphix.com>
Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Approved by: Hans Rosenfeld <rosenfeld@grumpf.hope-2000.org>
Author: Serapheim Dimitropoulos <serapheim@delphix.com>
2018-02-22 03:49:06 +00:00
Ed Maste
a0409b6f36 Remove accidental vim droppings
Reported by:	cy
2018-02-22 03:37:01 +00:00
Alexander Motin
1ea10a60f9 MFV r329793, r329795:
9075 Improve ZFS pool import/load process and corrupted pool recovery

illumos/illumos-gate@6f7938128a

Some work has been done lately to improve the debugability of the ZFS pool
load (and import) process. This includes:

https://www.illumos.org/issues/7638: Refactor spa_load_impl into several functions
https://www.illumos.org/issues/8961: SPA load/import should tell us why it failed
https://www.illumos.org/issues/7277: zdb should be able to print zfs_dbgmsg's

To iterate on top of that, there's a few changes that were made to make the
import process more resilient and crash free. One of the first tasks during the
pool load process is to parse a config provided from userland that describes
what devices the pool is composed of. A vdev tree is generated from that config,
and then all the vdevs are opened.

The Meta Object Set (MOS) of the pool is accessed, and several metadata objects
that are necessary to load the pool are read. The exact configuration of the
pool is also stored inside the MOS. Since the configuration provided from
userland is external and might not accurately describe the vdev tree
of the pool at the txg that is being loaded, it cannot be relied upon to safely
operate the pool. For that reason, the configuration in the MOS is read early
on. In the past, the two configurations were compared together and if there was
a mismatch then the load process was aborted and an error was returned.

The latter was a good way to ensure a pool does not get corrupted, however it
made the pool load process needlessly fragile in cases where the vdev
configuration changed or the userland configuration was outdated. Since the MOS
is stored in 3 copies, the configuration provided by userland doesn't have to be
perfect in order to read its contents. Hence, a new approach has been adopted:
The pool is first opened with the untrusted userland configuration just so that
the real configuration can be read from the MOS. The trusted MOS configuration
is then used to generate a new vdev tree and the pool is re-opened.

When the pool is opened with an untrusted configuration, writes are disabled
to avoid accidentally damaging it. During reads, some sanity checks are
performed on block pointers to see if each DVA points to a known vdev;
when the configuration is untrusted, instead of panicking the system if those
checks fail we simply avoid issuing reads to the invalid DVAs.

This new two-step pool load process now allows rewinding pools accross
vdev tree changes such as device replacement, addition, etc. Loading a pool
from an external config file in a clustering environment also becomes much
safer now since the pool will import even if the config is outdated and didn't,
for instance, register a recent device addition.

With this code in place, it became relatively easy to implement a
long-sought-after feature: the ability to import a pool with missing top level
(i.e. non-redundant) devices. Note that since this almost guarantees some loss
Of data, this feature is for now restricted to a read-only import.

Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Andrew Stormont <andyjstormont@gmail.com>
Approved by: Hans Rosenfeld <rosenfeld@grumpf.hope-2000.org>
Author: Pavel Zakharov <pavel.zakharov@delphix.com>
2018-02-22 03:15:35 +00:00
John Baldwin
642ffab5fc Avoid grabbing locks when grabbing the vt(4) console for DDB.
Trying to grab locks during cngrab() when entering the debugger is
deadlock prone as all other CPUs are already halted (and thus unable
to release locks) when cngrab() is invoked.  One could instead use
try-locks.  However, the case that the try-lock fails still has to
be handled.  In addition, if the try-lock works it doesn't provide
any greater ordering guarantees than is already provided by entering
and exiting DDB.  It is simpler to define a simpler path for the
case that the try-lock would fail and always use that when entering
DDB.  Messing with timers, etc. when entering DDB is dubious even if
the try-lock succeeds.

This patch attempts to use the smallest possible set of operations to
grab the vt(4) console when entering DDB without using any locks.

Reviewed by:	emaste
Tested by:	Matthew Macy
MFC after:	1 week
2018-02-22 02:26:29 +00:00
Ed Maste
eae594f7d5 Correct proper nouns in the Linuxulator
- Capitalize Linux
- Spell FreeBSD out in full
- Address some style(9) on changed lines

Sponsored by:	Turing Robotic Industries Inc.
2018-02-22 02:24:17 +00:00
John Baldwin
6619d9fb70 Bring in additional constants and message fields for TLS-related messages.
Sponsored by:	Chelsio Communications
2018-02-22 02:02:31 +00:00
Ed Maste
581bf7cbda Use 'const int *' for sysentvec errno translation table
This allows an sv_errtbl to be read-only .rodata.

Sponsored by:	Turing Robotic Industries Inc.
2018-02-22 01:59:59 +00:00
John Baldwin
125d42fe81 Move DDP PCB state into a helper structure.
This consolidates all of the DDP state in one place.  Also, the code has
now been fixed to ensure that DDP state is only accessed for DDP
connections.  This should not be a functional change but makes it cleaner
and easier to add state for other TOE socket modes in the future.

MFC after:	1 month
Sponsored by:	Chelsio Communications
2018-02-22 01:50:30 +00:00
Alexander Motin
613b0d87da 8942 zfs promote .../%recv should be an error
illumos/illumos-gate@add927f8c8

Reported on the ZFSonLinux https://github.com/zfsonlinux/zfs/issues/4843,
fixed by https://github.com/zfsonlinux/zfs/pull/6339:

If we are in the middle of an incremental zfs receive, the child .../%recv
will exist. If you concurrently run zfs promote .../%recv, it will "work",
but then zfs gets confused. For example, there's no obvious way to destroy
the containing filesystem (because it is now a clone of its invisible child).

Attempting to do this promote should be an error. We could fix this by
having zfs_ioc_promote() check if zc_name contains a %, similar to
zfs_ioc_rename().

Reviewed by: Paul Dagnelie <pcd@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Approved by: Dan McDonald <danmcd@joyent.com>
Author: loli10K <ezomori.nozomu@gmail.com>
2018-02-22 01:42:13 +00:00
Alexander Motin
a33ba3dbde MFV r329776: 8477 Assertion failed in vdev_state_dirty(): spa_writeable(spa)
illumos/illumos-gate@f4c1745bd6

Illumos 4080 allows "zpool clear" to work on readonly pools: i don't think
this is the intended behaviour, we shouldn't be allowed to clear readonly
pools. Probably.

A fix is already in the ZFS on Linux repository to addess this issue:
92e43c1718

Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Approved by: Dan McDonald <danmcd@joyent.com>
Author: loli10K <ezomori.nozomu@gmail.com>
2018-02-22 01:00:46 +00:00
Alexander Motin
eea9be67e6 MFV r329774:
8408 dsl_props_set_sync_impl() does not handle nested nvlists correctly

illumos/illumos-gate@85723e5eec

When iterating over the input nvlist in dsl_props_set_sync_impl() when we
don't preserve the nvpair name before looking up ZPROP_VALUE, so when we
later go to process it nvpair_name() is always "value" instead of the actual
property name.

This results in a couple of bugs in the recv code:

 - received properties are not restored correctly when failing to receive
   an incremental send stream
 - received properties are not completely replaced by the new ones when
   successfully receiving an incremental send stream

This was discovered on ZFS on Linux (fixed in
5f1346c299)

Reviewed by: Paul Dagnelie <pcd@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Approved by: Dan McDonald <danmcd@joyent.com>
Author: loli10K <ezomori.nozomu@gmail.com>
2018-02-22 00:55:25 +00:00
Alexander Motin
756595f675 MFV r329770: 9035 zfs: this statement may fall through
illumos/illumos-gate@46ac8fdfc5

Reviewed by: Yuri Pankov <yuripv@yuripv.net>
Reviewed by: Andy Fiddaman <omnios@citrus-it.co.uk>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Approved by: Dan McDonald <danmcd@joyent.com>
Author: Toomas Soome <tsoome@me.com>
2018-02-22 00:47:38 +00:00
Alexander Motin
502d18a8f1 MFV r329766: 8962 zdb should work on non-idle pools
illumos/illumos-gate@e144c4e6c9

Currently `zdb` consistently fails to examine non-idle pools as it fails
during the `spa_load()` process. The main problem seems to be that
`spa_load_verify()` fails as can be seen below:

$ sudo zdb -d -G dcenter
    zdb: can't open 'dcenter': I/O error

ZFS_DBGMSG(zdb):
    spa_open_common: opening dcenter
    spa_load(dcenter): LOADING
    disk vdev '/dev/dsk/c4t11d0s0': best uberblock found for spa dcenter. txg 40824950
    spa_load(dcenter): using uberblock with txg=40824950
    spa_load(dcenter): UNLOADING
    spa_load(dcenter): RELOADING
    spa_load(dcenter): LOADING
    disk vdev '/dev/dsk/c3t10d0s0': best uberblock found for spa dcenter. txg 40824952
    spa_load(dcenter): using uberblock with txg=40824952
    spa_load(dcenter): FAILED: spa_load_verify failed [error=5]
    spa_load(dcenter): UNLOADING

This change makes `spa_load_verify()` a dryrun when ran from `zdb`. This is
done by creating a global flag in zfs and then setting it in `zdb`.

Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Andy Stormont <astormont@racktopsystems.com>
Approved by: Dan McDonald <danmcd@joyent.com>
Author: Pavel Zakharov <pavel.zakharov@delphix.com>
2018-02-22 00:42:12 +00:00
John Baldwin
4f8666989a Add two new ioctls to bhyve for batch register fetch/store operations.
These are a convenience for bhyve's debug server to use a single
ioctl for 'g' and 'G' rather than a loop of individual get/set
ioctl requests.

Reviewed by:	grehan
MFC after:	2 months
Differential Revision:	https://reviews.freebsd.org/D14074
2018-02-22 00:39:25 +00:00
Alexander Motin
fa607d017d MFV r329762: 8961 SPA load/import should tell us why it failed
illumos/illumos-gate@3ee8c80c74

When we fail to open or import a storage pool, we typically don't get any
additional diagnostic information, just "no pool found" or "can not import".

While there may be no additional user-consumable information, we should at
least make this situation easier to debug/diagnose for developers and support.
For example, we could start by using `zfs_dbgmsg()` to log each thing that we
try when importing, and which things failed. E.g. "tried uberblock of txg X
from label Y of device Z". Also, we could log each of the stages that we go
through in `spa_load_impl()`.

Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Andrew Stormont <andyjstormont@gmail.com>
Approved by: Dan McDonald <danmcd@joyent.com>
Author: Pavel Zakharov <pavel.zakharov@delphix.com>
2018-02-22 00:03:14 +00:00
Warner Losh
6ffe523f18 Minor formatting nits. 2018-02-21 23:49:35 +00:00
Alexander Motin
df331510ab MFV r329760: 7638 Refactor spa_load_impl into several functions
illumos/illumos-gate@1fd3785ff6

spa_load_impl has grown out of proportions.  It is currently over 700
lines long and makes it very hard to follow or debug the import process
even for experienced ZFS developers.  The objective is to split it up
in a series of well commented functions.

Reviewed by: Paul Dagnelie <pcd@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Andrew Stormont <andyjstormont@gmail.com>
Approved by: Dan McDonald <danmcd@joyent.com>
Author: Pavel Zakharov <pavel.zakharov@delphix.com>
2018-02-21 23:38:30 +00:00
Alexander Motin
b17bfcde3d 9018 Replace kmem_cache_reap_now() with kmem_cache_reap_soon()
illumos/illumos-gate@36a64e6284

To prevent kmem_cache reaping from blocking other system resources, turn
kmem_cache_reap_now() (which blocks) into kmem_cache_reap_soon(). Callers
to kmem_cache_reap_soon() should use kmem_cache_reap_active(), which
exploits #9017's new taskq_empty().

Reviewed by: Bryan Cantrill <bryan@joyent.com>
Reviewed by: Dan McDonald <danmcd@joyent.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Yuri Pankov <yuripv@yuripv.net>
Author: Tim Kordas <tim.kordas@joyent.com>

FreeBSD does not use taskqueue for kmem caches reaping, so this change
is less dramatic then it is on Illumos, just limiting reaping to 1 time
per second.  It may possibly be improved later, if needed.
2018-02-21 23:15:06 +00:00
Alexander Motin
d208c07cf3 MFV r329753: 8809 libzpool should leverage work done in libfakekernel
illumos/illumos-gate@f06dce2c1f

Reviewed by: Sebastien Roy <sebastien.roy@delphix.com>
Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Gordon Ross <gordon.w.ross@gmail.com>
Approved by: Richard Lowe <richlowe@richlowe.net>
Author: Andrew Stormont <astormont@racktopsystems.com>
2018-02-21 21:18:04 +00:00
Kirk McKusick
f686b1710a Refactor fix in r329600 to do its check once in readsuper() rather
than in the two places that call readsuper().

No semantic change intended.

Reviewed by: kib
2018-02-21 19:56:19 +00:00
Ryan Stone
b3b6ff23e7 Allow route change requests to not specify the gateway.
Only require a gateway to be specified on a route add request.  On
a route change request that does not specify the gateway, the
gateway will remain the same.  This allows changing other route
parameters without having to re-specifying the gateway, like in
"route change 10.0.0.0/8 -mtu 9000".

Update the route(8) manpage to explicitly call out this usage
as being supported.

MFC after: 2 weeks
Sponsored by: Dell EMC Isilon
Reviewed By: eugen (rtsock.c change), rgrimes
Differential Revision: https://reviews.freebsd.org/D14291
2018-02-21 19:13:23 +00:00
Stephen Hurd
81ad57b1b8 IFLIB: Make isc_magic unsigned
The IFLIB_MAGIC macro is > INT_MAX, so isc_magic should
be able to contain it.

Reported by:	jeb
Sponsored by:	Limelight  Networks
2018-02-21 18:57:00 +00:00
Alexander Motin
832deba1e5 MFV r329736: 8969 Cannot boot from RAIDZ with parity > 1
illumos/illumos-gate@0fb055e81f

At present it is possible to boot from a root pool that is on RAIDZ but not
one that is on RAIDZ2 or RAIDZ3.  This is because, at the time the pool
version is checked to ensure support for dual/triple parity, the uberblock
has not yet been loaded into the SPA and therefore the code determines that
the pool version is too old and returns ENOTSUP.

Reviewed by: Igor Kozhukhov <igor@dilos.org>
Reviewed by: Andriy Gapon <avg@FreeBSD.org>
Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed by: Andy Stormont <astormont@racktopsystems.com>
Reviewed by: Toomas Soome <tsoome@me.com>
Approved by: Gordon Ross <gwr@nexenta.com>
Author: Andy Fiddaman <omnios@citrus-it.co.uk>

FreeBSD already had this fixed, so this is just a diff reduction.
2018-02-21 18:12:19 +00:00
Alexander Motin
24433f00ea MFV r329502: 7614 zfs device evacuation/removal
illumos/illumos-gate@5cabbc6b49

https://www.illumos.org/issues/7614:
This project allows top-level vdevs to be removed from the storage pool with
“zpool remove”, reducing the total amount of storage in the pool. This
operation copies all allocated regions of the device to be removed onto other
devices, recording the mapping from old to new location. After the removal is
complete, read and free operations to the removed (now “indirect”) vdev must
be remapped and performed at the new location on disk. The indirect mapping
table is kept in memory whenever the pool is loaded, so there is minimal
performance overhead when doing operations on the indirect vdev.

The size of the in-memory mapping table will be reduced when its entries
become “obsolete” because they are no longer used by any block pointers in
the pool. An entry becomes obsolete when all the blocks that use it are
freed. An entry can also become obsolete when all the snapshots that
reference it are deleted, and the block pointers that reference it have been
“remapped” in all filesystems/zvols (and clones). Whenever an indirect block
is written, all the block pointers in it will be “remapped” to their new
(concrete) locations if possible. This process can be accelerated by using
the “zfs remap” command to proactively rewrite all indirect blocks that
reference indirect (removed) vdevs.

Note that when a device is removed, we do not verify the checksum of the data
that is copied. This makes the process much faster, but if it were used on
redundant vdevs (i.e. mirror or raidz vdevs), it would be possible to copy
the wrong data, when we have the correct data on e.g. the other side of the
mirror. Therefore, mirror and raidz devices can not be removed.

Reviewed by: Alex Reece <alex@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: John Kennedy <john.kennedy@delphix.com>
Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Richard Laager <rlaager@wiktel.com>
Reviewed by: Tim Chase <tim@chase2k.com>
Approved by: Garrett D'Amore <garrett@damore.org>
Author: Prashanth Sreenivasa <pks@delphix.com>
2018-02-21 16:51:02 +00:00
Ian Lepore
94d7be6551 Add required header files.
Reported by:	andreast@
2018-02-21 16:36:44 +00:00
Ian Lepore
3982006ed5 Remove some files that snuck in via cut and paste.
Having these compiled into the module causes the kobj method descriptors
to be resolved incorrectly (by the compile-time linker instead of the
kernel linker), which then leads to hours of frustrating debugging.
2018-02-21 16:34:04 +00:00
Nathan Whitehorn
d9dbc2104f Add definition for the PowerPC A2. 2018-02-21 15:15:58 +00:00
Nathan Whitehorn
dddf28585d Add definitions for the new Radix MMU mode on POWER9+ CPUs. 2018-02-21 15:15:31 +00:00
Andriy Gapon
cfb675a138 MFV r329718: 8520 7198 lzc_rollback_to should support rolling back to origin
illumos/illumos-gate@95643f75d2
95643f75d2

https://www.illumos.org/issues/8520
  lzc_rollback_to() should support rolling back to a clone's origin.
  The current checks in zfs_ioc_rollback() would not allow that because the
  origin snapshot belongs to a different filesystem.
  The overly restrictive check was introduced in 7600, but it was not a
  regression as none of the existing tools provided a way to rollback to the
  origin.

https://www.illumos.org/issues/7198
  EINVAL is returned when a dataset does not have any snapshots, so there is
  nothing to roll back to.
  Although the code in zfs_do_rollback checks for that condition in advance, it's
  still possible that the snapshot(s) gets removed after the check and before the
  rollback sync task is executed.
  At the moment zfs command would crash when that happens.

Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Approved by: Dan McDonald <danmcd@joyent.com>
Author: Andriy Gapon <avg@FreeBSD.org>
MFC after:	2 weeks
2018-02-21 15:12:14 +00:00
Andriy Gapon
754d27df02 MFV r329715: 8997 ztest assertion failure in zil_lwb_write_issue
illumos/illumos-gate@f864f99efe
f864f99efe

https://www.illumos.org/issues/8997
  When dmu_tx_assign is called from zil_lwb_write_issue, it's possible
  for either ERESTART or EIO to be returned.
  If ERESTART is returned, this will cause an assertion to fail directly
  in zil_lwb_write_issue, where the code assumes the return value is
  EIO if dmu_tx_assign returns a non-zero value. This can occur if the
  SPA is suspended when dmu_tx_assign is called, and most often occurs
  when running zloop.
  If EIO is returned, this can cause assertions to fail elsewhere in the
  ZIL code. For example, zil_commit_waiter_timeout contains the
  following logic:
    lwb_t *nlwb = zil_lwb_write_issue(zilog, lwb);
    ASSERT3S(lwb->lwb_state, !=, LWB_STATE_OPENED);
  In this case, if dmu_tx_assign returned EIO from within
  zil_lwb_write_issue, the lwb variable passed in will not be issued
  to disk. Thus, it's lwb_state field will remain LWB_STATE_OPENED and
  this assertion will fail. zil_commit_waiter_timeout assumes that after
  it calls zil_lwb_write_issue, the lwb will be issued to disk, and
  doesn't handle the case where this is not true; i.e. it doesn't handle
  the case where dmu_tx_assign returns EIO.

Reviewed by: Matt Ahrens <mahrens@delphix.com>
Reviewed by: Andriy Gapon <avg@FreeBSD.org>
Approved by: Robert Mustacchi <rm@joyent.com>
Author: Prakash Surya <prakash.surya@delphix.com>
MFC after:	3 weeks
2018-02-21 15:07:49 +00:00
Andriy Gapon
9d6810819c MFV r329713: 8731 ASSERT3U(nui64s, <=, UINT16_MAX) fails for large blocks
illumos/illumos-gate@a6c1eb3c08
a6c1eb3c08

https://www.illumos.org/issues/8731
  annotate_ecksum() asserts that nui64s, calculated as nui64s = size / sizeof
  (uint64_t), is not greater than UINT16_MAX.
  This restriction is needed because histograms of incorrectly set and cleared
  bits have 16 bit counters and if the buffer consists of too many 64-bit words,
  then a counter can potentially overflow producing an incorrect result.
  When the largest buffer size was 128KB the greatest value of nui64s was 16K,
  well within the limit.
  But now we have support for large buffers and for buffer sizes of 512KB and
  above the restriction is violated.

Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Approved by: Dan McDonald <danmcd@joyent.com>
Author: Andriy Gapon <avg@FreeBSD.org>
MFC after:	2 weeks
2018-02-21 14:31:48 +00:00
Wojciech Macek
6d13fd638c PowerNV: Put processor to power-save state in idle thread
When processor enters power-save state it releases resources shared with other
cpu threads which makes other cores working much faster.

This patch also implements saving and restoring registers that might get
corrupted in power-save state.

Submitted by:          Patryk Duda <pdk@semihalf.com>
Obtained from:         Semihalf
Reviewed by:           jhibbits, nwhitehorn, wma
Sponsored by:          IBM, QCM Technologies
Differential revision: https://reviews.freebsd.org/D14330
2018-02-21 14:28:40 +00:00
Andriy Gapon
70639f9a5e MFV r329710: 8966 Source file zfs_acl.c, function zfs_aclset_common contains a use after end of the lifetime of a local variable
illumos/illumos-gate@82693e09cc
82693e09cc
https://www.illumos.org/issues/8966

Reviewed by: Matt Ahrens <mahrens@delphix.com>
Reviewed by: Andriy Gapon <avg@FreeBSD.org>
Approved by: Richard Lowe <richlowe@richlowe.net>
Author: WHR <msl0000023508@gmail.com>
PR:		225162
Submitted by:	WHR <msl0000023508@gmail.com>
Reported by:	WHR <msl0000023508@gmail.com>
MFC after:	1 week
2018-02-21 14:17:07 +00:00
Edward Tomasz Napierala
8404ad78db Use proper buffer length (the announce_buf char pointer used to be anarray),
broken in r317143. This fixes those weird "cd0: Attempt" messages at boot.

PR:		222103
Reviewed by:	scottl@
MFC after:	2 weeks
Sponsored by:	DARPA, AFRL
Differential Revision:	https://reviews.freebsd.org/D14369
2018-02-21 14:05:13 +00:00
Hans Petter Selasky
07c757ec25 Allow LinuxKPI character devices to receive mmap() calls from the Linux
binary mode user-space emulation layer. This is a regression issue after
r328436, when LinuxKPI character devices started to use DTYPE_DEV in
the "f_type" field of the associated file structure(s).

MFC after:	3 days
Found by:	Johannes Lundberg <johalun0@gmail.com>
Sponsored by:	Mellanox Technologies
2018-02-21 10:13:17 +00:00
Wojciech Macek
eb96cc1364 PowerNV: add missing RTC_WRITE support
Add function which can store RTC values to OPAL.

Submitted by:          Wojciech Macek <wma@semihalf.org>
Obtained from:         Semihalf
Sponsored by:          IBM, QCM Technologies
2018-02-21 08:13:17 +00:00
Wojciech Macek
19a5b68236 CXGBE: implement prefetch on non-Intel architectures
Submitted by:          Michal Stanek <mst@semihalf.com>
Obtained from:         Semihalf
Reviewed by:           np, pdk@semihalf.com
Sponsored by:          IBM, QCM Technologies
Differential revision: https://reviews.freebsd.org/D14452
2018-02-21 08:05:56 +00:00
Justin Hibbits
fcc491a3fe Split printtrap() into generic and CPU-specific components
Summary:
This compartmentalizes the CPU-specific trap components into its own
function, rather than littering the general printtrap() with various checks.
This will let us replace a series of #ifdef's with a runtime conditional check
in the future.

Reviewed By:	nwhitehorn
Differential Revision:	https://reviews.freebsd.org/D14416
2018-02-21 03:34:33 +00:00
Alexander Motin
d8e89539c8 MFV r324198: 8081 Compiler warnings in zdb
illumos/illumos-gate@3f7978d02b
3f7978d02b

https://www.illumos.org/issues/8081
  zdb(8) is full of minor problems that generate compiler warnings. On FreeBSD,
  which uses -WError, the only way to build it is to disable all compiler
  warnings. This makes it much harder to detect newly introduced bugs. We should
  cleanup all the warnings.

Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Approved by: Richard Lowe <richlowe@richlowe.net>
Author: Alan Somers <asomers@gmail.com>
2018-02-21 03:08:47 +00:00
Alexander Motin
03618fe74d MFV r319737: 6939 add sysevents to zfs core for commands
illumos/illumos-gate@ce1577b049
ce1577b049

https://www.illumos.org/issues/6939
  Originally created https://smartos.org/bugview/OS-4489
       sysevents should be fired in the kernel from ZFS whenever a command
       is run that is logged in zpool history.
  Example output
  Terminal 1
  root - gz sunos ~ # zfs create zones/foobar
  root - gz sunos ~ # zfs set quota=10g zones/foobar
  root - gz sunos ~ # zfs destroy zones/foobar
  Terminal 2
  root - gz sunos ~ # sysevent EC_zfs
  nvlist version: 0
      date = 2016-04-28T14:50:08.964Z
      vendor = SUNW
      publisher = zfs
      class = EC_zfs
      subclass = ESC_ZFS_history_event
      pid = 0
      data = (embedded nvlist)
      nvlist version: 0
          pool_name = zones
          pool_guid = 0x40c964e8f9a7a694
          history_record = (embedded nvlist)
          nvlist version: 0
              dsname = zones/foobar
              dsid = 0x1525
              history internal str =
              internal_name = create
              history txg = 0x4c4ef3

Reviewed by: Patrick Mooney <patrick.mooney@joyent.com>
Reviewed by: Joshua M. Clulow <jmc@joyent.com>
Reviewed by: Josh Wilsdon <jwilsdon@joyent.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Richard Elling <Richard.Elling@RichardElling.com>
Reviewed by: Alan Somers <asomers@gmail.com>
Reviewed by: Andrew Stormont <andyjstormont@gmail.com>
Approved by: Matthew Ahrens <mahrens@delphix.com>
Author: Dave Eddy <dave@daveeddy.com>
2018-02-21 02:19:42 +00:00
Alexander Motin
63e739af67 MFV r319736: 6396 remove SVM
illumos/illumos-gate@5f10ef697f
5f10ef697f

https://www.illumos.org/issues/6396
  LVM = SVM = Solaris Volume Manager
  dead code and not using with ZFS based platform.

Reviewed by: Igor Kozhukhov <ikozhukhov@gmail.com>
Reviewed by: Toomas Soome <tsoome@me.com>
Approved by: Hans Rosenfeld <rosenfeld@grumpf.hope-2000.org>
Author: Yuri Pankov <yuri.pankov@nexenta.com>
2018-02-21 00:24:54 +00:00
Alexander Motin
e5a4a83784 MFV r318941: 7446 zpool create should support efi system partition
illumos/illumos-gate@7855d95b30
7855d95b30

https://www.illumos.org/issues/7446
  Since we support whole-disk configuration for boot pool, we also will need
  whole disk support with UEFI boot and for this, zpool create should create efi-
  system partition.
  I have borrowed the idea from oracle solaris, and introducing zpool create -
  B switch to provide an way to specify that boot partition should be created.
  However, there is still an question, how big should the system partition be.
  For time being, I have set default size 256MB (thats minimum size for FAT32
  with 4k blocks). To support custom size, the set on creation "bootsize"
  property is created and so the custom size can be set as: zpool create B -
  o bootsize=34MB rpool c0t0d0
  After pool is created, the "bootsize" property is read only. When -B switch is
  not used, the bootsize defaults to 0 and is shown in zpool get output with
  value ''. Older zfs/zpool implementations are ignoring this property.
  https://www.illumos.org/rb/r/219/

Reviewed by: Andrew Stormont <andyjstormont@gmail.com>
Reviewed by: Yuri Pankov <yuri.pankov@gmail.com>
Approved by: Dan McDonald <danmcd@kebe.com>
Author: Toomas Soome <tsoome@me.com>

This commit makes no sense for FreeBSD, that is why I blocked the option,
but it should be good to stay closer to upstream.
2018-02-21 00:18:57 +00:00
Navdeep Parhar
7cb7c6e37a Catch up with the removal of nktr_slot_flags from upstream netmap. No
functional impact intended.

Submitted by:	Vincenzo Maffione <v.maffione@gmail.com>
2018-02-20 21:42:45 +00:00
Jeff Roberson
683ca3a432 Fix the broken subqueue assignment for the cleanq.
Reported by:	pho
Tested by:	pho
Sponsored by:	Netflix, Dell/EMC Isilon
2018-02-20 21:27:17 +00:00
Mateusz Guzik
500ca73d43 mtx: add debug assertions to mtx_spin_wait_unlocked 2018-02-20 20:39:34 +00:00
Mateusz Guzik
862db53fb5 Fix reaping on process fd close broken after r329449
The only consumer of proc_reap other than proc_to_reap was not updated
to not PROC_SLOCK.

Reported by:    Juan Ramon Molina Menor <listjm club.fr>
2018-02-20 20:19:38 +00:00
Stephen Hurd
a4e5960730 IFLIB: do not remove dmamap on buffer unload
Dmamap is created only on IFC attach. If we remove it on
buffer release, we won't be able to do ifconfig down&up. Only destroy
when in detach.

Reported by:	wma
Reviewed by:	wma
Sponsored by:	Limelight Networks
Differential Revision:	https://reviews.freebsd.org/D14060
2018-02-20 18:33:45 +00:00
Brooks Davis
b81e88d296 Reduce duplication in dynamic syscall registration code.
Remove the unused syscall_(de)register() functions in favor of the
better documented and easier to use syscall_helper_(un)register(9)
functions.

The default and freebsd32 versions differed in which array of struct
sysents they used and a few missing updates to the 32-bit code as
features were added to the main code.

Reviewed by:	cem
Sponsored by:	DARPA, AFRL
Differential Revision:	https://reviews.freebsd.org/D14337
2018-02-20 18:08:57 +00:00
Ian Lepore
7efedde853 Adjust whitespace of things added in the past couple years to match the
original style of the file.  No functional changes.
2018-02-20 14:59:29 +00:00
Mateusz Guzik
681a1b752c Make killpg1 perform process validity checks without proc lock held. 2018-02-20 10:52:07 +00:00
Konstantin Belousov
2c0f13aa59 vm_wait() rework.
Make vm_wait() take the vm_object argument which specifies the domain
set to wait for the min condition pass.  If there is no object
associated with the wait, use curthread' policy domainset.  The
mechanics of the wait in vm_wait() and vm_wait_domain() is supplied by
the new helper vm_wait_doms(), which directly takes the bitmask of the
domains to wait for passing min condition.

Eliminate pagedaemon_wait().  vm_domain_clear() handles the same
operations.

Eliminate VM_WAIT and VM_WAITPFAULT macros, the direct functions calls
are enough.

Eliminate several control state variables from vm_domain, unneeded
after the vm_wait() conversion.

Scetched and reviewed by:	jeff
Tested by:	pho
Sponsored by:	The FreeBSD Foundation, Mellanox Technologies
Differential revision:	https://reviews.freebsd.org/D14384
2018-02-20 10:13:13 +00:00
Wojciech Macek
f32ebdc85c PowerPC: Switch to more accurate unit to avoid division rounding
On POWER8 architecture there is a timer with 512Mhz frequency.
It has about 1,95ns period, but it is rounded to 1ns which is not accurate.

Submitted by:          Patryk Duda <pdk@semihalf.com>
Obtained from:         Semihalf
Reviewed by:           wma
Sponsored by:          IBM, QCM Technologies
Differential revision: https://reviews.freebsd.org/D14433
2018-02-20 07:30:57 +00:00
Wojciech Macek
838070d5f4 PowerNV: Send SIGILL on HEA illegal instruction exception
Currently Hypervisor Emulation Assistance interrupt is unhandled.
Executing an undefined instruction in userland triggers kernel panic.
Handle this the same way as Facility Unavailable Interrupt - send
SIGILL signal to userspace.

Submitted by:          Michal Stanek <mst@semihalf.com>
Obtained from:         Semihalf
Reviewed by:           nwhitehorn, pdk@semihalf.com, wma
Sponsored by:          IBM, QCM Technologies
Differential revision: https://reviews.freebsd.org/D14437
2018-02-20 06:38:55 +00:00
Alexander Motin
89dabdb4ea MFC r316910: 7812 Remove gender specific language
illumos/illumos-gate@48bbca8168
48bbca8168

https://www.illumos.org/issues/7812
  This change removes all gendered language that did not refer specifically
  to an individual person or pet. The convention taken was to use
  variations on "they" when referring to users and/or human beings, while
  using "it" when referring to code, functions, and/or libraries.
  Additionally, we took the liberty to fix up any whitespace issues that
  were found in any files that were already being modified.

Reviewed by: Matt Ahrens <mahrens@delphix.com>
Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Steve Gonczi <steve.gonczi@delphix.com>
Reviewed by: Chris Williamson <chris.williamson@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Igor Kozhukhov <igor@dilos.org>
Reviewed by: Dan McDonald <danmcd@omniti.com>
Reviewed by: Robert Mustacchi <rm@joyent.com>
Approved by: Richard Lowe <richlowe@richlowe.net>
Author: Daniel Hoffman <dj.hoffman@delphix.com>
2018-02-20 05:07:21 +00:00
Alexander Motin
3542f1bd3a MFV r307315:
7301 zpool export -f should be able to interrupt file freeing

Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Sanjay Nadkarni <sanjay.nadkarni@nexenta.com>
Reviewed by: Saso Kiselkov <saso.kiselkov@nexenta.com>
Reviewed by: John Kennedy <john.kennedy@delphix.com>
Author: Alek Pinchuk <alek@nexenta.com>

Closes #175
2018-02-20 04:36:51 +00:00
Alexander Motin
4a0867c8d2 MFV r302649: 7016 arc_available_memory is not 32-bit safe
illumos/illumos-gate@0dd053d7d8
0dd053d7d8

https://www.illumos.org/issues/7016
  upstream DLPX-39446 arc_available_memory is not 32-bit safe
  https://github.com/delphix/delphix-os/commit/
  6b353ea3b8a1610be22e71e657d051743c64190b
  related to this upstream:
  DLPX-38547 delphix engine hang
  https://github.com/delphix/delphix-os/commit/
  3183a567b3e8c62a74a65885ca60c86f3d693783
  DLPX-38547 delphix engine hang (fix static global)
  https://github.com/delphix/delphix-os/commit/
  22ac551d8ef085ad66cc8f65e51ac372b12993b9
  DLPX-38882 system hung waiting on free segment
  https://github.com/delphix/delphix-os/commit/
  cdd6beef7548cd3b12f0fc0328eeb3af540079c2

Reviewed by: Igor Kozhukhov <ikozhukhov@gmail.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Paul Dagnelie <pcd@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Approved by: Gordon Ross <gordon.ross@nexenta.com>
Author: Prakash Surya <prakash.surya@delphix.com>
2018-02-20 04:14:12 +00:00
Ian Lepore
42c52f36e4 Add missing MODULE_DEPENDS(). 2018-02-20 03:51:09 +00:00
Mateusz Guzik
81d68271d7 Reduce contention on the proctree lock during heavy package build.
There is a proctree -> allproc ordering established.

Most of the time it is either xlock -> xlock or slock -> slock.

On fork however there is a slock -> xlock pair which results in
pathological wait times due to threads keeping proctree held for
reading and all waiting on allproc. Switch this to xlock -> xlock.
Longer term fix would get rid of proctree in this place to begin with.
Right now it is necessary to walk the session/process group lists to
determine which id is free. The walk can be avoided e.g. with bitmaps.

The exit path used to have one place which dealt with allproc and
then with proctree. Move the allproc acquire into the section protected
by proctree. This reduces contention against threads waiting on proctree
in the fork codepath - the fork proctree holder does not have to wait
for allproc as often.

Finally, move tidhash manipulation outside of the area protected by
either of these locks. The removal from the hash was already unprotected.
There is no legitimate reason to look up thread ids for a process still
under construction.

This results in about 50% wait time reduction during -j 128 package build.
2018-02-20 02:18:30 +00:00
Jeff Roberson
06220fa737 Further parallelize the buffer cache.
Provide multiple clean queues partitioned into 'domains'.  Each domain manages
its own bufspace and has its own bufspace daemon.  Each domain has a set of
subqueues indexed by the current cpuid to reduce lock contention on the cleanq.

Refine the sleep/wakeup around the bufspace daemon to use atomics as much as
possible.

Add a B_REUSE flag that is used to requeue bufs during the scan to approximate
LRU rather than locking the queue on every use of a frequently accessed buf.

Implement bufspace_reserve with only atomic_fetchadd to avoid loop restarts.

Reviewed by:	markj
Tested by:	pho
Sponsored by:	Netflix, Dell/EMC Isilon
Differential Revision:	https://reviews.freebsd.org/D14274
2018-02-20 00:06:07 +00:00
Bryan Venteicher
88126356cf Add more virtqueue getter methods
MFC after:	2 weeks
2018-02-19 19:31:18 +00:00
Bryan Venteicher
985ed053e3 Add VirtIO bus config_generation method
VirtIO buses (PCI, MMIO) can provide a generation field so a driver
can ensure either a 64-bit or array read was stable.

MFC after:	2 weeks
2018-02-19 19:28:24 +00:00
Konstantin Belousov
9f74642385 Do not free(9) uninitialized pointer.
Reported and tested by:	allanjude
Reviewed by:	markj
Sponsored by:	The FreeBSD Foundation
2018-02-19 19:08:25 +00:00
Bryan Venteicher
7a16dacdfa Add PCI methods to iterate over the PCI capabilities
VirtIO V1 provides configuration in multiple VENDOR capabilities so this
allows all of the configuration to be discovered.

Reviewed by:	jhb
MFC after:	2 weeks
Differential Revision:	https://reviews.freebsd.org/D14325
2018-02-19 18:41:56 +00:00
Hans Petter Selasky
e44fa94c09 Implement list_safe_reset_next() function macro in the LinuxKPI.
MFC after:	1 week
Submitted by:	Johannes Lundberg <johalun0@gmail.com>
Sponsored by:	Mellanox Technologies
Sponsored by:	Limelight Networks
2018-02-19 16:31:19 +00:00
Nathan Whitehorn
65184f89b6 Set internal error returns for OF_peer(), OF_child(), and OF_parent() to
zero, matching the IEEE 1275 standard. Since these internal error paths
have never, to my knowledge, been taken, behavior is unchanged.

Reported by:	gonzo
MFC after:	2 weeks
2018-02-19 15:49:14 +00:00
Andrey V. Elsukov
15bf717a93 Remove unused variables and sysctl declaration.
MFC after:	1 week
2018-02-19 12:20:51 +00:00
Andrey V. Elsukov
6ca39da354 Check packet length to do not make out of bounds access. Also save ah_nxt
value to use it later, since ah pointer can become invalid.

Reported by:	Maxime Villard <max at m00nbsd dot net>
MFC after:	5 days
2018-02-19 11:14:38 +00:00
Andriy Gapon
8cebd0e419 relax an assert in zfsctl_snapdir_lookup to match r323578
Since r323578 we may remove the last reference to a covered vnode with
vrele() instead of vput().  So, v_usecount may be decremented before
the vnode is locked and zfsctl_snapdir_lookup may "catch" the vnode
with v_usecount of zero and v_holdcnt of one.

PR:		225795
Reported by:	asomers
MFC after:	1 week
2018-02-19 08:55:22 +00:00
Hans Petter Selasky
0f839f3a6d When stepping the radix tree in the LinuxKPI make sure we
clear the least significant bits, so that no entries are
skipped.

MFC after:	1 week
Sponsored by:	Mellanox Technologies
2018-02-19 06:11:58 +00:00
Ian Lepore
eb69d1f144 Build at45d and mx25l SPI flash drivers as modules. 2018-02-19 01:49:19 +00:00
Ian Lepore
63cdf4affb Add ofw_bus_if.h to SRCS. 2018-02-19 01:39:02 +00:00
Ian Lepore
2aa5d9c4c8 Add modules/spi as a gathering point for SPI-related modules, analagous to
modules/i2c for i2c/iicbus modules.  Build spibus as a module.
2018-02-19 01:32:27 +00:00
Mateusz Guzik
2ca66c1ef5 Fix process exit vs reap race introduced in r329449
The race manifested itself mostly in terms of crashes with "spin lock
held too long".

Relevant parts of respective code paths:

exit:				reap:
PROC_LOCK(p);
PROC_SLOCK(p);
p->p_state == PRS_ZOMBIE
PROC_UNLOCK(p);
				PROC_LOCK(p);
/* exit work */
				if (p->p_state == PRS_ZOMBIE) /* true */
					proc_reap()
					free proc
/* more exit work */
PROC_SUNLOCK(p);

Thus a still exiting process is reaped.

Prior to the change the zombie check was followed by slock/sunlock trip
which prevented the problem.

Even code prior to this commit has a bug: the proc is still accessed for
statistic collection purposes. However, the severity is rather small and
the bug may be fixed in a future commit.

Reported by:	many
Tested by:	allanjude
2018-02-19 00:54:08 +00:00
Ian Lepore
a7e31772e7 Build ofw_iicbus as a module if OPT_FDT is defined. 2018-02-19 00:47:03 +00:00
Mateusz Guzik
d257698833 mtx: add mtx_spin_wait_unlocked
The primitive can be used to wait for the lock to be released. Intended
usage is for locks in structures which are about to be freed.

The benefit is the avoided interrupt enable/disable trip + atomic op to
grab the lock and shorter wait if the lock is held (since there is no
worry someone will contend on the lock, re-reads can be more aggressive).

Briefly discussed with:	 kib
2018-02-19 00:38:14 +00:00