Commit Graph

225074 Commits

Author SHA1 Message Date
kp
0725653713 libifconfig: style(9) fixes
Also switch from BSD 3-clause to 2-clause license where possible, and
consolidate duplicate 3-clause license into one.

Submitted by:	Marie Helene Kvello-Aune <marieheleneka@gmail.com>
Reviewed by:	cem, kp
Differential Revision:	https://reviews.freebsd.org/D7764
2016-09-04 20:55:27 +00:00
dim
6507f9fe2d Make some additional -Wconstant-conversion warnings from clang 3.9.0 in
bwn(4) non-fatal for now.
2016-09-04 17:56:55 +00:00
dim
a37cccc119 For kernel builds, instead of suppressing certain clang warnings, make
them non-fatal, so there is some incentive to fix them eventually.
2016-09-04 17:55:22 +00:00
andrew
9b7353a566 Enable superpages on arm64 by default. These seem to be stable, having
survived multiple world and kernel builds, and of poudriere building full
package sets.

I have observed a 3% reduction in buildworld times with superpages enabled,
however further testing is needed to see if this is observed in other
workloads.

Obtained from:	ABT Systems Ltd
MFC after:	1 month
Sponsored by:	The FreeBSD Foundation
2016-09-04 17:50:23 +00:00
dim
e0d1234535 With clang 3.9.0, compiling sys/netinet/igmp.c results in the following
warning:

sys/netinet/igmp.c:546:21: error: implicit conversion from 'int' to 'char' changes value from 148 to -108 [-Werror,-Wconstant-conversion]
        p->ipopt_list[0] = IPOPT_RA;    /* Router Alert Option */
                         ~ ^~~~~~~~
sys/netinet/ip.h:153:19: note: expanded from macro 'IPOPT_RA'
#define IPOPT_RA                148             /* router alert */
                                ^~~

This is because ipopt_list is an array of char, so IPOPT_RA is wrapped
to a negative value.  It would be nice to change ipopt_list to an array
of u_char, but it changes the signature of the public struct ipoption,
so add an explicit cast to suppress the warning.

Reviewed by:	imp
MFC after:	3 days
Differential Revision: https://reviews.freebsd.org/D7777
2016-09-04 17:23:10 +00:00
dim
e0844cc82a With clang 3.9.0, compiling uplcom results in the following warnings:
sys/dev/usb/serial/uplcom.c:543:29: error: implicit conversion from 'int' to 'int8_t' (aka 'signed char') changes value from 192 to -64 [-Werror,-Wconstant-conversion]
        if (uplcom_pl2303_do(udev, UT_READ_VENDOR_DEVICE, UPLCOM_SET_REQUEST, 0x8484, 0, 1)
            ~~~~~~~~~~~~~~~~       ^~~~~~~~~~~~~~~~~~~~~
sys/dev/usb/usb.h:179:53: note: expanded from macro 'UT_READ_VENDOR_DEVICE'
#define UT_READ_VENDOR_DEVICE   (UT_READ  | UT_VENDOR | UT_DEVICE)
                                 ~~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~

This is because UT_READ is 0x80, so the int8_t argument is wrapped to a
negative value.  Fix this by using uint8_t instead.

Reviewed by:	imp, hselasky
MFC after:	3 days
Differential Revision: https://reviews.freebsd.org/D7776
2016-09-04 16:59:35 +00:00
mjg
46ac794958 cache: defer freeing entries until after the global lock is dropped
This also defers vdrop for held vnodes.

Glanced at by:	kib
2016-09-04 16:52:14 +00:00
bde
fcad0aa7c4 Oops, the previous i386 version of e_fmodf.S and e_fmodl.S was
actually the amd64 version.
2016-09-04 15:08:14 +00:00
bde
df9121eecb Disconnect the "optimized" asm variants of cos(), sin() and tan() from
the build on i386.  Leave them in the source tree for regression tests.

The asm functions were always much less accurate (by a factor of more
than 10**18 in the worst case).  They were faster on old CPUs.  But
with each new generation of CPUs they get relatively slower.  The
double precision C version's average advantage is about a factor of 2
on Haswell.

The asm functions were already intentionally avoided in float and long
double precision on i386 and in all precisions on amd64.  Float
precision and amd64 give larger advantages to the C version.  The long
double precision C code and compilers' understanding of long double
precision are not so good, so the i387 is still slightly faster for
long double precision, except for the unimportant subcase of huge args
where the sub-optimal C code now somehow beats the i387 by about a
factor of 2.
2016-09-04 14:12:19 +00:00
mjg
1326a2ad24 fd: fix up fdeget_file
It was supposed to return NULL if a fp is not installed.

Facepalm-by: mjg
2016-09-04 13:31:57 +00:00
bde
95d1e1376d Add asm versions of fmod(), fmodf() and fmodl() on amd64. Add asm
versions of fmodf() amd fmodl() on i387.

fmod is similar to remainder, and the C versions are 3 to 9 times
slower than the asm versions on x86 for both, but we had the strange
mixture of all 6 variants of remainder in asm and only 1 of 6
variants of fmod in asm.
2016-09-04 12:22:14 +00:00
des
b8710acf41 Upgrade to Unbound 1.5.9. 2016-09-04 12:17:57 +00:00
bde
f2287da07a Fix missing fmodl() on arches with 53-bit long doubles.
PR:		199422, 211965
MFC after:	1 week
2016-09-04 12:01:32 +00:00
mjg
1304091d66 cache: manage negative entry list with a dedicated lock
Since negative entries are managed with a LRU list, a hit requires a
modificaton.

Currently the code tries to upgrade the global lock if needed and is
forced to retry the lookup if it fails.

Provide a dedicated lock for use when the cache is only shared-locked.

Reviewed by:	kib
MFC after:	1 week
2016-09-04 08:58:35 +00:00
mjg
42f18d4efb cache: put all negative entry management code into dedicated functions
Reviewed by:	kib
MFC after:	1 week
2016-09-04 08:55:15 +00:00
landonf
f01dd47816 bhndb(4): Fix probing of bhndb-attached bhnd_nvram devices.
This fixes bhnd(4) nvram handling on devices that map SPROM CSRs via PCI
configuration space.

The probe method previously required that a bhnd(4) device be attached to the
parent bridge; now that the bhnd_nvram device is always attached first, this
unnecessary sanity check always failed.

Approved by:	adrian (mentor, implicit)
2016-09-04 01:47:21 +00:00
landonf
ea428c80f2 bhndb(4): Skip disabled cores when performing bridge configuration probing.
On BCM4321 chipsets, both PCI and PCIe cores are included, with one of
the cores potentially left floating.

Since the PCI core appears first in the device table, and the PCI
profiles appear first in the resource configuration tables, this resulted in
incorrectly matching and using the PCI/v1 resource configuration on PCIe
devices, rather than the correct PCIe/v1 profile.

Approved by:	adrian (mentor, implicit)
2016-09-04 01:43:54 +00:00
landonf
820a65d528 siba(4): Add missing bhnd_device/bhnd_device_quirk table terminator entries.
This resulted in an over-read on siba chipsets that failed to match the
existing entries.

Approved by:	adrian (mentor, implicit)
2016-09-04 01:25:46 +00:00
landonf
43593d5819 Remove empty directories left by r299241, r302190, r304870, and r301410
Approved by:	adrian (mentor, implicit)
2016-09-04 01:17:16 +00:00
landonf
4e0cac59aa Migrate bhndb(4) to the new bhnd_erom API.
Adds support for probing and initializing bhndb(4) bridge state using
the bhnd_erom API, ensuring that full bridge configuration is available
*prior* to actually attaching and enumerating the bhnd(4) child device,
allowing us to safely allocate bus-level agent/device resources during
bhnd(4) bus enumeration.

- Add a bhnd_erom_probe() method usable by bhndb(4). This is an analogue
  to the existing bhnd_erom_probe_static() method, and allows the bhndb
  bridge to discover the best available erom parser class prior to newbus
  probing of its children.
- Add support for supplying identification hints when probing erom
  devices. This is required on early EXTIF-only chipsets, where chip
  identification registers are not available.
- Migrate bhndb over to the new bhnd_erom API, using bhnd_core_info
  records rather than bridged bhnd(4) device_t references to determine
  the bridged chipsets' capability/bridge configuration.
- The bhndb parent (e.g. if_bwn) is now required to supply a hardware
  priority table to the bridge. The default table is currently sufficient
  for our supported devices.
- Drop the two-pass attach approach we used for compatibility with bhndb(4) in
  the bhnd(4) bus drivers, and instead perform bus enumeration immediately,
  and allocate bridged per-child bus-level resources during that enumeration.

Approved by:	adrian (mentor)
Differential Revision:	https://reviews.freebsd.org/D7768
2016-09-04 00:58:19 +00:00
markj
486a2523f7 Micro-optimize sleepq_signal().
Lift a comparison out of the loop that finds the highest-priority thread
on the queue.

MFC after:	1 week
2016-09-04 00:29:48 +00:00
markj
6549c0692e Respect the caller's hints when performing swap readahead.
The pager getpages interface allows the caller to bound the number of
readahead and readbehind pages, and vm_fault_hold() makes use of this
feature. These bounds were ignored after r305056, causing the swap pager
to potentially page in more than the specified number of pages.

Reported and reviewed by:	alc
X-MFC with:	r305056
2016-09-04 00:25:49 +00:00
landonf
982e7e3df8 Implement a generic bhnd(4) device enumeration table API.
This defines a new bhnd_erom_if API, providing a common interface to device
enumeration on siba(4) and bcma(4) devices, for use both in the bhndb bridge
and SoC early boot contexts, and migrates mips/broadcom over to the new API.

This also replaces the previous adhoc device enumeration support implemented
for mips/broadcom.

Migration of bhndb to the new API will be implemented in a follow-up commit.


- Defined new bhnd_erom_if interface for bhnd(4) device enumeration, along
  with bcma(4) and siba(4)-specific implementations.
- Fixed a minor bug in bhndb that logged an error when we attempted to map the
  full siba(4) bus space (18000000-17FFFFFF) in the siba EROM parser.
- Reverted use of the resource's start address as the ChipCommon enum_addr in
  bhnd_read_chipid(). When called from bhndb, this address is found within the
  host address space, resulting in an invalid bridged enum_addr.
- Added support for falling back on standard bus_activate_resource() in
  bhnd_bus_generic_activate_resource(), enabling allocation of the bhnd_erom's
  bhnd_resource directly from a nexus-attached bhnd(4) device.
- Removed BHND_BUS_GET_CORE_TABLE(); it has been replaced by the erom API.
- Added support for statically initializing bhnd_erom instances, for use prior
  to malloc availability. The statically allocated buffer size is verified both
  at runtime, and via a compile-time assertion (see BHND_EROM_STATIC_BYTES).
- bhnd_erom classes are registered within a module via a linker set, allowing
  mips/broadcom to probe available EROM parser instances without creating a
  strong reference to bcma/siba-specific symbols.
- Migrated mips/broadcom to bhnd_erom_if, replacing the previous MIPS-specific
  device enumeration implementation.

Approved by:	adrian (mentor)
Differential Revision:	https://reviews.freebsd.org/D7748
2016-09-03 23:57:17 +00:00
ache
5c5caa6dec The bug:
$ echo x | awk '/[[:cntrl:]]/'
x

The NUL character in cntrl class truncates the pattern, and an empty
pattern matches anything. The patch skips NUL as a quick fix.

PR:     195792
Submitted by:   kdrakehp@zoho.com
Approved by:    bwk@cs.princeton.edu (the author)
MFC after:      3 days
2016-09-03 23:04:56 +00:00
markj
8b45da7d55 Remove redefinitions of some kernel types from mbuf.d.
These override the kernel's definitions and do not match in some cases,
which can break scripts that use these types. With r305055, dtrace is able
to trace fields of struct mbuf's anonymous structs and unions, so there is
no need to redefine types already defined in CTF.

MFC after:	3 days
2016-09-03 20:43:59 +00:00
markj
fb5804c98d Remove support for idle page zeroing.
Idle page zeroing has been disabled by default on all architectures since
r170816 and has some bugs that make it seemingly unusable. Specifically,
the idle-priority pagezero thread exacerbates contention for the free page
lock, and yields the CPU without releasing it in non-preemptive kernels. The
pagezero thread also does not behave correctly when superpage reservations
are enabled: its target is a function of v_free_count, which includes
reserved-but-free pages, but it is only able to zero pages belonging to the
physical memory allocator.

Reviewed by:	alc, imp, kib
Differential Revision:	https://reviews.freebsd.org/D7714
2016-09-03 20:38:13 +00:00
dim
b1940da869 With clang 3.9.0, compiling cxgb results in the following warning:
sys/dev/cxgb/cxgb_sge.c:2873:44: error: implicit conversion from 'int'
to 'char' changes value from 128 to -128 [-Werror,-Wconstant-conversion]
                        *mtod(m, char *) = CPL_ASYNC_NOTIF;
                                         ~ ^~~~~~~~~~~~~~~

This is because CPL_ASYNC_NOTIF is 0x80, so the plain char argument is
wrapped to a negative value.  Fix this by using uint8_t instead.

Reviewed by:	np
MFC after:	3 days
Differential Revision: https://reviews.freebsd.org/D7772
2016-09-03 19:01:11 +00:00
np
e3b310dd0a Use correct CTR<n> variant. 2016-09-03 18:54:26 +00:00
ngie
396ee48cd2 Update contrib/netbsd-tests with new content from NetBSD
This updates the snapshot from 09/30/2014 to 08/11/2016

This brings in a number of new testcases from upstream, most
notably:

- bin/cat
- lib/libc
- lib/msun
- lib/libthr
- usr.bin/sort

lib/libc/tests/stdio/open_memstream_test.c was moved to
lib/libc/tests/stdio/open_memstream2_test.c to accomodate
the new open_memstream test from NetBSD.

MFC after:	2 months
Tested on:	amd64 (VMware fusion VM; various bare metal platforms); i386 (VMware fusion VM); make tinderbox
Sponsored by:	EMC / Isilon Storage Division
2016-09-03 18:11:48 +00:00
ngie
3e19aacd4a Skip testcases 9/10 if jail(8) isn't installed
These testcases require jail support

MFC after:	1 week
Sponsored by:	EMC / Isilon Storage Division
2016-09-03 17:59:46 +00:00
ngie
3b13b049f0 Add a missing "Bail out!" if zpool create fails
This will make the exit info more meaningful if/when zpool create fails,
and establishes parity with the other 2 zfs acl testcases (01, 03).

MFC after:	3 days
Sponsored by:	EMC / Isilon Storage Division
2016-09-03 17:31:13 +00:00
andrew
466a2c378c Explicitly include all .rodata.* sections in the kernel .rodata. This
helps link the kernel with lld as it will then put all these into a single
.rodata section.

MFC after:	1 week
Sponsored by:	ABT Systems Ltd
2016-09-03 17:23:24 +00:00
jmcneill
8d12cd5993 Use the root key in the Security ID EFUSE (when valid) to generate a
MAC address instead of creating a random one each boot.
2016-09-03 15:28:09 +00:00
imp
72c4771413 Don't use -N to set the OMAGIC with data and text writeable and data
not page aligned. To do this, use the ld script gnu ld installs on my
system.

This is imperfect: LDFLAGS_BIN and LD_FLAGS_BIN describe different
things. The loader script could be better named and take into account
other architectures. And having two different mechanisms to do
basically the same thing needs study. However, it's blocking forward
progress on lld, so I'll work in parallel to sort these out.

Differential Revision: https://reviews.freebsd.org/D7409
Reviewed by: emaste
2016-09-03 15:26:28 +00:00
jmcneill
e25bfb37e5 Add support for Allwinner A64 thermal sensors. 2016-09-03 15:26:00 +00:00
jmcneill
5d135bbf6f Add cpu-supply xref to cpu@0 2016-09-03 15:24:30 +00:00
jmcneill
af0e7ee263 Add SID, THS, and CPU operating points. 2016-09-03 15:23:59 +00:00
jmcneill
228693754e Add support for reading root key on A83T/A64. 2016-09-03 15:22:50 +00:00
des
5e4e917055 import unbound 1.5.9 2016-09-03 15:08:13 +00:00
dim
0aa418be67 With clang 3.9.0, compiling ppbus(4) results in the following warnings:
sys/dev/ppbus/ppb_1284.c:296:46: error: implicit conversion from 'int'
to 'char' changes value from 144 to -112 [-Werror,-Wconstant-conversion]
        if ((error = do_peripheral_wait(bus, SELECT | nBUSY, 0))) {
                     ~~~~~~~~~~~~~~~~~~      ~~~~~~~^~~~~~~
sys/dev/ppbus/ppb_1284.c:785:48: error: implicit conversion from 'int'
to 'char' changes value from 240 to -16 [-Werror,-Wconstant-conversion]
                if (do_1284_wait(bus, nACK | SELECT | PERROR | nBUSY,
                    ~~~~~~~~~~~~      ~~~~~~~~~~~~~~~~~~~~~~~^~~~~~~
sys/dev/ppbus/ppb_1284.c:786:29: error: implicit conversion from 'int'
to 'char' changes value from 240 to -16 [-Werror,-Wconstant-conversion]
                                        nACK | SELECT | PERROR | nBUSY)) {
                                        ~~~~~~~~~~~~~~~~~~~~~~~^~~~~~~

This is because nBUSY is 0x80, so the plain char argument is wrapped to
a negative value.  Fix this in a minimal fashion, by using uint8_t in a
few places.

Reviewed by:	emaste
MFC after:	3 days
Differential Revision: https://reviews.freebsd.org/D7771
2016-09-03 13:48:44 +00:00
dim
397ee44a63 Define drmP.h's __OS_HAS_AGP and __OS_HAS_MTRR macros in a defined and
portable way.

Reviewed by:	dumbbell
MFC after:	3 days
Differential Revision: https://reviews.freebsd.org/D7770
2016-09-03 13:33:28 +00:00
emaste
a610805fee remove CONSTRUCTORS from MIPS uboot linker script
The linker script CONSTRUCTORS keyword is only meaningful "when linking
object file formats which do not support arbitrary sections, such as
ECOFF and XCOFF"[1] and is ignored for other object file formats.

LLVM's lld does not yet accept (and ignore) CONSTRUCTORS, so just remove
CONSTRUCTORS from the linker script as it has no effect.

[1] https://sourceware.org/binutils/docs/ld/Output-Section-Keywords.html
2016-09-03 13:01:37 +00:00
mav
25849e5c65 Missed FreeBSD-specific piece of r305338. 2016-09-03 11:17:33 +00:00
mav
b54e25cf10 MFC r305337: 7004 dmu_tx_hold_zap() does dnode_hold() 7x on same object
Using a benchmark which has 32 threads creating 2 million files in the
same directory, on a machine with 16 CPU cores, I observed poor
performance. I noticed that dmu_tx_hold_zap() was using about 30% of
all CPU, and doing dnode_hold() 7 times on the same object (the ZAP
object that is being held).

dmu_tx_hold_zap() keeps a hold on the dnode_t the entire time it is
running, in dmu_tx_hold_t:txh_dnode, so it would be nice to use the
dnode_t that we already have in hand, rather than repeatedly calling
dnode_hold(). To do this, we need to pass the dnode_t down through
all the intermediate calls that dmu_tx_hold_zap() makes, making these
routines take the dnode_t* rather than an objset_t* and a uint64_t
object number. In particular, the following routines will need to have
analogous *_by_dnode() variants created:

dmu_buf_hold_noread()
dmu_buf_hold()
zap_lookup()
zap_lookup_norm()
zap_count_write()
zap_lockdir()
zap_count_write()

This can improve performance on the benchmark described above by 100%,
from 30,000 file creations per second to 60,000. (This improvement is on
top of that provided by working around the object allocation issue. Peak
performance of ~90,000 creations per second was observed with 8 CPUs;
adding CPUs past that decreased performance due to lock contention.) The
CPU used by dmu_tx_hold_zap() was reduced by 88%, from 340 CPU-seconds
to 40 CPU-seconds.

Sponsored by: Intel Corp.

Closes #109

Reviewed by: Steve Gonczi <steve.gonczi@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed by: Ned Bass <bass6@llnl.gov>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Author: Matthew Ahrens <mahrens@delphix.com>

openzfs/openzfs@d3e523d489
2016-09-03 11:00:29 +00:00
mav
c643a1ba03 MFV r305336: 7247 zfs receive of deduplicated stream fails
This resolves two 'zfs recv' issues. First, when receiving into an
existing filesystem, a snapshot created during the receive process is
not added to the guid->dataset map for the stream, resulting in failed
lookups for deduped streams when a WRITE_BYREF record refers to a
snapshot received earlier in the stream. Second, the newly created
snapshot was also not set properly, referencing the snapshot before the
new receiving dataset rather than the existing filesystem.

Closes #159

Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Dan Kimmel <dan.kimmel@delphix.com>
Author: Chris Williamson <chris.williamson@delphix.com>

openzfs/openzfs@b09697c8c1
2016-09-03 10:59:05 +00:00
mav
4e3d6e73d3 MFV r305335: 7003 zap_lockdir() should tag hold
zap_lockdir() / zap_unlockdir() should take a "void *tag" argument which
tags the hold on the zap. This will help diagnose programming errors
which misuse the hold on the ZAP.

Sponsored by: Intel Corp.

Closes #108

Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed by: Steve Gonczi <steve.gonczi@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Author: Matthew Ahrens <mahrens@delphix.com>

openzfs/openzfs@0780b3eab5
2016-09-03 10:58:14 +00:00
mav
ed2cd23bbf MFV r304157:
7230 add assertions to dmu_send_impl() to verify that stream includes BEGIN and END records

illumos/illumos-gate@12b90ee2d3
https://github.com/illumos/illumos-gate/commit/12b90ee2d3b10689fc45f4930d2392f5f
e1d9cfa

https://www.illumos.org/issues/7230
  A test failure occurred where a send stream had only a BEGIN record. This
  should not be possible if the send returns without error. Prevented this from
  happening in the future by adding an assertion to dmu_send_impl() to verify
  that if the function returns 0 (success) both a BEGIN and END record are
  present. Did this by adding flags to dmu_sendarg_t (indicating whether BEGIN o
r
  END records sent), having dump_record() set flags appropriately, adding VERIFY
  statement to dmu_send_impl().

Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Paul Dagnelie <pcd@delphix.com>
Reviewed by: Igor Kozhukhov <ikozhukhov@gmail.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Author: Matt Krantz <matt.krantz@delphix.com>
2016-09-03 10:10:58 +00:00
mav
5dc4a605c8 MFV r304156: 7235 remove unused func dsl_dataset_set_blkptr
illumos/illumos-gate@bd56f80007
https://github.com/illumos/illumos-gate/commit/bd56f80007857b960e0981ed0797ad8ec
844a96b

https://www.illumos.org/issues/7235
  The function dsl_dataset_set_blkptr() is unused. We should remove it.

Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Alex Reece <alex@delphix.com>
Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Igor Kozhukhov <ikozhukhov@gmail.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Author: Matthew Ahrens <mahrens@delphix.com>
2016-09-03 10:09:23 +00:00
mav
64b6bd408d MFV r304159: 7277 zdb should be able to print zfs_dbgmsg's
illumos/illumos-gate@29bdd2f916
https://github.com/illumos/illumos-gate/commit/29bdd2f916366ece37c4748bca6b3d61f
57a223b

https://www.illumos.org/issues/7277
  ztest always prints the debug messages (zfs_dbgmsg()) by calling
  zfs_dbgmsg_print(). We should add a flag to zdb to make it do this as well
  before exiting.

Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Igor Kozhukhov <ikozhukhov@gmail.com>
Approved by: Dan McDonald <danmcd@omniti.com>
Author: Pavel Zakharov <pavel.zakharov@delphix.com>
2016-09-03 10:07:46 +00:00
mav
3bf1abe8c9 MFV r304155: 7090 zfs should improve allocation order and throttle allocations
illumos/illumos-gate@0f7643c737
https://github.com/illumos/illumos-gate/commit/0f7643c7376dd69a08acbfc9d1d7d548b
10c846a

https://www.illumos.org/issues/7090
  When write I/Os are issued, they are issued in block order but the ZIO pipelin
e
  will drive them asynchronously through the allocation stage which can result i
n
  blocks being allocated out-of-order. It would be nice to preserve as much of
  the logical order as possible.
  In addition, the allocations are equally scattered across all top-level VDEVs
  but not all top-level VDEVs are created equally. The pipeline should be able t
o
  detect devices that are more capable of handling allocations and should
  allocate more blocks to those devices. This allows for dynamic allocation
  distribution when devices are imbalanced as fuller devices will tend to be
  slower than empty devices.
  The change includes a new pool-wide allocation queue which would throttle and
  order allocations in the ZIO pipeline. The queue would be ordered by issued
  time and offset and would provide an initial amount of allocation of work to
  each top-level vdev. The allocation logic utilizes a reservation system to
  reserve allocations that will be performed by the allocator. Once an allocatio
n
  is successfully completed it's scheduled on a given top-level vdev. Each top-
  level vdev maintains a maximum number of allocations that it can handle
  (mg_alloc_queue_depth). The pool-wide reserved allocations (top-levels *
  mg_alloc_queue_depth) are distributed across the top-level vdevs metaslab
  groups and round robin across all eligible metaslab groups to distribute the
  work. As top-levels complete their work, they receive additional work from the
  pool-wide allocation queue until the allocation queue is emptied.

Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Alex Reece <alex@delphix.com>
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: Dan Kimmel <dan.kimmel@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Paul Dagnelie <paul.dagnelie@delphix.com>
Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Sebastien Roy <sebastien.roy@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Author: George Wilson <george.wilson@delphix.com>
2016-09-03 10:04:37 +00:00