123574 Commits

Author SHA1 Message Date
Andrew Gallatin
5ccac9f972 lagg: allow lacp to manage the link state
Lacp needs to manage the link state itself. Unlike other
lagg protocols, the ability of lacp to pass traffic
depends not only on the lagg members having link, but also
on the lacp protocol converging to a distributing state with the
link partner.

If we prematurely mark the link as up, then we will send a
gratuitous arp (via arp_handle_ifllchange()) before the lacp
interface is capable of passing traffic. When this happens,
the gratuitous arp is lost, and our link partner may cache
a stale mac address (eg, when the base mac address for the
lagg bundle changes, due to a BIOS change re-ordering NIC
unit numbers)

Reviewed by: jtl, hselasky
Sponsored by: Netflix
2018-08-13 14:13:25 +00:00
Michael Tuexen
839d21d62e Use the stacb instead of the asoc in state macros.
This is not a functional change. Just a preparation for upcoming
dtrace state change provider support.
2018-08-13 13:58:45 +00:00
Michael Tuexen
61a2188021 Use consistently the macors to modify the assoc state.
No functional change.
2018-08-13 11:56:21 +00:00
Michal Meloun
23242e7a9c Add USB ID for rebranded RTL8153 found on NVIDIA Jetson TX1 board.
MFC after:	3 days
2018-08-13 07:28:25 +00:00
Emmanuel Vadot
2421576ca3 Import DTS files from Linux 4.18 2018-08-13 06:40:20 +00:00
Matt Macy
20a3cbe1f8 fix static ZFS linking
Static linking of ZFS is a newish option and LINT doesn't include it
2018-08-12 21:04:53 +00:00
Justin Hibbits
54318d2a6a ipmi/opal: Enable polled mode and proper callback
Fix a NULL dereference that would occur any time an ioctl() was done, due to a
missing ipmi_enqueue_request callback.  Just use the default for now, until we
decide to properly enable IPMI interrupts.

Reported by:	kbowling
2018-08-12 20:33:55 +00:00
Michael Tuexen
812649d86f Add explicit cast to silence a warning for the userland stack.
Thanks to Felix Weinrank for providing the patch.
2018-08-12 14:05:15 +00:00
Navdeep Parhar
4a89444d7e Remove unused stuff from iw_cxgbe.h 2018-08-12 03:36:09 +00:00
Matt Macy
fb8f55f586 MFV/ZoL: Add dbuf hash and dbuf cache kstats
TODO: KSTAT_TYPE_NAMED support

commit 5e021f56d3437d3523904652fe3cc23ea1f4cb70
Author: Giuseppe Di Natale <dinatale2@users.noreply.github.com>
Date:   Mon Jan 29 10:24:52 2018 -0800

    Add dbuf hash and dbuf cache kstats

    Introduce kstats about the dbuf hash and dbuf cache
    to make it easier to inspect state. This should help
    with debugging and understanding of these portions
    of the codebase.

    Correct format of dbuf kstat file.

    Introduce a dbc column to dbufs kstat to indicate if
    a dbuf is in the dbuf cache.

    Introduce field filtering in the dbufstat python script.

    Introduce a no header option to the dbufstat python script.

    Introduce a test case to test basic mru->mfu list movement
    in the ARC.

    Reviewed-by: Tony Hutter <hutter2@llnl.gov>
    Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
    Signed-off-by: Giuseppe Di Natale <dinatale2@llnl.gov>
    Closes #6906
2018-08-12 03:15:30 +00:00
Matt Macy
13ae5c6ba8 MFV/ZoL: Fix stack dbuf_hold_impl()
commit fc5bb51f08a6c91ff9ad3559d0266eeeab0b1f61
Author: Brian Behlendorf <behlendorf1@llnl.gov>
Date:   Thu Aug 26 10:52:00 2010 -0700

    Fix stack dbuf_hold_impl()

    This commit preserves the recursive function dbuf_hold_impl() but moves
    the local variables and function arguments to the heap to minimize
    the stack frame size.  Enough space is initially allocated on the
    stack for 20 levels of recursion.  This technique was based on commit
    34229a2f2ac07363f64ddd63e014964fff2f0671 which reduced stack usage of
    traverse_visitbp().

    Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2018-08-12 02:24:18 +00:00
Matt Macy
6e3d1345d9 fix build DN_MAX_BONUSLEN -> DN_OLD_MAX_BONUSLEN 2018-08-12 02:12:44 +00:00
Matt Macy
0f5add2566 Restore legacy dnode_phys layout on tier 2 arches
Evidently gcc4 doesn't support anonymous union members
2018-08-12 02:09:06 +00:00
Matt Macy
104ed324dd MFV/ZoL: Fix stack noinline
commit 60948de1ef976aabaa3630707bcc8b5867508507
Author: Brian Behlendorf <behlendorf1@llnl.gov>
Date:   Thu Aug 26 10:58:36 2010 -0700

    Fix stack noinline

    Certain function must never be automatically inlined by gcc because
    they are stack heavy or called recursively.  This patch flags all
    such functions I've found as 'noinline' to prevent gcc from making
    the optimization.

    Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2018-08-12 01:29:30 +00:00
Matt Macy
71d48dbda3 MFV/ZoL: Fix PANIC: metaslab_free_dva(): bad DVA X:Y:Z
commit 81edd3e83409218879e7af293daa86b0c40eb015
Author: Peng <peng.hse@xtaotech.com>
Date:   Wed Jun 8 15:22:07 2016 +0800

    Fix PANIC: metaslab_free_dva(): bad DVA X:Y:Z

    The following scenario can result in garbage in the dn_spill field.
    The db->db_blkptr must be set to NULL when DNODE_FLAG_SPILL_BLKPTR
    is clear to ensure the dn_spill field is cleared.

    Current txg = A.
    * A new spill buffer is created. Its dbuf is initialized with
      db_blkptr = NULL and it's dirtied.

    Current txg = B.
    * The spill buffer is modified. It's marked as dirty in this txg.
    * Additional changes make the spill buffer unnecessary because the
      xattr fits into the bonus buffer, so it's removed. The dbuf is
      undirtied in this txg, but it's still referenced and cannot be
      destroyed.

    Current txg = C.
    * Starts syncing of txg A
    * dbuf_sync_leaf() is called for the spill buffer. Since db_blkptr
      is NULL, dbuf_check_blkptr() is called.
    * The dbuf starts being written and it reaches the ready state
      (not done yet).
    * A new change makes the spill buffer necessary again.
      sa_build_layouts() ends up calling dbuf_find() to locate the
      dbuf.  It finds the old dbuf because it has not been destroyed yet
      (it will be destroyed when the previous write is done and there
      are no more references). The old dbuf has db_blkptr != NULL.
    * txg A write is complete and the dbuf released. However it's still
      referenced, so it's not destroyed.

    Current txg = D.
    * Starts syncing of txg B
    * dbuf_sync_leaf() is called for the bonus buffer. Its contents are
      directly copied into the dnode, overwriting the blkptr area because,
      in txg B, the bonus buffer was big enough to hold the entire xattr.
    * At this point, the db_blkptr of the spill buffer used in txg C
      gets corrupted.

    Signed-off-by: Peng <peng.hse@xtaotech.com>
    Signed-off-by: Tim Chase <tim@chase2k.com>
    Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
    Closes #3937
2018-08-12 01:17:32 +00:00
Matt Macy
6f06a36d47 MFV/ZoL: add dbuf stats
NB: disabled pending the addition of KSTAT_TYPE_RAW support to the
SPL

commit e0b0ca983d6897bcddf05af2c0e5d01ff66f90db
Author: Brian Behlendorf <behlendorf1@llnl.gov>
Date:   Wed Oct 2 17:11:19 2013 -0700

    Add visibility in to cached dbufs

    Currently there is no mechanism to inspect which dbufs are being
    cached by the system.  There are some coarse counters in arcstats
    by they only give a rough idea of what's being cached.  This patch
    aims to improve the current situation by adding a new dbufs kstat.

    When read this new kstat will walk all cached dbufs linked in to
    the dbuf_hash.  For each dbuf it will dump detailed information
    about the buffer.  It will also dump additional information about
    the referenced arc buffer and its related dnode.  This provides a
    more complete view in to exactly what is being cached.

    With this generic infrastructure in place utilities can be written
    to post-process the data to understand exactly how the caching is
    working.  For example, the data could be processed to show a list
    of all cached dnodes and how much space they're consuming.  Or a
    similar list could be generated based on dnode type.  Many other
    ways to interpret the data exist based on what kinds of questions
    you're trying to answer.

    Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
    Signed-off-by: Prakash Surya <surya1@llnl.gov>
2018-08-12 01:10:18 +00:00
Matt Macy
cc0fbbb92e MFV/ZoL: Implement large_dnode pool feature
commit 50c957f702ea6d08a634e42f73e8a49931dd8055
Author: Ned Bass <bass6@llnl.gov>
Date:   Wed Mar 16 18:25:34 2016 -0700

    Implement large_dnode pool feature

    Justification
    -------------

    This feature adds support for variable length dnodes. Our motivation is
    to eliminate the overhead associated with using spill blocks.  Spill
    blocks are used to store system attribute data (i.e. file metadata) that
    does not fit in the dnode's bonus buffer. By allowing a larger bonus
    buffer area the use of a spill block can be avoided.  Spill blocks
    potentially incur an additional read I/O for every dnode in a dnode
    block. As a worst case example, reading 32 dnodes from a 16k dnode block
    and all of the spill blocks could issue 33 separate reads. Now suppose
    those dnodes have size 1024 and therefore don't need spill blocks.  Then
    the worst case number of blocks read is reduced to from 33 to two--one
    per dnode block. In practice spill blocks may tend to be co-located on
    disk with the dnode blocks so the reduction in I/O would not be this
    drastic. In a badly fragmented pool, however, the improvement could be
    significant.

    ZFS-on-Linux systems that make heavy use of extended attributes would
    benefit from this feature. In particular, ZFS-on-Linux supports the
    xattr=sa dataset property which allows file extended attribute data
    to be stored in the dnode bonus buffer as an alternative to the
    traditional directory-based format. Workloads such as SELinux and the
    Lustre distributed filesystem often store enough xattr data to force
    spill bocks when xattr=sa is in effect. Large dnodes may therefore
    provide a performance benefit to such systems.

    Other use cases that may benefit from this feature include files with
    large ACLs and symbolic links with long target names. Furthermore,
    this feature may be desirable on other platforms in case future
    applications or features are developed that could make use of a
    larger bonus buffer area.

    Implementation
    --------------

    The size of a dnode may be a multiple of 512 bytes up to the size of
    a dnode block (currently 16384 bytes). A dn_extra_slots field was
    added to the current on-disk dnode_phys_t structure to describe the
    size of the physical dnode on disk. The 8 bits for this field were
    taken from the zero filled dn_pad2 field. The field represents how
    many "extra" dnode_phys_t slots a dnode consumes in its dnode block.
    This convention results in a value of 0 for 512 byte dnodes which
    preserves on-disk format compatibility with older software.

    Similarly, the in-memory dnode_t structure has a new dn_num_slots field
    to represent the total number of dnode_phys_t slots consumed on disk.
    Thus dn->dn_num_slots is 1 greater than the corresponding
    dnp->dn_extra_slots. This difference in convention was adopted
    because, unlike on-disk structures, backward compatibility is not a
    concern for in-memory objects, so we used a more natural way to
    represent size for a dnode_t.

    The default size for newly created dnodes is determined by the value of
    a new "dnodesize" dataset property. By default the property is set to
    "legacy" which is compatible with older software. Setting the property
    to "auto" will allow the filesystem to choose the most suitable dnode
    size. Currently this just sets the default dnode size to 1k, but future
    code improvements could dynamically choose a size based on observed
    workload patterns. Dnodes of varying sizes can coexist within the same
    dataset and even within the same dnode block. For example, to enable
    automatically-sized dnodes, run

     # zfs set dnodesize=auto tank/fish

    The user can also specify literal values for the dnodesize property.
    These are currently limited to powers of two from 1k to 16k. The
    power-of-2 limitation is only for simplicity of the user interface.
    Internally the implementation can handle any multiple of 512 up to 16k,
    and consumers of the DMU API can specify any legal dnode value.

    The size of a new dnode is determined at object allocation time and
    stored as a new field in the znode in-memory structure. New DMU
    interfaces are added to allow the consumer to specify the dnode size
    that a newly allocated object should use. Existing interfaces are
    unchanged to avoid having to update every call site and to preserve
    compatibility with external consumers such as Lustre. The new
    interfaces names are given below. The versions of these functions that
    don't take a dnodesize parameter now just call the _dnsize() versions
    with a dnodesize of 0, which means use the legacy dnode size.

    New DMU interfaces:
      dmu_object_alloc_dnsize()
      dmu_object_claim_dnsize()
      dmu_object_reclaim_dnsize()

    New ZAP interfaces:
      zap_create_dnsize()
      zap_create_norm_dnsize()
      zap_create_flags_dnsize()
      zap_create_claim_norm_dnsize()
      zap_create_link_dnsize()

    The constant DN_MAX_BONUSLEN is renamed to DN_OLD_MAX_BONUSLEN. The
    spa_maxdnodesize() function should be used to determine the maximum
    bonus length for a pool.

    These are a few noteworthy changes to key functions:

    * The prototype for dnode_hold_impl() now takes a "slots" parameter.
      When the DNODE_MUST_BE_FREE flag is set, this parameter is used to
      ensure the hole at the specified object offset is large enough to
      hold the dnode being created. The slots parameter is also used
      to ensure a dnode does not span multiple dnode blocks. In both of
      these cases, if a failure occurs, ENOSPC is returned. Keep in mind,
      these failure cases are only possible when using DNODE_MUST_BE_FREE.

      If the DNODE_MUST_BE_ALLOCATED flag is set, "slots" must be 0.
      dnode_hold_impl() will check if the requested dnode is already
      consumed as an extra dnode slot by an large dnode, in which case
      it returns ENOENT.

    * The function dmu_object_alloc() advances to the next dnode block
      if dnode_hold_impl() returns an error for a requested object.
      This is because the beginning of the next dnode block is the only
      location it can safely assume to either be a hole or a valid
      starting point for a dnode.

    * dnode_next_offset_level() and other functions that iterate
      through dnode blocks may no longer use a simple array indexing
      scheme. These now use the current dnode's dn_num_slots field to
      advance to the next dnode in the block. This is to ensure we
      properly skip the current dnode's bonus area and don't interpret it
      as a valid dnode.

    zdb
    ---
    The zdb command was updated to display a dnode's size under the
    "dnsize" column when the object is dumped.

    For ZIL create log records, zdb will now display the slot count for
    the object.

    ztest
    -----
    Ztest chooses a random dnodesize for every newly created object. The
    random distribution is more heavily weighted toward small dnodes to
    better simulate real-world datasets.

    Unused bonus buffer space is filled with non-zero values computed from
    the object number, dataset id, offset, and generation number.  This
    helps ensure that the dnode traversal code properly skips the interior
    regions of large dnodes, and that these interior regions are not
    overwritten by data belonging to other dnodes. A new test visits each
    object in a dataset. It verifies that the actual dnode size matches what
    was stored in the ztest block tag when it was created. It also verifies
    that the unused bonus buffer space is filled with the expected data
    patterns.

    ZFS Test Suite
    --------------
    Added six new large dnode-specific tests, and integrated the dnodesize
    property into existing tests for zfs allow and send/recv.

    Send/Receive
    ------------
    ZFS send streams for datasets containing large dnodes cannot be received
    on pools that don't support the large_dnode feature. A send stream with
    large dnodes sets a DMU_BACKUP_FEATURE_LARGE_DNODE flag which will be
    unrecognized by an incompatible receiving pool so that the zfs receive
    will fail gracefully.

    While not implemented here, it may be possible to generate a
    backward-compatible send stream from a dataset containing large
    dnodes. The implementation may be tricky, however, because the send
    object record for a large dnode would need to be resized to a 512
    byte dnode, possibly kicking in a spill block in the process. This
    means we would need to construct a new SA layout and possibly
    register it in the SA layout object. The SA layout is normally just
    sent as an ordinary object record. But if we are constructing new
    layouts while generating the send stream we'd have to build the SA
    layout object dynamically and send it at the end of the stream.

    For sending and receiving between pools that do support large dnodes,
    the drr_object send record type is extended with a new field to store
    the dnode slot count. This field was repurposed from unused padding
    in the structure.

    ZIL Replay
    ----------
    The dnode slot count is stored in the uppermost 8 bits of the lr_foid
    field. The bits were unused as the object id is currently capped at
    48 bits.

    Resizing Dnodes
    ---------------
    It should be possible to resize a dnode when it is dirtied if the
    current dnodesize dataset property differs from the dnode's size, but
    this functionality is not currently implemented. Clearly a dnode can
    only grow if there are sufficient contiguous unused slots in the
    dnode block, but it should always be possible to shrink a dnode.
    Growing dnodes may be useful to reduce fragmentation in a pool with
    many spill blocks in use. Shrinking dnodes may be useful to allow
    sending a dataset to a pool that doesn't support the large_dnode
    feature.

    Feature Reference Counting
    --------------------------
    The reference count for the large_dnode pool feature tracks the
    number of datasets that have ever contained a dnode of size larger
    than 512 bytes. The first time a large dnode is created in a dataset
    the dataset is converted to an extensible dataset. This is a one-way
    operation and the only way to decrement the feature count is to
    destroy the dataset, even if the dataset no longer contains any large
    dnodes. The complexity of reference counting on a per-dnode basis was
    too high, so we chose to track it on a per-dataset basis similarly to
    the large_block feature.

    Signed-off-by: Ned Bass <bass6@llnl.gov>
    Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
    Closes #3542
2018-08-12 00:45:53 +00:00
Matt Macy
9f3a171221 Enable balanced arc pruning
Taken from:
ommit f6046738365571bd647f804958dfdff8a32fbde4
Author: Brian Behlendorf <behlendorf1@llnl.gov>
Date:   Sat May 30 09:57:53 2015 -0500

    Make arc_prune() asynchronous

    As described in the comment above arc_adapt_thread() it is critical
    that the arc_adapt_thread() function never sleep while holding a hash
    lock.  This behavior was possible in the Linux implementation because
    the arc_prune() logic was implemented to be synchronous.  Under
    illumos the analogous dnlc_reduce_cache() function is asynchronous.

    To address this the arc_do_user_prune() function is has been reworked
    in to two new functions as follows:

    * arc_prune_async() is an asynchronous implementation which dispatches
    the prune callback to be run by the system taskq.  This makes it
    suitable to use in the context of the arc_adapt_thread().

    * arc_prune() is a synchronous implementation which depends on the
    arc_prune_async() implementation but blocks until the outstanding
    callbacks complete.  This is used in arc_kmem_reap_now() where it
    is safe, and expected, that memory will be freed.

    This patch additionally adds the zfs_arc_meta_strategy module option
    while allows the meta reclaim strategy to be configured.  It defaults
    to a balanced strategy which has been proved to work well under Linux
    but the illumos meta-only strategy can be enabled.

    Signed-off-by: Tim Chase <tim@chase2k.com>
    Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2018-08-11 22:01:52 +00:00
Navdeep Parhar
37310a98a8 cxgbe(4): Move all control queues to the adapter.
There used to be one control queue per adapter (the mgmtq) that was
initialized during adapter init and one per port that was initialized
later during port init.  This change moves all the control queues (one
per port/channel) to the adapter so that they are initialized during
adapter init and are available before any port is up.  This allows the
driver to issue ctrlq work requests over any channel without having to
bring up any port.

MFH:		2 weeks
Sponsored by:	Chelsio Communications
2018-08-11 21:10:08 +00:00
Matt Macy
d815f5ba09 buildworld fix: private appears to have special meaning on FreeBSD - revert to priv 2018-08-11 20:41:42 +00:00
Matt Macy
6b55e6fb04 Limit the amount of dnode metadata in the ARC
In addition import most recent arc_prune_async implementation as dependency

commit 25458cbef9e59ef9ee6a7e729ab2522ed308f88f
Author: Tim Chase <tim@chase2k.com>
Date:   Wed Jul 13 07:42:40 2016 -0500

    Limit the amount of dnode metadata in the ARC

    Metadata-intensive workloads can cause the ARC to become permanently
    filled with dnode_t objects as they're pinned by the VFS layer.
    Subsequent data-intensive workloads may only benefit from about
    25% of the potential ARC (arc_c_max - arc_meta_limit).

    In order to help track metadata usage more precisely, the other_size
    metadata arcstat has replaced with dbuf_size, dnode_size and bonus_size.

    The new zfs_arc_dnode_limit tunable, which defaults to 10% of
    zfs_arc_meta_limit, defines the minimum number of bytes which is desirable
    to be consumed by dnodes.  Attempts to evict non-metadata will trigger
    async prune tasks if the space used by dnodes exceeds this limit.

    The new zfs_arc_dnode_reduce_percent tunable specifies the amount by
    which the excess dnode space is attempted to be pruned as a percentage of
    the amount by which zfs_arc_dnode_limit is being exceeded.  By default,
    it tries to unpin 10% of the dnodes.

    The problem of dnode metadata pinning was observed with the following
    testing procedure (in this example, zfs_arc_max is set to 4GiB):

        - Create a large number of small files until arc_meta_used exceeds
          arc_meta_limit (3GiB with default tuning) and arc_prune
          starts increasing.

        - Create a 3GiB file with dd.  Observe arc_mata_used.  It will still
          be around 3GiB.

        - Repeatedly read the 3GiB file and observe arc_meta_limit as before.
          It will continue to stay around 3GiB.

    With this modification, space for the 3GiB file is gradually made
    available as subsequent demands on the ARC are made.  The previous behavior
    can be restored by setting zfs_arc_dnode_limit to the same value as the
    zfs_arc_meta_limit.

    Signed-off-by: Tim Chase <tim@chase2k.com>
    Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
    Issue #4345
    Issue #4512
    Issue #4773
    Closes #4858
2018-08-11 19:45:04 +00:00
Alan Cox
c65ed2ff53 Eliminate a redundant assignment.
MFC after:	1 week
2018-08-11 19:21:53 +00:00
Kristof Provost
e9ddca4a40 pf: Take the IF_ADDR_RLOCK() when iterating over the group list
We did do this elsewhere in pf, but the lock was missing here.

Sponsored by:	Essen Hackathon
2018-08-11 16:37:55 +00:00
Kristof Provost
33b242b533 pf: Fix 'set skip on' for groups
The pfi_skip_if() function sometimes caused skipping of groups to work,
if the members of the group used the groupname as a name prefix.
This is often the case, e.g. group lo usually contains lo0, lo1, ...,
but not always.

Rather than relying on the name explicitly check for group memberships.

Obtained from:	OpenBSD (pf_if.c,v 1.62, pf_if.c,v 1.63)
Sponsored by:	Essen Hackathon
2018-08-11 16:34:30 +00:00
Navdeep Parhar
3098bcfc05 cxgbe(4): Create two variants of service_iq, one for queues with
freelists and one for those without.

MFH:		3 weeks
Sponsored by:	Chelsio Communications
2018-08-11 04:55:47 +00:00
Matt Macy
90df93417e ZFS/MFV: Use cached feature info in spa_add_feature_stats()
commit 417104bdd3c7ce07ec58674dd078f9891c3bc780
Author: Ned Bass <bass6@llnl.gov>
Date:   Thu Feb 26 12:24:11 2015 -0800

    Use cached feature info in spa_add_feature_stats()

    Avoid issuing I/O to the pool when retrieving feature flags information.
    Trying to read the ZAPs from disk means that zpool clear would hang if
    the pool is suspended and recovery would require a reboot. To keep the
    feature stats resident in memory, we hang a cached nvlist off of the
    spa.  It is built up from disk the first time spa_add_feature_stats() is
    called, and refreshed thereafter using the cached feature reference
    counts. spa_add_feature_stats() gets called at pool import time so we
    can be sure the cached nvlist will be available if the pool is later
    suspended.

    Signed-off-by: Ned Bass <bass6@llnl.gov>
    Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
    Closes #3082
2018-08-10 23:42:11 +00:00
Devin Teske
ab9ed8a1bd Fix misspellings of transmitter/transmitted
Reviewed by:	emaste, bcr
Sponsored by:	Smule, Inc.
Differential Revision:	https://reviews.freebsd.org/D16025
2018-08-10 20:37:32 +00:00
Conrad Meyer
f053ca1f08 Walk back r337554 while discussion continues
The idea was to get the uncontroversial mechanical change out of the way,
then get the meatier functional changes reviewed subsequently.  I had not
realized that the immediately adjacent issue was addressed in a different
direction in r334506 (see Warner's guidance in D15592).

Discussion continues, trying to determine if there is a secondary issue
still[1] and how best to fix it.  With 12-related activities coming up,
while that is ongoing, just take this back for now.

[1]: Shutdown-time eventhandler events fire normally during panic's reboot
path.  Driver callbacks that attempt to issue and wait on interrupt-
completed IO may never complete, hanging the system.  This is particularly
obnoxious in the shutdown/panic path, as the debugger cannot be entered
anymore and the hang prevents reboot restoring availability.

(There's nothing CAM-specific about this problem -- any shutdown
event-triggered driver could do something like this during panic.  But most
NICs, etc.  don't try to send spin-down commands at shutdown. ;-))

Discussed with:	imp, markj
2018-08-10 19:19:07 +00:00
Kyle Evans
0915d9d070 subr_prf: remove think-o that had returned to local patch
Reported by:	cognet
2018-08-10 15:35:02 +00:00
Kyle Evans
170bc29131 boot tagging: minor fixes
msgbufinit may be called multiple times as we initialize the msgbuf into a
progressively larger buffer. This doesn't happen as of now on head, but it
may happen in the future and we generally support this. As such, only print
the boot tag if we've just initialized the buffer for the first time.

The boot tag also now has a newline appended to it for better visibility,
and has been switched to a normal printf, by requesto f bde, after we've
denoted that the msgbuf is mapped.
2018-08-10 15:29:06 +00:00
Warner Losh
7e299411ac Bring in timespce_get form NetBSD.
Bring in the functionality for timespec_get from NetBSD. I've lightly
edited the .c file to remove _DIAGASSERT because FreeBSD doesn't have
that functionality and the typical #define'ing it to assert isn't
right here. The man page is verbatim from NetBSD, but will be revised
as part of a larger cleanup of the time man pages (they are
inconsistent and vague in all the wrong places).

Differential Review: https://reviews.freebsd.org/D16649
2018-08-10 15:16:30 +00:00
Kyle Evans
84c956df77 ath: Minor style cleanups
device_printf => DPRINTF and two whitespace adjustments

Submitted by:	Augustin Cavalier <waddlesplash@gmail.com>
Obtained from:	Haiku (4a88aa503ad4155a20931e263d24343043994ea9)
MFC after:	1 week
2018-08-10 13:38:23 +00:00
Kyle Evans
8e0cc51b87 ieee8021_node: fix whitespace issues
Submitted by:	Augustin Cavalier <waddlesplash@gmail.com>
Obtained from:	Haiku (dffc3e235360cd7b71261239ee8507b7d62a1471)
MFC after:	1 week
2018-08-10 13:34:23 +00:00
Kyle Evans
58a7c4bfcf net80211: Drain ageq before cleaning it up.
The comment above ieee80211_ageq_cleanup specifically notes that the queue
is assumed to be empty, and in order to make it so, ieee80211_ageq_drain
must be used.

Submitted by:	Augustin Cavalier <waddlesplash@gmail.com>
Obtained from:	Haiku (dffc3e235360cd7b71261239ee8507b7d62a1471)
MFC after:	1 week
2018-08-10 13:32:02 +00:00
Kyle Evans
060b3e4ff1 bwi(4): Set ic->ic_softc before bwi_getradiocaps to avoid bad deref
Submitted by:	François Revol <revol@free.fr>
Obtained from:	Haiku (ba88131cfde64e21bedb4ebedd699cfa5e7fd314)
MFC after:	1 week
2018-08-10 13:06:14 +00:00
Andrey V. Elsukov
16bbf600d9 Remove unneeded ipsec-related includes.
Reviewed by:	rrs
Differential Revision:	https://reviews.freebsd.org/D16637
2018-08-10 07:24:01 +00:00
Matt Macy
648cfe57fd Performance optimization of AVL tree comparator functions
MFV:
commit ee36c709c3d5f7040e1bd11f5c75318aa03e789f
Author: Gvozden Neskovic <neskovic@gmail.com>
Date:   Sat Aug 27 20:12:53 2016 +0200

    perf: 2.75x faster ddt_entry_compare()
        First 256bits of ddt_key_t is a block checksum, which are expected
    to be close to random data. Hence, on average, comparison only needs to
    look at first few bytes of the keys. To reduce number of conditional
    jump instructions, the result is computed as: sign(memcmp(k1, k2)).

    Sign of an integer 'a' can be obtained as: `(0 < a) - (a < 0)` := {-1, 0, 1} ,
    which is computed efficiently.  Synthetic performance evaluation of
    original and new algorithm over 1G random keys on 2.6GHz Intel(R) Xeon(R)
    CPU E5-2660 v3:

    old     6.85789 s
    new     2.49089 s

    perf: 2.8x faster vdev_queue_offset_compare() and vdev_queue_timestamp_compare()
        Compute the result directly instead of using conditionals

    perf: zfs_range_compare()
        Speedup between 1.1x - 2.5x, depending on compiler version and
    optimization level.

    perf: spa_error_entry_compare()
        `bcmp()` is not suitable for comparator use. Use `memcmp()` instead.

    perf: 2.8x faster metaslab_compare() and metaslab_rangesize_compare()
    perf: 2.8x faster zil_bp_compare()
    perf: 2.8x faster mze_compare()
    perf: faster dbuf_compare()
    perf: faster compares in spa_misc
    perf: 2.8x faster layout_hash_compare()
    perf: 2.8x faster space_reftree_compare()
    perf: libzfs: faster avl tree comparators
    perf: guid_compare()
    perf: dsl_deadlist_compare()
    perf: perm_set_compare()
    perf: 2x faster range_tree_seg_compare()
    perf: faster unique_compare()
    perf: faster vdev_cache _compare()
    perf: faster vdev_uberblock_compare()
    perf: faster fuid _compare()
    perf: faster zfs_znode_hold_compare()

    Signed-off-by: Gvozden Neskovic <neskovic@gmail.com>
    Signed-off-by: Richard Elling <richard.elling@gmail.com>
    Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
    Closes #5033
2018-08-10 06:42:08 +00:00
Justin Hibbits
7d849dc1a4 powerpc: Add lwsync and ptesync 'sync' opcode variants to ddb disassembler
The canonical form of sync is:

  sync L, E (if Category Elemental Memory Barriers implemented)

The L bits (2) denote the type of sync:

  0 -- hwsync
  1 -- lwsync
  2 -- ptesync or hwsync

It's been found that most 32-bit CPUs designed prior to the introduction of
lwsync will ignore the L bits.  However, some cores, particularly the e500 core,
will trigger an illegal instruction exception.  Adding these variants will make
it easier to see which sync variant is actually being used in case of a trap.
2018-08-10 03:28:40 +00:00
Cy Schubert
79476a1c3e Correct a comment. Should have been detected by ipf_nat_in() not
ipf_nat_out().

MFC after:	1 week
X-MFC-with:	r337558
2018-08-10 00:30:15 +00:00
Cy Schubert
e6191e11f0 Identify the return value (rval) that led to the IPv4 NAT failure
in ipf_nat_checkout() and report it in the frb_natv4out and frb_natv4in
dtrace probes.

This is currently being used to diagnose NAT failures in PR/208566. It's
rather handy so this commit makes it available for future diagnosis and
debugging efforts.

PR:		208566
MFC after:	1 week
2018-08-10 00:04:32 +00:00
Glen Barber
b534d57f63 Rename head from -CURRENT to -ALPHA1 as part of the
12.0-RELEASE cycle.  This commit marks the start of
the code slush for the 12.0 cycle.

Approved by:	re (implicit)
Sponsored by:	The FreeBSD Foundation
2018-08-10 00:01:21 +00:00
Conrad Meyer
2077be2b73 cam(4): Add an xpt-neutral flag indicating a valid panic CCB
No functional change.

Note that this change is careful to set the CCB header xflags after
foo_fill_bar() routines, which generally zero existing flags.  An earlier
version of this patch mistakenly set the flag before the fill routines.

Submitted by:	Scott Ferris <sferris AT isilon.com>, jhibbits@
Reviewed by:	bdrewery@, markj@, and non-committer FreeBSD contributor Anton Rang
Sponsored by:	Dell EMC Isilon
2018-08-09 21:53:32 +00:00
Navdeep Parhar
2d73ac5e4a cxgbe(4): Add a sysctl to control the tx credit reclaim mechanism for
netmap tx queues.  There is no change in default behavior.

Sponsored by:	Chelsio Communications
2018-08-09 21:52:51 +00:00
Conrad Meyer
bc812246a0 cam_ccb.h: Remove redundant declarations of static inline functions
No functional change.

They're unnecessarily confusing for tools like grep or ctags.

Sponsored by:	Dell EMC Isilon
2018-08-09 21:20:07 +00:00
Navdeep Parhar
518bca2c21 cxgbe(4): Set fl_pktshift to 0 by default.
Sponsored by:	Chelsio Communications
2018-08-09 21:07:32 +00:00
Kyle Evans
240fcda1e8 subr_prf: style(9) the sizeof
Reported by:	jkim, ian
2018-08-09 19:09:06 +00:00
Mark Johnston
b50a4ea646 Account for the lowmem handlers in the inactive queue scan target.
Before r329882 the target would be computed after lowmem handlers run
and free pages.  On some systems a significant amount of page
reclamation happens this way.  However, with r329882 the target is
computed first, which can lead to unnecessary reclamation from the
page cache, and this in turn may result in excessive swapping.

Instead, adjust the target after running lowmem handlers.  Don't
invoke the lowmem handlers before the PID controller, though, since
that would hide the true rate of page allocation.

Reviewed by:	alc, kib (previous version)
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D16606
2018-08-09 18:25:49 +00:00
Kyle Evans
4c793b68da subr_prf: Use "sizeof current_boot_tag" instead 2018-08-09 17:53:18 +00:00
Kyle Evans
2a4650cc11 BOOT_TAG: Make a config(5) option, expose as sysctl and loader tunable
BOOT_TAG lived shortly in sys/msgbuf.h, but this wasn't necessarily great
for changing it or removing it. Move it into subr_prf.c and add options for
it to opt_printf.h.

One can specify both the BOOT_TAG and BOOT_TAG_SZ (really, size of the
buffer that holds the BOOT_TAG). We expose it as kern.boot_tag and also add
a loader tunable by the same name that we'll fetch upon initialization of
the msgbuf.

This allows for flexibility and also ensures that there's a consistent way
to figure out the boot tag of the running kernel, rather than relying on
headers to be in-sync.

Prodded super-super-lightly by:	imp
2018-08-09 17:47:47 +00:00
Kyle Evans
21aa6e8345 msgbuf: Light detailing (const'ify and bool'itize) 2018-08-09 17:42:27 +00:00