Commit Graph

123926 Commits

Author SHA1 Message Date
Kirk McKusick
d530fd484b TRIM consolodation is supposed to be off by default 2018-08-20 21:19:21 +00:00
Warner Losh
49a49b37df Move options INTRNG into std.armv6 and std.armv7
INTRNG is required on all armv6 and armv7 systems, so make it
standard.
2018-08-20 20:31:53 +00:00
Bjoern A. Zeeb
10b070c166 GC inc_isipv6; it was added for "temp" compatibility in 2001, r86764
and does not seem to be used.
2018-08-20 20:06:36 +00:00
Konstantin Belousov
a997bcc015 Update comment about ABI of flush_l1s_sw to match the reality.
CPUID instruction clobbers %rbx and %rdx.

Sponsored by:	The FreeBSD Foundation
MFC after:	13 days
2018-08-20 19:09:39 +00:00
Konstantin Belousov
b0568ddbec Always initialize PCPU kcr3 for vmspace0 pmap.
If an exception or NMI occurs before CPU switched to a pmap different
from vmspace0, PCPU kcr3 is left zero for pti config, which causes
triple-fault in the handler.

Sponsored by:	The FreeBSD Foundation
MFC after:	3 days
2018-08-20 19:07:57 +00:00
Oleksandr Tymoshenko
5747fe4fb9 [ig4] add ACPI Device HID for AMD platforms
Added ACPI Device HID AMDI0010 for the designware I2C controllers in
future AMD platforms. Also, when verifying component version check for
minimal value instead of exact match.

PR:		230641
Submitted by:	Rajesh <rajfbsd@gmail.com>
Reviewed by:	cem, gonzo
MFC after:	2 weeks
Differential Revision:	https://reviews.freebsd.org/D16670
2018-08-20 18:50:56 +00:00
Alan Cox
44d0efb215 Eliminate kmem_alloc_contig()'s unused arena parameter.
Reviewed by:	hselasky, kib, markj
Discussed with:	jeff
Differential Revision:	https://reviews.freebsd.org/D16799
2018-08-20 15:57:27 +00:00
Alexander Motin
8805f3d7be Remove extra M_ZERO from NG_MKRESPONSE() argument.
NG_MKRESPONSE() sets M_ZERO by itself.

Submitted by:	Dmitry Luhtionov <dmitryluhtionov@gmail.com>
MFC after:	1 week
2018-08-20 14:35:54 +00:00
Randall Stewart
c28440db29 This change represents a substantial restructure of the way we
reassembly inbound tcp segments. The old algorithm just blindly
dropped in segments without coalescing. This meant that every
segment could take up greater and greater room on the linked list
of segments. This of course is now subject to a tighter limit (100)
of segments which in a high BDP situation will cause us to be a
lot more in-efficent as we drop segments beyond 100 entries that
we receive. What this restructure does is cause the reassembly
buffer to coalesce segments putting an emphasis on the two
common cases (which avoid walking the list of segments) i.e.
where we add to the back of the queue of segments and where we
add to the front. We also have the reassembly buffer supporting
a couple of debug options (black box logging as well as counters
for code coverage). These are compiled out by default but can
be added by uncommenting the defines.

Sponsored by:	Netflix Inc.
Differential Revision:	https://reviews.freebsd.org/D16626
2018-08-20 12:43:18 +00:00
John Baldwin
a800b45c18 Merge amd64 and i386 <machine/intr_machdep.h> headers.
Reviewed by:	kib
MFC after:	2 weeks
Differential Revision:	https://reviews.freebsd.org/D16803
2018-08-20 12:31:39 +00:00
John-Mark Gurney
b060c61dfd use sbuf so that lines are printed together... As aarch64 often
has SMP enabled, lines can get intermixed with other console output
making these lines hard to read...

Reviewed by:	manu
Differential Revision:	https://reviews.freebsd.org/D16689
2018-08-19 21:37:51 +00:00
Matt Macy
381388b9c4 add snps IP uart support / genaralize UART
This is an amalgam of a patch by Doug Ambrisko to
generalize uart_acpi_find_device, imp moving the
ACPI table to uart_dev_ns8250.c and advice by jhb
to work around a bug in the EPYC 3151 BIOS
(the BIOS incorrectly marks the serial ports as
disabled)

Reviewed by: imp
MFC after: 8 weeks
Differential Revision: https://reviews.freebsd.org/D16432
2018-08-19 21:10:21 +00:00
Justin Hibbits
8d67357c5c powerpc conf: Add PRINTF_BUFR_SIZE option to Book-E configs
Without this, printf is very hard to follow at times on multicore systems.
2018-08-19 19:07:59 +00:00
Justin Hibbits
b793c8ab28 Sort SPR_SPEFSCR in the SPR list
Also remove duplicate definition of SPR_IBAT0U.
2018-08-19 19:03:43 +00:00
Justin Hibbits
290646564b powerpc64: Align frequently used/exclusive data on cacheline boundaries
This is effectively a merge from amd64 of r312888, r323235, and r333486.

I've been running this on my POWER9 Talos for some time now with no ill
effects.

Suggested by:	mjg
2018-08-19 19:00:44 +00:00
Emmanuel Vadot
65aee3a872 arm64: allwinner: Add aw_syscon driver to GENERIC
Recent DTS use the syscon for the emac controller.
We support this but since U-Boot is still using old DTS it was never
needed for us to add this support, but this is a problem when using upstream
recent DTS and will be when U-Boot will catch up.

While here add a new compatible to the aw_syscon driver as Linux changed it ...
2018-08-19 18:55:33 +00:00
Justin Hibbits
340a810bf0 booke pmap: hide debug-ish printf behind bootverbose
It's not necessary during normal operation to know the mapped region size
and wasted space.
2018-08-19 18:54:43 +00:00
Konstantin Belousov
c1141fba00 Update L1TF workaround to sustain L1D pollution from NMI.
Current mitigation for L1TF in bhyve flushes L1D either by an explicit
WRMSR command, or by software reading enough uninteresting data to
fully populate all lines of L1D.  If NMI occurs after either of
methods is completed, but before VM entry, L1D becomes polluted with
the cache lines touched by NMI handlers.  There is no interesting data
which NMI accesses, but something sensitive might be co-located on the
same cache line, and then L1TF exposes that to a rogue guest.

Use VM entry MSR load list to ensure atomicity of L1D cache and VM
entry if updated microcode was loaded.  If only software flush method
is available, try to help the bhyve sw flusher by also flushing L1D on
NMI exit to kernel mode.

Suggested by and discussed with: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed by:	jhb
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
Differential revision:	https://reviews.freebsd.org/D16790
2018-08-19 18:47:16 +00:00
John Baldwin
2734fedc4e Fix a couple of comment nits. 2018-08-19 17:57:51 +00:00
Xin LI
56019a539f Bump __FreeBSD_version after r338059 (Chacha20 based arc4random(3)
and deprecation of arc4random_stir and arc4random_addrandom).
2018-08-19 17:47:30 +00:00
Xin LI
c1e80940f3 Update userland arc4random() with OpenBSD's Chacha20 based arc4random().
ObsoleteFiles.inc:

    Remove manual pages for arc4random_addrandom(3) and
    arc4random_stir(3).

  contrib/ntp/lib/isc/random.c:
  contrib/ntp/sntp/libevent/evutil_rand.c:

    Eliminate in-tree usage of arc4random_addrandom().

  crypto/heimdal/lib/roken/rand.c:
  crypto/openssh/config.h:

    Eliminate in-tree usage of arc4random_stir().

  include/stdlib.h:

    Remove arc4random_stir() and arc4random_addrandom() prototypes,
    provide temporary shims for transistion period.

  lib/libc/gen/Makefile.inc:

    Hook arc4random-compat.c to build, add hint for Chacha20 source for
    kernel, and remove arc4random_addrandom(3) and arc4random_stir(3)
    links.

  lib/libc/gen/arc4random.c:

    Adopt OpenBSD arc4random.c,v 1.54 with bare minimum changes, use the
    sys/crypto/chacha20 implementation of keystream.

  lib/libc/gen/Symbol.map:

    Remove arc4random_stir and arc4random_addrandom interfaces.

  lib/libc/gen/arc4random.h:

    Adopt OpenBSD arc4random.h,v 1.4 but provide _ARC4_LOCK of our own.

  lib/libc/gen/arc4random.3:

    Adopt OpenBSD arc4random.3,v 1.35 but keep FreeBSD r114444 and
    r118247.

  lib/libc/gen/arc4random-compat.c:

    Compatibility shims for arc4random_stir and arc4random_addrandom
    functions to preserve ABI.  Log once when called but do nothing
    otherwise.

  lib/libc/gen/getentropy.c:
  lib/libc/include/libc_private.h:

    Fold __arc4_sysctl into getentropy.c (renamed to arnd_sysctl).
    Remove from libc_private.h as a result.

  sys/crypto/chacha20/chacha.c:
  sys/crypto/chacha20/chacha.h:

    Make it possible to use the kernel implementation in libc.

PR:		182610
Reviewed by:	cem, markm
Obtained from:	OpenBSD
Relnotes:	yes
Differential Revision:	https://reviews.freebsd.org/D16760
2018-08-19 17:40:50 +00:00
John Baldwin
38a13e9002 Fix the MPTable probe code after the 4:4 changes on i386.
The MPTable probe code was using PMAP_MAP_LOW as the PA -> VA offset
when searching for the table signature but still using KERNBASE once
it had found the table.  As a result, the mpfps table pointed into a
random part of the kernel text instead of the actual MP Table.

Rather than adding more #ifdef's, use BIOS_PADDRTOVADDR from
<machine/pc/bios.h> which already uses PMAP_MAP_LOW on i386 and KERNBASE
on amd64.

Reviewed by:	kib
MFC after:	1 week
Differential Revision:	https://reviews.freebsd.org/D16802
2018-08-19 17:36:50 +00:00
Kirk McKusick
4de0d16b8c For traditional disks, the filesystem attempts to allocate the
blocks of a file as contiguously as possible. Since the filesystem
does not know how large a file will grow when it is first being
written, it initially places the file in a set of blocks in which
it currently fits. As it grows, it is relocated to areas with
larger contiguous blocks.  In this way it saves its large contiguous
sets of blocks for the files that need them and thus avoids
unnecessaily fragmenting its disk space.

We used to skip reallocating the blocks of a file into a contiguous
sequence if the underlying flash device requested BIO_DELETE
notifications, because devices that benefit from BIO_DELETE also
benefit from not moving the data. However, in the algorithm described
above that reallocates the blocks, the destination for the data is
usually moved before the data is written to the initially allocated
location. So we rarely suffer the penalty of extra writes.  With
the addition of the consolodation of contiguous blocks into single
BIO_DELETE operations, having fewer but larger contiguous blocks
reduces the number of (slow and expensive) BIO_DELETE operations.
So when doing BIO_DELETE consolodation, we do block reallocation.

Reviewed by:  kib
Tested by:    Peter Holm
Sponsored by: Netflix
2018-08-19 17:19:20 +00:00
Kirk McKusick
fc6e171535 Add consolodation of TRIM / BIO_DELETE commands to the UFS/FFS filesystem.
When deleting files on filesystems that are stored on flash-memory
(solid-state) disk drives, the filesystem notifies the underlying
disk of the blocks that it is no longer using. The notification
allows the drive to avoid saving these blocks when it needs to
flash (zero out) one of its flash pages. These notifications of
no-longer-being-used blocks are referred to as TRIM notifications.
In FreeBSD these TRIM notifications are sent from the filesystem
to the drive using the BIO_DELETE command.

Until now, the filesystem would send a separate message to the drive
for each block of the file that was deleted. Each Gigabyte of file
size resulted in over 3000 TRIM messages being sent to the drive.
This burst of messages can overwhelm the drive's task queue causing
multiple second delays for read and write requests.

This implementation collects runs of contiguous blocks in the file
and then consolodates them into a single BIO_DELETE command to the
drive. The BIO_DELETE command describes the run of blocks as a
single large block being deleted. Each Gigabyte of file size can
result in as few as two BIO_DELETE commands and is typically less
than ten.  Though these larger BIO_DELETE commands take longer to
run, they do not clog the drive task queue, so read and write
commands can intersperse effectively with them.

Though this new feature has been throughly reviewed and tested, it
is being added disabled by default so as to minimize the possibility
of disrupting the upcoming 12.0 release. It can be enabled by running
``sysctl vfs.ffs.dotrimcons=1''. Users are encouraged to test it.
If no problems arise, we will consider requesting that it be enabled
by default for 12.0.

Reviewed by:  kib
Tested by:    Peter Holm
Sponsored by: Netflix
2018-08-19 16:56:42 +00:00
John Baldwin
a568818913 Remove some vestiges of IPI_LAZYPMAP on i386.
The support for lazy pmap invalidations on i386 was removed in r281707.
This removes the constant for the IPI and stops accounting for it when
sizing the interrupt count arrays.

Reviewed by:	kib
MFC after:	2 weeks
Differential Revision:	https://reviews.freebsd.org/D16801
2018-08-19 16:14:59 +00:00
Michael Tuexen
8e02b4e00c Don't expose the uptime via the TCP timestamps.
The TCP client side or the TCP server side when not using SYN-cookies
used the uptime as the TCP timestamp value. This patch uses in all
cases an offset, which is the result of a keyed hash function taking
the source and destination addresses and port numbers into account.
The keyed hash function is the same a used for the initial TSN.

Reviewed by:		rrs@
MFC after:		1 month
Sponsored by:		Netflix, Inc.
Differential Revision:	https://reviews.freebsd.org/D16636
2018-08-19 14:56:10 +00:00
Cy Schubert
683a58eeb9 The bucket index is subtracted by one at lines 2304 and 2314. When 0 it
becomes -1, except these are unsigned integers, so they become very large
numbers. Thus are always larger than the maximum bucket; the hash table
insertion fails causing NAT to fail.

This commit ensures that if the index is already zero it is not reduced
prior to insertion into the hash table.

PR:		208566
2018-08-19 13:45:03 +00:00
Cy Schubert
58a290b9f4 Add handy DTrace probes useful in diagnosing NAT issues. DTrace probes
are situated next to error counters and/or in one instance prior to the
-1 return from various functions. This was useful in diagnosis of
PR/208566 and will be handy in the future diagnosing NAT failures.

PR:		208566
MFC after:	3 days
2018-08-19 13:44:59 +00:00
Cy Schubert
1d6e9fe75c Expose np (nat_t - an entry in the nat table structure) in the DTrace
probe when nat fails (label badnat). This is useful in diagnosing
failed NAT issues and was used in PR/208566.

PR:		208566
MFC after:	3 days
2018-08-19 13:44:56 +00:00
Tai-hwa Liang
d17f8070a1 Extending the delay cycles to give the codec more time to pump ADC data across the AC-link.
Without this patch, some CS4614 cards will need users to reload the driver manually or
the hardware won't be initialised properly. Something like:

	# kldload snd_csa
	# kldunload snd_csa
	# kldload snd_csa

Tested with:	Terratec SiXPack 5.1+
2018-08-19 01:14:46 +00:00
Conrad Meyer
b8e771e97a Back out r338035 until Warner is finished churning GSoC PNP patches
I was not aware Warner was making or planning to make forward progress in
this area and have since been informed of that.

It's easy to apply/reapply when churn dies down.
2018-08-19 00:46:22 +00:00
Conrad Meyer
faa319436f Remove unused and easy to misuse PNP macro parameter
Inspired by r338025, just remove the element size parameter to the
MODULE_PNP_INFO macro entirely.  The 'table' parameter is now required to
have correct pointer (or array) type.  Since all invocations of the macro
already had this property and the emitted PNP data continues to include the
element size, there is no functional change.

Mostly done with the coccinelle 'spatch' tool:

  $ cat modpnpsize0.cocci
    @normaltables@
    identifier b,c;
    expression a,d,e;
    declarer MODULE_PNP_INFO;
    @@
     MODULE_PNP_INFO(a,b,c,d,
    -sizeof(d[0]),
     e);

    @singletons@
    identifier b,c,d;
    expression a;
    declarer MODULE_PNP_INFO;
    @@
     MODULE_PNP_INFO(a,b,c,&d,
    -sizeof(d),
     1);

  $ rg -l MODULE_PNP_INFO -- sys | \
    xargs spatch --in-place --sp-file modpnpsize0.cocci

(Note that coccinelle invokes diff(1) via a PATH search and expects diff to
tolerate the -B flag, which BSD diff does not.  So I had to link gdiff into
PATH as diff to use spatch.)

Tinderbox'd (-DMAKE_JUST_KERNELS).
2018-08-19 00:22:21 +00:00
Alan Cox
94d0f0877d Oops. r338030 didn't eliminate the unused arena argument from all of
kmem_alloc_attr()'s callers.  Correct that mistake.
2018-08-18 22:35:19 +00:00
Kirk McKusick
7e038bc257 Replace the TRIM consolodation framework originally added in -r337396
driven by problems found with the algorithms being tested for TRIM
consolodation.

Reported by:  Peter Holm
Suggested by: kib
Reviewed by:  kib
Sponsored by: Netflix
2018-08-18 22:21:59 +00:00
Alan Cox
db7c2a4822 Eliminate the unused arena parameter from kmem_alloc_attr().
Reviewed by:	kib, markj
Differential Revision:	https://reviews.freebsd.org/D16793
2018-08-18 22:07:48 +00:00
Kirk McKusick
cc91864c26 Revert -r337396. It is being replaced with a revised interface that
resulted from testing and further reviews.
2018-08-18 21:21:06 +00:00
Dimitry Andric
a06da7bafe Use the size of one bge_devs element for the MODULE_PNP_INFO macro,
instead of the size of the whole bge_devs array.

This should stop kldxref searching beyond the end of .rodata when it
processes relocations, and emitting "unhandled relocation type" errors,
at least on i386.
2018-08-18 20:41:43 +00:00
Konstantin Belousov
1ace6e5bea Rudimentary AER reading code for ddb(4).
This is very primitive code to inspect the PCI error state and AER
error state, dump the log and clear errors, from ddb.
pci_print_faulted_dev() is made external to allow calling it from
other places.  It was called from NMI handler but this chunk is not
included.

Also there is a tunable-controlled code to clear AER on device attach,
disabled by default.

All this code was useful to me when I debugged ACPI_DMAR failures (not
faults) long time ago.

Reviewed by:	cem, imp (previous version)
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
Differential revision:	https://reviews.freebsd.org/D7813
2018-08-18 20:35:19 +00:00
John Baldwin
8cd385fda0 Make 'device crypto' lines more consistent.
- In configurations with a pseudo devices section, move 'device crypto'
  into that section.
- Use a consistent comment.  Note that other things common in kernel
  configs such as GELI also require 'device crypto', not just IPSEC.

Reviewed by:	rgrimes, cem, imp
Differential Revision:	https://reviews.freebsd.org/D16775
2018-08-18 20:32:08 +00:00
Kyle Evans
d529de874b res_find: Fix fallback logic
The fallback logic was broken if hints were found in multiple environments.
If we found a hint in either the loader environment or the static
environment, fallback would be incremented excessively when we returned to
the environment-selection bits. These checks should have also been guarded
by the fbacklvl checks. As a result, fbacklvl could quickly get to a point
where we skip either the static environment and/or the static hints
depending on which environments contained valid hints.

The impact of this bug is minimal, mostly affecting mips boards that use
static hints and may have hints in either the loader environment or the
static environment.

There may be better ways to express the searchable environments and
describing their characteristics (immutable, already searched, etc.) but
this may be revisited after 12 branches.

Reported by:	Dan Nelson <dnelson_1901@yahoo.com>
Triaged by:	Dan Nelson <dnelson_1901@yahoo.com>
MFC after:	3 days
2018-08-18 19:45:56 +00:00
Rick Macklem
fdab4d3b29 Fix LORs between vn_start_write() and vn_lock() in nfsrv_copymr().
When coding the pNFS server, I added vn_start_write() calls in nfsrv_copymr()
done while the vnodes were locked, not realizing I had introduced LORs and
possible deadlock when an exported file system on the MDS is suspended.
This patch fixes the LORs by moving the vn_start_write() calls up to before
where the vnodes are locked. For "tvp", the vn_start_write() probaby isn't
necessary, because NFS mounts can't be suspended. However, I think doing
so is harmless.
Thanks go to kib@ for letting me know that I had introduced these LORs.
This patch only affects the behaviour of the pNFS server when pnfsdscopymr(8)
is used to recover a mirrored DS.
2018-08-18 19:14:06 +00:00
Alan Cox
067fd85894 Eliminate the arena parameter to kmem_malloc_domain(). It is redundant.
The domain and flags parameters suffice.  In fact, the related functions
kmem_alloc_{attr,contig}_domain() don't have an arena parameter.

Reviewed by:	kib, markj
Differential Revision:	https://reviews.freebsd.org/D16713
2018-08-18 18:33:50 +00:00
Konstantin Belousov
9e2d4791d1 Print L1D FLUSH feature.
Sponsored by:	The FreeBSD Foundation
MFC after:	3 days
2018-08-18 12:17:05 +00:00
Xin LI
ed1fa01ac4 Regen after r337998. 2018-08-18 06:33:51 +00:00
Xin LI
0362ec1e8e getrandom(2) should not be restricted in capability mode. 2018-08-18 06:31:49 +00:00
Navdeep Parhar
e7e0844422 cxgbe(4): Replace T4_PKT_TIMESTAMP with something slightly less hackish. 2018-08-18 04:23:51 +00:00
Rick Macklem
3e5ba2e187 Fix LORs between vn_start_write() and vn_lock() in the pNFS server.
When coding the pNFS server, I added several vn_start_write() calls done
while the vnode was locked, not realizing I had introduced LORs and
possible deadlock when an exported file system on the MDS is suspended.
This patch fixes this by removing the added vn_start_write() calls and
modifying the code so that the extant vn_start_write() call before the
NFS RPC/operation is done when needed by the pNFS server.
Flags are changed so that LayoutCommit and LayoutReturn now get a
vn_start_write() done for them.
When the pNFS server is enabled, the code now also changes the flags for
Getattr, so that the vn_start_write() is done for Getattr, since it may
need to do a vn_set_extattr(). The nfs_writerpc flag array was made global
to the NFS server and renamed nfsrv_writerpc, which is consistent naming
for globals in the NFS server.
Thanks go to kib@ for reporting that doing vn_start_write() while the vnode is
locked results in a LOR.
This patch only affects the behaviour of the pNFS server.
2018-08-17 21:12:16 +00:00
Navdeep Parhar
a56e2056a3 cxgbe(4): Adjust ntids to account for nhptids in the TOE case too.
This should have been part of r337538.
2018-08-17 20:28:31 +00:00
Navdeep Parhar
72049e7395 cxgbe/tom: Put the ifnet or VLAN's PCP value in the 802.1Q tag of frames
generated by the TOE.  Works with vid 0 (no VLAN, just priority) too.

MFC after:	1 week
Sponsored by:	Chelsio Communications
2018-08-17 19:22:46 +00:00
Mark Johnston
1436ff1ebb Typo.
X-MFC with:	r337974
2018-08-17 16:07:06 +00:00
Mark Johnston
3ccbdc8254 Add INVARIANTS-only fences around lockless vnode refcount updates.
Some internal KASSERTs access the v_iflag field without the vnode
interlock held after such a refcount update.  The fences are needed for
the assertions to be correct in the face of store reordering.

Reported and tested by:	jhibbits
Reviewed by:	kib, mjg
MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D16756
2018-08-17 15:41:01 +00:00
Alexander Motin
cd2315086a 9751 Allocation throttling misplacing ditto blocks
Relax allocation throttling for ditto blocks.  Due to random imbalances
in allocation it tends to push block copies to one vdev, that looks
slightly better at the moment.  Slightly less strict policy allows both
improve data security and surprisingly write performance, since we don't
need to touch extra metaslabs on each vdev to respect the min distance.

Sponsored by:	iXsystems, Inc.
2018-08-17 15:17:09 +00:00
Alexander Motin
a8e93e3cd7 9738 Fix third block copy allocations, broken at 9112.
Use METASLAB_WEIGHT_CLAIM weight to allocate tertiary blocks.
Previous use of METASLAB_WEIGHT_SECONDARY for that caused errors
later on metaslab_activate_allocator() call, leading to massive
load of unneeded metaslabs and write freezes.

Reviewed by:	Paul Dagnelie <pcd@delphix.com>
2018-08-17 15:00:41 +00:00
Kristof Provost
d47023236c pf: Limit the maximum number of fragments per packet
Similar to the network stack issue fixed in r337782 pf did not limit the number
of fragments per packet, which could be exploited to generate high CPU loads
with a crafted series of packets.

Limit each packet to no more than 64 fragments. This should be sufficient on
typical networks to allow maximum-sized IP frames.

This addresses the issue for both IPv4 and IPv6.

MFC after:	3 days
Security:	CVE-2018-5391
Sponsored by:	Klara Systems
2018-08-17 15:00:10 +00:00
Warner Losh
62ee5bbd73 GPT is standard in x86 and arm64 land. Add it to DEFAULTS with the
others.

Differential Revision: https://reviews.freebsd.org/D16740
2018-08-17 14:47:21 +00:00
Mariusz Zaborski
2fe6aefff8 capsicum: allow the setproctitle(3) function in capability mode
Capsicum in past allowed to change the process title.
This was broken with r335939.

PR:		230584
Submitted by:	Yuichiro NAITO <naito.yuichiro@gmail.com>
Reported by:	ian@niw.com.au
MFC after:	1 week
2018-08-17 14:35:10 +00:00
Rick Macklem
9fbb0faf4f Don't set a file's size for the MDS file of a pNFS service.
When a pNFS service is running, the size of the files created on the MDS
are normally 0, since the data is written to the data files on the DS(s).
However, without this patch, if a Setattr with a non-zero size was done by
a client, the MDS file was set to that size.  This was thought to be benign,
but it turns out that files with a non-zero size plus extended attributes
can cause a "ffs_truncate3" panic in UFS. Although the exact cause of this
panic() has not been isolated, this patch avoids the panic() and leaves
the MDS files in a consistent state of always having a size == 0.
Note that these MDS files never store data. The patch also includes an
unnecessary initialization of savsize in case some compiler or static
analyser complains it might not be initialized.
This patch only affects the NFS server when pNFS is enabled via the "-p"
command line option on nfsd.
2018-08-17 12:32:38 +00:00
Conrad Meyer
9ebbebe4f7 cryptosoft: Reduce generality of supported algorithm composition
Fix a regression introduced in r336439.

Rather than allowing any linked list of algorithms, allow at most two
(typically, some combination of encrypt and/or MAC).  Removes a WAITOK
malloc in an unsleepable context (classic LOR) by placing both software
algorithm contexts within the OCF-managed session object.

Tested with 'cryptocheck -a all -d cryptosoft0', which includes some
encrypt-and-MAC modes.

PR:		230304
Reported by:	sef@
2018-08-17 04:40:01 +00:00
Justin Hibbits
b14959dacc random: Add PowerPC 'darn' instruction entropy source
Summary:
PowerISA 3.0 adds a 'darn' instruction to "deliver a random number".  This
driver was modeled after (rather, copied and gutted of) the Ivy Bridge
rdrand driver.

This uses the "Conditional Random Number" behavior to remove input bias.

From the ISA reference the 'darn' instruction, and the random number
generator backing it, conforms to the NIST SP800-90B and SP800-90C
standards, compliant to the extent possible at the time the hardware was
designed, and guarantees a minimum 0.5 bits of entropy per bit returned.

Reviewed By:	markm, secteam (delphij)
Approved by:	secteam (delphij)
Differential Revision: https://reviews.freebsd.org/D16552
2018-08-17 03:49:07 +00:00
Kyle Evans
45625675e7 subr_prf: Don't write kern.boot_tag if it's empty
This change allows one to set kern.boot_tag="" and not get a blank line
preceding other boot messages. While this isn't super critical- blank lines
are easy to filter out both mentally and in processing dmesg later- it
allows for a mode of operation that matches previous behavior.

I intend to MFC this whole series to stable/11 by the end of the month with
boot_tag empty by default to make this effectively a nop in the stable
branch.
2018-08-17 03:42:57 +00:00
Conrad Meyer
08d77c0178 Riscv: Include crypto for IPSec
Similar to r337944.  I think this is the last configuration that includes IPsec
but not crypto.
2018-08-17 01:08:22 +00:00
Conrad Meyer
923d206149 arm: Define crypto option on platforms that include IPsec
Missed in r337940.

(It's not like there are any crypto files IPsec doesn't pull in, so it is
unclear what not defining the crypto option was supposed to achieve.)

Reported by:	np@
2018-08-17 01:04:02 +00:00
Navdeep Parhar
32a52e9e39 if_vlan(4): A VLAN always has a PCP and its ifnet's if_pcp should be set
to the PCP value in use instead of IFNET_PCP_NONE.

MFC after:	1 week
Sponsored by:	Chelsio Communications
2018-08-17 01:03:23 +00:00
Conrad Meyer
25b7033b73 crypto(4): Add cryptosoft, cryptodev support for Poly-1305 2018-08-17 00:31:06 +00:00
Conrad Meyer
01d5de8fca Add xform-conforming auth_hash wrapper for Poly-1305
The wrapper is a thin shim around libsodium's Poly-1305 implementation.  For
now, we just use the C algorithm and do not attempt to build the
SSE-optimized variant for x86 processors.

The algorithm support has not yet been plumbed through cryptodev, or added
to cryptosoft.
2018-08-17 00:30:04 +00:00
Conrad Meyer
f36e41e20b Bring in compatibility glue for libsodium
The idea is untouched upstream sources live in sys/contrib/libsodium.

sys/crypto/libsodium are support routines or compatibility headers to allow
building unmodified upstream code.

This is not yet integrated into the build system, so no functional change.
2018-08-17 00:27:56 +00:00
Conrad Meyer
0ac341f145 Bring in libsodium to sys/contrib
Bring in https://github.com/jedisct1/libsodium at
461ac93b260b91db8ad957f5a576860e3e9c88a1 (August 7, 2018), unmodified.

libsodium is derived from Daniel J. Bernstein et al.'s 2011 NaCl
("Networking and Cryptography Library," pronounced "salt") software library.
At the risk of oversimplifying, libsodium primarily exists to make it easier
to use NaCl.  NaCl and libsodium provide high quality implementations of a
number of useful cryptographic concepts (as well as the underlying
primitics) seeing some adoption in newer network protocols.

I considered but dismissed cleaning up the directory hierarchy and
discarding artifacts of other build systems in favor of remaining close to
upstream (and easing future updates).

Nothing is integrated into the build system yet, so in that sense, no
functional change.
2018-08-17 00:23:50 +00:00
Glen Barber
9dae4c97a0 Rename head from ALPHA1 to ALPHA2 in preparation for the next set
of snapshot builds.

Hashtag:	MaximumEffort
Approved by:	re (implicit)
Sponsored by:	The FreeBSD Foundation
2018-08-16 23:58:22 +00:00
Navdeep Parhar
32d2623ae2 Add the ability to look up the 3b PCP of a VLAN interface. Use it in
toe_l2_resolve to fill up the complete vtag and not just the vid.

Reviewed by:	kib@
MFC after:	1 week
Sponsored by:	Chelsio Communications
Differential Revision:	https://reviews.freebsd.org/D16752
2018-08-16 23:46:38 +00:00
Jamie Gritton
c542c43ef1 Revert r337922, except for some documention-only bits. This needs to wait
until user is changed to stop using jail(2).

Differential Revision:	D14791
2018-08-16 19:09:43 +00:00
Alexander Motin
6d14f2c48f Make vfs.zfs.zio.dva_throttle_enabled sysctl writable.
Not sure what I thought originally, but as I see now runtime changes are
working fine, and the code seems like even designed for this.
2018-08-16 18:44:50 +00:00
Jamie Gritton
284001a222 Put jail(2) under COMPAT_FREEBSD11. It has been the "old" way of creating
jails since FreeBSD 7.

Along with the system call, put the various security.jail.allow_foo and
security.jail.foo_allowed sysctls partly under COMPAT_FREEBSD11 (or
BURN_BRIDGES).  These sysctls had two disparate uses: on the system side,
they were global permissions for jails created via jail(2) which lacked
fine-grained permission controls; inside a jail, they're read-only
descriptions of what the current jail is allowed to do.  The first use
is obsolete along with jail(2), but keep them for the second-read-only use.

Differential Revision:	D14791
2018-08-16 18:40:16 +00:00
Doug Ambrisko
3991dbf3fa Fix a module Makefile error on amd64 so the IPMI HW interfaces are built.
When the module is being unloaded and no HW interfaces were created don't
clean up.  This was exposed by the amd64 module build issue.
2018-08-16 15:59:02 +00:00
Andrew Turner
a24d3c094e Remove the L1 and L2 xscale page defines and rename the generic macros to
the common name. While here move the macros to check these into pmap-v4.c
as they're only used there.

Sponsored by:	DARPA, AFRL
2018-08-16 10:00:51 +00:00
Andrey V. Elsukov
8065bd0bca Properly initialize IP version in IPv6 header. This was missed in r334673.
Reported by:	Lars Schotte <lars at gustik dot eu>
2018-08-16 09:19:06 +00:00
Alexander Motin
edc391e922 Add couple tunables/sysctl, missed in r336949. 2018-08-16 00:50:14 +00:00
Emmanuel Vadot
8c000099ec am335x: Add pocketbeagle DTS to the build
U-Boot works for this board since 2018.07 and the DTS is now present
in the tree.
2018-08-15 21:47:03 +00:00
Navdeep Parhar
9f78434942 cxgbe(4): Use VLAN_TRUNKDEV instead of private cookie to figure out the
parent of a VLAN ifnet.

MFC after:	1 week
Sponsored by:	Chelsio Communications
2018-08-15 21:24:05 +00:00
Alexander Motin
8ce70dfcfa Fix mismerge in r337196.
ZoL did the same mistake, and fixed it with separate commit 863522b1f9:

dsl_scan_scrub_cb: don't double-account non-embedded blocks

We were doing count_block() twice inside this function, once
unconditionally at the beginning (intended to catch the embedded block
case) and once near the end after processing the block.

The double-accounting caused the "zpool scrub" progress statistics in
"zpool status" to climb from 0% to 200% instead of 0% to 100%, and
showed double the I/O rate it was actually seeing.

This was apparently a regression introduced in commit 00c405b4b5e8,
which was an incorrect port of this OpenZFS commit:

    https://github.com/openzfs/openzfs/commit/d8a447a7

Reviewed by: Thomas Caputi <tcaputi@datto.com>
Reviewed by: Matt Ahrens <matt@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Signed-off-by: Steven Noonan <steven@uplinklabs.net>
Closes #7720
Closes #7738

Reported by:	sef
2018-08-15 21:01:57 +00:00
Matt Macy
f9be038601 Fix in6_multi double free
This is actually several different bugs:
- The code is not designed to handle inpcb deletion after interface deletion
  - add reference for inpcb membership
- The multicast address has to be removed from interface lists when the refcount
  goes to zero OR when the interface goes away
  - decouple list disconnect from refcount (v6 only for now)
- ifmultiaddr can exist past being on interface lists
  - add flag for tracking whether or not it's enqueued
- deferring freeing moptions makes the incpb cleanup code simpler but opens the
  door wider still to races
  - call inp_gcmoptions synchronously after dropping the the inpcb lock

Fundamentally multicast needs a rewrite - but keep applying band-aids for now.

Tested by: kp
Reported by: novel, kp, lwhsu
2018-08-15 20:23:08 +00:00
Alexander Leidinger
f6c0e63bf7 - Add exec hook "exec.created". This is called when the jail is
created and before exec.start is called.			[1]
- Bump __FreeBSD_version.

This allows to attach ZFS datasets and various other things to be
done before any command/service/rc-script is started in the new
jail.

PR:			228066					[1]
Reviewed by:		jamie					[1]
Submitted by:		Stefan Grönke <stefan@gronke.net>	[1]
Differential Revision:	https://reviews.freebsd.org/D15330	[1]
2018-08-15 18:35:42 +00:00
Conrad Meyer
5cb27f0813 FUSE: Document global sysctl knobs
So that I don't have to keep grepping around the codebase to remember what each
one does.  And maybe it saves someone else some time.

Fix a trivial whitespace issue while here.

No functional change.

Sponsored by:	Dell EMC Isilon
2018-08-15 17:41:19 +00:00
Luiz Otavio O Souza
a0376d4d29 Fix a typo in comment.
MFC after:	3 days
X-MFC with:	r321316
Sponsored by:	Rubicon Communications, LLC (Netgate)
2018-08-15 16:36:29 +00:00
Luiz Otavio O Souza
59b2022f94 Late style follow up on r312770.
Submitted by:	glebius
X-MFC with:	r312770
MFC after:	3 days
2018-08-15 15:44:30 +00:00
Andrew Turner
d8ba426351 Remove pmap_kenter_section from the arm pmap. It's unused.
Sponsored by:	DARPA, AFRL
2018-08-15 14:57:34 +00:00
Andrew Turner
ac2d0191fc Remove ARM_HAVE_SUPERSECTIONS. It was only supported on some XScale CPUs.
Sponsored by:	DARPA, AFRL
2018-08-15 14:52:56 +00:00
Andrew Turner
d20512fab9 Make code and data only used within the arm pmap code as static.
Sponsored by:	DARPA, AFRL
2018-08-15 14:45:01 +00:00
Andrew Turner
fa68430df2 Remove arm pmap variables that are only ever set and never read.
Sponsored by:	DARPA, AFRL
2018-08-15 14:29:04 +00:00
Andrew Turner
8082df3c7e Remove ARM_MMU_GENERIC, it's the only ARMV4/v5 MMU we support.
Sponsored by:	DARPA, AFRL
2018-08-15 14:19:07 +00:00
Andrew Turner
58e9644017 Remove the ARMv5 pmap function pointers. These were to support XScale so
are now unused.

Sponsored by:	DARPA, AFRL
2018-08-15 13:52:31 +00:00
Andrew Turner
795272d885 Remove checks for now unsupported CPU_* values in arm headers.
Sponsored by:	DARPA, AFRL
2018-08-15 13:48:59 +00:00
Luiz Otavio O Souza
02fd7b50a0 The interface name must be sanitized before the search to match the existing
netgraph node.

Fixes the search (and use) of VLANs with dot notation.

Obtained from:	pfSense
Sponsored by:	Rubicon Communications, LLC (Netgate)
2018-08-15 13:42:22 +00:00
Andrew Turner
daa5b12a0a Start to remove XScale support from the ARMv4/v5 pmap. Support for XScale
has been removed from the kernel so we can remove it from here to help
simplify the code.

Sponsored by:	DARPA, AFRL
2018-08-15 13:40:16 +00:00
Andrew Turner
916e7b1252 Set the Execute Never flags in EFI device memory as required by the ARMv8
spec.

Sponsored by:	DARPA, AFRL
2018-08-15 13:19:15 +00:00
Andrew Turner
f13a4096b7 Remove PHYSADDR from kernel configurations that don't need it. The only
place we need to set it is when we also have FLASHADDR set.

Sponsored by:	DARPA, AFRL
2018-08-15 13:13:19 +00:00
Andrew Turner
559cb76c51 Remove the VIRT armv7 kernel config. It is supported by GENERIC.
Sponsored by:	DARPA, AFRL
2018-08-15 13:03:01 +00:00
Konstantin Belousov
54564eda77 Fix early EFIRT on PCID machines after r337773.
Ensure that the valid PCID state is created for proc0 pmap, since it
might be used by efirt enter() before first context switch on the BSP.

Sponsored by:	The FreeBSD Foundation
MFC after:	6 days
2018-08-15 12:48:49 +00:00
Edward Tomasz Napierala
e77b6cfe34 In the help message at the mountroot prompt, suggest something that
actually works and matches the bsdinstall(8) default.

MFC after:	2 weeks
Sponsored by:	DARPA, AFRL
2018-08-15 12:12:21 +00:00
Toomas Soome
527d337fdb cd9660 pointer sign issues and missing __packed attribute
The isonum_* functions are defined to take unsigend char* as an argument,
but the structure fields are defined as char. Change to u_char where needed.

Probably the full structure should be changed, but I'm not sure about the
side affects.

While there, add __packed attribute.

Differential Revision:	https://reviews.freebsd.org/D16564
2018-08-15 06:42:31 +00:00
Navdeep Parhar
51347c3ff1 cxgbe(4): Use two hashes instead of a table to keep track of
hashfilters.  Two because the driver needs to look up a hashfilter by
its 4-tuple or tid.

A couple of fixes while here:
- Reject attempts to add duplicate hashfilters.
- Do not assume that any part of the 4-tuple that isn't specified is 0.
  This makes it consistent with all other mandatory parameters that
  already require explicit user input.

MFC after:	2 weeks
Sponsored by:	Chelsio Communications
2018-08-15 03:03:01 +00:00
Warner Losh
74cc33ce57 Flesh out a comment about what we're doing with read bias and trims.
Sponsored by: Netflix
2018-08-15 00:15:40 +00:00
Warner Losh
56b9659ee7 arm/ralink cleanup
Remove the non-INTRNG code.
Remove left over cut and paste code from the lpc code that was the start for the port.
Set KERNPHYSADDR and KERNVIRTADDR

Tested on Buffalo_WZR2-G300N

Differential Revision: https://reviews.freebsd.org/D16622
2018-08-14 20:45:43 +00:00
Mark Johnston
b8abc9d8f5 Help ensure that the copy loop doesn't get converted to a memcpy() call.
Reported and reviewed by: kib
X-MFC with:	r337715
Sponsored by:	The FreeBSD Foundation
2018-08-14 19:21:31 +00:00
David Bright
53e992cfb9 Fix several memory leaks.
The libkqueue tests have several places that leak memory by using an
idiom like:

puts(kevent_to_str(kevp));

Rework to save the pointer returned from kevent_to_str() and then
free() it after it has been used.

Reported by:	asomers (pointer to Coverity), Coverity
CID:		1296063, 1296064, 1296065, 1296066, 1296067, 1350287, 1394960
Sponsored by:	Dell EMC
2018-08-14 19:12:45 +00:00
Luiz Otavio O Souza
e13a20dad7 Disable the auto negotiation if the port is set to fixed-link.
Tested on SG-3100 (ARMADA38X) and Espresso.bin (A37x0).  Fixes the network
on espresso.bin.

Sponsored by:	Rubicon Communications, LLC (Netgate)
2018-08-14 18:58:16 +00:00
Jonathan T. Looney
2ceeacbe71 Lower the default limits on the IPv6 reassembly queue.
Currently, the limits are quite high. On machines with millions of
mbuf clusters, the reassembly queue limits can also run into
the millions. Lower these values.

Also, try to ensure that no bucket will have a reassembly
queue larger than approximately 100 items. This limits the cost to
find the correct reassembly queue when processing an incoming
fragment.

Due to the low limits on each bucket's length, increase the size of
the hash table from 64 to 1024.

Reviewed by:	jhb
Security:	FreeBSD-SA-18:10.ip
Security:	CVE-2018-6923
2018-08-14 17:32:07 +00:00
Jonathan T. Looney
a967df1c8f Lower the default limits on the IPv4 reassembly queue.
In particular, try to ensure that no bucket will have a reassembly
queue larger than approximately 100 items. This limits the cost to
find the correct reassembly queue when processing an incoming
fragment.

Due to the low limits on each bucket's length, increase the size of
the hash table from 64 to 1024.

Reviewed by:	jhb
Security:	FreeBSD-SA-18:10.ip
Security:	CVE-2018-6923
2018-08-14 17:30:46 +00:00
Konstantin Belousov
c30578feeb Provide part of the mitigation for L1TF-VMM.
On the guest entry in bhyve, flush L1 data cache, using either L1D
flush command MSR if available, or by reading enough uninteresting
data to fill whole cache.

Flush is automatically enabled on CPUs which do not report RDCL_NO,
and can be disabled with the hw.vmm.l1d_flush tunable/kenv.

Security:	CVE-2018-3646
Reviewed by:	emaste. jhb, Tony Luck <tony.luck@intel.com>
Sponsored by:	The FreeBSD Foundation
2018-08-14 17:29:41 +00:00
Jonathan T. Looney
5f9f192dc5 Drop 0-byte IPv6 fragments.
Currently, we process IPv6 fragments with 0 bytes of payload, add them
to the reassembly queue, and do not recognize them as duplicating or
overlapping with adjacent 0-byte fragments. An attacker can exploit this
to create long fragment queues.

There is no legitimate reason for a fragment with no payload. However,
because IPv6 packets with an empty payload are acceptable, allow an
"atomic" fragment with no payload.

Reviewed by:	jhb
Security:	FreeBSD-SA-18:10.ip
Security:	CVE-2018-6923
2018-08-14 17:29:22 +00:00
Jonathan T. Looney
1e9f3b734e Implement a limit on on the number of IPv6 reassembly queues per bucket.
There is a hashing algorithm which should distribute IPv6 reassembly
queues across the available buckets in a relatively even way. However,
if there is a flaw in the hashing algorithm which allows a large number
of IPv6 fragment reassembly queues to end up in a single bucket, a per-
bucket limit could help mitigate the performance impact of this flaw.

Implement such a limit, with a default of twice the maximum number of
reassembly queues divided by the number of buckets. Recalculate the
limit any time the maximum number of reassembly queues changes.
However, allow the user to override the value using a sysctl
(net.inet6.ip6.maxfragbucketsize).

Reviewed by:	jhb
Security:	FreeBSD-SA-18:10.ip
Security:	CVE-2018-6923
2018-08-14 17:27:41 +00:00
Jonathan T. Looney
03c99d7662 Add a limit of the number of fragments per IPv6 packet.
The IPv4 fragment reassembly code supports a limit on the number of
fragments per packet. The default limit is currently 17 fragments.
Among other things, this limit serves to limit the number of fragments
the code must parse when trying to reassembly a packet.

Add a limit to the IPv6 reassembly code. By default, limit a packet
to 65 fragments (64 on the queue, plus one final fragment to complete
the packet). This allows an average fragment size of 1,008 bytes, which
should be sufficient to hold a fragment. (Recall that the IPv6 minimum
MTU is 1280 bytes. Therefore, this configuration allows a full-size
IPv6 packet to be fragmented on a link with the minimum MTU and still
carry approximately 272 bytes of headers before the fragmented portion
of the packet.)

Users can adjust this limit using the net.inet6.ip6.maxfragsperpacket
sysctl.

Reviewed by:	jhb
Security:	FreeBSD-SA-18:10.ip
Security:	CVE-2018-6923
2018-08-14 17:26:07 +00:00
Jonathan T. Looney
2adfd64f35 Make the IPv6 fragment limits be global, rather than per-VNET, limits.
The IPv6 reassembly fragment limit is based on the number of mbuf clusters,
which are a global resource. However, the limit is currently applied
on a per-VNET basis. Given enough VNETs (or given sufficient customization
on enough VNETs), it is possible that the sum of all the VNET fragment
limits will exceed the number of mbuf clusters available in the system.

Given the fact that the fragment limits are intended (at least in part) to
regulate access to a global resource, the IPv6 fragment limit should
be applied on a global basis.

Note that it is still possible to disable fragmentation for a particular
VNET by setting the net.inet6.ip6.maxfragpackets sysctl to 0 for that
VNET. In addition, it is now possible to disable fragmentation globally
by setting the net.inet6.ip6.maxfrags sysctl to 0.

Reviewed by:	jhb
Security:	FreeBSD-SA-18:10.ip
Security:	CVE-2018-6923
2018-08-14 17:24:26 +00:00
Jonathan T. Looney
ff790bbad0 Implement a limit on on the number of IPv4 reassembly queues per bucket.
There is a hashing algorithm which should distribute IPv4 reassembly
queues across the available buckets in a relatively even way. However,
if there is a flaw in the hashing algorithm which allows a large number
of IPv4 fragment reassembly queues to end up in a single bucket, a per-
bucket limit could help mitigate the performance impact of this flaw.

Implement such a limit, with a default of twice the maximum number of
reassembly queues divided by the number of buckets. Recalculate the
limit any time the maximum number of reassembly queues changes.
However, allow the user to override the value using a sysctl
(net.inet.ip.maxfragbucketsize).

Reviewed by:	jhb
Security:	FreeBSD-SA-18:10.ip
Security:	CVE-2018-6923
2018-08-14 17:23:05 +00:00
Jonathan T. Looney
7b9c5eb0a5 Add a global limit on the number of IPv4 fragments.
The IP reassembly fragment limit is based on the number of mbuf clusters,
which are a global resource. However, the limit is currently applied
on a per-VNET basis. Given enough VNETs (or given sufficient customization
of enough VNETs), it is possible that the sum of all the VNET limits
will exceed the number of mbuf clusters available in the system.

Given the fact that the fragment limit is intended (at least in part) to
regulate access to a global resource, the fragment limit should
be applied on a global basis.

VNET-specific limits can be adjusted by modifying the
net.inet.ip.maxfragpackets and net.inet.ip.maxfragsperpacket
sysctls.

To disable fragment reassembly globally, set net.inet.ip.maxfrags to 0.
To disable fragment reassembly for a particular VNET, set
net.inet.ip.maxfragpackets to 0.

Reviewed by:	jhb
Security:	FreeBSD-SA-18:10.ip
Security:	CVE-2018-6923
2018-08-14 17:19:49 +00:00
Konstantin Belousov
8d32b46379 Add definitions related to the L1D flush operation capability and MSR.
Sponsored by:	The FreeBSD Foundation
2018-08-14 17:19:11 +00:00
Jonathan T. Looney
80d7a85390 Improve IPv6 reassembly performance by hashing fragments into buckets.
Currently, all IPv6 fragment reassembly queues are kept in a flat
linked list. This has a number of implications. Two significant
implications are: all reassembly operations share a common lock,
and it is possible for the linked list to grow quite large.

Improve IPv6 reassembly performance by hashing fragments into buckets,
each of which has its own lock. Calculate the hash key using a Jenkins
hash with a random seed.

Reviewed by:	jhb
Security:	FreeBSD-SA-18:10.ip
Security:	CVE-2018-6923
2018-08-14 17:17:37 +00:00
Jonathan T. Looney
5d9bd45518 Improve hashing of IPv4 fragments.
Currently, IPv4 fragments are hashed into buckets based on a 32-bit
key which is calculated by (src_ip ^ ip_id) and combined with a random
seed. However, because an attacker can control the values of src_ip
and ip_id, it is possible to construct an attack which causes very
deep chains to form in a given bucket.

To ensure more uniform distribution (and lower predictability for
an attacker), calculate the hash based on a key which includes all
the fields we use to identify a reassembly queue (dst_ip, src_ip,
ip_id, and the ip protocol) as well as a random seed.

Reviewed by:	jhb
Security:	FreeBSD-SA-18:10.ip
Security:	CVE-2018-6923
2018-08-14 17:15:47 +00:00
Konstantin Belousov
9840c7373c Reserve page at the physical address zero on amd64.
We always zero the invalidated PTE/PDE for superpage, which means that
L1TF CPU vulnerability (CVE-2018-3620) can be only used for reading
from the page at zero.

Note that both i386 and amd64 exclude the page from phys_avail[]
array, so this change is redundant, but I think that phys_avail[] on
UEFI-boot does not need to do that.  Eventually the blacklisting
should be made conditional on CPUs which report that they are not
vulnerable to L1TF.

Reviewed by:	emaste. jhb
Sponsored by:	The FreeBSD Foundation
2018-08-14 17:14:33 +00:00
Konstantin Belousov
8fba5348fc amd64: ensure that curproc->p_vmspace pmap always matches PCPU
curpmap.

When performing context switch on a machine without PCID, if current
%cr3 equals to the new pmap %cr3, which is typical for kernel_pmap
vs. kernel process, I overlooked to update PCPU curpmap value.  Remove
check for %cr3 not equal to pm_cr3 for doing the update.  It is
believed that this case cannot happen at all, due to other changes in
this revision.

Also, do not set the very first curpmap to kernel_pmap, it should be
vmspace0 pmap instead to match curproc.

Move the common code to activate the initial pmap both on BSP and APs
into pmap_activate_boot() helper.

Reported by: eadler, ambrisko
Discussed with: kevans
Reviewed by:	alc, markj (previous version)
Tested by: ambrisko (previous version)
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D16618
2018-08-14 16:37:14 +00:00
Luiz Otavio O Souza
6f207f5b47 Add support to the Marvell Xenon SDHCI controller.
Tested on Espresso.bin (37x0) and Macchiato.bin (8k) with SD cards and
eMMCs.

Obtained from:	pfSense
Sponsored by:	Rubicon Communications, LLC (Netgate)
2018-08-14 16:33:30 +00:00
Ruslan Bukin
5bcd113c91 Query MVPConf0.PVPE for number of CPUs.
Rather than hard-coding the number of CPUs to 2, look up the PVPE field
in MVPConf0, as the valid VPE numbers are from 0 to PVPE inclusive.

Submitted by:	"James Clarke" <jrtc4@cam.ac.uk>
Reviewed by:	br
Sponsored by:	DARPA, AFRL
Differential Revision:	https://reviews.freebsd.org/D16644
2018-08-14 16:29:10 +00:00
Konstantin Belousov
ef52dc71eb Fix typo.
Noted by:	alc
MFC after:	3 days
2018-08-14 16:27:17 +00:00
Ruslan Bukin
b3410bc623 Avoid repeated address calculation for malta_ap_boot.
Submitted by:	"James Clarke" <jrtc4@cam.ac.uk>
Reviewed by:	br, arichardson
Sponsored by:	DARPA, AFRL
Differential Revision:	https://reviews.freebsd.org/D16655
2018-08-14 16:26:44 +00:00
Ruslan Bukin
9aa2d5e4fa Remove unused code.
Sponsored by:	DARPA, AFRL
2018-08-14 16:22:14 +00:00
Ruslan Bukin
2cfd37def0 Rewrite RISC-V disassembler:
- Use macroses from encoding.h generated by riscv-opcodes.
- Add support for C-compressed ISA extension.

Sponsored by:	DARPA, AFRL
2018-08-14 16:03:03 +00:00
Andrew Turner
3f9baabdd0 Remove cpu_pfr from arm. It's unused. 2018-08-14 16:01:25 +00:00
Andrew Turner
52a532939b Remove an old comment now the code it references has been removed. 2018-08-14 15:48:13 +00:00
Andrew Turner
27e0028cdd Fix the spelling of armv4_idcache_inv_all in an END macro. 2018-08-14 15:42:27 +00:00
Luiz Otavio O Souza
37844eaacf Use the correct PTE when changing the attribute of multiple pages.
Submitted by:	andrew (long time ago)
Sponsored by:	Rubicon Communications, LLC (Netgate)
2018-08-14 15:27:50 +00:00
Mark Johnston
27f4c235ee Explain why we aren't using memcpy().
Reported by:	jmg
X-MFC with:	r337715
Sponsored by:	The FreeBSD Foundation
2018-08-14 14:50:06 +00:00
Mark Johnston
845800e190 Don't use memcpy() in the early microcode loading code.
At some point memcpy() may be an ifunc, ifunc resolution cannot be done
until CPU identification has been performed, and CPU identification must
be done after loading any microcode updates.

X-MFC with:	r337715
Sponsored by:	The FreeBSD Foundation
2018-08-14 14:02:53 +00:00
Luiz Otavio O Souza
217643e7da Fix a typo on the PSCI smc call wrapper.
Looks good from:	andrew
Sponsored by:	Rubicon Communications, LLC (Netgate)
2018-08-14 13:56:49 +00:00
Mark Johnston
3571aee662 Fix the !SMP x86 build.
Reported by:	Michael Butler <imb@protected-networks.net>
X-MFC with:	r337715
Sponsored by:	The FreeBSD Foundation
2018-08-14 13:56:42 +00:00
Andrew Turner
398810619c Support reading from the arm64 ID registers from userspace.
Trap reads to the arm64 ID registers and write a safe value into them. This
will allow us to put more useful values in these later and have userland
check them to find what features the hardware supports.

These are currently safe defaults, but will later be populated with better
values from the hardware.

Sponsored by:	DARPA, AFRL
Differential Revision:	https://reviews.freebsd.org/D16533
2018-08-14 11:00:54 +00:00
Michael Tuexen
98f04e431e Use a macro to set the assoc state. I missed this in r337706. 2018-08-14 08:33:47 +00:00
Michael Tuexen
0f1346f7f4 Remove a set but not used warning showing up in usrsctp. 2018-08-14 08:32:33 +00:00
Andrey V. Elsukov
62484790e0 Restore ability to send ICMP and ICMPv6 redirects.
It was lost when tryforward appeared. Now ip[6]_tryforward will be enabled
only when sending redirects for corresponding IP version is disabled via
sysctl. Otherwise will be used default forwarding function.

PR:		221137
Submitted by:	mckay@
MFC after:	2 weeks
2018-08-14 07:54:14 +00:00
Matt Macy
81eb4dcf9e Add library and kernel support for AMD Family 17h counters
NB: lacks default sample rate for most counters
2018-08-14 05:18:43 +00:00
Ian Lepore
5af4ab6524 Export the eeprom device size via readonly sysctl. Also export the write
page size and address size, although they are likely to be inherently
less-interesting values outside of the driver.
2018-08-13 23:53:11 +00:00
Brooks Davis
8f4dfca127 Copy out from kernel to data, not the other way around.
MFC after:	3 days
Sponsored by:	DARPA, AFRL
2018-08-13 21:53:18 +00:00
Marius Strobl
73ed47f04f Remove the duplicated CSUM_IP6_TCP introduced in r311849 from the TX
checksum capabilities of IGB-class MACs. While at it, fix the line
wrapping.

PR:	230571
2018-08-13 20:29:39 +00:00
Warner Losh
acc173a6aa Port the mps panic-safe shutdown_final handling to mpr
r330951 by smh fixed the mps driver to avoid deadlocks when panicing.
The same code is needed for mpr, so port it here, along with the fix
which allows the CCBs scheduled to complete avoiding at least a scary
message and likely other unintended consequences.

Sponsored by: Netflix
Differential Review: https://reviews.freebsd.org/D16663
2018-08-13 19:59:42 +00:00
Warner Losh
d4b95382ee Call xpt_sim_poll in shutdown_final handler.
When we're shutting down, we send a number of start/stop commands to
the known targets. We have to wait for them to complete. During a
panic, the interrupts are off, and using pause to wait for them to
fire and complete won't work: we have to poll after pause returns so
the completion routines of the CCBs run so we decrement work
outstanding counts.

Sponsored by: Netflix
Differential Review: https://reviews.freebsd.org/D16663
2018-08-13 19:59:37 +00:00
Warner Losh
0cc28e3cd5 Create xpt_sim_poll and refactor a bit using it.
xpt_sim_poll takes the sim to poll as an argument. It will do the
proper locking protocol, call the SIM polling routine, and then call
camisr_runqueue to process completions on any CCBs the SIM's poll
routine completed. It will be used during late shutdown when a SIM is
waiting for CCBs it sent during shutdown to finish and the scheduler
isn't running because we've panic'd.

This sequence was used twice in cam_xpt, so refactor those to use this
new function.

Sponsored by: Netflix
Differential Review: https://reviews.freebsd.org/D16663
2018-08-13 19:59:32 +00:00
Navdeep Parhar
408954013a Whitespace nit in t4_tom.h 2018-08-13 19:21:28 +00:00
Vladimir Kondratyev
48f2b00648 evdev: Remove evdev.ko linkage dependency on kbd driver
Move evdev_ev_kbd_event() helper from evdev to kbd.c as otherwise evdev
unconditionally requires all keyboard and console stuff to be compiled
into the kernel. This dependency happens as evdev_ev_kbd_event() helper
references kbdsw global variable defined in kbd.c through use of
kbdd_ioctl() macro.

While here make all keyboard drivers respect evdev_rcpt_mask while setting
typematic rate and LEDs with evdev interface.

Requested by:	Milan Obuch <bsd@dino.sk>
Reviewed by:	hselasky, gonzo
Differential Revision:	https://reviews.freebsd.org/D16614
2018-08-13 19:05:53 +00:00
Vladimir Kondratyev
911aed94fa evdev: remove soft context from evdev methods parameter list.
Now softc should be retrieved from struct edvev * pointer
with evdev_get_softc() helper.

wmt(4) is a sample of driver that support both KPI.

Reviewed by:	hselasky, gonzo
Differential Revision:	https://reviews.freebsd.org/D16614
2018-08-13 19:00:42 +00:00
Oleksandr Tymoshenko
b16d03ad6e [ig4] Fix initialization sequence for newer ig4 chips
Newer chips may require assert/deassert after power down for proper
startup. Check respective flag in DEVIDLE_CTRL and perform operation
if neccesssary.

PR:		221777
Submitted by:	marc.priggemeyer@gmail.com
Obtained from:	DragonFly BSD
Tested on:	Thinkpad T470
2018-08-13 18:53:14 +00:00
Mark Johnston
97edfc1b45 Implement kernel support for early loading of Intel microcode updates.
Updates in the format described in section 9.11 of the Intel SDM can
now be applied as one of the first steps in booting the kernel.  Updates
that are loaded this way are automatically re-applied upon exit from
ACPI sleep states, in contrast with the existing cpucontrol(8)-based
method.  For the time being only Intel updates are supported.

Microcode update files are passed to the kernel via loader(8).  The
file type must be "cpu_microcode" in order for the file to be recognized
as a candidate microcode update.  Updates for multiple CPU types may be
concatenated together into a single file, in which case the kernel
will select and apply a matching update.  Memory used to store the
update file will be freed back to the system once the update is applied,
so this approach will not consume more memory than required.

Reviewed by:	kib
MFC after:	6 weeks
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D16370
2018-08-13 17:13:09 +00:00
Konstantin Belousov
c1344d2bbe Prevent some parallel swap-ins, rate-limit swapper swap-ins.
If faultin() was called outside swapper (from PHOLD()), do not allow
swapper to initiate additional swap-ins.  Swapper' initiated swap-ins
are serialized because they are synchronous and executed in the
context of the thread0.  With the added limitation, we only allow
parallel swap-ins from PHOLD(), which is up to PHOLD() users to
manage, usually they do not need to.

Rate-limit swapper' swap-ins to one in the MAXSLP / 2 seconds
interval, counting faultin() swapins.

Suggested by:	alc
Reviewed by:	alc, markj
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
Differential revision:	https://reviews.freebsd.org/D16610
2018-08-13 16:48:46 +00:00
Jung-uk Kim
51f42bad71 Merge ACPICA 20180810. 2018-08-13 16:26:26 +00:00
Ruslan Bukin
c1d0e057d8 Add RISC-V instructions encoding.
This is the output of
$ cat opcodes opcodes-rvc-pseudo opcodes-rvc opcodes-custom |
    ./parse-opcodes -c

It is confirmed by author that the output of parse-opcodes is
in the public domain.

This will be required for DDB disassembler.

Discussed with: Andrew Waterman <waterman@eecs.berkeley.edu>
Obtained from:	https://github.com/riscv/riscv-opcodes
Sponsored by:	DARPA, AFRL
2018-08-13 16:07:18 +00:00
Andrew Gallatin
5ccac9f972 lagg: allow lacp to manage the link state
Lacp needs to manage the link state itself. Unlike other
lagg protocols, the ability of lacp to pass traffic
depends not only on the lagg members having link, but also
on the lacp protocol converging to a distributing state with the
link partner.

If we prematurely mark the link as up, then we will send a
gratuitous arp (via arp_handle_ifllchange()) before the lacp
interface is capable of passing traffic. When this happens,
the gratuitous arp is lost, and our link partner may cache
a stale mac address (eg, when the base mac address for the
lagg bundle changes, due to a BIOS change re-ordering NIC
unit numbers)

Reviewed by: jtl, hselasky
Sponsored by: Netflix
2018-08-13 14:13:25 +00:00
Michael Tuexen
839d21d62e Use the stacb instead of the asoc in state macros.
This is not a functional change. Just a preparation for upcoming
dtrace state change provider support.
2018-08-13 13:58:45 +00:00
Michael Tuexen
61a2188021 Use consistently the macors to modify the assoc state.
No functional change.
2018-08-13 11:56:21 +00:00
Michal Meloun
23242e7a9c Add USB ID for rebranded RTL8153 found on NVIDIA Jetson TX1 board.
MFC after:	3 days
2018-08-13 07:28:25 +00:00
Emmanuel Vadot
2421576ca3 Import DTS files from Linux 4.18 2018-08-13 06:40:20 +00:00
Matt Macy
20a3cbe1f8 fix static ZFS linking
Static linking of ZFS is a newish option and LINT doesn't include it
2018-08-12 21:04:53 +00:00
Justin Hibbits
54318d2a6a ipmi/opal: Enable polled mode and proper callback
Fix a NULL dereference that would occur any time an ioctl() was done, due to a
missing ipmi_enqueue_request callback.  Just use the default for now, until we
decide to properly enable IPMI interrupts.

Reported by:	kbowling
2018-08-12 20:33:55 +00:00
Michael Tuexen
812649d86f Add explicit cast to silence a warning for the userland stack.
Thanks to Felix Weinrank for providing the patch.
2018-08-12 14:05:15 +00:00
Navdeep Parhar
4a89444d7e Remove unused stuff from iw_cxgbe.h 2018-08-12 03:36:09 +00:00
Matt Macy
fb8f55f586 MFV/ZoL: Add dbuf hash and dbuf cache kstats
TODO: KSTAT_TYPE_NAMED support

commit 5e021f56d3437d3523904652fe3cc23ea1f4cb70
Author: Giuseppe Di Natale <dinatale2@users.noreply.github.com>
Date:   Mon Jan 29 10:24:52 2018 -0800

    Add dbuf hash and dbuf cache kstats

    Introduce kstats about the dbuf hash and dbuf cache
    to make it easier to inspect state. This should help
    with debugging and understanding of these portions
    of the codebase.

    Correct format of dbuf kstat file.

    Introduce a dbc column to dbufs kstat to indicate if
    a dbuf is in the dbuf cache.

    Introduce field filtering in the dbufstat python script.

    Introduce a no header option to the dbufstat python script.

    Introduce a test case to test basic mru->mfu list movement
    in the ARC.

    Reviewed-by: Tony Hutter <hutter2@llnl.gov>
    Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
    Signed-off-by: Giuseppe Di Natale <dinatale2@llnl.gov>
    Closes #6906
2018-08-12 03:15:30 +00:00
Matt Macy
13ae5c6ba8 MFV/ZoL: Fix stack dbuf_hold_impl()
commit fc5bb51f08a6c91ff9ad3559d0266eeeab0b1f61
Author: Brian Behlendorf <behlendorf1@llnl.gov>
Date:   Thu Aug 26 10:52:00 2010 -0700

    Fix stack dbuf_hold_impl()

    This commit preserves the recursive function dbuf_hold_impl() but moves
    the local variables and function arguments to the heap to minimize
    the stack frame size.  Enough space is initially allocated on the
    stack for 20 levels of recursion.  This technique was based on commit
    34229a2f2ac07363f64ddd63e014964fff2f0671 which reduced stack usage of
    traverse_visitbp().

    Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2018-08-12 02:24:18 +00:00
Matt Macy
6e3d1345d9 fix build DN_MAX_BONUSLEN -> DN_OLD_MAX_BONUSLEN 2018-08-12 02:12:44 +00:00
Matt Macy
0f5add2566 Restore legacy dnode_phys layout on tier 2 arches
Evidently gcc4 doesn't support anonymous union members
2018-08-12 02:09:06 +00:00
Matt Macy
104ed324dd MFV/ZoL: Fix stack noinline
commit 60948de1ef976aabaa3630707bcc8b5867508507
Author: Brian Behlendorf <behlendorf1@llnl.gov>
Date:   Thu Aug 26 10:58:36 2010 -0700

    Fix stack noinline

    Certain function must never be automatically inlined by gcc because
    they are stack heavy or called recursively.  This patch flags all
    such functions I've found as 'noinline' to prevent gcc from making
    the optimization.

    Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2018-08-12 01:29:30 +00:00
Matt Macy
71d48dbda3 MFV/ZoL: Fix PANIC: metaslab_free_dva(): bad DVA X:Y:Z
commit 81edd3e83409218879e7af293daa86b0c40eb015
Author: Peng <peng.hse@xtaotech.com>
Date:   Wed Jun 8 15:22:07 2016 +0800

    Fix PANIC: metaslab_free_dva(): bad DVA X:Y:Z

    The following scenario can result in garbage in the dn_spill field.
    The db->db_blkptr must be set to NULL when DNODE_FLAG_SPILL_BLKPTR
    is clear to ensure the dn_spill field is cleared.

    Current txg = A.
    * A new spill buffer is created. Its dbuf is initialized with
      db_blkptr = NULL and it's dirtied.

    Current txg = B.
    * The spill buffer is modified. It's marked as dirty in this txg.
    * Additional changes make the spill buffer unnecessary because the
      xattr fits into the bonus buffer, so it's removed. The dbuf is
      undirtied in this txg, but it's still referenced and cannot be
      destroyed.

    Current txg = C.
    * Starts syncing of txg A
    * dbuf_sync_leaf() is called for the spill buffer. Since db_blkptr
      is NULL, dbuf_check_blkptr() is called.
    * The dbuf starts being written and it reaches the ready state
      (not done yet).
    * A new change makes the spill buffer necessary again.
      sa_build_layouts() ends up calling dbuf_find() to locate the
      dbuf.  It finds the old dbuf because it has not been destroyed yet
      (it will be destroyed when the previous write is done and there
      are no more references). The old dbuf has db_blkptr != NULL.
    * txg A write is complete and the dbuf released. However it's still
      referenced, so it's not destroyed.

    Current txg = D.
    * Starts syncing of txg B
    * dbuf_sync_leaf() is called for the bonus buffer. Its contents are
      directly copied into the dnode, overwriting the blkptr area because,
      in txg B, the bonus buffer was big enough to hold the entire xattr.
    * At this point, the db_blkptr of the spill buffer used in txg C
      gets corrupted.

    Signed-off-by: Peng <peng.hse@xtaotech.com>
    Signed-off-by: Tim Chase <tim@chase2k.com>
    Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
    Closes #3937
2018-08-12 01:17:32 +00:00
Matt Macy
6f06a36d47 MFV/ZoL: add dbuf stats
NB: disabled pending the addition of KSTAT_TYPE_RAW support to the
SPL

commit e0b0ca983d6897bcddf05af2c0e5d01ff66f90db
Author: Brian Behlendorf <behlendorf1@llnl.gov>
Date:   Wed Oct 2 17:11:19 2013 -0700

    Add visibility in to cached dbufs

    Currently there is no mechanism to inspect which dbufs are being
    cached by the system.  There are some coarse counters in arcstats
    by they only give a rough idea of what's being cached.  This patch
    aims to improve the current situation by adding a new dbufs kstat.

    When read this new kstat will walk all cached dbufs linked in to
    the dbuf_hash.  For each dbuf it will dump detailed information
    about the buffer.  It will also dump additional information about
    the referenced arc buffer and its related dnode.  This provides a
    more complete view in to exactly what is being cached.

    With this generic infrastructure in place utilities can be written
    to post-process the data to understand exactly how the caching is
    working.  For example, the data could be processed to show a list
    of all cached dnodes and how much space they're consuming.  Or a
    similar list could be generated based on dnode type.  Many other
    ways to interpret the data exist based on what kinds of questions
    you're trying to answer.

    Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
    Signed-off-by: Prakash Surya <surya1@llnl.gov>
2018-08-12 01:10:18 +00:00
Matt Macy
cc0fbbb92e MFV/ZoL: Implement large_dnode pool feature
commit 50c957f702ea6d08a634e42f73e8a49931dd8055
Author: Ned Bass <bass6@llnl.gov>
Date:   Wed Mar 16 18:25:34 2016 -0700

    Implement large_dnode pool feature

    Justification
    -------------

    This feature adds support for variable length dnodes. Our motivation is
    to eliminate the overhead associated with using spill blocks.  Spill
    blocks are used to store system attribute data (i.e. file metadata) that
    does not fit in the dnode's bonus buffer. By allowing a larger bonus
    buffer area the use of a spill block can be avoided.  Spill blocks
    potentially incur an additional read I/O for every dnode in a dnode
    block. As a worst case example, reading 32 dnodes from a 16k dnode block
    and all of the spill blocks could issue 33 separate reads. Now suppose
    those dnodes have size 1024 and therefore don't need spill blocks.  Then
    the worst case number of blocks read is reduced to from 33 to two--one
    per dnode block. In practice spill blocks may tend to be co-located on
    disk with the dnode blocks so the reduction in I/O would not be this
    drastic. In a badly fragmented pool, however, the improvement could be
    significant.

    ZFS-on-Linux systems that make heavy use of extended attributes would
    benefit from this feature. In particular, ZFS-on-Linux supports the
    xattr=sa dataset property which allows file extended attribute data
    to be stored in the dnode bonus buffer as an alternative to the
    traditional directory-based format. Workloads such as SELinux and the
    Lustre distributed filesystem often store enough xattr data to force
    spill bocks when xattr=sa is in effect. Large dnodes may therefore
    provide a performance benefit to such systems.

    Other use cases that may benefit from this feature include files with
    large ACLs and symbolic links with long target names. Furthermore,
    this feature may be desirable on other platforms in case future
    applications or features are developed that could make use of a
    larger bonus buffer area.

    Implementation
    --------------

    The size of a dnode may be a multiple of 512 bytes up to the size of
    a dnode block (currently 16384 bytes). A dn_extra_slots field was
    added to the current on-disk dnode_phys_t structure to describe the
    size of the physical dnode on disk. The 8 bits for this field were
    taken from the zero filled dn_pad2 field. The field represents how
    many "extra" dnode_phys_t slots a dnode consumes in its dnode block.
    This convention results in a value of 0 for 512 byte dnodes which
    preserves on-disk format compatibility with older software.

    Similarly, the in-memory dnode_t structure has a new dn_num_slots field
    to represent the total number of dnode_phys_t slots consumed on disk.
    Thus dn->dn_num_slots is 1 greater than the corresponding
    dnp->dn_extra_slots. This difference in convention was adopted
    because, unlike on-disk structures, backward compatibility is not a
    concern for in-memory objects, so we used a more natural way to
    represent size for a dnode_t.

    The default size for newly created dnodes is determined by the value of
    a new "dnodesize" dataset property. By default the property is set to
    "legacy" which is compatible with older software. Setting the property
    to "auto" will allow the filesystem to choose the most suitable dnode
    size. Currently this just sets the default dnode size to 1k, but future
    code improvements could dynamically choose a size based on observed
    workload patterns. Dnodes of varying sizes can coexist within the same
    dataset and even within the same dnode block. For example, to enable
    automatically-sized dnodes, run

     # zfs set dnodesize=auto tank/fish

    The user can also specify literal values for the dnodesize property.
    These are currently limited to powers of two from 1k to 16k. The
    power-of-2 limitation is only for simplicity of the user interface.
    Internally the implementation can handle any multiple of 512 up to 16k,
    and consumers of the DMU API can specify any legal dnode value.

    The size of a new dnode is determined at object allocation time and
    stored as a new field in the znode in-memory structure. New DMU
    interfaces are added to allow the consumer to specify the dnode size
    that a newly allocated object should use. Existing interfaces are
    unchanged to avoid having to update every call site and to preserve
    compatibility with external consumers such as Lustre. The new
    interfaces names are given below. The versions of these functions that
    don't take a dnodesize parameter now just call the _dnsize() versions
    with a dnodesize of 0, which means use the legacy dnode size.

    New DMU interfaces:
      dmu_object_alloc_dnsize()
      dmu_object_claim_dnsize()
      dmu_object_reclaim_dnsize()

    New ZAP interfaces:
      zap_create_dnsize()
      zap_create_norm_dnsize()
      zap_create_flags_dnsize()
      zap_create_claim_norm_dnsize()
      zap_create_link_dnsize()

    The constant DN_MAX_BONUSLEN is renamed to DN_OLD_MAX_BONUSLEN. The
    spa_maxdnodesize() function should be used to determine the maximum
    bonus length for a pool.

    These are a few noteworthy changes to key functions:

    * The prototype for dnode_hold_impl() now takes a "slots" parameter.
      When the DNODE_MUST_BE_FREE flag is set, this parameter is used to
      ensure the hole at the specified object offset is large enough to
      hold the dnode being created. The slots parameter is also used
      to ensure a dnode does not span multiple dnode blocks. In both of
      these cases, if a failure occurs, ENOSPC is returned. Keep in mind,
      these failure cases are only possible when using DNODE_MUST_BE_FREE.

      If the DNODE_MUST_BE_ALLOCATED flag is set, "slots" must be 0.
      dnode_hold_impl() will check if the requested dnode is already
      consumed as an extra dnode slot by an large dnode, in which case
      it returns ENOENT.

    * The function dmu_object_alloc() advances to the next dnode block
      if dnode_hold_impl() returns an error for a requested object.
      This is because the beginning of the next dnode block is the only
      location it can safely assume to either be a hole or a valid
      starting point for a dnode.

    * dnode_next_offset_level() and other functions that iterate
      through dnode blocks may no longer use a simple array indexing
      scheme. These now use the current dnode's dn_num_slots field to
      advance to the next dnode in the block. This is to ensure we
      properly skip the current dnode's bonus area and don't interpret it
      as a valid dnode.

    zdb
    ---
    The zdb command was updated to display a dnode's size under the
    "dnsize" column when the object is dumped.

    For ZIL create log records, zdb will now display the slot count for
    the object.

    ztest
    -----
    Ztest chooses a random dnodesize for every newly created object. The
    random distribution is more heavily weighted toward small dnodes to
    better simulate real-world datasets.

    Unused bonus buffer space is filled with non-zero values computed from
    the object number, dataset id, offset, and generation number.  This
    helps ensure that the dnode traversal code properly skips the interior
    regions of large dnodes, and that these interior regions are not
    overwritten by data belonging to other dnodes. A new test visits each
    object in a dataset. It verifies that the actual dnode size matches what
    was stored in the ztest block tag when it was created. It also verifies
    that the unused bonus buffer space is filled with the expected data
    patterns.

    ZFS Test Suite
    --------------
    Added six new large dnode-specific tests, and integrated the dnodesize
    property into existing tests for zfs allow and send/recv.

    Send/Receive
    ------------
    ZFS send streams for datasets containing large dnodes cannot be received
    on pools that don't support the large_dnode feature. A send stream with
    large dnodes sets a DMU_BACKUP_FEATURE_LARGE_DNODE flag which will be
    unrecognized by an incompatible receiving pool so that the zfs receive
    will fail gracefully.

    While not implemented here, it may be possible to generate a
    backward-compatible send stream from a dataset containing large
    dnodes. The implementation may be tricky, however, because the send
    object record for a large dnode would need to be resized to a 512
    byte dnode, possibly kicking in a spill block in the process. This
    means we would need to construct a new SA layout and possibly
    register it in the SA layout object. The SA layout is normally just
    sent as an ordinary object record. But if we are constructing new
    layouts while generating the send stream we'd have to build the SA
    layout object dynamically and send it at the end of the stream.

    For sending and receiving between pools that do support large dnodes,
    the drr_object send record type is extended with a new field to store
    the dnode slot count. This field was repurposed from unused padding
    in the structure.

    ZIL Replay
    ----------
    The dnode slot count is stored in the uppermost 8 bits of the lr_foid
    field. The bits were unused as the object id is currently capped at
    48 bits.

    Resizing Dnodes
    ---------------
    It should be possible to resize a dnode when it is dirtied if the
    current dnodesize dataset property differs from the dnode's size, but
    this functionality is not currently implemented. Clearly a dnode can
    only grow if there are sufficient contiguous unused slots in the
    dnode block, but it should always be possible to shrink a dnode.
    Growing dnodes may be useful to reduce fragmentation in a pool with
    many spill blocks in use. Shrinking dnodes may be useful to allow
    sending a dataset to a pool that doesn't support the large_dnode
    feature.

    Feature Reference Counting
    --------------------------
    The reference count for the large_dnode pool feature tracks the
    number of datasets that have ever contained a dnode of size larger
    than 512 bytes. The first time a large dnode is created in a dataset
    the dataset is converted to an extensible dataset. This is a one-way
    operation and the only way to decrement the feature count is to
    destroy the dataset, even if the dataset no longer contains any large
    dnodes. The complexity of reference counting on a per-dnode basis was
    too high, so we chose to track it on a per-dataset basis similarly to
    the large_block feature.

    Signed-off-by: Ned Bass <bass6@llnl.gov>
    Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
    Closes #3542
2018-08-12 00:45:53 +00:00
Matt Macy
9f3a171221 Enable balanced arc pruning
Taken from:
ommit f6046738365571bd647f804958dfdff8a32fbde4
Author: Brian Behlendorf <behlendorf1@llnl.gov>
Date:   Sat May 30 09:57:53 2015 -0500

    Make arc_prune() asynchronous

    As described in the comment above arc_adapt_thread() it is critical
    that the arc_adapt_thread() function never sleep while holding a hash
    lock.  This behavior was possible in the Linux implementation because
    the arc_prune() logic was implemented to be synchronous.  Under
    illumos the analogous dnlc_reduce_cache() function is asynchronous.

    To address this the arc_do_user_prune() function is has been reworked
    in to two new functions as follows:

    * arc_prune_async() is an asynchronous implementation which dispatches
    the prune callback to be run by the system taskq.  This makes it
    suitable to use in the context of the arc_adapt_thread().

    * arc_prune() is a synchronous implementation which depends on the
    arc_prune_async() implementation but blocks until the outstanding
    callbacks complete.  This is used in arc_kmem_reap_now() where it
    is safe, and expected, that memory will be freed.

    This patch additionally adds the zfs_arc_meta_strategy module option
    while allows the meta reclaim strategy to be configured.  It defaults
    to a balanced strategy which has been proved to work well under Linux
    but the illumos meta-only strategy can be enabled.

    Signed-off-by: Tim Chase <tim@chase2k.com>
    Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2018-08-11 22:01:52 +00:00
Navdeep Parhar
37310a98a8 cxgbe(4): Move all control queues to the adapter.
There used to be one control queue per adapter (the mgmtq) that was
initialized during adapter init and one per port that was initialized
later during port init.  This change moves all the control queues (one
per port/channel) to the adapter so that they are initialized during
adapter init and are available before any port is up.  This allows the
driver to issue ctrlq work requests over any channel without having to
bring up any port.

MFH:		2 weeks
Sponsored by:	Chelsio Communications
2018-08-11 21:10:08 +00:00
Matt Macy
d815f5ba09 buildworld fix: private appears to have special meaning on FreeBSD - revert to priv 2018-08-11 20:41:42 +00:00
Matt Macy
6b55e6fb04 Limit the amount of dnode metadata in the ARC
In addition import most recent arc_prune_async implementation as dependency

commit 25458cbef9e59ef9ee6a7e729ab2522ed308f88f
Author: Tim Chase <tim@chase2k.com>
Date:   Wed Jul 13 07:42:40 2016 -0500

    Limit the amount of dnode metadata in the ARC

    Metadata-intensive workloads can cause the ARC to become permanently
    filled with dnode_t objects as they're pinned by the VFS layer.
    Subsequent data-intensive workloads may only benefit from about
    25% of the potential ARC (arc_c_max - arc_meta_limit).

    In order to help track metadata usage more precisely, the other_size
    metadata arcstat has replaced with dbuf_size, dnode_size and bonus_size.

    The new zfs_arc_dnode_limit tunable, which defaults to 10% of
    zfs_arc_meta_limit, defines the minimum number of bytes which is desirable
    to be consumed by dnodes.  Attempts to evict non-metadata will trigger
    async prune tasks if the space used by dnodes exceeds this limit.

    The new zfs_arc_dnode_reduce_percent tunable specifies the amount by
    which the excess dnode space is attempted to be pruned as a percentage of
    the amount by which zfs_arc_dnode_limit is being exceeded.  By default,
    it tries to unpin 10% of the dnodes.

    The problem of dnode metadata pinning was observed with the following
    testing procedure (in this example, zfs_arc_max is set to 4GiB):

        - Create a large number of small files until arc_meta_used exceeds
          arc_meta_limit (3GiB with default tuning) and arc_prune
          starts increasing.

        - Create a 3GiB file with dd.  Observe arc_mata_used.  It will still
          be around 3GiB.

        - Repeatedly read the 3GiB file and observe arc_meta_limit as before.
          It will continue to stay around 3GiB.

    With this modification, space for the 3GiB file is gradually made
    available as subsequent demands on the ARC are made.  The previous behavior
    can be restored by setting zfs_arc_dnode_limit to the same value as the
    zfs_arc_meta_limit.

    Signed-off-by: Tim Chase <tim@chase2k.com>
    Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
    Issue #4345
    Issue #4512
    Issue #4773
    Closes #4858
2018-08-11 19:45:04 +00:00
Alan Cox
c65ed2ff53 Eliminate a redundant assignment.
MFC after:	1 week
2018-08-11 19:21:53 +00:00
Kristof Provost
e9ddca4a40 pf: Take the IF_ADDR_RLOCK() when iterating over the group list
We did do this elsewhere in pf, but the lock was missing here.

Sponsored by:	Essen Hackathon
2018-08-11 16:37:55 +00:00
Kristof Provost
33b242b533 pf: Fix 'set skip on' for groups
The pfi_skip_if() function sometimes caused skipping of groups to work,
if the members of the group used the groupname as a name prefix.
This is often the case, e.g. group lo usually contains lo0, lo1, ...,
but not always.

Rather than relying on the name explicitly check for group memberships.

Obtained from:	OpenBSD (pf_if.c,v 1.62, pf_if.c,v 1.63)
Sponsored by:	Essen Hackathon
2018-08-11 16:34:30 +00:00
Navdeep Parhar
3098bcfc05 cxgbe(4): Create two variants of service_iq, one for queues with
freelists and one for those without.

MFH:		3 weeks
Sponsored by:	Chelsio Communications
2018-08-11 04:55:47 +00:00
Matt Macy
90df93417e ZFS/MFV: Use cached feature info in spa_add_feature_stats()
commit 417104bdd3c7ce07ec58674dd078f9891c3bc780
Author: Ned Bass <bass6@llnl.gov>
Date:   Thu Feb 26 12:24:11 2015 -0800

    Use cached feature info in spa_add_feature_stats()

    Avoid issuing I/O to the pool when retrieving feature flags information.
    Trying to read the ZAPs from disk means that zpool clear would hang if
    the pool is suspended and recovery would require a reboot. To keep the
    feature stats resident in memory, we hang a cached nvlist off of the
    spa.  It is built up from disk the first time spa_add_feature_stats() is
    called, and refreshed thereafter using the cached feature reference
    counts. spa_add_feature_stats() gets called at pool import time so we
    can be sure the cached nvlist will be available if the pool is later
    suspended.

    Signed-off-by: Ned Bass <bass6@llnl.gov>
    Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
    Closes #3082
2018-08-10 23:42:11 +00:00
Devin Teske
ab9ed8a1bd Fix misspellings of transmitter/transmitted
Reviewed by:	emaste, bcr
Sponsored by:	Smule, Inc.
Differential Revision:	https://reviews.freebsd.org/D16025
2018-08-10 20:37:32 +00:00
Conrad Meyer
f053ca1f08 Walk back r337554 while discussion continues
The idea was to get the uncontroversial mechanical change out of the way,
then get the meatier functional changes reviewed subsequently.  I had not
realized that the immediately adjacent issue was addressed in a different
direction in r334506 (see Warner's guidance in D15592).

Discussion continues, trying to determine if there is a secondary issue
still[1] and how best to fix it.  With 12-related activities coming up,
while that is ongoing, just take this back for now.

[1]: Shutdown-time eventhandler events fire normally during panic's reboot
path.  Driver callbacks that attempt to issue and wait on interrupt-
completed IO may never complete, hanging the system.  This is particularly
obnoxious in the shutdown/panic path, as the debugger cannot be entered
anymore and the hang prevents reboot restoring availability.

(There's nothing CAM-specific about this problem -- any shutdown
event-triggered driver could do something like this during panic.  But most
NICs, etc.  don't try to send spin-down commands at shutdown. ;-))

Discussed with:	imp, markj
2018-08-10 19:19:07 +00:00
Kyle Evans
0915d9d070 subr_prf: remove think-o that had returned to local patch
Reported by:	cognet
2018-08-10 15:35:02 +00:00
Kyle Evans
170bc29131 boot tagging: minor fixes
msgbufinit may be called multiple times as we initialize the msgbuf into a
progressively larger buffer. This doesn't happen as of now on head, but it
may happen in the future and we generally support this. As such, only print
the boot tag if we've just initialized the buffer for the first time.

The boot tag also now has a newline appended to it for better visibility,
and has been switched to a normal printf, by requesto f bde, after we've
denoted that the msgbuf is mapped.
2018-08-10 15:29:06 +00:00
Warner Losh
7e299411ac Bring in timespce_get form NetBSD.
Bring in the functionality for timespec_get from NetBSD. I've lightly
edited the .c file to remove _DIAGASSERT because FreeBSD doesn't have
that functionality and the typical #define'ing it to assert isn't
right here. The man page is verbatim from NetBSD, but will be revised
as part of a larger cleanup of the time man pages (they are
inconsistent and vague in all the wrong places).

Differential Review: https://reviews.freebsd.org/D16649
2018-08-10 15:16:30 +00:00
Kyle Evans
84c956df77 ath: Minor style cleanups
device_printf => DPRINTF and two whitespace adjustments

Submitted by:	Augustin Cavalier <waddlesplash@gmail.com>
Obtained from:	Haiku (4a88aa503ad4155a20931e263d24343043994ea9)
MFC after:	1 week
2018-08-10 13:38:23 +00:00
Kyle Evans
8e0cc51b87 ieee8021_node: fix whitespace issues
Submitted by:	Augustin Cavalier <waddlesplash@gmail.com>
Obtained from:	Haiku (dffc3e235360cd7b71261239ee8507b7d62a1471)
MFC after:	1 week
2018-08-10 13:34:23 +00:00
Kyle Evans
58a7c4bfcf net80211: Drain ageq before cleaning it up.
The comment above ieee80211_ageq_cleanup specifically notes that the queue
is assumed to be empty, and in order to make it so, ieee80211_ageq_drain
must be used.

Submitted by:	Augustin Cavalier <waddlesplash@gmail.com>
Obtained from:	Haiku (dffc3e235360cd7b71261239ee8507b7d62a1471)
MFC after:	1 week
2018-08-10 13:32:02 +00:00
Kyle Evans
060b3e4ff1 bwi(4): Set ic->ic_softc before bwi_getradiocaps to avoid bad deref
Submitted by:	François Revol <revol@free.fr>
Obtained from:	Haiku (ba88131cfde64e21bedb4ebedd699cfa5e7fd314)
MFC after:	1 week
2018-08-10 13:06:14 +00:00
Andrey V. Elsukov
16bbf600d9 Remove unneeded ipsec-related includes.
Reviewed by:	rrs
Differential Revision:	https://reviews.freebsd.org/D16637
2018-08-10 07:24:01 +00:00
Matt Macy
648cfe57fd Performance optimization of AVL tree comparator functions
MFV:
commit ee36c709c3d5f7040e1bd11f5c75318aa03e789f
Author: Gvozden Neskovic <neskovic@gmail.com>
Date:   Sat Aug 27 20:12:53 2016 +0200

    perf: 2.75x faster ddt_entry_compare()
        First 256bits of ddt_key_t is a block checksum, which are expected
    to be close to random data. Hence, on average, comparison only needs to
    look at first few bytes of the keys. To reduce number of conditional
    jump instructions, the result is computed as: sign(memcmp(k1, k2)).

    Sign of an integer 'a' can be obtained as: `(0 < a) - (a < 0)` := {-1, 0, 1} ,
    which is computed efficiently.  Synthetic performance evaluation of
    original and new algorithm over 1G random keys on 2.6GHz Intel(R) Xeon(R)
    CPU E5-2660 v3:

    old     6.85789 s
    new     2.49089 s

    perf: 2.8x faster vdev_queue_offset_compare() and vdev_queue_timestamp_compare()
        Compute the result directly instead of using conditionals

    perf: zfs_range_compare()
        Speedup between 1.1x - 2.5x, depending on compiler version and
    optimization level.

    perf: spa_error_entry_compare()
        `bcmp()` is not suitable for comparator use. Use `memcmp()` instead.

    perf: 2.8x faster metaslab_compare() and metaslab_rangesize_compare()
    perf: 2.8x faster zil_bp_compare()
    perf: 2.8x faster mze_compare()
    perf: faster dbuf_compare()
    perf: faster compares in spa_misc
    perf: 2.8x faster layout_hash_compare()
    perf: 2.8x faster space_reftree_compare()
    perf: libzfs: faster avl tree comparators
    perf: guid_compare()
    perf: dsl_deadlist_compare()
    perf: perm_set_compare()
    perf: 2x faster range_tree_seg_compare()
    perf: faster unique_compare()
    perf: faster vdev_cache _compare()
    perf: faster vdev_uberblock_compare()
    perf: faster fuid _compare()
    perf: faster zfs_znode_hold_compare()

    Signed-off-by: Gvozden Neskovic <neskovic@gmail.com>
    Signed-off-by: Richard Elling <richard.elling@gmail.com>
    Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
    Closes #5033
2018-08-10 06:42:08 +00:00
Justin Hibbits
7d849dc1a4 powerpc: Add lwsync and ptesync 'sync' opcode variants to ddb disassembler
The canonical form of sync is:

  sync L, E (if Category Elemental Memory Barriers implemented)

The L bits (2) denote the type of sync:

  0 -- hwsync
  1 -- lwsync
  2 -- ptesync or hwsync

It's been found that most 32-bit CPUs designed prior to the introduction of
lwsync will ignore the L bits.  However, some cores, particularly the e500 core,
will trigger an illegal instruction exception.  Adding these variants will make
it easier to see which sync variant is actually being used in case of a trap.
2018-08-10 03:28:40 +00:00
Cy Schubert
79476a1c3e Correct a comment. Should have been detected by ipf_nat_in() not
ipf_nat_out().

MFC after:	1 week
X-MFC-with:	r337558
2018-08-10 00:30:15 +00:00
Cy Schubert
e6191e11f0 Identify the return value (rval) that led to the IPv4 NAT failure
in ipf_nat_checkout() and report it in the frb_natv4out and frb_natv4in
dtrace probes.

This is currently being used to diagnose NAT failures in PR/208566. It's
rather handy so this commit makes it available for future diagnosis and
debugging efforts.

PR:		208566
MFC after:	1 week
2018-08-10 00:04:32 +00:00
Glen Barber
b534d57f63 Rename head from -CURRENT to -ALPHA1 as part of the
12.0-RELEASE cycle.  This commit marks the start of
the code slush for the 12.0 cycle.

Approved by:	re (implicit)
Sponsored by:	The FreeBSD Foundation
2018-08-10 00:01:21 +00:00
Conrad Meyer
2077be2b73 cam(4): Add an xpt-neutral flag indicating a valid panic CCB
No functional change.

Note that this change is careful to set the CCB header xflags after
foo_fill_bar() routines, which generally zero existing flags.  An earlier
version of this patch mistakenly set the flag before the fill routines.

Submitted by:	Scott Ferris <sferris AT isilon.com>, jhibbits@
Reviewed by:	bdrewery@, markj@, and non-committer FreeBSD contributor Anton Rang
Sponsored by:	Dell EMC Isilon
2018-08-09 21:53:32 +00:00
Navdeep Parhar
2d73ac5e4a cxgbe(4): Add a sysctl to control the tx credit reclaim mechanism for
netmap tx queues.  There is no change in default behavior.

Sponsored by:	Chelsio Communications
2018-08-09 21:52:51 +00:00
Conrad Meyer
bc812246a0 cam_ccb.h: Remove redundant declarations of static inline functions
No functional change.

They're unnecessarily confusing for tools like grep or ctags.

Sponsored by:	Dell EMC Isilon
2018-08-09 21:20:07 +00:00
Navdeep Parhar
518bca2c21 cxgbe(4): Set fl_pktshift to 0 by default.
Sponsored by:	Chelsio Communications
2018-08-09 21:07:32 +00:00
Kyle Evans
240fcda1e8 subr_prf: style(9) the sizeof
Reported by:	jkim, ian
2018-08-09 19:09:06 +00:00
Mark Johnston
b50a4ea646 Account for the lowmem handlers in the inactive queue scan target.
Before r329882 the target would be computed after lowmem handlers run
and free pages.  On some systems a significant amount of page
reclamation happens this way.  However, with r329882 the target is
computed first, which can lead to unnecessary reclamation from the
page cache, and this in turn may result in excessive swapping.

Instead, adjust the target after running lowmem handlers.  Don't
invoke the lowmem handlers before the PID controller, though, since
that would hide the true rate of page allocation.

Reviewed by:	alc, kib (previous version)
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D16606
2018-08-09 18:25:49 +00:00
Kyle Evans
4c793b68da subr_prf: Use "sizeof current_boot_tag" instead 2018-08-09 17:53:18 +00:00
Kyle Evans
2a4650cc11 BOOT_TAG: Make a config(5) option, expose as sysctl and loader tunable
BOOT_TAG lived shortly in sys/msgbuf.h, but this wasn't necessarily great
for changing it or removing it. Move it into subr_prf.c and add options for
it to opt_printf.h.

One can specify both the BOOT_TAG and BOOT_TAG_SZ (really, size of the
buffer that holds the BOOT_TAG). We expose it as kern.boot_tag and also add
a loader tunable by the same name that we'll fetch upon initialization of
the msgbuf.

This allows for flexibility and also ensures that there's a consistent way
to figure out the boot tag of the running kernel, rather than relying on
headers to be in-sync.

Prodded super-super-lightly by:	imp
2018-08-09 17:47:47 +00:00
Kyle Evans
21aa6e8345 msgbuf: Light detailing (const'ify and bool'itize) 2018-08-09 17:42:27 +00:00
Navdeep Parhar
8a684e1fd1 cxgbe(4): Display pkt-size and burst-size in traffic class parameters. 2018-08-09 14:36:44 +00:00
Navdeep Parhar
5fc0f72f3b cxgbe(4): Add support for high priority filters on T6+. They have their
own region in the TCAM starting with T6, unlike previous chips where
they were in the same region as normal filters.

These filters "hit" before anything else in the LE's lookup.  The exact
order is:
a) High priority filters
b) TOE's active region (TCAM and/or hash)
c) Servers (TOE hw listeners)
d) Normal filters

MFC after:	1 week
Sponsored by:	Chelsio Communications
2018-08-09 14:19:47 +00:00
Leandro Lupori
c8e2123b6a [ppc] Fix kernel panic when using BOOTP_NFSROOT
On PowerPC (and possibly other architectures), that doesn't use
EARLY_AP_STARTUP, the config task queue may be used initialized.
This was observed while trying to mount the root fs from NFS, as
reported here: https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=230168.

This patch has 2 main changes:
1- Perform a basic initialization of qgroup_config, similar to
what is done in taskqgroup_adjust, but simpler.
This makes qgroup_config ready to be used during NFS root mount.

2- When EARLY_AP_STARTUP is not used, call inm_init() and
in6m_init() right before SI_SUB_ROOT_CONF, because bootp needs
to send multicast packages to request an IP.

PR:		Bug 230168
Reported by:	sbruno
Reviewed by:	jhibbits, mmacy, sbruno
Approved by:	jhibbits
Differential Revision:	D16633
2018-08-09 14:04:51 +00:00
Olivier Houchard
d8f1ed8d94 Import CK as of commit 08813496570879fbcc2adcdd9ddc0a054361bfde, mostly
to avoid using lwsync on ppc32.
2018-08-09 12:11:49 +00:00
Hans Petter Selasky
25a1e0f636 Implement missing atomic_fcmpset_XXX() support for i386.
This also fixes i386 build after r337527.

MFC after:		1 week
Sponsored by:		Mellanox Technologies
2018-08-09 11:30:13 +00:00
Andriy Gapon
27cfcd9599 add an option for ddb ps command to print process arguments
We use ps to collect the information of all processes in textdump. But
it doesn't contain process arguments which however sometimes are very
useful for debugging.  The new 'a' modifier adds that capability.

While here, remove 'm' modifier from ddb.4.  It was in the manual page
from its very first revision, but I could not find any evidence of the
code ever supporting it.

Submitted by:	Terry Hu <thu@panzura.com>
Reviewed by:	kib
MFC after:	1 week
Sponsored by:	Panzura
Differential Revision: https://reviews.freebsd.org/D16603
2018-08-09 11:21:31 +00:00
Hans Petter Selasky
6402bc3d1e Use atomic_fcmpset_XXX() instead of atomic_cmpset_XXX() when possible
in the LinuxKPI.

Suggested by:	mjg @
MFC after:	1 week
Sponsored by:	Mellanox Technologies
2018-08-09 09:39:32 +00:00
Matt Macy
9fec45d8e5 epoch_block_wait: don't check TD_RUNNING
struct epoch_thread is not type safe (stack allocated) and thus cannot be dereferenced from another CPU

Reported by: novel@
2018-08-09 05:18:27 +00:00
Kyle Evans
2834d61202 kern: Add a BOOT_TAG marker at the beginning of boot dmesg
From the "newly licensed to drive" PR department, add a BOOT_TAG marker (by
default, --<<BOOT>>--, to the beginning of each boot's dmesg. This makes it
easier to do textproc magic to locate the start of each boot and, of
particular interest to some, the dmesg of the current boot.

The PR has a dmesg(8) component as well that I've opted not to include for
the moment- it was the more contentious part of this PR.

bde@ also made the statement that this boot tag should be written with an
ordinary printf, which I've- for the moment- declined to change about this
patch to keep it more transparent to observer of the boot process.

PR:		43434
Submitted by:	dak <aurelien.nephtali@wanadoo.fr> (basically rewritten)
MFC after:	maybe never
2018-08-09 01:32:09 +00:00
Breno Leitao
78f4e2fea0 powerpc64/powernv: re-read RTC after polling
If OPAL_RTC_READ is busy and does not return the information on the first run,
as returning OPAL_BUSY_EVENT, the system will crash since ymd and hmsm variable
will contain junk values.

This is happening because we were not calling OPAL_RTC_READ again after
OPAL_POLL_EVENTS' return, which would finally replace the old/junk hmsm and ymd
values.

The code was also mixing OPAL_RTC_READ and OPAL_POLL_EVENTS return values.

This patch fix this logic and guarantee that we call OPAL_RTC_READ after
OPAL_POLL_EVENTS return, and guarantee the code will only proceed if
OPAL_RTC_READ returns OPAL_SUCCESS.

Reviewed by: jhibbits
Approved by: jhibbits (mentor)
Differential Revision: https://reviews.freebsd.org/D16617
2018-08-08 21:19:07 +00:00
Rick Macklem
41df1b5b47 Assorted fixes to handling of LayoutRecall callbacks, mostly error handling.
After a re-read of the appropriate section of RFC5661, I decided that a
few things should be changed related to LayoutRecall callback handling.
Here are the things fixed by this patch.
- For two of the three cases that LayoutRecall is done, I now think
  setting the clora_changed argument false is correct.
- All errors other than NFSERR_DELAY returned by LayoutRecall appear
  permanent, so don't retry for any of them. (NFSERR_DELAY is retried by
  newnfs_request(), so it is not affected by this patch.)
- Instead of waiting "forever" (actually until the process is SIGTERM'd)
  for Layouts to be returned during a mirror copy, fail and return
  ENXIO after about 1minute.
  Waiting for a <ctrl>C made sense when pnfsdscopymr() was done by itself,
  but did not make sense when done via find(1).
This patch only affects the pNFS server.
2018-08-08 20:21:45 +00:00
Andrey V. Elsukov
5c4aca8218 Use host byte order when comparing mss values.
This fixes tcp-setmss action on little endian machines.

PR:		225536
Submitted by:	John Zielinski
2018-08-08 17:32:02 +00:00
Alan Cox
2bf8cb3804 Add support for pmap_enter(..., psind=1) to the armv6 pmap. In other words,
add support for explicitly requesting that pmap_enter() create a 1 MB page
mapping.  (Essentially, this feature allows the machine-independent layer
to create superpage mappings preemptively, and not wait for automatic
promotion to occur.)

Export pmap_ps_enabled() to the machine-independent layer.

Add a flag to pmap_pv_insert_pte1() that specifies whether it should fail
or reclaim a PV entry when one is not available.

Refactor pmap_enter_pte1() into two functions, one by the same name, that
is a general-purpose function for creating pte1 mappings, and another,
pmap_enter_1mpage(), that is used to prefault 1 MB read- and/or execute-
only mappings for execve(2), mmap(2), and shmat(2).

In addition, as an optimization to pmap_enter(..., psind=0), eliminate the
use of pte2_is_managed() from pmap_enter().  Unlike the x86 pmap
implementations, armv6 does not have a managed bit defined within the PTE.
So, pte2_is_managed() is actually a call to PHYS_TO_VM_PAGE(), which is O(n)
in the number of vm_phys_segs[].  All but one call to PHYS_TO_VM_PAGE() in
pmap_enter() can be avoided.

Reviewed by:	kib, markj, mmel
Tested by:	mmel
MFC after:	6 weeks
Differential Revision:	https://reviews.freebsd.org/D16555
2018-08-08 16:55:01 +00:00
Ruslan Bukin
6371d0bd64 Implement uma_small_alloc(), uma_small_free().
Reviewed by:	markj
Obtained from:	arm64
Sponsored by:	DARPA, AFRL
Differential Revision:	https://reviews.freebsd.org/D16628
2018-08-08 16:08:38 +00:00
Pedro F. Giffuni
c820acbf0a msdosfs: fixes for Undefined Behavior.
These were found by the Undefined Behaviour GsoC project at NetBSD:

Do not change signedness bit with left shift.
While there avoid signed integer overflow.
Address both issues with using unsigned type.

msdosfs_fat.c:512:42, left shift of 1 by 31 places cannot be represented
in type 'int'
msdosfs_fat.c:521:44, left shift of 1 by 31 places cannot be represented
in type 'int'
msdosfs_fat.c:744:14, left shift of 1 by 31 places cannot be represented
in type 'int'
msdosfs_fat.c:744:24, signed integer overflow: -2147483648 - 1 cannot be
represented in type 'int [20]'
msdosfs_fat.c:840:13, left shift of 1 by 31 places cannot be represented
in type 'int'
msdosfs_fat.c:840:36, signed integer overflow: -2147483648 - 1 cannot be
represented in type 'int [20]'

Detected with micro-UBSan in the user mode.

Hinted from:	NetBSD (CVS 1.33)
MFC after:	2 weeks
Differenctial Revision:	https://reviews.freebsd.org/D16615
2018-08-08 15:08:22 +00:00
Randall Stewart
d18ea344e6 Fix a small bug in rack where it will
end up sending the FIN twice.
Sponsored by:	Netflix Inc.
Differential Revision:	https://reviews.freebsd.org/D16604
2018-08-08 13:36:49 +00:00
Fedor Uporov
53288b712d Split the dir_index and dir_nlink features.
Do not allow to create more that EXT4_LINK_MAX links to directory in case
if the dir_nlink is not set, like it is done in the fresh e2fsprogs updates.

MFC after:      3 months
2018-08-08 12:08:46 +00:00
Fedor Uporov
17c7b27f55 Fix directory blocks checksum updating logic.
The checksum updating functions were not called in case of dir index inode splitting
and in case of dir entry removing, when the entry was first in the block.
Fix and move the dir entry adding logic when i_count == 0 to new function.

MFC after:      3 months
2018-08-08 12:07:45 +00:00
Conrad Meyer
3dc1c7d6bc FUSE: Remove some set-but-not-used variables
No functional change.
2018-08-08 04:46:03 +00:00
Alan Cox
78f1deeffe Defer and aggregate swap_pager_meta_build frees.
Before swp_pager_meta_build replaces an old swapblk with an new one,
it frees the old one.  To allow such freeing of blocks to be
aggregated, have swp_pager_meta_build return the old swap block, and
make the caller responsible for freeing it.

Define a pair of short static functions, swp_pager_init_freerange and
swp_pager_update_freerange, to do the initialization and updating of
blk addresses and counters used in aggregating blocks to be freed.

Submitted by:	Doug Moore <dougm@rice.edu>
Reviewed by:	kib, markj (an earlier version)
Tested by:	pho
MFC after:	1 week
Differential Revision:	https://reviews.freebsd.org/D13707
2018-08-08 02:30:34 +00:00
Kevin Lo
a99020fbf3 - Fix hash calculation by MAC address
- Since rx_cmd_c is an uint16_t, use le16toh() instead of le32toh()

Reviewed by:	emaste
2018-08-08 01:20:02 +00:00
Navdeep Parhar
09a7189fb7 cxgbe(4): Allow the driver to specify a burst size when configuring a
traffic class for rate limiting.

Add experimental knobs that allow the user to specify a default pktsize
and burstsize for traffic classes associated with a port:

dev.<ifname>.<instance>.tc.pktsize
dev.<ifname>.<instance>.tc.burstsize

Sponsored by:	Chelsio Communications
2018-08-07 22:13:03 +00:00
Rick Macklem
93df87f208 Allow newnfs_request() to retry all callback RPCs with an NFSERR_DELAY reply.
The code in newnfs_request() retries RPCs that get a reply of NFSERR_DELAY,
but exempts certain NFSv4 operations. However, for callback RPCs, there
should not be any exemptions at this time. The code would have erroneously
exempted the CBRECALL callback, since it has the same operation number as
the CLOSE operation.
This patch fixes this by checking for a callback RPC (indicated by clp != NULL)
and not checking for exempt operations for callbacks.
This would have only affected the NFSv4 server when delegations are enabled
(they are not enabled by default) and the client replies to CBRECALL with
NFSERR_DELAY. This may never actually happen.
Spotted during code inspection.

MFC after:	2 weeks
2018-08-07 21:29:14 +00:00
Konstantin Belousov
8f94195022 Followup to r337430: only call elf_reloc_ifunc on x86.
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2018-08-07 20:43:50 +00:00
Marius Strobl
e545f5a8a3 Update the list of architectures having atomic_fcmpset_{8,16,64}(9) and
atomic_swap_{64,int}(9) respectively as of r337433.
2018-08-07 18:59:02 +00:00
Marius Strobl
13a10f3414 Implement atomic_swap_{int,long,ptr}(9). 2018-08-07 18:56:51 +00:00
Marius Strobl
ac97c7e4c1 Implement atomic_swap_64(9). 2018-08-07 18:56:01 +00:00
Konstantin Belousov
cb0eecdf92 Futex support functions in linux.ko and linux32.ko on amd64 should be
aware of SMAP.

Reported and tested by:	Johannes Lundberg <johalun0@gmail.com>, wulf
Sponsored by:	The FreeBSD Foundation
2018-08-07 18:29:10 +00:00
Konstantin Belousov
289ead7cb0 Add missed handling of local relocs against ifunc target in the obj modules.
Reported and tested by:	wulf
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2018-08-07 18:26:46 +00:00
Mark Johnston
159f344b84 Recognize ICS1893C PHYs.
Submitted by:	Sebastian Huber <sebastian.huber@embedded-brains.de>
MFC after:	1 week
2018-08-07 17:13:42 +00:00
Mark Johnston
c7902fbeae Improve handling of control message truncation.
If a recvmsg(2) or recvmmsg(2) caller doesn't provide sufficient space
for all control messages, the kernel sets MSG_CTRUNC in the message
flags to indicate truncation of the control messages.  In the case
of SCM_RIGHTS messages, however, we were failing to dispose of the
rights that had already been externalized into the recipient's file
descriptor table.  Add a new function and mbuf type to handle this
cleanup task, and use it any time we fail to copy control messages
out to the recipient.  To simplify cleanup, control message truncation
is now only performed at control message boundaries.

The change also fixes a few related bugs:
- Rights could be leaked to the recipient process if an error occurred
  while copying out a message's contents.
- We failed to set MSG_CTRUNC if the truncation occurred on a control
  message boundary, e.g., if the caller received two control messages
  and provided only the exact amount of buffer space needed for the
  first.

PR:		131876
Reviewed by:	ed (previous version)
MFC after:	1 month
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D16561
2018-08-07 16:36:48 +00:00
Colin Percival
0b4d5eb8fd Replace a pair of 8-bit writes to VGA memory with a single 16-bit write.
The VGA "text mode" buffer has a pair of bytes for each character: One
byte for the character symbol, and an "attribute" byte encoding the
foreground and background colours.  When updating the screen, we were
writing these two bytes separately.

On some virtualized systems, every write results in a glyph being redrawn
into a (graphical) virtual screen; writing these two bytes separately
results in twice as much work being done to draw characters, whereas if
we perform a single 16-bit write instead, the character only needs to be
redrawn once.

On an EC2 c5.4xlarge instance, this change cuts 1.30s from the kernel boot,
speeding it up from 8.90s to 7.60s.

MFC after:	1 week
2018-08-07 08:33:40 +00:00
Cy Schubert
95bdea60e0 Remove redundant and incorrect default definition of AF_INET6. AF_INET6
is defined in sys/socket.h where it's defined as 28.

A bit of trivia: On NetBSD AF_INET6 is defined as 24. On Solaris it is
defined as 26. This is probably why Darren defaulted to 26, because
ipfilter was originally written for SunOS 4 and Solaris many moons ago.

MFC after:	2 weeks
2018-08-07 07:12:59 +00:00
John Baldwin
d2aec9714a Make the system C11 atomics headers fully compatible with external GCC.
The <sys/cdefs.h> and <stdatomic.h> headers already included support for
C11 atomics via intrinsincs in modern versions of GCC, but these versions
tried to "hide" atomic variables inside a wrapper structure.  This wrapper
is not compatible with GCC's internal <stdatomic.h> header, so that if
GCC's <stdatomic.h> was used together with <sys/cdefs.h>, use of C11
atomics would fail to compile.  Fix this by not hiding atomic variables
in a structure for modern versions of GCC.  The headers already avoid
using a wrapper structure on clang.

Note that this wrapper was only used if C11 was not enabled (e.g.
via -std=c99), so this also fixes compile failures if a modern version
of GCC was used with -std=c11 but with FreeBSD's <stdatomic.h> instead
of GCC's <stdatomic.h> and this change fixes that case as well.

Reported by:	Mark Millard
Reviewed by:	kib
Differential Revision:	https://reviews.freebsd.org/D16585
2018-08-06 23:51:08 +00:00
Navdeep Parhar
1979b51141 cxgbe(4): Allow user-configured and driver-configured traffic classes to
be used simultaneously.  Move sysctl_tc and sysctl_tc_params to
t4_sched.c while here.

MFC after:	3 weeks
Sponsored by:	Chelsio Communications
2018-08-06 23:21:13 +00:00
Navdeep Parhar
7b8f5a200a cxgbe(4): Break up sysctl_bitfield into 8 bit and 16 bit variants. Have
them display the current value of the bitfield rather than the fixed
value that was provided when the sysctl node was created.

MFC after:	1 week
Sponsored by:	Chelsio Communications
2018-08-06 21:54:51 +00:00
Kirk McKusick
68c49bcc40 Put in place the framework for consolodating contiguous blocks into
a smaller number of larger TRIM requests. The hope had been to have
the full TRIM consolodation in place for 12.0, but the algorithms
are still under development and need further testing. With this
framework in place it will be possible to easily add TRIM consolodation
once the optimal strategy has been found.

The only functional change with this patch is the elimination of TRIM
requests for blocks that are freed before they have been likely to
have been written.

Reviewed by: kib
Discussed with: Warner Losh and Chuck Silvers
Sponsored by: Netflix
2018-08-06 21:09:11 +00:00
Navdeep Parhar
564ec04ea8 Fix typo in cxgbe/t4_tom. 2018-08-06 19:09:55 +00:00
Jonathan T. Looney
95a914f631 Address concerns about CPU usage while doing TCP reassembly.
Currently, the per-queue limit is a function of the receive buffer
size and the MSS.  In certain cases (such as connections with large
receive buffers), the per-queue segment limit can be quite large.
Because we process segments as a linked list, large queues may not
perform acceptably.

The better long-term solution is to make the queue more efficient.
But, in the short-term, we can provide a way for a system
administrator to set the maximum queue size.

We set the default queue limit to 100.  This is an effort to balance
performance with a sane resource limit.  Depending on their
environment, goals, etc., an administrator may choose to modify this
limit in either direction.

Reviewed by:	jhb
Approved by:	so
Security:	FreeBSD-SA-18:08.tcp
Security:	CVE-2018-6922
2018-08-06 17:36:57 +00:00
Andrew Turner
b17f3d298d Default to armv5te in LINT on arm. This should fix building LINT there. 2018-08-06 14:40:45 +00:00
Hans Petter Selasky
549dcdb34e Implement current_work() function in the LinuxKPI.
Tested by:	Johannes Lundberg <johalun0@gmail.com>
MFC after:	1 week
Sponsored by:	Mellanox Technologies
2018-08-06 10:48:20 +00:00
Randall Stewart
936b2b64ae This fixes a bug in Rack where we were
not properly using the correct value for
Delayed Ack.

Sponsored by:	Netflix Inc.
Differential Revision: https://reviews.freebsd.org/D16579
2018-08-06 09:22:07 +00:00
Hans Petter Selasky
db119089be Implement atomic_long_cmpxchg() function in the LinuxKPI.
Submitted by:	Johannes Lundberg <johalun0@gmail.com>
MFC after:	1 week
Sponsored by:	Mellanox Technologies
2018-08-06 08:40:02 +00:00
Hans Petter Selasky
f698bc4d76 Define __poll_t type in the LinuxKPI.
Submitted by:	Johannes Lundberg <johalun0@gmail.com>
MFC after:	1 week
Sponsored by:	Mellanox Technologies
2018-08-06 08:35:16 +00:00
Emmanuel Vadot
d19afc9abf aw_thermal: Add nvmem and H5 support
Now that aw_sid expose nvmem interface, use that to read the calibration
data.
Add support for H5 SoC.
Fix the bindings, we used to have non-upstreamed bindings. Switch to the
one that have been sent upstream. They are not stable yet, so we switch
from custom, wrong, bindings to correct, proposed bindings
2018-08-06 05:36:00 +00:00
Emmanuel Vadot
97eb836f8b aw_sid: Add nvmem interface
Rework aw_sid so it can work with the nvmem interface.
Each SoC expose a set of fuses (for now rootkey/boardid and, if available,
the thermal calibration data). A fuse can be private or public, reading private
fuse needs to be done via some registers instead of reading directly.
Each fuse is exposed as a sysctl.
For now leave the possibility for a driver to read any fuse without using
the nvmem interface as the awg and emac driver use this to generate a mac
address.
2018-08-06 05:35:24 +00:00
Rick Macklem
25705dd5d0 Copy all bits of a file handle in case there is padding in the structure.
At least on x86, fhandle_t is a packed structure, so I believe an
assignment will copy all the bits. However, for some current/future
architectures, there might be padding in the structure that doesn't get
copied via an assignment.
Since NFS assumes a file handle is an opaque blob of bits that can be
compared via memcmp()/bcmp(), all the bits including any padding must be
copied.
This patch replaces the assignments with a call to a byte copy function.
Spotted during code inspection.
2018-08-05 19:21:50 +00:00
Kristof Provost
91e0f2d200 pf: Increase default hash table size
Now that we (by default) limit the number of states to 100.000 it makse sense
to also adjust the default size of the hash table.

Based on the benchmarking results in
https://github.com/ocochard/netbenches/blob/master/Atom_C2758_8Cores-Chelsio_T540-CR/pf-states_hashsize/results/fbsd12-head.r332390/README.md
128K entries offers a good compromise between performance and memory use.

Users may still overrule this setting with the net.pf.states_hashsize and
net.pf.source_nodes_hashsize loader(8) tunables.
2018-08-05 13:54:37 +00:00