Commit Graph

15712 Commits

Author SHA1 Message Date
mjoras
88e3a903b6 Properly reset the fields in clean_unrhdr.
In r324542 I neglected to reset the first and last fields of struct
unrhdr. This causes a tmpfs to fail the unr(9) consistency checks with
DIAGNOSTIC on. Fix this by resetting the fields by calling init_unrhdr.
While here, change a loop to use TAILQ_FOREACH_SAFE since it is more
readable and equally fast.

Reported by:	David Wolfskill <david@catwhisker.org>
Approved by:	rstone (mentor)
Sponsored by:	Dell EMC Isilon
2017-10-16 16:14:50 +00:00
tijl
5b4d8cee9d When a Linux program tries to access a /path the kernel tries
/compat/linux/path before /path.  Stop following symbolic links when
looking up /compat/linux/path so dead symbolic links aren't ignored.
This allows syscalls like readlink(2) and lstat(2) to work on such links.
And open(2) will return an error now instead of trying /path.
2017-10-15 18:53:21 +00:00
mjg
e99dca871b mtx: fix up owner_mtx after r324609
Now that MTX_UNOWNED is 0 the test was alwayas false.
2017-10-14 00:47:30 +00:00
alc
e64f2e0930 Address two problems with sendfile(..., SF_NOCACHE) and apply one
"optimization".  First, sendfile(..., SF_NOCACHE) frees pages without
checking whether those pages are mapped.  This can leave the system
with mappings to free or repurposed pages.  Second, a page can be
busied between the time of the current busy test and acquiring the
object lock.  Essentially, the test performed before the object lock
is acquired can only be regarded as an optimization to short-circuit
further work on the page.  It cannot, however, be relied upon to prove
that it is safe to free the page.  Third, when sendfile(..., SF_NOCACHE)
was originally implemented, vm_page_deactivate_noreuse() did not yet
exist.  Use vm_page_deactivate_noreuse() instead of vm_page_deactivate(),
because it comes closer to freeing the page.

In collaboration with:	glebius
Discussed with:	gallatin, kib, markj
X-MFC after:	r324448
2017-10-13 16:31:50 +00:00
avg
13558f906a remove process and jail directory machinations from dounmount
The manipulations done by mountcheckdirs() are not that useful during
the unmount, they can bring about unexpected security consequences.

Thic change effectively reverts the change in r73241.

The change also allows to simplify the handling of rootvnode global
variable.

Discussed with:	mckusick, mjg, kib
Reviewed by:	trasz
MFC after:	1 month
Differential Revision: https://reviews.freebsd.org/D12366
2017-10-13 09:42:05 +00:00
emaste
a97448391f regen init_sysent.c r324560 2017-10-12 15:48:37 +00:00
emaste
5d14c78c8e allow posix_fallocate in capability mode
posix_fallocate is logically equivalent to writing zero blocks to the
desired file size and there is no reason to prevent calling it in
capability mode. posix_fallocate already checked for the CAP_WRITE
right, so we merely need to list it in capabilities.conf.

Reviewed by:	allanjude
MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D12640
2017-10-12 15:45:53 +00:00
mjoras
1bac95586a Add clearing function for unr(9).
Previously before you could call unrhdr_delete you needed to
individually free every allocated unit. It is useful to be able to tear
down the unr without having to go through this process, as it is
significantly faster than freeing the individual units.

Reviewed by:	cem, lidl
Approved by:	rstone (mentor)
Sponsored by:	Dell EMC Isilon
Differential Revision:	https://reviews.freebsd.org/D12591
2017-10-11 21:53:50 +00:00
kib
49c3eee83d The th_bintime, th_microtime and th_nanotime members of the timehand
all cache the last system time (uptime + boottime).  Only the format
differs.  Do not re-calculate the bintime and simply use the value
used to calculate the microtime and nanotime.

Group all the updates under the relevant comment.  Remove obsoleted
XXX part.

Submitted by:	Sebastian Huber <sebastian.huber@embedded-brains.de>
MFC after:	1 week
2017-10-11 11:03:11 +00:00
sbruno
89fa9e690e match sendfile() error handling to send().
Sendfile() should match the error checking order of send() which
is currently:

SBS_CANTSENDMORE
so_error
SS_ISCONNECTED

Submitted by:	Jason Eggleston <jason@eggnet.com>
Reviewed by:	glebius
MFC after:	2 weeks
Sponsored by:	Limelight Networks
Differential Revision:	https://reviews.freebsd.org/D12633
2017-10-10 22:21:05 +00:00
sbruno
1b75921a14 Revert r324405 at the request of the submitter pending better solution.
Submitted by:	Jason Eggleston <jason@eggnet.com>
Sponsored by:	Limelight Networks
2017-10-10 00:32:21 +00:00
glebius
4bbcc4b9b3 Improvements to sendfile(2) mbuf free routine.
o Fall back to default m_ext free mech, using function pointer in
  m_ext_free, and remove sf_ext_free() called directly from mbuf code.
  Testing on modern CPUs showed no regression.
o Provide internally used flag EXT_FLAG_SYNC, to mark that I/O uses
  SF_SYNC flag.  Lack of the flag allows us not to dereference
  ext_arg2, saving from a cache line miss.
o Create function sendfile_free_page() that later will be used, for
  multi-page mbufs.  For now compiler will inline it into
  sendfile_free_mext().

In collaboration with:	gallatin
Differential Revision:	https://reviews.freebsd.org/D12615
2017-10-09 21:06:16 +00:00
glebius
492b1ec8b6 In mb_dupcl() don't copy full m_ext, to avoid cache miss. Respectively,
in mb_free_ext() always use fields from the original refcount holding
mbuf (see. r296242) mbuf.  Cuts another cache miss from mb_free_ext().

However, treat EXT_EXTREF mbufs differently, since they are different -
they don't have a refcount holding mbuf.

Provide longer comments in m_ext declaration to explain this change
and change from r296242.

In collaboration with:	gallatin
Differential Revision:	https://reviews.freebsd.org/D12615
2017-10-09 20:51:58 +00:00
glebius
a9152a7f99 Shorten list of arguments to mbuf external storage freeing function.
All of these arguments are stored in m_ext, so there is no reason
to pass them in the argument list.  Not all functions need the second
argument, some don't even need the first one.  The second argument
lives in next cache line, so not dereferencing it is a performance
gain.  This was discovered in sendfile(2), which will be covered by
next commits.

The second goal of this commit is to bring even more flexibility
to m_ext mbufs, allowing to create more fields in m_ext, opaque to
the generic mbuf code, and potentially set and dereferenced by
subsystems.

Reviewed by:	gallatin, kbowling
Differential Revision:	https://reviews.freebsd.org/D12615
2017-10-09 20:35:31 +00:00
hselasky
69fd8ab233 When showing the sleepqueues from the in-kernel debugger,
properly dump all the sendqueues and not just the first one

History:
It appears that in the commit which introduced the code,
r165272, the array indexes of "sq_blocked[0]" and "td_name[i]"
were interchanged. In r180927 "td_name[i]" was corrected to
"td_name[0]", but "sq_blocked[0]" was left unchanged.

PR:		222624
Discussed with:	kmacy @
MFC after:	1 week
Sponsored by:	Mellanox Technologies
2017-10-09 18:33:29 +00:00
alc
6a7e568f99 The recent change to initialization of blists (r324420) relied on '-1'
appearing only where the code explicitly set it, but since much of the
data was not initialized, '-1' appeared other places too, and led to
panics.  Clear the allocated data before initializing nonzero values by
allocating with M_ZERO.

Submitted by:	Doug Moore <dougm@rice.edu>
Reported by:	Oleg V. Nauman <oleg@theweb.org.ua>, cy
Tested by:	Oleg V. Nauman <oleg@theweb.org.ua>
MFC after:	1 week
X-MFC with:	r324420
Differential Revision:	https://reviews.freebsd.org/D12627
2017-10-09 18:19:06 +00:00
alc
008747750e The blst_radix_init function has two purposes - to compute the number of
nodes to allocate for the blist, and to initialize them.  The computation
can be done much more quickly by identifying the terminating node, if any,
at every level of the tree and then summing the number of nodes at each
level that precedes the topmost terminator.  The initialization can also be
done quickly, since settings at the root mark the tree as all-allocated, and
only a few terminator nodes need to be marked in the rest of the tree.
Eliminate blst_radix_init, and perform its two functions more simply in
blist_create.

The allocation of the blist takes places in two pieces, but there's no good
reason to do so, when a single allocation is sufficient, and simpler.
Allocate the blist struct, and the array of nodes associated with it, with a
single allocation.

Submitted by:	Doug Moore <dougm@rice.edu>
Reviewed by:	markj (an earlier version)
MFC after:	1 week
Differential Revision:	https://reviews.freebsd.org/D11968
2017-10-08 22:17:39 +00:00
ian
5cc2194826 Add eventhandler notifications for newbus device attach/detach.
The detach case is slightly complicated by the fact that some in-kernel
consumers may want to know before a device detaches (so they can release
related resources, stop using the device, etc), but the detach can fail. So
there are pre- and post-detach notifications for those consumers who need to
handle all cases.

A couple salient comments from the review, they amount to some helpful
documentation about these events, but there's currently no good place for
such documentation...

Note that in the current newbus locking model, DETACH_BEGIN and
DETACH_COMPLETE/FAILED sequence of event handler invocation might interweave
with other attach/detach events arbitrarily. The handlers should be prepared
for such situations.

Also should note that detach may be called after the parent bus knows the
hardware has left the building. In-kernel consumers have to be prepared to
cope with this race.

Differential Revision:	https://reviews.freebsd.org/D12557
2017-10-08 17:33:49 +00:00
ian
69fb8e033a Restore the ability to deregister an eventhandler from within the callback.
When the EVENTHANDLER(9) subsystem was created, it was a documented feature
that an eventhandler callback function could safely deregister itself. In
r200652 that feature was inadvertantly broken by adding drain-wait logic to
eventhandler_deregister(), so that it would be safe to unload a module upon
return from deregistering its event handlers.

There are now 145 callers of EVENTHANDLER_DEREGISTER(), and it's likely many
of them are depending on the drain-wait logic that has been in place for 8
years. So instead of creating a separate eventhandler_drain() and adding it
to some or all of those 145 call sites, this creates a separate
eventhandler_drain_nowait() function for the specific purpose of
deregistering a callback from within the running callback.

Differential Revision:	https://reviews.freebsd.org/D12561
2017-10-08 17:21:16 +00:00
sbruno
692f29cb6e Check so_error early in sendfile() call. Prior to this patch, if a
connection was reset by the remote end, sendfile() would just report
ENOTCONN instead of ECONNRESET.

Submitted by:	Jason Eggleston <jason@eggnet.com>
Reviewed by:	glebius
Sponsored by:	Limelight Networks
Differential Revision:	https://reviews.freebsd.org/D12575
2017-10-07 23:30:57 +00:00
mjg
6523dff125 namecache: factor out ~MAKEENTRY lookups from the common path
Lookups of the sort are rare compared to regular ones and succesfull ones
result in removing entries from the cache.

In the current code buckets are rlocked and a trylock dance is performed,
which can fail and cause a restart. Fixing it will require a little bit
of surgery and in order to keep the code maintaineable the 2 cases have
to split.

MFC after:	1 week
2017-10-06 23:05:55 +00:00
markj
08a37c1513 Let stack_create(9) take a malloc flags argument.
Reviewed by:	cem
Sponsored by:	Dell EMC Isilon
Differential Revision:	https://reviews.freebsd.org/D12614
2017-10-06 21:52:28 +00:00
manu
13125fb720 vfs_export_lookup: Fix r324054
When using the default address list nam is still valid, the code in
r324054 assumed that is was NULL.

Reported by:	Guy Yur <guyyur@gmail.com>
Tested by:	Guy Yur <guyyur@gmail.com>
2017-10-06 09:02:36 +00:00
mjg
6f516b023d locks: take the number of readers into account when waiting
Previous code would always spin once before checking the lock. But a lock
with e.g. 6 readers is not going to become free in the duration of once spin
even if they start draining immediately.

Conservatively perform one for each reader.

Note that the total number of allowed spins is still extremely small and is
subject to change later.

MFC after:	1 week
2017-10-05 19:18:02 +00:00
shurd
bb993c697a Fix "taskqgroup_attach: setaffinity failed: 3" with iflib drivers
Improved logging added in r323879 exposed an error during
attach. We need the irq, not the rid to work correctly. em uses
shared irqs, so it will use the same irq for TX as RX. bnxt does
not use shared irqs, or TX irqs at all, so there's no need to set
the TX irq affinity.

Reviewed by:	sbruno
Approved by:	sbruno (mentor)
Sponsored by:	Limelight Networks
Differential Revision:	https://reviews.freebsd.org/D12496
2017-10-05 14:43:30 +00:00
mjg
a059e2d8eb locks: partially tidy up waiting on readers
spin first instant of instantly re-readoing and don't re-read after
spinning is finished - the state is already known.

Note the code is subject to significant changes later.

MFC after:	1 week
2017-10-05 13:01:18 +00:00
avg
341d9df85b sysctl-s in a module should be accessible only when the module is initialized
A sysctl can have a custom handler that may access data that is initialized
via SYSINIT(9) or via a module event handler (also invoked via SYSINIT).
Thus, it is not safe to allow access to the module's sysctl-s until
the initialization is performed.  Likewise, we should not allow access
to teh sysctl-s after the module is uninitialized.
The latter is easy to achieve by properly ordering linker_file_unregister_sysctls
and linker_file_sysuninit.
The former is not as easy for two reasons:
- the initialization may depend on tunables which get set when sysctl-s are
  registered, so we need to set the tunables before running sysinit-s
- the initialization may try to dynamically add more sysctl-s under statically
  defined sysctl nodes
So, this change splits the sysctl setup into two phases.  In the first phase
the sysctl-s are registered as before but they are disabled and hidden from
consumers.  In the second phase, done after sysinit-s, normal access to the
sysctl-s is enabled.

The change should affect only dynamic module loading and unloading after
the system boot-up.  Nothing changes for sysctl-s compiled into the kernel
and sysctl-s in preloaded modules.

Discussed with:	hselasky, ian, jhb
Reviewed by:	julian, kib
MFC after:	2 weeks
Sponsored by:	Panzura
Differential Revision: https://reviews.freebsd.org/D12545
2017-10-05 12:32:14 +00:00
glebius
7168fac388 Hide struct socket and struct unpcb from the userland.
Violators may define _WANT_SOCKET and _WANT_UNPCB respectively and
are not guaranteed for stability of the structures.  The violators
list is the the usual one: libprocstat(3) and netstat(1) internally
and lsof in ports.

In struct xunpcb remove the inclusion of kernel structure and add
a bunch of spare fields.  The xsocket already has socket not included,
but add there spares as well.  Embed xsockbuf into xsocket.

Sort declarations in sys/socketvar.h to separate kernel only from
userland available ones.

PR:		221820 (exp-run)
2017-10-02 23:29:56 +00:00
alc
513b841b01 Use vm_page_active() rather than directly accessing the page's queue
field.

Reviewed by:	kib, markj
MFC after:	2 weeks
X-MFC with:	r324146
2017-10-02 07:30:21 +00:00
avg
8e8e725268 revert r324166, it has an unrelated change in it 2017-10-01 16:37:54 +00:00
avg
391a453e06 MFV r323531: 8521 nvlist memory leak in get_clones_stat() and spa_load_best()
illumos/illumos-gate@7d3000f774
7d3000f774

https://www.illumos.org/issues/8521
  Yuri reported this to the mailing list:
  doing a `reboot -d` on current illumos-gate HEAD gives the following "::
  findleaks -dv" output:
  findleaks: maximum buffers => 301061
  findleaks: actual buffers => 297587
  findleaks:
  findleaks: potential pointers => 29289774
  findleaks: dismissals => 26242305 (89.5%)
  findleaks: misses => 331153 ( 1.1%)
  findleaks: dups => 2419681 ( 8.2%)
  findleaks: follows => 296635 ( 1.0%)
  findleaks:
  findleaks: peak memory usage => 7353 kB
  findleaks: elapsed CPU time => 1.5 seconds
  findleaks: elapsed wall time => 2.0 seconds
  findleaks:
  CACHE LEAKED BUFCTL CALLER
  ffffff03d222b008 120 ffffff03ef7ceb78 nv_alloc_sys+0x1f
  ffffff03d222a448 123 ffffff03f4150cc8 nv_alloc_sys+0x1f
  ffffff03d222b448 5 ffffff03f28bd598 nv_alloc_sys+0x1f
  ffffff03d222b888 87 ffffff03f28c10f0 nv_alloc_sys+0x1f
  ffffff03d222c008 21 ffffff03f4139310 nv_alloc_sys+0x1f
  ffffff03d222b888 43 ffffff040ef3f3e8 nv_alloc_sys+0x1f
  ffffff03d222c008 120 ffffff03f4591e58 nv_alloc_sys+0x1f
  ffffff03d222b008 121 ffffff03f352c068 nv_alloc_sys+0x1f
  ffffff03d222a448 112 ffffff03f414e5f8 nv_alloc_sys+0x1f
  ffffff03d222b008 119 ffffff03ee92fdc0 nv_alloc_sys+0x1f
  ffffff03d222b888 46 ffffff03f28c1378 nv_alloc_sys+0x1f
  ffffff03d222b448 4 ffffff03f28c7708 nv_alloc_sys+0x1f
  ffffff03d222c008 20 ffffff03f2a6e7e8 nv_alloc_sys+0x1f

Reviewed by: Steve Gonczi <steve.gonczi@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Yuri Pankov <yuripv@gmx.com>
Reviewed by: Matt Ahrens <mahrens@delphix.com>
Approved by: Dan McDonald <danmcd@joyent.com>
Author: Pavel Zakharov <pavel.zakharov@delphix.com>

MFC after:	5 weeks
X-MFC after:	r324163
2017-10-01 16:34:16 +00:00
markj
f30cb11686 Have uiomove_object_page() keep accessed pages in the active queue.
Previously, uiomove_object_page() would maintain LRU by requeuing the
accessed page. This involves acquiring one of the heavily contended page
queue locks. Moreover, it is unnecessarily expensive for pages in the
active queue.

As of r254304 the page daemon continually performs a slow scan of the
active queue, with the effect that unreferenced pages are gradually
moved to the inactive queue, from which they can be reclaimed. Prior to
that revision, the active queue was scanned only during shortages of
free and inactive pages, meaning that unreferenced pages could get
"stuck" in the queue. Thus, tmpfs was required to use the inactive queue
and requeue pages in order to maintain LRU. Now that this is no longer
the case, tmpfs I/O operations can use the active queue and avoid the
page queue locks in most cases, instead setting PGA_REFERENCED on
referenced pages to provide pseudo-LRU.

Reviewed by:	alc (previous version)
MFC after:	2 weeks
2017-09-30 23:41:28 +00:00
kib
81faa225ff Revert r323722. A better fix will be committed shortly, as well as
some still useful bits of the reverted revision.

The problem with the committed fix is that there are still issues with
returning from NMI, when NMI interrupted kernel in a moment where the
kernel segments selectors were still not loaded into registers.  If
this happens, the NMI return would loose the userspace selectors
because r323722 does not reload segment registers on return to kernel
mode.

Fixing the problem is complicated.  Since an alternative approach to
handle the original bug exists, it makes sence to stop adding more
complexity.

Discussed with:	bde
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2017-09-28 08:38:24 +00:00
jhb
bb8b530dd0 Use UMA_ALIGNOF() for name cache UMA zones.
This fixes kernel crashes due to misaligned accesses to the 64-bit
time_t embedded in struct namecache_ts in MIPS n32 kernels.

MFC after:	1 week
Sponsored by:	DARPA / AFRL
2017-09-27 23:18:57 +00:00
manu
444f0ddd90 vfs_export: Simplify vfs_export_lookup
If the filesystem is not exported directly return NULL.
If no address is given and filesystem is exported using some default
one return it directly, if it doesn't have a default one directly
return NULL.

Reviewed by:	kib, bapt
MFC after:	1 week
Sponsored by:	Gandi.net
Differential Revision:	https://reviews.freebsd.org/D12505
2017-09-27 09:39:16 +00:00
mjg
92eb817a05 sysctl: remove target buffer read/write checks prior to calling the handler
Said checks were inherently racy anyway as jokers could unmap target areas
before the handler got around to accessing them.

This saves time by avoiding locking the address space.

MFC after:	1 week
2017-09-27 01:31:52 +00:00
mjg
84d28f19d5 Annotate sysctlmemlock with __exclusive_cache_line.
MFC after:	1 week
2017-09-27 01:27:43 +00:00
mjg
77579b97a2 mtx: drop the tid argument from _mtx_lock_sleep
tid must be equal to curthread and the target routine was already reading
it anyway, which is not a problem. Not passing it as a parameter allows for
a little bit shorter code in callers.

MFC after:	1 week
2017-09-27 00:57:05 +00:00
jhb
3adfca76a7 Log signal number passed to PT_STEP requests in KTR_PTRACE traces.
MFC after:	1 week
2017-09-25 20:38:55 +00:00
cem
795d0e91a5 ddb(4): Add 'show badstacks' command to show witness badstacks
Add a DDB command that mirrors sysctl debug.witness.badstacks.

Reapply r323935 after fixing trivial deficiency.  I forgot to compile with
WITNESS enabled.  Thanks emaste@ for fixing the build while I was asleep.

Reported by:	rstone
Reviewed by:	rstone (previous version)
Sponsored by:	Dell EMC Isilon
Differential Revision:	https://reviews.freebsd.org/D12468
2017-09-23 17:48:49 +00:00
emaste
82492e966c Revert r323935 as it broke the build
subr_witness.c:2577:4: error: use of undeclared identifier 'req'
                        req->oldidx = 0;
                        ^
2017-09-23 12:35:46 +00:00
shurd
07b9a7d42f Make struct grouptask gt_name member a char array
Previously, it was just a pointer which was copied, but
some callers pass in a stack variable which will go out of scope.
Add GROUPTASK_NAMELEN macro (32) and snprintf() the name into it,
using "grouptask" if name is NULL. We can now safely include
gtask->gt_name in console messages.

Reviewed by:	sbruno
Approved by:	sbruno (mentor)
Sponsored by:	Limelight Networks
Differential Revision:	https://reviews.freebsd.org/D12449
2017-09-23 01:39:16 +00:00
cem
80127a72d7 ddb(4): Add 'show badstacks' command to show witness badstacks
Add a DDB command that mirrors sysctl debug.witness.badstacks.

Reported by:	rstone
Reviewed by:	rstone
Sponsored by:	Dell EMC Isilon
Differential Revision:	https://reviews.freebsd.org/D12468
2017-09-22 20:01:12 +00:00
mckusick
4c3c44cdd8 Continuing efforts to provide hardening of FFS, this change adds a
check hash to cylinder groups. If a check hash fails when a cylinder
group is read, no further allocations are attempted in that cylinder
group until it has been fixed by fsck. This avoids a class of
filesystem panics related to corrupted cylinder group maps. The
hash is done using crc32c.

Check hases are added only to UFS2 and not to UFS1 as UFS1 is primarily
used in embedded systems with small memories and low-powered processors
which need as light-weight a filesystem as possible.

Specifics of the changes:

sys/sys/buf.h:
    Add BX_FSPRIV to reserve a set of eight b_xflags that may be used
    by individual filesystems for their own purpose. Their specific
    definitions are found in the header files for each filesystem
    that uses them. Also add fields to struct buf as noted below.

sys/kern/vfs_bio.c:
    It is only necessary to compute a check hash for a cylinder
    group when it is actually read from disk. When calling bread,
    you do not know whether the buffer was found in the cache or
    read. So a new flag (GB_CKHASH) and a pointer to a function to
    perform the hash has been added to breadn_flags to say that the
    function should be called to calculate a hash if the data has
    been read. The check hash is placed in b_ckhash and the B_CKHASH
    flag is set to indicate that a read was done and a check hash
    calculated. Though a rather elaborate mechanism, it should
    also work for check hashing other metadata in the future. A
    kernel internal API change was to change breada into a static
    fucntion and add flags and a function pointer to a check-hash
    function.

sys/ufs/ffs/fs.h:
    Add flags for types of check hashes; stored in a new word in the
    superblock. Define corresponding BX_ flags for the different types
    of check hashes. Add a check hash word in the cylinder group.

sys/ufs/ffs/ffs_alloc.c:
    In ffs_getcg do the dance with breadn_flags to get a check hash and
    if one is provided, check it.

sys/ufs/ffs/ffs_vfsops.c:
    Copy across the BX_FFSTYPES flags in background writes.
    Update the check hash when writing out buffers that need them.

sys/ufs/ffs/ffs_snapshot.c:
    Recompute check hash when updating snapshot cylinder groups.

sys/libkern/crc32.c:
lib/libufs/Makefile:
lib/libufs/libufs.h:
lib/libufs/cgroup.c:
    Include libkern/crc32.c in libufs and use it to compute check
    hashes when updating cylinder groups.

Four utilities are affected:

sbin/newfs/mkfs.c:
    Add the check hashes when building the cylinder groups.

sbin/fsck_ffs/fsck.h:
sbin/fsck_ffs/fsutil.c:
    Verify and update check hashes when checking and writing cylinder groups.

sbin/fsck_ffs/pass5.c:
    Offer to add check hashes to existing filesystems.
    Precompute check hashes when rebuilding cylinder group
    (although this will be done when it is written in fsutil.c
    it is necessary to do it early before comparing with the old
    cylinder group)

sbin/dumpfs/dumpfs.c
    Print out the new check hash flag(s)

sbin/fsdb/Makefile:
    Needs to add libufs now used by pass5.c imported from fsck_ffs.

Reviewed by: kib
Tested by: Peter Holm (pho)
2017-09-22 12:45:15 +00:00
shurd
5116c3be34 Fix undeclared identifier error introduced in r323879
It doesn't appear to be safe to use gtask->gt_name.

Reported by:	Mark Johnston, Jenkins
Reviewed by:	sbruno
Approved by:	sbruno (mentor)
Sponsored by:	Limelight Networks
Differential Revision:	https://reviews.freebsd.org/D12448
2017-09-21 23:27:35 +00:00
jhb
c5f4b54f16 Only handle _PC_MAX_CANON, _PC_MAX_INPUT, and _PC_VDISABLE for TTY devices.
Move handling of these three pathconf() variables out of vop_stdpathconf()
and into devfs_pathconf() as TTY devices can only be devfs files.  In
addition, only return settings for these three variables for devfs devices
whose device switch has the D_TTY flag set.

Discussed with:	bde, kib
Sponsored by:	Chelsio Communications
2017-09-21 23:05:32 +00:00
shurd
c8fa21be21 Improved logging of gtaskqueue failues
Check the return code of intr_setaffinity() and log any errors
it returns. When a qid is not located, log an error before returning
failure.  Also, use __func__ rather than hardcoding the function name

Reviewed by:	sbruno
Approved by:	sbruno (mentor)
Sponsored by:	Limelight Networks
Differential Revision:	https://reviews.freebsd.org/D12436
2017-09-21 21:14:48 +00:00
shurd
8b20e8f83d Fix M_GTASKQUEUE definition
Previously had the same short and long description as taskqueues.
This could cause problems with memguard(9) and vmstat -m which use
the short description as a unique identifier.

Reviewed by:	sbruno
Approved by:	sbruno (mentor)
MFC after:	3 days
Sponsored by:	Limelight Networks
Differential Revision:	https://reviews.freebsd.org/D12438
2017-09-21 20:34:33 +00:00
kib
62cba400a3 Do not vrele() covered vnode under the mp mutex.
If vrele() changes the hold count to zero, it needs to acquire the
vnode lock.

Sponsored by:	The FreeBSD Foundation
Discussed with:	avg
X-MFC with:	r323578
2017-09-19 16:49:45 +00:00
kib
23d65de60e For unlinked files, do not msync(2) or sync on the vnode deactivation.
One consequence of the patch is that msyncing unlinked file mappings
no longer reduces the amount of the dirty memory in the system, but I
do not think that there are users of msync(2) that utilize it for such
side-effect.

Reported and tested by:	tjil
PR:	222356
Reviewed by:	alc
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
Differential revision:	https://reviews.freebsd.org/D12411
2017-09-19 16:46:37 +00:00
kib
981650a056 Fix handling of the segment registers on i386.
Suppose that userspace is executing with the non-standard segment
descriptors.  Then, until exception or interrupt handler executed
SET_KERNEL_SEGS, kernel is still executing with user %ds, %es and %fs.
If an interrupt occurs in this window, the interrupt handler is
executed unsafely, relying on usability of the usermode registers.  If
the interrupt results in the context switch on return, the
contamination of the kernel state spreads to the thread we switched
to.  As result, kernel data accesses might fault or, if only the base
is changed, completely messed up.

More, if the user segment was allocated in LDT, another thread might
mark the descriptor as invalid before doreti code tried to reload
them.  In this case kernel panics.

The issue exists for all exception entry points which use trap gate,
and thus do not automatically disable interrupts on entry, and for
lcall_handler.

Fix is two-fold: first, we need to disable interrupts for all kernel
entries, changing the IDT descriptor types from trap gate to interrupt
gate.  Interrupts are re-enabled not earlier than the kernel segments
are loaded into the segment registers.  Second, we only load the
segment registers from the trap frame when returning to usermode.  For
the later, all interrupt return paths must happen through the doreti
common code.

There is no way to disable interrupts on call gate, which is the
supposed mode of servicing for lcall $7,$0 syscalls.  Change the LDT
descriptor 0 into a code segment type and point it to the userspace
trampoline which redirects the syscall to int $0x80.

All the measures make the segment register handling similar to that of
amd64.  We do not apply amd64 optimizations of not reloading segment
registers on return from the syscall.

Reported by:	Maxime Villard <max@m00nbsd.net>
Tested by:	pho (the non-lcall part)
Reviewed by:	jhb
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
Differential revision:	https://reviews.freebsd.org/D12402
2017-09-18 20:22:42 +00:00
alc
c7eb020649 Modify blst_leaf_alloc to take only the cursor argument.
Modify blst_leaf_alloc to find allocations that cross the boundary between
one leaf node and the next when those two leaves descend from the same
meta node.

Update the hint field for leaves so that it represents a bound on how
large an allocation can begin in that leaf, where it currently represents
a bound on how large an allocation can be found within the boundaries of
the leaf.

The first phase of blst_leaf_alloc currently shrinks sequences of
consecutive 1-bits in mask until each has been shrunken by count-1 bits,
so that any bits remaining show where an allocation can begin, or until
all the bits have disappeared, in which case the allocation fails. This
change amends that so that the high-order bit is copied, as if, when the
last block was free, it was followed by an endless stream of free
blocks. It also amends the early stopping condition, so that the shrinking
of 1-sequences stops early when there are none, or there is only one
unbounded one remaining.

The search for the first set bit is unchanged, and the code path
thereafter is mostly unchanged unless the first set bit is in a position
that makes some of those copied sign bits matter. In that case, we look
for a next leaf, and at what blocks it can provide, to see if a
cross-boundary allocation is possible.

The hint is updated on a successful allocation that clears the last bit,
but it not updated on a failed allocation that leaves the last bit
set. So, as long as the last block is free, the hint value for the leaf is
large. As long as the last block is free, and there's a next leaf, a large
allocation can begin here, perhaps. A stricter rule than this would mean
that allocations and frees in one leaf could require hint updates to the
preceding leaf, and this change seeks to leave the freeing code
unmodified.

Define BLIST_BMAP_MASK, and use it for bit masking in blst_leaf_free and
blist_leaf_fill, as well as in blst_leaf_alloc.

Correct a panic message in blst_leaf_free.

Submitted by:	Doug Moore <dougm@rice.edu>
Reviewed by:	markj (an earlier version)
MFC after:	1 week
Differential Revision:	https://reviews.freebsd.org/D11819
2017-09-16 18:12:15 +00:00
shurd
acad431d06 Revert r323516 (iflib rollup)
This was really too big of a commit even if everything worked, but there
are multiple new issues introduced in the one huge commit, so it's not
worth keeping this until it's fixed.

I'll work on splitting this up into logical chunks and introduce them one
at a time over the next week or two.

Approved by:	sbruno (mentor)
Sponsored by:	Limelight Networks
2017-09-16 02:41:38 +00:00
glebius
13e2631492 Fix locking in soisconnected().
When a newborn socket moves from incomplete queue to complete
one, we need to obtain the listening socket lock after the child,
which is a wrong order.  The old code did that in potentially
endless loop of mtx_trylock().  The new one does only one attempt
of mtx_trylock(), and in case of failure references listening
socket, unlocks child and locks everything in right order.  In
case if listening socket shuts down during that, just bail out.

Reported & tested by:	Jason Eggleston <jeggleston llnw.com>
Reported & tested by:	Jason Wolfe <jason llnw.com>
2017-09-14 18:05:54 +00:00
jhb
e5ea82a50d Add AT_HWCAP and AT_EHDRFLAGS on all platforms.
A new 'u_long *sv_hwcap' field is added to 'struct sysentvec'.  A
process ABI can set this field to point to a value holding a mask of
architecture-specific CPU feature flags.  If an ABI does not wish to
supply AT_HWCAP to processes the field can be left as NULL.

The support code for AT_EHDRFLAGS was already present on all systems,
just the #define was not present.  This is a step towards unifying the
AT_* constants across platforms.

Reviewed by:	kib
MFC after:	1 month
Differential Revision:	https://reviews.freebsd.org/D12290
2017-09-14 14:26:55 +00:00
avg
fa3c3c05d2 dounmount: do not release the mount point's reference on the covered vnode
As long as mnt_ref is not zero there can be a consumer that might try
to access mnt_vnodecovered.  For this reason the covered vnode must not
be freed until mnt_ref goes to zero.
So, move the release of the covered vnode to vfs_mount_destroy.

Reviewed by:	kib
MFC after:	3 weeks
Differential Revision: https://reviews.freebsd.org/D12329
2017-09-14 08:47:06 +00:00
glebius
906e83affb Use soref() in sendfile(2) instead fhold() to reference a socket.
The problem is that fdrop() requires syscall context, as it may
enter sleep in some cases.  The reason to use it in the original
non-blocking sendfile implementation, was to avoid use of global
ACCEPT_LOCK() on every I/O completion. Now in head sorele() no
longer requires this lock.
2017-09-13 22:11:05 +00:00
glebius
c3c8e7f59c Fix two issues with not ready data in sockets (read: sendfile)
in UNIX sockets.

o Check that socket is still connected in uipc_ready(). If not
  we are responsible to free mbufs.
o In uipc_send() if socket appears to be disconnected, but we
  are sending data with pending I/Os, don't free mbufs.

Reported by:	Kevin Bowling <kbowling llnw.com>
Tested by:	Kevin Bowling <kbowling llnw.com>
PR:		222259
Reported by:	Mark Martinec <Mark.Martinec ijs.si>
MFC after:	3 days
2017-09-13 16:47:23 +00:00
shurd
99c641b97c Roll up iflib commits from github. This pulls in most of the work done
by Matt Macy as well as other changes which he has accepted via pull
request to his github repo at https://github.com/mattmacy/networking/

This should bring -CURRENT and the github repo into close enough sync to
allow small feature branches rather than a large chain of interdependant
patches being developed out of tree.  The reset of the synchronization
should be able to be completed on github by splitting the remaining
changes that are not yet ready into short feature branches for later
review as smaller commits.

Here is a summary of changes included in this patch:

1)  More checks when INVARIANTS are enabled for eariler problem
    detection
2)  Group Task Queue cleanups
    - Fix use of duplicate shortdesc for gtaskqueue malloc type.
      Some interfaces such as memguard(9) use the short description to
      identify malloc types, so duplicates should be avoided.
3)  Allow gtaskqueues to use ithreads in addition to taskqueues
    - In some cases, this can improve performance
4)  Better logging when taskqgroup_attach*() fails to set interrupt
    affinity.
5)  Do not start gtaskqueues until they're needed
6)  Have mp_ring enqueue function enter the ABDICATED rather than BUSY
    state.  This moves the TX to the gtaskq and allows processing to
    continue faster as well as make TX batching more likely.
7)  Add an ift_txd_errata function to struct if_txrx.  This allows
    drivers to inspect/modify mbufs before transmission.
8)  Add a new IFLIB_NEED_ZERO_CSUM for drivers to indicate they need
    checksums zeroed for checksum offload to work.  This avoids modifying
    packet data in the TX path when possible.
9)  Use ithreads for iflib I/O instead of taskqueues
10) Clean up ioctl and support async ioctl functions
11) Prefetch two cachlines from each mbuf instead of one up to 128B.  We
    often need to parse packet header info beyond 64B.
12) Fix potential memory corruption due to fence post error in
    bit_nclear() usage.
13) Improved hang detection and handling
14) If the packet is smaller than MTU, disable the TSO flags.
    This avoids extra packet parsing when not needed.
15) Move TCP header parsing inside the IS_TSO?() test.
    This avoids extra packet parsing when not needed.
16) Pass chains of mbufs that are not consumed by lro to if_input()
    rather call if_input() for each mbuf.
17) Re-arrange packet header loads to get as much work as possible done
    before a cache stall.
18) Lock the context when calling IFDI_ATTACH_PRE()/IFDI_ATTACH_POST()/
    IFDI_DETACH();
19) Attempt to distribute RX/TX tasks across cores more sensibly,
    especially when RX and TX share an interrupt.  RX will attempt to
    take the first threads on a core, and TX will attempt to take
    successive threads.
20) Allow iflib_softirq_alloc_generic() to request affinity to the same
    cpus an interrupt has affinity with.  This allows TX queues to
    ensure they are serviced by the socket the device is on.
21) Add new iflib sysctls to net.iflib:
    - timer_int - interval at which to run per-queue timers in ticks
    - force_busdma
22) Add new per-device iflib sysctls to dev.X.Y.iflib
    - rx_budget allows tuning the batch size on the RX path
    - watchdog_events Count of watchdog events seen since load
23) Fix error where netmap_rxq_init() could get called before
    IFDI_INIT()
24) e1000: Fixed version of r323008: post-cold sleep instead of DELAY
    when waiting for firmware
    - After interrupts are enabled, convert all waits to sleeps
    - Eliminates e1000 software/firmware synchronization busy waits after
      startup
25) e1000: Remove special case for budget=1 in em_txrx.c
    - Premature optimization which may actually be incorrect with
      multi-segment packets
26) e1000: Split out TX interrupt rather than share an interrupt for
    RX and TX.
    - Allows better performance by keeping RX and TX paths separate
27) e1000: Separate igb from em code where suitable
    Much easier to understand separate functions and "if (is_igb)" than
    previous tests like "if (reg_icr & (E1000_ICR_RXSEQ | E1000_ICR_LSC))"

#blamebruno

Reviewed by:	sbruno
Approved by:	sbruno (mentor)
Sponsored by:	Limelight Networks
Differential Revision:	https://reviews.freebsd.org/D12235
2017-09-13 01:18:42 +00:00
alc
e7430120e9 To analyze the allocation of swap blocks by blist functions, add a method
for analyzing the radix tree structures and reporting on the number, and
sizes, of maximal intervals of free blocks.  The report includes the number
of maximal intervals, and also the number of them in each of several size
ranges, from small (size 1, or 3 to 4) to large (28657 to 46367) with size
boundaries defined by Fibonacci numbers.  The report is written in the test
tool with the 's' command, or in a running kernel by sysctl.

The analysis of the radix tree frequently computes the position of the lone
bit set in a u_daddr_t, a computation that also appears in leaf allocation.
That computation has been moved into a function of its own, and optimized
for cases where an inlined machine instruction can replace the usual binary
search.

Submitted by:	Doug Moore <dougm@rice.edu>
MFC after:	1 week
Differential Revision:	https://reviews.freebsd.org/D11906
2017-09-10 17:46:03 +00:00
des
64e97ceba3 If the user tries to set kern.randompid to 1 (which is meaningless), set
it to a random value between 100 and 1123, rather than 0 as before.

Submitted by:	Marie Helene Kvello-Aune <marieheleneka@gmail.com>
MFC after:	1 week
Differential Revision:	https://reviews.freebsd.org/D5336
2017-09-10 15:01:29 +00:00
mjg
4161a7cc9b namecache: clean up struct namecache_ts handling
namecache_ts differs from mere namecache by few fields placed mid struct.
The access to the last element (the name) is thus special-cased.

The standard solution is to put new fields at the very beginning anad
embedd the original struct. The pointer shuffled around points to the
embedded part. If needed, access to new fields can be gained through
__containerof.

MFC after:	1 week
2017-09-10 11:17:32 +00:00
mjg
7b3cd5931a namecache: fold the unlock label into the only consumer
No functional changes.

MFC after:	1 week
2017-09-08 06:57:11 +00:00
mjg
4c5b54c23a namecache: factor out dot lookup into a dedicated function
The intent is to move uncommon cases out of the way.

MFC after:	1 week
2017-09-08 06:51:33 +00:00
mjg
21bf4a2f0f Annotate Giant with __exclusive_cache_line 2017-09-08 06:46:24 +00:00
mjg
71e5f0276c Annotate global process locks with __exclusive_cache_line
MFC after:	1 week
2017-09-08 06:46:02 +00:00
mjg
fb0f2cc9b2 Sprinkle __read_frequently on few obvious places.
Note that some of annotated variables should probably change their types
to something smaller, preferably bit-sized.
2017-09-06 20:33:33 +00:00
mjg
4cc87bd651 Start annotating global _padalign locks with __exclusive_cache_line
While these locks are guarnteed to not share their respective cache lines,
their current placement leaves unnecessary holes in lines which preceeded them.

For instance the annotation of vm_page_queue_free_mtx allows 2 neighbour
cachelines (previously separate by the lock) to be collapsed into 1.

The annotation is only effective on architectures which have it implemented in
their linker script (currently only amd64). Thus locks are not converted to
their not-padaligned variants as to not affect the rest.

MFC after:	1 week
2017-09-06 20:28:18 +00:00
trasz
052644307b Make root_mount_rel(9) ignore NULL arguments, like it used to before r313351.
It would be better to fix API consumers to not pass NULL there - most of them,
such as gmirror, already contain the neccessary checks - but this is easier
and much less error-prone.

One known user-visible result is that it fixes panic on a failed "graid label".

PR:		221846
MFC after:	2 weeks
Sponsored by:	DARPA, AFRL
2017-09-05 14:32:56 +00:00
imp
6267655c2e Add CAM/NVMe support for CAM_DATA_SG
This adds support in pass(4) for data to be described with a
scatter-gather list (sglist) to augment the existing (single) virtual
address.

Differential Revision: https://reviews.freebsd.org/D11361
Submitted by: Chuck Tuffli
Reviewed by: imp@, scottl@, kenm@
2017-08-29 15:29:57 +00:00
bdrewery
79ee71cc38 Allow vdrop() of a vnode not yet on the per-mount list after r306512.
The old code allowed calling vdrop() before insmntque() to place the vnode back
onto the freelist for later recycling.  Some downstream consumers may rely on
this support.  Normally insmntque() failing is fine since is uses vgone() and
immediately frees the vnode rather than attempting to add it to the freelist if
vdrop() were used instead.

Also assert that vhold() cannot be used on such a vnode.

Reviewed by:	kib, cem, markj
Sponsored by:	Dell EMC Isilon
Differential Revision:	https://reviews.freebsd.org/D12126
2017-08-28 19:29:51 +00:00
cem
c426019dec Enhance debugibility of sysctl leaf re-use warnings
Print the full conflicting oid path, and include the function name in the
warning so it is clear that the warnings are sysctl-related.

PR:		221853
Submitted by:	Fabian Keil <fk AT fabiankeil.de> (earlier version)
Sponsored by:	Dell EMC Isilon
2017-08-27 17:12:30 +00:00
cem
7d770137dd Improve scheduler performance
Improve scheduler performance by flattening nonsensical topology layers
(layers with only one child don't serve any purpose).

This is especially relevant on non-AMD Zen systems after r322776.  On my
dual core Intel laptop, this brings the kern.sched.topology_spec table down
from three levels to two.

Submitted by:	jeff
Reviewed by:	attilio
Sponsored by:	Dell EMC Isilon
2017-08-27 05:14:48 +00:00
jhb
1509e7d93a Don't grab SOCK_LOCK for soref() when queuing an AIO request.
The AIO job holds a reference on the associated file descriptor, so the
socket's count should already be > 0.  This fixes a LOR with the socket
buffer lock after recent socket locking changes in HEAD.

Sponsored by:	Chelsio Communications
2017-08-25 23:10:27 +00:00
alc
aa0a430e9b Correct a regression in the previous change, r322459. Specifically, the
removal of the "blk" parameter from blst_meta_alloc() had the unintended
effect of generating an out-of-range allocation when the cursor reaches
the end of the tree if the number of managed blocks in the tree equals
the so-called "radix" (which in the blist code is not the standard notion
of what a radix is but rather the maximum number of leaves in a tree of
the current height.)  In other words, only certain swap configurations
were affected, which is why earlier testing did not reveal the problem.

Submitted by:	Doug Moore <dougm@rice.edu>
Reported by:	pho, kib
Tested by:	pho
X-MFC with:	r322459
Differential Revision:	https://reviews.freebsd.org/D12106
2017-08-25 18:47:23 +00:00
glebius
3966551520 Third take on the r319685 and r320480. Actually allow for call soisconnected()
via soisdisconnected(), and in the earlier unlock earlier to avoid lock
recursion.

This fixes a situation when a socket on accept queue is reset before being
accepted.

Reported by:	Jason Eggleston <jeggleston llnw.com>
2017-08-24 20:49:19 +00:00
cem
103a56551d Remove unused declaration and update ddb.4
A follow-up to r322836.

Warnings for the unused declaration were breaking some second tier
architectures, but did not show up in Clang on x86.

Reported by:	markj (ddb.4), emaste (declaration)
Sponsored by:	Dell EMC Isilon
2017-08-24 19:16:25 +00:00
cem
cd248afb29 Merge print_lockchain and print_sleepchain
When debugging a deadlock, it is useful to follow the full chain of locks as
far as possible.

Reviewed by:	jhb
Sponsored by:	Dell EMC Isilon
Differential Revision:	https://reviews.freebsd.org/D12115
2017-08-24 15:12:16 +00:00
jkim
cc8928bcd3 Fix size to copyout(9) for cpuset_getid(2).
MFC after:	3 days
2017-08-22 20:46:29 +00:00
cem
7f37053028 subr_smp: Clean up topology analysis, add additional layers
Rather than repeatedly nesting loops, separate concerns with a single loop
per call stack level.  Use a table to drive the recursive routine.  Handle
missing topology layers more gracefully (infer a single unit).

Analyze some additional optional layers which may be present on e.g. AMD Zen
systems (groups, aka dies, per package; and cachegroups, aka CCXes, per
group).

Display that additional information in the boot-time topology information,
when it is relevent (non-one).

Reviewed by:	markj@, mjoras@ (earlier version)
Sponsored by:	Dell EMC Isilon
Differential Revision:	https://reviews.freebsd.org/D12019
2017-08-22 00:10:15 +00:00
kib
b8035d686e Allow vinvalbuf() to operate with the shared vnode lock.
This mode allows other clean buffers to arrive while we flush the buf
lists for the vnode, which is fine for the targeted use.  We only need
that all buffers existed at the time of the function start were
flushed.  In fact, only one assert has to be relaxed.

In collaboration with:	pho
Reviewed by:	rmacklem
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
X-Differential revision:	https://reviews.freebsd.org/D12083
2017-08-20 10:07:45 +00:00
markj
a06088f666 Remove some unneeded subroutines for padding writes to dump devices.
Right now we only need to pad when writing kernel dump headers, so
flatten three related subroutines into one. The encrypted kernel dump
code already writes out its key in a dumper.blocksize-sized block.

No functional change intended.

Reviewed by:	cem, def
Sponsored by:	Dell EMC Isilon
Differential Revision:	https://reviews.freebsd.org/D11647
2017-08-18 04:07:25 +00:00
markj
ce8e2801bf Rename mkdumpheader() and group EKCD functions in kern_shutdown.c.
This helps simplify the code in kern_shutdown.c and reduces the number
of globally visible functions.

No functional change intended.

Reviewed by:	cem, def
Sponsored by:	Dell EMC Isilon
Differential Revision:	https://reviews.freebsd.org/D11603
2017-08-18 04:04:09 +00:00
markj
f6dd3eb223 Factor out duplicated kernel dump code into dump_{start,finish}().
dump_start() and dump_finish() are responsible for writing kernel dump
headers, optionally writing the key when encryption is enabled, and
initializing the initial offset into the dump device.

Also remove the unused dump_pad(), and make some functions static now that
they're only called from kern_shutdown.c.

No functional change intended.

Reviewed by:	cem, def
Sponsored by:	Dell EMC Isilon
Differential Revision:	https://reviews.freebsd.org/D11584
2017-08-18 03:52:35 +00:00
lstewart
5b9d72a33b An off-by-one error exists in sbuf_vprintf()'s use of SBUF_HASROOM() when an
sbuf is filled to capacity by vsnprintf(), the loop exits without error, and
the sbuf is not marked as auto-extendable.

SBUF_HASROOM() evaluates true if there is room for one or more non-NULL
characters, but in the case that the sbuf was filled exactly to capacity,
SBUF_HASROOM() evaluates false. Consequently, sbuf_vprintf() incorrectly
assigns an ENOMEM error to the sbuf when in fact everything is fine, in turn
poisoning the buffer for all subsequent operations.

Correct by moving the ENOMEM assignment into the loop where it can be made
unambiguously.

As a related safety net change, explicitly check for the zero bytes drained
case in sbuf_drain() and set EDEADLK as the error. This avoids an infinite loop
in sbuf_vprintf() if a drain function were to inadvertently return a value of
zero to sbuf_drain().

Reviewed by:	cem, jtl, gallatin
MFC after:	2 weeks
Sponsored by:	Netflix, Inc.
Differential Revision:	https://reviews.freebsd.org/D8535
2017-08-18 02:06:28 +00:00
lstewart
0923281ac1 Implement simple record boundary tracking in sbuf(9) to avoid record splitting
during drain operations. When an sbuf is configured to use this feature by way
of the SBUF_DRAINTOEOR sbuf_new() flag, top-level sections started with
sbuf_start_section() create a record boundary marker that is used to avoid
flushing partial records.

Reviewed by:	cem,imp,wblock
MFC after:	2 weeks
Sponsored by:	Netflix, Inc.
Differential Revision:	https://reviews.freebsd.org/D8536
2017-08-17 07:20:09 +00:00
ian
03be283a1e Fix compile error with option DEBUG. This is fallout from some long-ago
INTRNG refactoring that didn't get caught at the time because code in a
debugf() statement isn't compiled unless DEBUG is defined.

PR:		221557
2017-08-16 16:51:55 +00:00
cem
796e094ce6 Fix a couple of comment typos
No functional change.

Submitted by:	Anton Rang <anton.rang AT isilon.com>
Sponsored by:	Dell EMC Isilon
2017-08-15 02:21:02 +00:00
ian
3dabf0d777 Add config_intrhook_oneshot(): schedule an intrhook function and unregister
it automatically after it runs.

The config_intrhook mechanism allows a driver to stall the boot process
until device(s) required for booting are available, by not allowing system
inits to proceed until all intrhook functions have been unregistered.
Virtually all existing code simply unregisters from within the hook function
when it gets called.

This new function makes that common usage more convenient. Instead of
allocating and filling in a struct, passing it to a function that might (in
theory) fail, and checking the return code, now a driver can simply call
this cannot-fail routine, passing just the intrhook function and its arg.

Differential Revision:	https://reviews.freebsd.org/D11963
2017-08-13 18:10:24 +00:00
alc
6b281b98b3 The *_meta_* functions include a radix parameter, a blk parameter, and
another parameter that identifies a starting point in the memory address
block.  Radix is a power of two, blk is a multiple of radix, and the
starting point is in the range [blk, blk+radix), so that blk can always be
computed from the other two.  This change drops the blk parameter from the
meta functions and computes it instead.  (On amd64, for example, this
change reduces subr_blist.o's text size by 7%.)

It also makes the radix parameters unsigned to address concerns that the
calculation of '-radix' might overflow without the -fwrapv option.  (See
https://reviews.freebsd.org/D11819.)

Submitted by:	Doug Moore <dougm@rice.edu>
MFC after:	1 week
Differential Revision:	https://reviews.freebsd.org/D11964
2017-08-13 16:39:49 +00:00
markj
bce4478d7d Have sendfile_swapin() use vm_page_grab_pages().
Reviewed by:	alc, kib
MFC after:	2 weeks
Differential Revision:	https://reviews.freebsd.org/D11942
2017-08-11 16:32:24 +00:00
markj
6f4724899b Modify vm_page_grab_pages() to handle VM_ALLOC_NOWAIT.
This will allow its use in sendfile_swapin().

Reviewed by:	alc, kib
MFC after:	2 weeks
Differential Revision:	https://reviews.freebsd.org/D11942
2017-08-11 16:29:22 +00:00
alc
c85a3e68de An invalid page can't be dirty.
Reviewed by:	kib
MFC after:	1 week
2017-08-11 16:27:54 +00:00
andrew
78ce51bf4f Only return the current cpu if it's in the cpumask. When we restrict the
cpumask it probably means we are unable to sent interrupts to CPUs outside
the map. As such only return the current CPU when it's within the mask
otherwise return the first valid CPU.

This is needed on ThunderX as, in a dual socket configuration, we are
unable to send MSI/MSI-X interrupts between sockets.

Reviewed by:	mmel
Sponsored by:	DARPA, AFRL
Differential Revision:	https://reviews.freebsd.org/D11957
2017-08-11 12:45:58 +00:00
glebius
3eeca31b85 Plug uninitialized stack variable leak in sendfile(2).
Reported by:	Ilja Van Sprundel <ivansprundel ioactive.com>
Submitted by:	Domagoj Stolfa <domagoj.stolfa gmail.com>
MFC after:	1 week
Security:	uninitialized stack variable leak
2017-08-09 17:48:38 +00:00
alc
318304a5b7 Introduce vm_page_grab_pages(), which is intended to replace loops calling
vm_page_grab() on consecutive page indices.  Besides simplifying the code
in the caller, vm_page_grab_pages() allows for batching optimizations.
For example, the current implementation replaces calls to vm_page_lookup()
on consecutive page indices by cheaper calls to vm_page_next().

Reviewed by:	kib, markj
Tested by:	pho (an earlier version)
MFC after:	2 weeks
Differential Revision:	https://reviews.freebsd.org/D11926
2017-08-09 04:23:04 +00:00
asomers
94c85a3167 Make p1003_1b.aio_listio_max a tunable
p1003_1b.aio_listio_max is now a tunable. Its value is reflected in the
sysctl of the same name, and the sysconf(3) variable _SC_AIO_LISTIO_MAX.
Its value will be bounded from below by the compile-time constant
AIO_LISTIO_MAX and from above by the compile-time constant
MAX_AIO_QUEUE_PER_PROC and the tunable vfs.aio.max_aio_queue.

Reviewed by:	jhb, kib
MFC after:	3 weeks
Relnotes:	yes
Sponsored by:	Spectra Logic Corp
Differential Revision:	https://reviews.freebsd.org/D11601
2017-08-08 16:14:31 +00:00
br
3364e8aea9 o Replace __riscv__ with __riscv
o Replace __riscv64 with (__riscv && __riscv_xlen == 64)

This is required to support new GCC 7.1 compiler.
This is compatible with current GCC 6.1 compiler.

RISC-V is extensible ISA and the idea here is to have built-in define
per each extension, so together with __riscv we will have some subset
of these as well (depending on -march string passed to compiler):

__riscv_compressed
__riscv_atomic
__riscv_mul
__riscv_div
__riscv_muldiv
__riscv_fdiv
__riscv_fsqrt
__riscv_float_abi_soft
__riscv_float_abi_single
__riscv_float_abi_double
__riscv_cmodel_medlow
__riscv_cmodel_medany
__riscv_cmodel_pic
__riscv_xlen

Reviewed by:	ngie
Sponsored by:	DARPA, AFRL
Differential Revision:	https://reviews.freebsd.org/D11901
2017-08-07 14:09:57 +00:00
alc
590490023c In case readers are misled by expressions that combine multiplication and
division, add parentheses to make the precedence explicit.

Submitted by:	Doug Moore <dougm@rice.edu>
Requested by:	imp
Reviewed by:	imp
MFC after:	1 week
X-MFC after:	r321840
Differential Revision:	https://reviews.freebsd.org/D11815
2017-08-04 04:23:23 +00:00
markj
2350ab0562 Amend r321884 to check the refcount and update the class with w_mtx held.
Reviewed by:	jhb
X-MFC with:	r321884
2017-08-01 23:14:38 +00:00