Commit Graph

16124 Commits

Author SHA1 Message Date
Jonathan T. Looney
0766f278d8 Make UMA and malloc(9) return non-executable memory in most cases.
Most kernel memory that is allocated after boot does not need to be
executable.  There are a few exceptions.  For example, kernel modules
do need executable memory, but they don't use UMA or malloc(9).  The
BPF JIT compiler also needs executable memory and did use malloc(9)
until r317072.

(Note that a side effect of r316767 was that the "small allocation"
path in UMA on amd64 already returned non-executable memory.  This
meant that some calls to malloc(9) or the UMA zone(9) allocator could
return executable memory, while others could return non-executable
memory.  This change makes the behavior consistent.)

This change makes malloc(9) return non-executable memory unless the new
M_EXEC flag is specified.  After this change, the UMA zone(9) allocator
will always return non-executable memory, and a KASSERT will catch
attempts to use the M_EXEC flag to allocate executable memory using
uma_zalloc() or its variants.

Allocations that do need executable memory have various choices.  They
may use the M_EXEC flag to malloc(9), or they may use a different VM
interfact to obtain executable pages.

Now that malloc(9) again allows executable allocations, this change also
reverts most of r317072.

PR:		228927
Reviewed by:	alc, kib, markj, jhb (previous version)
Sponsored by:	Netflix
Differential Revision:	https://reviews.freebsd.org/D15691
2018-06-13 17:04:41 +00:00
Warner Losh
a971acbc25 Implement a 'car limit' for bioq.
Allow one to implement a 'car limit' for
bioq_disksort. debug.bioq_batchsize sets the size of car limit. Every
time we queue that many requests, we start over so that we limit the
latency for requests when the software queue depths are large. A value
of '0', the default, means to revert to the old behavior.

Sponsored by: Netflix
2018-06-13 16:48:07 +00:00
Bruce Evans
ab35e1c71b Fix the encoding of major and minor numbers in 64-bit dev_t by restoring
the old encodings for the lower 16 and 32 bits and only using the
higher 32 bits for unusually large major and minor numbers.  This
change breaks compatibility with the previous encoding (which was only
used in -current).

Fix truncation to (essentially) 16-bit dev_t in newnfs v3.

Any encoding of device numbers gives an ABI, so it can't be changed
without translations for compatibility.  Extra bits give the much
larger complication that the translations need to compress into fewer
bits.  Fortunately, more than 32 bits are rarely needed, so
compression is rarely needed except for 16-bit linux dev_t where it
was always needed but never done.

The previous encoding moved the major number into the top 32 bits.
Almost no translation code handled this, so the major number was blindly
truncated away in most 32-bit encodings.  E.g., for ffs, mknod(8) with
major = 1 and minor = 2 gave dev_t = 0x10000002; ffs cannot represent
this and blindly truncated it to 2.  But if this mknod was run on any
released version of FreeBSD, it gives dev_t = 0x102.  ffs can represent
this, but in the previous encoding it was not decoded, giving major = 0,
minor = 0x102.

The presence of bugs was most obvious for exporting dev_t's from an
old system to -current, since bugs in newnfs augment them.  I fixed
oldnfs to support 32-bit dev_t in 1996 (r16634), but this regressed
to 16-bit dev_t in newnfs, first to the old 16-bit encoding and then
further in -current.  E.g., old ad0 with major = 234, minor = 0x10002
had the correct (major, minor) number on the wire, but newnfs truncated
this to (234, 2) and then the previous encoding shifted the major
number into oblivion as seen by ffs or old applications.

I first tried to fix this by translating on every ABI/API boundary, but
there are too many boundaries and too many sloppy translations by blind
truncation.  So use the old encoding for the low 32 bits so that sloppy
translations work no worse than before provided the high 32 bits are
not set.  Add some error checking for when bits are lost.  Keep not
doing any error checking for translations for almost everything in
compat/linux.

compat/freebsd32/freebsd32_misc.c:
Optionally check for losing bits after possibly-truncating assignments as
before.

compat/linux/linux_stats.c:
Depend on the representation being compatible with Linux's (or just with
itself for local use) and spell some of the translations as assignments in
a macro that hides the details.

fs/nfsclient/nfs_clcomsubs.c:
Essentially the same fix as in 1996, except there is now no possible
truncation in makedev() itself.  Also fix nearby style bugs.

kern/vfs_syscalls.c:
As for freebsd32.  Also update the sysctl description to include file
numbers, and change it to describe device ids as device numbers.

sys/types.h:
Use inline functions (wrapped by macros) since the expressions are now
a bit too complicated for plain macros.  Describe the encoding and
some of the reasons for it.  16-bit compatibility didn't leave many
reasonable choices for the 32-bit encoding, and 32-bit compatibility
doesn't leave many reasonable choices for the 64-bit encoding.  My
choice is to put the 8 new minor bits in the low 8 bits of the top 32
bits.  This minimizes discontiguities.

Reviewed by:	kib (except for rewrite of the comment in linux_stats.c)
2018-06-13 12:22:00 +00:00
Bruce Evans
372639f944 Fix some bugs found while fixing the representation and translation
of 64-bit dev_t's (but not ones involving dev_t's).

st_size was supposed to be clamped in cvtstat() and linux's copy_stat(),
but the clamping code wasn't aware that st_size is signed, and also had
an obfuscated off-by-1 value for the unsigned limit, so its effect was
to produce a bizarre negative size instead of clamping.

Change freebsd32's copy_ostat() to be no worse than cvtstat().  It was
missing clamping and bzero()ing of padding.

Reviewed by:	kib (except a final fix of the clamp to the signed maximum)
2018-06-13 08:50:43 +00:00
Ed Maste
00ce0c6258 makesyscalls: simplify capenabled pipeline
Replace cat + 2x grep with one grep.

Sponsored by:	Turing Robotic Industries
2018-06-11 18:57:40 +00:00
Matt Macy
0ea9d9376e limit change to fixing controlp handling pending review 2018-06-11 17:10:19 +00:00
Matt Macy
c34bf30069 soreceive_stream: correctly handle edge cases
- non NULL controlp is not an error, returning EINVAL
  would cause X forwarding to fail

- MSG_PEEK and MSG_WAITALL are fairly exceptional, but we still
  want to handle them - punt to soreceive_generic
2018-06-11 16:31:42 +00:00
Mateusz Guzik
0001edb823 counter: add a bit missed in r334858
It happens to be a noop.
2018-06-08 22:06:32 +00:00
Matt Macy
a62b4665f4 AF_UNIX: bring uipc_ready in compliance with new locking protocol
PR:	228742
Submitted by: markj
Reviewed by:	markj
2018-06-08 20:31:59 +00:00
Jonathan T. Looney
1fbe13cf4b Add a socket destructor callback. This allows kernel providers to set
callbacks to perform additional cleanup actions at the time a socket is
closed.

Michio Honda presented a use for this at BSDCan 2018.
(See https://www.bsdcan.org/2018/schedule/events/965.en.html .)

Submitted by:	Michio Honda <micchie at sfc.wide.ad.jp> (previous version)
Reviewed by:	lstewart (previous version)
Differential Revision:	https://reviews.freebsd.org/D15706
2018-06-08 19:35:24 +00:00
Mateusz Guzik
b8af2820f6 uma: fix up r334824
Turns out there is code which ends up passing M_ZERO to counters.
Since counters zero unconditionally on their own, just ignore drop the
flag in that place.
2018-06-08 05:40:36 +00:00
Matt Macy
eb7c901995 hwpmc: simplify calling convention for hwpmc interrupt handling
pmc_process_interrupt takes 5 arguments when only 3 are needed.
cpu is always available in curcpu and inuserspace can always be
derived from the passed trapframe.

While facially a reasonable cleanup this change was motivated
by the need to workaround a compiler bug.

core2_intr(cpu, tf) ->
  pmc_process_interrupt(cpu, ring, pmc, tf, inuserspace) ->
    pmc_add_sample(cpu, ring, pm, tf, inuserspace)

In the process of optimizing the tail call the tf pointer was getting
clobbered:

(kgdb) up
    at /storage/mmacy/devel/freebsd/sys/dev/hwpmc/hwpmc_mod.c:4709
4709                                pmc_save_kernel_callchain(ps->ps_pc,
(kgdb) up
1205                    error = pmc_process_interrupt(cpu, PMC_HR, pm, tf,

resulting in a crash in pmc_save_kernel_callchain.
2018-06-08 04:58:03 +00:00
Randall Stewart
89e560f441 This commit brings in a new refactored TCP stack called Rack.
Rack includes the following features:
 - A different SACK processing scheme (the old sack structures are not used).
 - RACK (Recent acknowledgment) where counting dup-acks is no longer done
        instead time is used to knwo when to retransmit. (see the I-D)
 - TLP (Tail Loss Probe) where we will probe for tail-losses to attempt
        to try not to take a retransmit time-out. (see the I-D)
 - Burst mitigation using TCPHTPS
 - PRR (partial rate reduction) see the RFC.

Once built into your kernel, you can select this stack by either
socket option with the name of the stack is "rack" or by setting
the global sysctl so the default is rack.

Note that any connection that does not support SACK will be kicked
back to the "default" base  FreeBSD stack (currently known as "default").

To build this into your kernel you will need to enable in your
kernel:
   makeoptions WITH_EXTRA_TCP_STACKS=1
   options TCPHPTS

Sponsored by:	Netflix Inc.
Differential Revision:		https://reviews.freebsd.org/D15525
2018-06-07 18:18:13 +00:00
Alan Cox
5b274055d1 When pidctrl_daemon() is called multiple times within an interval, it
should use the cumulative error to calculate the output.
2018-06-07 07:48:50 +00:00
Matt Macy
fcabd54160 AF_UNIX: check for unp == unp2 on disconnect 2018-06-07 04:57:40 +00:00
Alan Cox
e768070ca9 pidctrl_daemon() implements a variation on the classical, discrete PID
controller that tries to handle early invocations of the controller,
in other words, invocations before the expected end of the interval.
However, there were some calculation errors in this early invocation
case.  Notably, if an early invocation occurred while the error was
negative, the derivative term was off by a large amount.  One visible
effect of this error was that processes were being killed by the
virtual memory system's OOM killer when in fact there was plentiful
free memory.

Correct a couple minor errors in the sysctl descriptions, and apply
some style fixes.

Reviewed by:	jeff, markj
2018-06-07 02:54:11 +00:00
Sean Bruno
1a43cff92a Load balance sockets with new SO_REUSEPORT_LB option.
This patch adds a new socket option, SO_REUSEPORT_LB, which allow multiple
programs or threads to bind to the same port and incoming connections will be
load balanced using a hash function.

Most of the code was copied from a similar patch for DragonflyBSD.

However, in DragonflyBSD, load balancing is a global on/off setting and can not
be set per socket. This patch allows for simultaneous use of both the current
SO_REUSEPORT and the new SO_REUSEPORT_LB options on the same system.

Required changes to structures:
Globally change so_options from 16 to 32 bit value to allow for more options.
Add hashtable in pcbinfo to hold all SO_REUSEPORT_LB sockets.

Limitations:
As DragonflyBSD, a load balance group is limited to 256 pcbs (256 programs or
threads sharing the same socket).

This is a substantially different contribution as compared to its original
incarnation at svn r332894 and reverted at svn r332967.  Thanks to rwatson@
for the substantive feedback that is included in this commit.

Submitted by:	Johannes Lundberg <johalun0@gmail.com>
Obtained from:	DragonflyBSD
Relnotes:	Yes
Sponsored by:	Limelight Networks
Differential Revision:	https://reviews.freebsd.org/D11003
2018-06-06 15:45:57 +00:00
Justin Hibbits
3f9e1fc8ee Revert r334708
This is the wrong place to put the barrier.
Requested by:	kib,mjg
2018-06-06 15:12:19 +00:00
Justin Hibbits
32c369f40c Add a memory barrier after taking a reference on the vnode holdcnt in _vhold
This is needed to avoid a race between the VNASSERT() below, and another
thread updating the VI_FREE flag, on weakly-ordered architectures.

On a 72-thread POWER9, without this barrier a 'make -j72 buildworld' would
panic on the assert regularly.

It may be possible to use a weaker barrier, and I'll investigate that once
all stability issues are worked out on POWER9.
2018-06-06 12:57:11 +00:00
Matt Macy
ebfaf69cc0 hwpmc: log name->pid, name->tid mappings
By logging all threads and processes 'pmc filter'
can now filter on process or thread name, relieving
the user of the burden of determining which tid or
pid was which when the sample was taken.

% pmc filter -T if_io_tqg -P nginx pmc.log pmc-iflib.log

% pmc filter -x -T idle pmc.log pmc-noidle.log
2018-06-05 04:26:40 +00:00
Mark Johnston
97bc9a9384 Regen after r334626. 2018-06-04 19:36:47 +00:00
Mark Johnston
9f9c9b22ec Reimplement brk() and sbrk() to avoid the use of _end.
Previously, libc.so would initialize its notion of the break address
using _end, a special symbol emitted by the static linker following
the bss section.  Compatibility issues between lld and ld.bfd could
cause the wrong definition of _end (libc.so's definition rather than
that of the executable) to be used, breaking the brk()/sbrk()
interface.

Avoid this problem and future interoperability issues by simply not
relying on _end.  Instead, modify the break() system call to return
the kernel's view of the current break address, and have libc
initialize its state using an extra syscall upon the first use of the
interface.  As a side effect, this appears to fix brk()/sbrk() usage
in executables run with rtld direct exec, since the kernel and libc.so
no longer maintain separate views of the process' break address.

PR:		228574
Reviewed by:	kib (previous version)
MFC after:	2 months
Differential Revision:	https://reviews.freebsd.org/D15663
2018-06-04 19:35:15 +00:00
Alan Cox
3e7cb27cdd Use a single, consistent approach to returning success versus failure in
vm_map_madvise().  Previously, vm_map_madvise() used a traditional Unix-
style "return (0);" to indicate success in the common case, but Mach-
style return values in the edge cases.  Since KERN_SUCCESS equals zero,
the only problem with this inconsistency was stylistic.  vm_map_madvise()
has exactly two callers in the entire source tree, and only one of them
cares about the return value.  That caller, kern_madvise(), can be
simplified if vm_map_madvise() consistently uses Unix-style return
values.

Since vm_map_madvise() uses the variable modify_map as a Boolean, make it
one.

Eliminate a redundant error check from kern_madvise().  Add a comment
explaining where the check is performed.

Explicitly note that exec_release_args_kva() doesn't care about
vm_map_madvise()'s return value.  Since MADV_FREE is passed as the
behavior, the return value will always be zero.

Reviewed by:	kib, markj
MFC after:	7 days
2018-06-04 16:28:06 +00:00
Matt Macy
5de96e33d6 hwpmc: support sampling both kernel and user stacks when interrupted in kernel
This adds the -U options to pmcstat which will attribute in-kernel samples
back to the user stack that invoked the system call. It is not the default,
because when looking at kernel profiles it is generally more desirable to
merge all instances of a given system call together.

Although heavily revised, this change is directly derived from D7350 by
Jonathan T. Looney.

Obtained from: jtl
Sponsored by: Juniper Networks, Limelight Networks
2018-06-04 01:10:23 +00:00
Mateusz Guzik
d0a22279db Remove an unused argument to turnstile_unpend.
PR:	228694
Submitted by:	Julian Pszczołowski <julian.pszczolowski@gmail.com>
2018-06-02 22:37:53 +00:00
Mateusz Guzik
34c538c356 malloc: try to use builtins for zeroing at the callsite
Plenty of allocation sites pass M_ZERO and sizes which are small and known
at compilation time. Handling them internally in malloc loses this information
and results in avoidable calls to memset.

Instead, let the compiler take the advantage of it whenever possible.

Discussed with:	jeff
2018-06-02 22:20:09 +00:00
Mark Johnston
3fb14f61e1 Avoid completing I/O when dumping core after a panic.
Filesystem or pager completion callbacks are generally non-functional
after a panic and may trigger deadlocks if invoked in this context
(e.g., by attempting to destroying a buffer mapping).  To avoid this
situation, short-circuit I/O completion in biodone().

Reviewed by:	imp
Discussed with:	mav
MFC after:	2 weeks
Differential Revision:	https://reviews.freebsd.org/D15592
2018-06-01 23:49:32 +00:00
Ed Maste
b8d908b71e ANSIfy sys/kern 2018-06-01 13:26:45 +00:00
Warner Losh
c580ca4cf4 Make the data returned by devinfo harder to overflow.
Rather than using fixed-length strings, pack them into a string table
to return. Also expand the buffer from ~300 charaters to 3k. This should
be enough, even for USB.

This fixes a problem where USB pnp info is truncated on return to
userland.

Differential Revision: https://reviews.freebsd.org/D15629
2018-05-31 02:57:58 +00:00
Brooks Davis
64b378f1e1 Remove alternative names that are identical to the default.
Verified by make sysent producing no changes.
2018-05-30 22:22:58 +00:00
Ed Maste
f912a970e6 link_elf_obj: correct an error message
Previously we'd report that a file has "no valid symbol table" if it in
fact had two or more.  Change the message to report that there must be
exactly one.
2018-05-30 12:55:27 +00:00
Matt Macy
e445381f13 epoch(9): make epoch closer to style(9) 2018-05-30 03:39:57 +00:00
Stephen Hurd
3e0e6330b5 iflib: mark irq allocation name parameter as constant
The *name parameter passed to iflib_irq_alloc_generic and
iflib_softirq_alloc_generic is never modified. Many places in code pass
string literals and thus should not be modified.

Mark the *name parameter as a const char * instead, so that we enforce
that the name is not modified before passing to bus_describe_intr()

Submitted by:	Jacob Keller <jacob.e.keller@intel.com>
Reviewed by:	kmacy
Sponsored by:	Intel Corporation
Differential Revision:	https://reviews.freebsd.org/D15343
2018-05-29 21:56:39 +00:00
Gleb Smirnoff
147cd40fe7 Revert second chunk of r333860. The warning from gcc is false positive. The
npages won't be ever used in no space case.
2018-05-29 21:45:15 +00:00
Matt Macy
b99aa0fbb2 hwpmc: don't enter epoch section across mmap hook 2018-05-29 18:03:48 +00:00
Brooks Davis
d8b2f0790b Correct pointer subtraction in KASSERT().
The assertion would never fire without truly spectacular future
programming errors.

Reported by:	Coverity
CID:		1391367, 1391368
Sponsored by:	DARPA, AFRL
2018-05-29 17:49:03 +00:00
Andriy Gapon
ec6faf94c4 add support for console resuming, implement it for uart, use on x86
This change adds a new optional console method cn_resume and a kernel
console interface cnresume.  Consoles that may need to re-initialize
their hardware after suspend (e.g., because firmware does not care to do
it) will implement cn_resume.  Note that it is called in rather early
environment not unlike early boot, so the same restrictions apply.
Platform specific code, for platforms that support hardware suspend,
should call cnresume early after resume, before any console output is
expected.

This change fixes a problem with a system of mine failing to resume when
a serial console is used.  I found that the serial port was in a strange
configuration and an attempt to write to it likely resulted in an
infinite loop.

To avoid adding cn_resume method to every console driver, CONSOLE_DRIVER
macro has been extended to support optional methods.

Reviewed by:	imp, mav
MFC after:	3 weeks
Differential Revision: https://reviews.freebsd.org/D15552
2018-05-29 16:16:24 +00:00
Matt Macy
552b3e1798 witness/hwpmc: fix locking order for pmc locks 2018-05-28 23:14:38 +00:00
Eric van Gyzen
70d66bcf28 kern_cpuset: fix small leak on error path
The "mask" was leaked on some error paths.

Reported by:	Coverity
CID:		1384683
Sponsored by:	Dell EMC
2018-05-26 14:23:11 +00:00
Eric van Gyzen
16b51429d2 kdb_trap: Fix use of uninitialized data
In some cases, other_cpus was used without being initialized.
Thankfully, it was harmless.

Reported by:	Coverity
CID:		1385265
Sponsored by:	Dell EMC
2018-05-26 14:01:44 +00:00
Brooks Davis
659a2e9243 Regen after r334223: make vadvise compat freebsd11. 2018-05-25 20:41:26 +00:00
Brooks Davis
7351a8bdb5 Make vadvise compat freebsd11.
The vadvise syscall (aka ovadvise) is undocumented and has always been
implmented as returning EINVAL.  Put the syscall under COMPAT11 and
provide a userspace implementation.

Reviewed by:	kib
Sponsored by:	DARPA, AFRL
Differential Revision:	https://reviews.freebsd.org/D15557
2018-05-25 20:40:23 +00:00
Matt Macy
acf9fd05d8 AF_UNIX: It is possible for UNIX datagram sockets to be connected
to themselves. The updated code assumed that that could not happen
and would try to lock the unp mutex twice.

There may be a lingering issue here but this fixes it for the
reporter.

PR:	228458
Reported by:	marieheleneka at gmail.com
2018-05-24 21:13:46 +00:00
Matt Macy
c684c14ce3 AF_UNIX: evidently Samba likes to connect a unix socket to itself, fix locking 2018-05-24 18:22:13 +00:00
Matt Macy
a3a734908b AF_UNIX in connectat unp and unp2 can be the same 2018-05-24 18:22:05 +00:00
Conrad Meyer
a0638b33f7 Yank crufty INTR_FILTER option
It was introduced to the tree in r169320 and r169321 in May 2007.

It never got much use and never became a kernel default.  The code
duplicates the default path quite a bit, with slight modifications.  Just
yank out the cruft.  Whatever goals were being aimed for can probably be met
within the existing framework, without a flag day option.

Mostly mechanical change: 'unifdef -m -UINTR_FILTER'.

Reviewed by:	mmacy
Sponsored by:	Dell EMC Isilon
Differential Revision:	https://reviews.freebsd.org/D15546
2018-05-24 17:06:00 +00:00
Brooks Davis
5f77b8a88b Avoid two suword() calls per auxarg entry.
Instead, construct an auxargs array and copy it out all at once.

Use an array of Elf_Auxinfo rather than pairs of Elf_Addr * to represent
the array. This is the correct type where pairs of words just happend
to work. To reduce the size of the diff, AUXARGS_ENTRY is altered to act
on this array rather than introducing a new macro.

Return errors on copyout() and suword() failures and handle them in the
caller.

Incidentally fixes AT_RANDOM and AT_EXECFN in 32-bit linux on amd64
which incorrectly used AUXARG_ENTRY instead of AUXARGS_ENTRY_32
(now removed due to the use of proper types).

Reviewed by:	kib
Comments from:	emaste, jhb
Obtained from:	CheriBSD
Sponsored by:	DARPA, AFRL
Differential Revision:	https://reviews.freebsd.org/D15485
2018-05-24 16:25:18 +00:00
Bjoern A. Zeeb
bb8f162363 Try to be consistent and spell "vnet" lower case like all the
other options (and as we do on command line).

Sponsored by:	iXsystems, Inc.
2018-05-24 15:31:05 +00:00
Bjoern A. Zeeb
36b41cc336 Improve the KASSERT to also have the prison pointer.
Helpful when debugging from ddb.

Sponsored by:		iXsystems, Inc.
2018-05-24 15:28:21 +00:00
Matt Macy
16529dace8 AF_UNIX: assert that we're not acquiring the same lock 2018-05-24 15:28:16 +00:00