the process arguments. New arguments length zero causes the drop of
the pargs instead of allocation of useless zero-length buffer.
Submitted by: Thomas Munro
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D16111
It's likely that the header was needed in the past for swi(9).
But now that code does not use swi(9) or any other interfaces defined
in sys/interrupt.h.
MFC after: 1 week
- Add tracker argument to preemptible epochs
- Inline epoch read path in kernel and tied modules
- Change in_epoch to take an epoch as argument
- Simplify tfb_tcp_do_segment to not take a ti_locked argument,
there's no longer any benefit to dropping the pcbinfo lock
and trying to do so just adds an error prone branchfest to
these functions
- Remove cases of same function recursion on the epoch as
recursing is no longer free.
- Remove the the TAILQ_ENTRY and epoch_section from struct
thread as the tracker field is now stack or heap allocated
as appropriate.
Tested by: pho and Limelight Networks
Reviewed by: kbowling at llnw dot com
Sponsored by: Limelight Networks
Differential Revision: https://reviews.freebsd.org/D16066
Avoid pulling in all of the <sys/proc.h> dependencies by
automatically generating a stripped down thread_lite exporting
only the fields of interest. The field declarations are type checked
against the original and the offsets of the generated result is
automatically checked.
kib has expressed disagreement and would have preferred to simply
use genassym style offsets (which loses type check enforcement).
jhb has expressed dislike of it due to header pollution and a
duplicate structure. He would have preferred to just have defined
thread in _thread.h. Nonetheless, he admits that this is the only
viable solution at the moment.
The impetus for this came from mjg's D15331:
"Inline critical_enter/exit for amd64"
Reviewed by: jeff
Differential Revision: https://reviews.freebsd.org/D16078
The '%I' format in the kern.corefile sysctl limits the number of
core files that a process can generate to the number stored in the
debug.ncores sysctl. The '%I' format is replaced by the single digit
index. Previously, if all indexes were taken the kernel would overwrite
only a core file with the highest index in a filename.
Currently the system will create a new core file if there is a free
index or if all slots are taken it will overwrite the oldest one.
Reviewed by: kib(code), bcr (updating)
Differential Revision: https://reviews.freebsd.org/D15991
Differential Revision: https://reviews.freebsd.org/D16084
We have a problem with vn_fullpath_global when the file exists. Work
around it by printing the full path if the core file name starts with /,
or current working directory followed by the filename if not.
Sponsored by: Netflix
Differential Review: https://reviews.freebsd.org/D16026
sbuf to make it safe. Callers are expected to add the " " around it,
if needed.
Sponsored by: Netflix
Differential Review: https://reviews.freebsd.org/D16026
If EARLY_AP_STARTUP is not defined it is possible for an epoch to be
allocated prior to it being possible to call epoch_call without
issue.
Based on patch by andrew@
PR: 229014
Reported by: andrew
used portion of the current thread's time slice if the current thread
belongs to the process being queried (i.e., if clock_gettime is invoked
with a clock ID of CLOCK_PROCESS_CPUTIME_ID or the value provided by
passing getpid(2) to clock_getcpuclockid(3)).
The CLOCK_VIRTUAL and CLOCK_PROF timers already make this adjustment via
long-standing code in calcru(), but since those timers are not specified
by POSIX it seems useful to add it here so that the higher accuracy is
available to code which aims to be portable.
PR: 228669
Reported by: Graham Percival
Reviewed by: kib
MFC after: 1 week
The refactoring in r335479 overlooked the fact that the dynamic kenv can
also be switched to if hintmode == 0. This is problematic because the
checkmethod bits are only ever ran once, but it worked previously because
the use_kenv was a global state and the first lookup would enable it if
occurring after the dynamic environment has been setup.
Extending our local definition of use_kenv to include all non-STATIC
hintmodes as long as the dynamic_kenv is setup fixes this. We still have
potential issues if the dynamic kenv comes up while we're doing an anchored
search through the environment, but this is not much of a concern right now
because:
1.) The dynamic environment comes up super early in boot, just after kmem
2.) This is going to get rewritten to provide a safer mechanism for the
anchored searches, ensuring that we continue using the same environment
chain (dynamic env or static fallback) for all anchored search invocations
Reported by: mmamcy
X-MFC-With: r335479
An RFSTOPPED thread can't clean TDB_STOPATFORK, which is done in the
fork_return() in its context, so parent is stuck forever. Triggered
when trying to ptrace linux process. Instead of waiting for the new
thread to clear TDB_STOPATFORK, tag it as traced and reparent to the
debugger in do_fork(), and let it only notify the debugger when run.
Submitted by: Yanko Yankulov <yanko.yankulov@gmail.com>
Reviewed by: jhb
MFC after: 1 week
X-MFC-Note: keep p_dbgwait placeholder intact
Differential revision: https://reviews.freebsd.org/D15857
Summary: POWER8 and POWER9 use a single CPU register, per core, to change clock
speed. Everything else is handled by the on-chip controller. This change
necessitates a change to the cpufreq global kernel driver to bump supported
levels, as the device tree for these systems can have theoretically 256
different options. On my POWER9 Talos, the list consists of 100 items. At
16.67MHz intervals, that allows for a change of roughly 1.67GHz between lowest
and highest.
This has only been tested on the POWER9. However, since they're similar, this
should work on POWER8 as well.
Reviewed By: nwhitehorn
Differential Revision: https://reviews.freebsd.org/D15932
Some complexity exists in these bits that isn't needed. The sysctl handler,
upon change to '2', runs through the current set of hints and sets them in
the kenv.
However, this isn't at all necessary if we're pulling hints from the kenv,
static or dynamic, as the former will get added to the latter in
init_dynamic_kenv (see: kern_environment.c). We can reduce this
configuration to just adding static_hints to the kenv if we were previously
using them.
The changes in res_find are minimal and based on the observation that once
use_kenv gets set to '1' it will never be reset to '0', and it gets set to
'1' as soon as we hit fallback mode. Later work will refactor res_find a
little bit and eliminate this now-local, because it's become clear that
there's some funkiness revolving around use_kenv=1 and it being used to
imply that we're certainly looking at the dynamic_kenv.
Reviewed by: ray
MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D15940
VERBOSE_SYSINIT is currently an all-or-nothing option. debug.verbose_sysinit
adds an option to have the code compiled in but quiet by default so that
getting this information from a device in the field doesn't necessarily
require distributing a recompiled kernel.
Its default is VERBOSE_SYSINIT's value as defined in the kernconf. As such,
the default behavior for simply omitting or including this option is
unchanged.
MFC after: 1 week
This is needed for efifb.
arm and ricv pmap (the two arch with arm64 that uses subr_devmap) have very
different implementation so for now only add this for arm64.
Tested with efifb on Pine64 with a few other patches.
Reviewed by: cognet
Differential Revision: https://reviews.freebsd.org/D15294
to FOREACH_PROC_IN_SYSTEM() to have a single pattern to look for.
Reviewed by: kib
MFC after: 2 weeks
Sponsored by: iXsystems, Inc.
Differential Revision: https://reviews.freebsd.org/D15916
Implement a ddb function walking the namecache to do this.
Reviewed by: jhb, mjg
Inspired by: gdb macro from jhb (old version)
Sponsored by: iXsystems, Inc.
Differential Revision: https://reviews.freebsd.org/D14898
- Convert inpcbinfo info & hash locks to epoch for read and mutex for write
- Garbage collect code that handled INP_INFO_TRY_RLOCK failures as
INP_INFO_RLOCK which can no longer fail
When running 64 netperfs sending minimal sized packets on a 2x8x2 reduces
unhalted core cycles samples in rwlock rlock/runlock in udp_send from 51% to
3%.
Overall packet throughput rate limited by CPU affinity and NIC driver design
choices.
On the receiver unhalted core cycles samples in in_pcblookup_hash went from
13% to to 1.6%
Tested by LLNW and pho@
Reviewed by: jtl
Sponsored by: Limelight Networks
Differential Revision: https://reviews.freebsd.org/D15686
Using of rwlock with multiqueue NICs for IP forwarding on high pps
produces high lock contention and inefficient. Rmlock fits better for
such workloads.
Reviewed by: melifaro, olivier
Obtained from: Yandex LLC
Sponsored by: Yandex LLC
Differential Revision: https://reviews.freebsd.org/D15789
counter_rate will be greater than ticks, resulting in counter_ratecheck()
failure. To fix this take an absolute value of the difference between
ticks and cr_ticks.
Reported by: jtl
Sponsored by: Netflix
- Filter out PRS_NEW procs as rufetch() tries taking the thread lock
which may not yet be initialized.
- Hold PROC_LOCK to ensure stability of iterating the threads.
- p_rux fields are protected by the process statlock as well.
MFC after: 2 weeks
Reviewed by: kib
Sponsored by: Dell EMC
Differential Revision: https://reviews.freebsd.org/D15809
The break() system call was renamed (several times) starting in v3
AT&T UNIX when C was invented and break was a language keyword. The
last vestage of a need for it to be called something else (eg obreak)
was removed in r225617 which consistantly prefixed all syscall
implementations.
Reviewed by: emaste, kib (older version)
Sponsored by: DARPA, AFRL
Differential Revision: https://reviews.freebsd.org/D15638
Most kernel memory that is allocated after boot does not need to be
executable. There are a few exceptions. For example, kernel modules
do need executable memory, but they don't use UMA or malloc(9). The
BPF JIT compiler also needs executable memory and did use malloc(9)
until r317072.
(Note that a side effect of r316767 was that the "small allocation"
path in UMA on amd64 already returned non-executable memory. This
meant that some calls to malloc(9) or the UMA zone(9) allocator could
return executable memory, while others could return non-executable
memory. This change makes the behavior consistent.)
This change makes malloc(9) return non-executable memory unless the new
M_EXEC flag is specified. After this change, the UMA zone(9) allocator
will always return non-executable memory, and a KASSERT will catch
attempts to use the M_EXEC flag to allocate executable memory using
uma_zalloc() or its variants.
Allocations that do need executable memory have various choices. They
may use the M_EXEC flag to malloc(9), or they may use a different VM
interfact to obtain executable pages.
Now that malloc(9) again allows executable allocations, this change also
reverts most of r317072.
PR: 228927
Reviewed by: alc, kib, markj, jhb (previous version)
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D15691
Allow one to implement a 'car limit' for
bioq_disksort. debug.bioq_batchsize sets the size of car limit. Every
time we queue that many requests, we start over so that we limit the
latency for requests when the software queue depths are large. A value
of '0', the default, means to revert to the old behavior.
Sponsored by: Netflix
the old encodings for the lower 16 and 32 bits and only using the
higher 32 bits for unusually large major and minor numbers. This
change breaks compatibility with the previous encoding (which was only
used in -current).
Fix truncation to (essentially) 16-bit dev_t in newnfs v3.
Any encoding of device numbers gives an ABI, so it can't be changed
without translations for compatibility. Extra bits give the much
larger complication that the translations need to compress into fewer
bits. Fortunately, more than 32 bits are rarely needed, so
compression is rarely needed except for 16-bit linux dev_t where it
was always needed but never done.
The previous encoding moved the major number into the top 32 bits.
Almost no translation code handled this, so the major number was blindly
truncated away in most 32-bit encodings. E.g., for ffs, mknod(8) with
major = 1 and minor = 2 gave dev_t = 0x10000002; ffs cannot represent
this and blindly truncated it to 2. But if this mknod was run on any
released version of FreeBSD, it gives dev_t = 0x102. ffs can represent
this, but in the previous encoding it was not decoded, giving major = 0,
minor = 0x102.
The presence of bugs was most obvious for exporting dev_t's from an
old system to -current, since bugs in newnfs augment them. I fixed
oldnfs to support 32-bit dev_t in 1996 (r16634), but this regressed
to 16-bit dev_t in newnfs, first to the old 16-bit encoding and then
further in -current. E.g., old ad0 with major = 234, minor = 0x10002
had the correct (major, minor) number on the wire, but newnfs truncated
this to (234, 2) and then the previous encoding shifted the major
number into oblivion as seen by ffs or old applications.
I first tried to fix this by translating on every ABI/API boundary, but
there are too many boundaries and too many sloppy translations by blind
truncation. So use the old encoding for the low 32 bits so that sloppy
translations work no worse than before provided the high 32 bits are
not set. Add some error checking for when bits are lost. Keep not
doing any error checking for translations for almost everything in
compat/linux.
compat/freebsd32/freebsd32_misc.c:
Optionally check for losing bits after possibly-truncating assignments as
before.
compat/linux/linux_stats.c:
Depend on the representation being compatible with Linux's (or just with
itself for local use) and spell some of the translations as assignments in
a macro that hides the details.
fs/nfsclient/nfs_clcomsubs.c:
Essentially the same fix as in 1996, except there is now no possible
truncation in makedev() itself. Also fix nearby style bugs.
kern/vfs_syscalls.c:
As for freebsd32. Also update the sysctl description to include file
numbers, and change it to describe device ids as device numbers.
sys/types.h:
Use inline functions (wrapped by macros) since the expressions are now
a bit too complicated for plain macros. Describe the encoding and
some of the reasons for it. 16-bit compatibility didn't leave many
reasonable choices for the 32-bit encoding, and 32-bit compatibility
doesn't leave many reasonable choices for the 64-bit encoding. My
choice is to put the 8 new minor bits in the low 8 bits of the top 32
bits. This minimizes discontiguities.
Reviewed by: kib (except for rewrite of the comment in linux_stats.c)
of 64-bit dev_t's (but not ones involving dev_t's).
st_size was supposed to be clamped in cvtstat() and linux's copy_stat(),
but the clamping code wasn't aware that st_size is signed, and also had
an obfuscated off-by-1 value for the unsigned limit, so its effect was
to produce a bizarre negative size instead of clamping.
Change freebsd32's copy_ostat() to be no worse than cvtstat(). It was
missing clamping and bzero()ing of padding.
Reviewed by: kib (except a final fix of the clamp to the signed maximum)
- non NULL controlp is not an error, returning EINVAL
would cause X forwarding to fail
- MSG_PEEK and MSG_WAITALL are fairly exceptional, but we still
want to handle them - punt to soreceive_generic
callbacks to perform additional cleanup actions at the time a socket is
closed.
Michio Honda presented a use for this at BSDCan 2018.
(See https://www.bsdcan.org/2018/schedule/events/965.en.html .)
Submitted by: Michio Honda <micchie at sfc.wide.ad.jp> (previous version)
Reviewed by: lstewart (previous version)
Differential Revision: https://reviews.freebsd.org/D15706
Turns out there is code which ends up passing M_ZERO to counters.
Since counters zero unconditionally on their own, just ignore drop the
flag in that place.
pmc_process_interrupt takes 5 arguments when only 3 are needed.
cpu is always available in curcpu and inuserspace can always be
derived from the passed trapframe.
While facially a reasonable cleanup this change was motivated
by the need to workaround a compiler bug.
core2_intr(cpu, tf) ->
pmc_process_interrupt(cpu, ring, pmc, tf, inuserspace) ->
pmc_add_sample(cpu, ring, pm, tf, inuserspace)
In the process of optimizing the tail call the tf pointer was getting
clobbered:
(kgdb) up
at /storage/mmacy/devel/freebsd/sys/dev/hwpmc/hwpmc_mod.c:4709
4709 pmc_save_kernel_callchain(ps->ps_pc,
(kgdb) up
1205 error = pmc_process_interrupt(cpu, PMC_HR, pm, tf,
resulting in a crash in pmc_save_kernel_callchain.
Rack includes the following features:
- A different SACK processing scheme (the old sack structures are not used).
- RACK (Recent acknowledgment) where counting dup-acks is no longer done
instead time is used to knwo when to retransmit. (see the I-D)
- TLP (Tail Loss Probe) where we will probe for tail-losses to attempt
to try not to take a retransmit time-out. (see the I-D)
- Burst mitigation using TCPHTPS
- PRR (partial rate reduction) see the RFC.
Once built into your kernel, you can select this stack by either
socket option with the name of the stack is "rack" or by setting
the global sysctl so the default is rack.
Note that any connection that does not support SACK will be kicked
back to the "default" base FreeBSD stack (currently known as "default").
To build this into your kernel you will need to enable in your
kernel:
makeoptions WITH_EXTRA_TCP_STACKS=1
options TCPHPTS
Sponsored by: Netflix Inc.
Differential Revision: https://reviews.freebsd.org/D15525
controller that tries to handle early invocations of the controller,
in other words, invocations before the expected end of the interval.
However, there were some calculation errors in this early invocation
case. Notably, if an early invocation occurred while the error was
negative, the derivative term was off by a large amount. One visible
effect of this error was that processes were being killed by the
virtual memory system's OOM killer when in fact there was plentiful
free memory.
Correct a couple minor errors in the sysctl descriptions, and apply
some style fixes.
Reviewed by: jeff, markj
This patch adds a new socket option, SO_REUSEPORT_LB, which allow multiple
programs or threads to bind to the same port and incoming connections will be
load balanced using a hash function.
Most of the code was copied from a similar patch for DragonflyBSD.
However, in DragonflyBSD, load balancing is a global on/off setting and can not
be set per socket. This patch allows for simultaneous use of both the current
SO_REUSEPORT and the new SO_REUSEPORT_LB options on the same system.
Required changes to structures:
Globally change so_options from 16 to 32 bit value to allow for more options.
Add hashtable in pcbinfo to hold all SO_REUSEPORT_LB sockets.
Limitations:
As DragonflyBSD, a load balance group is limited to 256 pcbs (256 programs or
threads sharing the same socket).
This is a substantially different contribution as compared to its original
incarnation at svn r332894 and reverted at svn r332967. Thanks to rwatson@
for the substantive feedback that is included in this commit.
Submitted by: Johannes Lundberg <johalun0@gmail.com>
Obtained from: DragonflyBSD
Relnotes: Yes
Sponsored by: Limelight Networks
Differential Revision: https://reviews.freebsd.org/D11003