With Flower (CloudABI's network connection daemon) becoming more
complete, there is no longer any need for creating any unconnected
sockets. Socket pairs in combination with file descriptor passing is all
that is necessary, as that is what is used by Flower to pass network
connections from the public internet to listening processes.
Remove all of the kernel bits that were used to implement socket(),
listen(), bindat() and connectat(). In principle, accept() and
SO_ACCEPTCONN may also be removed, but there are still some consumers
left.
Obtained from: https://github.com/NuxiNL/cloudabi
MFC after: 1 month
- map the hard-coded frame buffer address above KERNBASE. Using the
physical address only worked because of larger mapping bugs.
The hard-coded frame buffer address only works on x86. Use messy ifdefs
to try to avoid warnings about unused code for other arches.
- remove the sysctl for reading and writing the table kernel console
attributes. Writing only worked for emergency output since normal
output uses unalterd copies.
- fix the test for the emergency console being usable
- explain why a hard-coded attribute is used very early. Emergency output
works on x86 even before the pcpu pointer is initialized.
vnode lock.
Caller of softdep_count_dependencies() may own a buffer lock, which
might conflict with the lock order.
Reported and tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 10 days
Advertise this by changing the defaults to mostly red. If you don't like
this, change them (almost) back using:
vidcontrol -c charcolors,base=7,height=0
vidcontrol -c mousecolors,base=0[,height=15]
The (graphics mode only) mouse cursor colors were hard-coded to a black
border and lightwhite interior. Black for the border is the worst
possible default, since it is the same as the default black background
and not good for any dark background. Reversing this gives the better
default of X Windows. Coloring everything works better still. Now
the coloring defaults to a lightwhite border and red interior.
Coloring for the character cursor is more complicated and mode
dependent. The new coloring doesn't apply for hardware cursors. For
non-block cursors, it only applies in graphics mode. In text mode,
the cursor color was usually a hard-coded (dull)white for the background
only, unless the foreground was white when it was a hard-coded black
for the background only, unless the foreground was white and the
background was black it was reverse video. In graphics mode, it was
always reverse video for the block cursor. Reverse video is worse,
especially over cutmarking regions, since cutmarking still uses simple
reverse video (nothing better is possible in text mode) and double
reverse video for the cursor gives normal video. Now, graphics mode
uses the same algorithm as the best case for text mode in all cases
for graphics mode. The hard-coded sequence { white, black, } for the
background is now { red, white, blue, } where the first 2 colors can
be configured. The blue color at the end is a sentinel which prevents
reverse video being used in most cases but breaks the compatibility
setting for white on black and black on white characters. This will
be fixed later. The compatibility setting is most needed for mono modes.
The previous commit to syscons.c changed sc_cnterm() to be more careful.
It followed null pointers in some cases. But sc_cnterm() has been
unreachable for 15+ years since changes for multiple consoles turned
off calls to the the cnterm destructor for all console drivers. Before
them, it was only called at boot time. So no driver with an attached
console has ever been unloadable and not even the non-console destructors
have been tested much.
These files are compiled in userland too, so we can't use sys/systm.h
and rely on CTASSERT. Switch to using _Static_assert instead.
MFC After: 3 days
Sponsored by: Netflix
card has to do PCIe transactions to complete the reset process, but
can't do them, per the PCIe spec, unless bus mastering is enabled.
Submitted by: Kinjal Patel
PR: 22166
terminal state for kernel console output.
r56043 in 2000 added many complications to support dynamic selection
of the terminal emulator using modules and the ioctl CONS_SETTERM.
This was never completed. There are still no modules, but it is easy
to restore the scterm and dumb emulators at compile time. Then
boot-time configuration for the preferred one doesn't work right, but
CONS_SETTERM almost works after fixing this bug. CONS_SETTERM only
switches the emulator for the user state, leaving the kernel state(s)
still using the boot-time emulator. The fix is especially important
when switching from sc to scteken, since the scteken state has pointers
in it.
Rename kernel_console_ts to sc_kts.
The previous update to the driver to 3.2.12-k changed the VF's API version
to 1.2, but did not let the VF fall back to 1.1 or 1.0 versions. So, this
patch tries 1.2 first, then the older versions in succession if that fails.
This should allow the VF driver to negotiate 1.1 and work with older PF
drivers, such as the one used in Amazon's EC2 service.
PR: 220872
Submitted by: Jeb Cramer <jeb.j.cramer@intel.com>
MFC after: 1 week
Sponsored by: Intel Corporation
extreme outliers from dodgy drives. Adjust comments to reflect this,
and make sure that the number of latency buckets match in the two
places where it matters.
o Allow I/O scheduler to gather times on 32-bit systems. We do this by shifting
the sbintime_t over by 8 bits and truncating to 32-bits. This gives us 8.24
time. This is sufficient both in range (256 seconds is about 128x what current
users need) and precision (60ns easily meets the 1ms smallest bucket size
measurements). 64-bit systems are unchanged. Centralize all the time math so
it's easy to tweak tha range / precision tradeoffs in the future.
o While I'm here, the I/O scheduler should be using periph_data rather than
sim_data since it is operating on behalf of the periph.
Differential Review: https://reviews.freebsd.org/D12119
via soisdisconnected(), and in the earlier unlock earlier to avoid lock
recursion.
This fixes a situation when a socket on accept queue is reset before being
accepted.
Reported by: Jason Eggleston <jeggleston llnw.com>
A follow-up to r322836.
Warnings for the unused declaration were breaking some second tier
architectures, but did not show up in Clang on x86.
Reported by: markj (ddb.4), emaste (declaration)
Sponsored by: Dell EMC Isilon
When debugging a deadlock, it is useful to follow the full chain of locks as
far as possible.
Reviewed by: jhb
Sponsored by: Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D12115
Not enabling FSGSBASE in %cr4 does not prevent reporting of the
feature by the CPUID instruction (blame Int*l). As result, kernels
which were run under monitors pretended that usermode cannot modify
TLS base without the syscall, while libc noted right combination of
capable CPU and the new kernel version, trying to use the WRFSBASE
instruction.
Really old hypervisors that cannot handle enablement of these features
in %cr4 would require the manual configuration, by setting the loader
tunable hw.cpu_stdext_disable=0x81
Reported by: lwhsu, mjoras
Sponsored by: The FreeBSD Foundation
MFC after: 18 days
The check for timestamps are too early to handle SYN-ACK correctly.
So move it down after the corresponing processing has been done.
PR: 216832
Obtained from: antonfb@hesiod.org
MFC after: 1 week
Remote DMA over Converged Ethernet, RoCE, for the ConnectX-4 series of
PCI express network cards.
There is currently no user-space support and this driver only supports
kernel side non-routable RoCE V1. The krping kernel module can be used
to test this driver. Full user-space support including RoCE V2 will be
added as part of the ongoing upgrade to ibcore from Linux 4.9. Otherwise
this driver is feature equivalent to mlx4ib(4). The mlx5ib(4) kernel
module will only be built when WITH_OFED=YES is specified.
MFC after: 2 weeks
Sponsored by: Mellanox Technologies
newvers.sh looks for a .vcs subdirectory (e.g. .git, .svn) to determine
which vcs info tool to run (e.g., git rev-parse, svn info).
(As of r308789 if a .vcs subdirectory is not found at ${TOPDIR} then
newvers.sh walks up successive parent directories, testing for the .vcs
subdirectory at each step. This is done in case the FreeBSD source is
built in a subdirectory as part of some larger project, but either way
newvers.sh still tests for the .vcs subdirectory.)
However, when using git worktree there is no .git subdirectory but
rather a plain text .git file which contains a reference to the main
working tree.
Change findvcs() to test that the .vcs entry exists, regardless of type.
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
Rather than repeatedly nesting loops, separate concerns with a single loop
per call stack level. Use a table to drive the recursive routine. Handle
missing topology layers more gracefully (infer a single unit).
Analyze some additional optional layers which may be present on e.g. AMD Zen
systems (groups, aka dies, per package; and cachegroups, aka CCXes, per
group).
Display that additional information in the boot-time topology information,
when it is relevent (non-one).
Reviewed by: markj@, mjoras@ (earlier version)
Sponsored by: Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D12019
sysctls to display stats, stats polled every 2 seconds
Modify QLA_LOCK()/QLA_UNLOCK() to not sleep after acquiring mtx_lock
Add support to turn OFF/ON error recovery following heartbeat failure for
debug purposes.
Set default max values to 32 Tx/Rx/SDS rings
MFC after:5 days
The full system memory barrier around a TLB invalidation is stricter than
required. It needs to wait on accesses to main memory, with just the weaker
store variant before the invalidate. As such use the dsb istst, tlbi, dlb
ish sequence already used in pmap.
The tlbi instruction in this sequence is also unnecessarily using a
broadcast invalidate when it just needs to invalidate the local CPUs TLB.
Switch to a non-broadcast variant of this instruction.
Sponsored by: DARPA, AFRL
Right now, we enable the CR4.FSGSBASE bit on CPUs which support the
facility (Ivy and later), to allow usermode to read fs and gs bases
without syscalls. This bit also controls the write access to bases
from userspace, but WRFSBASE and WRGSBASE instructions currently
cannot be used, because return path from both exceptions or interrupts
overrides bases with the values from pcb.
Supporting the instructions is useful because this means that usermode
can implement green-threads completely in userspace without issuing
syscalls to change all of the machine context.
Support is implemented by saving the fs base and user gs base when
PCB_FULL_IRET flag is set. The flag is set on the context switch,
which potentially causes clobber of the bases due to activation of
another context, and when explicit modification of the user context by
a syscall or exception handler is performed. In particular, the patch
moves setting of the flag before syscalls change context.
The changes to doreti_exit and PUSH_FRAME to clear PCB_FULL_IRET on
entry from userspace can be considered a bug fixes on its own.
Reviewed by: jhb (previous version)
Tested by: pho (previous version)
Sponsored by: The FreeBSD Foundation
MFC after: 3 weeks
Differential revision: https://reviews.freebsd.org/D12023
softdep_count_dependencies().
Buffer's b_dep list is protected by the SU mount lock. Owning the
buffer lock is not enough to guarantee the stability of the list.
Calculation of the UFS mount owning the workitems from the buffer must
be much more careful to not dereference the work item which might be
freed meantime. To get to ump, use the pointers chain which does not
involve workitems at all.
Reported and tested by: pho
Reviewed by: mckusick
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
When a security policy should match TCP connection with specific ports,
the SYN+ACK segment send by syncache_respond() is considered as forwarded
packet, because at this moment TCP connection does not have PCB structure,
and ip_output() is called without inpcb pointer. In this case SPIDX filled
for SP lookup will not contain TCP ports and security policy will not
be found. This can lead to unencrypted SYN+ACK on the wire.
This patch restores the old behavior, when ports will not be filled only
for forwarded packets.
Reported by: Dewayne Geraghty <dewayne.geraghty at heuristicsystems.com.au>
MFC after: 1 week
Deadlock condition:
The return value of TDQ_LOCKPTR(td) is the same for two threads.
1) The first thread signals a wakeup while keeping the rcu_read_lock().
This invokes sched_add() which in turn will try to lock TDQ_LOCK().
2) The second thread is calling synchronize_rcu() calling mi_switch() over
and over again trying to yield(). This prevents the first thread from running
and releasing the RCU reader lock.
Solution:
Release the thread lock while yielding to allow other threads to acquire the
lock pointed to by TDQ_LOCKPTR(td).
Found by: KrishnamRaju ErapaRaju <Krishna2@chelsio.com>
MFC after: 1 week
Sponsored by: Mellanox Technologies
Currently several paths in the NFS client upgrade the shared vnode
lock to exclusive, which might cause temporal dropping of the lock.
This action appears to be fatal for nullfs mounts over NFS. If the
operation is performed over nullfs vnode, then bypassed down to NFS
VOP, and the lock is dropped, other thread might reclaim the upper
nullfs vnode. Since on reclaim the nullfs vnode lock and NFS vnode
lock are split, the original lock state of the nullfs vnode is not
restored. As result, VFS operations receive not locked vnode after a
VOP call.
Stop upgrading the vnode lock when we check the consistency or flush
buffers as result of detected inconsistency. Instead, allocate a new
lockmgr lock for each NFS node, which is locked exclusive instead of
the vnode lock upgrade. In other words, the other parallel
modification of the vnode are excluded by either vnode lock conflict
or exclusivity of the new lock when the vnode lock is shared.
Also revert r316529 because now the vnode cannot be reclaimed during
ncl_vinvalbuf().
In collaboration with: pho
Reviewed by: rmacklem
Reported and tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
Differential revision: https://reviews.freebsd.org/D12083
This mode allows other clean buffers to arrive while we flush the buf
lists for the vnode, which is fine for the targeted use. We only need
that all buffers existed at the time of the function start were
flushed. In fact, only one assert has to be relaxed.
In collaboration with: pho
Reviewed by: rmacklem
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
X-Differential revision: https://reviews.freebsd.org/D12083
- Use more relevant name 'signo' instead of 'i' for the local variable
which contains a signal number to send for the current exception.
- Eliminate two labels 'userout' and 'out' which point to the very end
of the trap() function. Instead use return directly.
- Re-indent the prot_fault_translation block by reducing if() nesting.
- Some more monor style changes.
Requested and reviewed by: bde
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
This information is normally available via acpi_perf, but in case it is not,
add support for fetching the information via MSRs on AMD family 17h (Zen)
processors. Zen uses a slightly different formula than previous generation
AMD CPUs.
This was inspired by, but does not fix, PR 221621.
Reported by: Sean P. R. <seanpr AT swbell.net>
Reviewed by: mjoras@
Sponsored by: Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D12082
There was already a per-vty defaults field, but it was useless since it was
only initialized when propagating the global settings and thus no different
from the current global settings and not per-vty. The global defaults field
was also invariant after boot time, but not quite so useless.
Fix this by adding a second selection bit the the control flags of the
relevant ioctl(). vidcontrol doesn't support this yet. Setting either
default propagates the change to the current setting for the same level
and then to all lower levels.
Improve the 3-way escape sequence used by termcap to control the cursor.
The "normal" (ve) case has always used reset, so the user could set
it to anything, but since the reset is to a global value this is not
very useful, especially since the "very visible" (vs) case doesn't
reset but inconsistently forces to a blinking block. Change vs to
first reset and then XOR the blinking bit so that it is predictably
different from ve.
attribute field is curs_attr. The base field holds user data translated
in a reversible way and is needed because current field holds this in
an irreversible way for efficiency.
Factor out some common code for the reversible translation. This is
slightly simpler now, and much easier to expand.
Translate the magic flags value -1 to a single control flag internally
up front so other flags can be trusted later. This can be used for the
relevant ioctl() too.
Remove CONS_CURSOR_FLAGS which contained all the control flags. It was
unused and not useful. After adding more flags, there will be tests on
a couple at a time but never on them all. This API should have used this
to disallow unknown flags.
Make sure that %eflags.D flag is cleared for hook.
Improve comments.
When #UD dtrace code checks for a registered hook before checking that
the exception was raised from kernel mode, we might run with the user
%ds, trapping on access. Exception entry from userspace automatically
load valid %ss, which we can use there instead.
Noted and reviewed by: bde
Sponsored by: The FreeBSD Foundation
MFC after: 3 days
redundant initializations.
Hard-code base = 0, height = (approx. 1/8 of the boot-time font height)
in all cases, and remove the BIOS/MD support for setting these values.
This asks for an underline cursor sized for the boot-time font instead
of various less hard-coded but worse values. I used that think that
the x86 BIOS always gave the same values as the above hard-coding, but
on 1 of my systems it gives the wrong value of base = 1.
The remaining BIOS fields are shift_state and bell_pitch. These are now
consistently not explicitly reinitialized to 0. All sc_get_bios_value()
functions except x86's are now empty, and the only useful thing that x86
returns is shift_state. This really belongs in atkbdc, but heavier
use of the BIOS to read the more useful typematic rate has been removed
there. fb still makes much heavier use of the BIOS.
Using latest U-Boot for RPI 1 or 2 the DTB loaded by the firmware is discarded.
The DTB was previously patched by the firmware to contain the DMA channel mask.
DTB provided by the rpi firmware or DTS in the Linux tree contain the raw value
directly. Do the same for our DTS as we cannot switch to the upstream ones yet.
Not having the DMA channel mask setup properly cause mmc not to be detected
(and probably other problems on driver using DMA).
Also, add links for rpi dtb to the name used by u-boot. This way the dtb can be
loaded by ubldr using the U-Boot env variable fdtfile.
Tested On: RPI B Rev2, RPI Zero, RPI 2 v1.1 RPI 2 v1.2
Thanks to Sylvain Garrigues <sylvain@sylvaingarrigues.com> for the help.
PR: 218344
Drop the EARLY_AP_STARTUP gtaskqueue code, as gtaskqueues are now
initialized before APs are started.
Reviewed by: hselasky@, jhb@
Sponsored by: Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D12054
exception did not happen in vm86 mode. A vm86 userland process could
have a %cs that matches GSEL_KPL, while dtrace cannot hook it.
Submitted by: Maxime Villard <max@m00nbsd.net>
MFC after: 3 days
was aliased to a vt sequence, causing and fixing various bugs.
For syscons, this restores support for arg 2 which sets blinking block
too forcefully, and restores bugs for arg 0 and 1. Arg 2 is used for
vs in the cons25 entry in termcap, but I've never noticed an application
that uses this. The bugs involve replacing local settings by global
ones and need better handling of defaults to fix.
For vt, this requires moving the aliasing code from teken to vt where
it belongs. This sequences is very important for cons25 compatibility
in vt since it is used by the cons25 termcap entries for ve, vi and
vs. vt can't properly support vs for either cons25 or xterm since it
doesn't support blinking. For xterm, the termcap entry for vs asks
for something different using 12;25h instead of 25h.
Rename C25CURS for this to C25LCT and change its description to be closer
to echoing the old comment about it. CURS is too generic.
Fix missing syscons escape sequence for setting the global cursor shape
(and type). Only support this in syscons since vt can't emulate anything
in it.
In a recent commit, I forgot to expand an X to an abbreviation of "BORDER".
Fix this and some nearby bad names.
The descriptions were copied from comments in scterm-sc.c, but some
of these are bad. The border [color] was inconsistently described as
a property of the "display", but I had changed this to "adapter" to
match the descriptions for other color settings. All colors supported
by the cons25 sequences are actually properties of the current vty and
that should not be described. But the other colors are defaults.
Change "adapter" to "default" for them and remove "adapter" for the
border. Reduce the verbosity of the abbreviation from AD to D.
It should toggle between 2 states, but it used a cut-down version of
support for a related 3-state syscons escape sequence and inherited
bugs from that. The usual misbehaviour was that hiding and showing
the cursor reset it to a global default.
Support for the 3-state sequence remains broken by aliasing to the 2-state
sequence. This works better but incompatibly for the 2 cases that it
supports.
Make sure that the flags INP_IPV4 and INP_IPV6 are consistently set
for inpcbs used for TCP sockets, no matter if the setting is derived
from the net.inet6.ip6.v6only sysctl or the IPV6_V6ONLY socket option.
For UDP this was already done right.
PR: 221385
MFC after: 1 week
Right now we only need to pad when writing kernel dump headers, so
flatten three related subroutines into one. The encrypted kernel dump
code already writes out its key in a dumper.blocksize-sized block.
No functional change intended.
Reviewed by: cem, def
Sponsored by: Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D11647
This helps simplify the code in kern_shutdown.c and reduces the number
of globally visible functions.
No functional change intended.
Reviewed by: cem, def
Sponsored by: Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D11603
dump_start() and dump_finish() are responsible for writing kernel dump
headers, optionally writing the key when encryption is enabled, and
initializing the initial offset into the dump device.
Also remove the unused dump_pad(), and make some functions static now that
they're only called from kern_shutdown.c.
No functional change intended.
Reviewed by: cem, def
Sponsored by: Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D11584
sbuf is filled to capacity by vsnprintf(), the loop exits without error, and
the sbuf is not marked as auto-extendable.
SBUF_HASROOM() evaluates true if there is room for one or more non-NULL
characters, but in the case that the sbuf was filled exactly to capacity,
SBUF_HASROOM() evaluates false. Consequently, sbuf_vprintf() incorrectly
assigns an ENOMEM error to the sbuf when in fact everything is fine, in turn
poisoning the buffer for all subsequent operations.
Correct by moving the ENOMEM assignment into the loop where it can be made
unambiguously.
As a related safety net change, explicitly check for the zero bytes drained
case in sbuf_drain() and set EDEADLK as the error. This avoids an infinite loop
in sbuf_vprintf() if a drain function were to inadvertently return a value of
zero to sbuf_drain().
Reviewed by: cem, jtl, gallatin
MFC after: 2 weeks
Sponsored by: Netflix, Inc.
Differential Revision: https://reviews.freebsd.org/D8535
The Nodes per Processor topology information determines how many bits of the
APIC ID represent the Node (Zeppelin die, on Zen systems) ID. Documented in
Ryzen and Epyc Processor Programming Reference (PPR).
Correct topology information enables the scheduler to make better decisions
on this hardware.
Reviewed by: kib@
Tested by: jeff@ (earlier version)
Sponsored by: Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D11801
during drain operations. When an sbuf is configured to use this feature by way
of the SBUF_DRAINTOEOR sbuf_new() flag, top-level sections started with
sbuf_start_section() create a record boundary marker that is used to avoid
flushing partial records.
Reviewed by: cem,imp,wblock
MFC after: 2 weeks
Sponsored by: Netflix, Inc.
Differential Revision: https://reviews.freebsd.org/D8536
Fallout from r322588. I'm not sure why !SMP is a knob we have, but, we have
it.
Reported by: Michael Butler <imb AT protected-networks.net>
Sponsored by: Dell EMC Isilon
When built with -fno-inline-functions zfs.ko contains undefined references
to these functions if they are only marked inline.
Reviewed by: avg (earlier version)
MFC after: 1 week
Sponsored by: Chelsio Communications
This fixes a regression accidentally introduced in r322588, due to an
interaction with EARLY_AP_STARTUP.
Reviewed by: bdrewery@, jhb@
Sponsored by: Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D12053
Cleaning up a bpf_if is a two stage process. We first move it to the
bpf_freelist (in bpfdetach()) and only later do we actually free it (in
bpf_ifdetach()).
We cannot set the ifp->if_bpf to NULL from bpf_ifdetach() because it's
possible that the ifnet has already gone away, or that it has been assigned
a new bpf_if.
This can lead to a struct ifnet which is up, but has if_bpf set to NULL,
which will panic when we try to send the next packet.
Keep track of the pointer to the bpf_if (because it's not always
ifp->if_bpf), and NULL it immediately in bpfdetach().
PR: 213896
MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D11782
Add an option to dynamically rebalance interrupts across cores
(hw.intrbalance); off by default.
The goal is to minimize preemption. By placing interrupt sources on distinct
CPUs, ithreads get preferentially scheduled on distinct CPUs. Overall
preemption is reduced and latency is reduced. In our workflow it reduced
"fighting" between two high-frequency interrupt sources. Reduced latency
was proven by, e.g., SPEC2008.
Submitted by: jeff@ (earlier version)
Reviewed by: kib@
Sponsored by: Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D10435
but it was actually extended then and it is still used (just once) in
/usr/src by its primary user (vidcontrol), while its replacement is
still not used in /usr/src.
yokota became inactive soon after deprecating CONS_CURSORTYPE (this
was part of a large change to make cursor attributes per-vty).
vidcontrol has incomplete support even for the old ioctl. I will
update it soon. Then there are many broken escape sequences to fix.
This is just to prepare for setting cursor colors using vidcontrol.
Intel SGX allows to manage isolated compartments "Enclaves" in user VA
space. Enclaves memory is part of processor reserved memory (PRM) and
always encrypted. This allows to protect user application code and data
from upper privilege levels including OS kernel.
This includes SGX driver and optional linux ioctl compatibility layer.
Intel SGX SDK for FreeBSD is also available.
Note this requires support from hardware (available since late Intel
Skylake CPUs).
Many thanks to Robert Watson for support and Konstantin Belousov
for code review.
Project wiki: https://wiki.freebsd.org/Intel_SGX.
Reviewed by: kib
Relnotes: yes
Sponsored by: DARPA, AFRL
Differential Revision: https://reviews.freebsd.org/D11113
Setting this flag allows us to skip pages removal from VM object queue
during object termination and to leave that for cdev_pg_dtor function.
Move pages removal code to separate function vm_object_terminate_pages()
as comments does not survive indentation.
This will be required for Intel SGX support where we will have to remove
pages from VM object manually.
Reviewed by: kib, alc
Sponsored by: DARPA, AFRL
Differential Revision: https://reviews.freebsd.org/D11688
Previously the locking of vlan(4) interfaces was not very comprehensive.
Particularly there was very little protection against the destruction of
active vlan(4) interfaces or concurrent modification of a vlan(4)
interface. The former readily produced several different panics.
The changes can be summarized as using two global vlan locks (an
rmlock(9) and an sx(9)) to protect accesses to the if_vlantrunk field of
struct ifnet, in addition to other places where global exclusive access
is required. vlan(4) should now be much more resilient to the destruction
of active interfaces and concurrent calls into the configuration path.
PR: 220980
Reviewed by: ae, markj, mav, rstone
Approved by: rstone (mentor)
MFC after: 4 weeks
Sponsored by: Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D11370
This is a variant of vm_page_alloc() which accepts an additional parameter:
the page in the object with largest index that is smaller than the requested
index. vm_page_alloc() finds this page using a lookup in the object's radix
tree, but in some cases its identity is already known, allowing the lookup
to be elided.
Modify kmem_back() and vm_page_grab_pages() to use vm_page_alloc_after().
vm_page_alloc() is converted into a trivial wrapper of
vm_page_alloc_after().
Suggested by: alc
Reviewed by: alc, kib
MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D11984
sys/cddl/contrib/opensolaris/uts/common/fs/zfs/vdev_geom.c
Be more careful about the use of provider names vs vdev names in
ZFS_LOG statements.
MFC after: 3 weeks
Sponsored by: Spectra Logic Corp
It is quite useful when double fault is not caused by a stack overflow.
Tested by: pho (as part of the larger patch)
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Do this even for non-transparent mode VF. Better safe than sorry.
MFC after: 3 days
Sponsored by: Microsoft
Differential Revision: https://reviews.freebsd.org/D11981
- Update hn(4)'s stats properly for non-transparent mode VF.
- Allow BPF tapping to hn(4) for non-transparent mode VF.
- Don't setup mbuf hash, if 'options RSS' is set.
In Azure, when VF is activated, TCP SYN and SYN|ACK go through hn(4)
while the rest of segments and ACKs belonging to the same TCP 4-tuple
go through the VF. So don't setup mbuf hash, if a VF is activated
and 'options RSS' is not enabled. hn(4) and the VF may use neither
the same RSS hash key nor the same RSS hash function, so the hash
value for packets belonging to the same flow could be different!
- Disable LRO.
hn(4) will only receive broadcast packets, multicast packets, TCP
SYN and SYN|ACK (in Azure), LRO is useless for these packet types.
For non-transparent, we definitely _cannot_ enable LRO at all, since
the LRO flush will use hn(4) as the receiving interface; i.e.
hn_ifp->if_input(hn_ifp, m).
While I'm here, remove unapplied comment and minor style change.
MFC after: 3 days
Sponsored by: Microsoft
Differential Revision: https://reviews.freebsd.org/D11978
While, I'm here add comment about why updating VF's imcast stat is
not necessary.
MFC after: 3 days
Sponsored by: Microsoft
Differential Revision: https://reviews.freebsd.org/D11948
setting up the timer fails, because on some types of chips that's the
first attempt to access the device. If the chip is missing/non-responsive
then you'd get a driver that attached and didn't register the rtc, with
no clue about why. On other chip types there are inits that come before
timer setup, and they already print messages about errors.
- Add FDT probe code.
- Do i2c transfers with exclusive bus ownership.
- Use config_intrhook_oneshot() to defer chip setup because some i2c
busses can't do transfers without interrupts.
- Add a detach() routine.
- Add to module build.
This driver supports only basic timekeeping functionality. It completely
replaces the ds133x driver. It can also replace the ds1374 driver, but that
will take a few other changes in MIPS code and config, and will be committed
separately. It does NOT replace the existing ds1307 driver, which provides
access to some of the extended features on the 1307 chip, such as controlling
the square wave output signal. If both ds1307 and ds13rtc drivers are
present, the ds1307 driver will outbid and win control of the device.
This driver can be configured with FDT data, or by using hints on non-FDT
systems. In addition to the standard hints for i2c devices, it requires
a "chiptype" string of the form "dallas,ds13xx" where 'xx' is the chip id
(i.e., the same format as FDT compat strings).
and sc_irq_length to the softc to handle the base number of IRQs available,
make gicv3_get_nirqs return the number of available interrupt IDs, and
limit which CPUs we send interrupts to based on the numa domain.
The last point is only strictly needed on a dual socket ThunderX where we
are unable to send MSI/MSI-X interrupts between sockets.
Sponsored by: DARPA, AFRL
it automatically after it runs.
The config_intrhook mechanism allows a driver to stall the boot process
until device(s) required for booting are available, by not allowing system
inits to proceed until all intrhook functions have been unregistered.
Virtually all existing code simply unregisters from within the hook function
when it gets called.
This new function makes that common usage more convenient. Instead of
allocating and filling in a struct, passing it to a function that might (in
theory) fail, and checking the return code, now a driver can simply call
this cannot-fail routine, passing just the intrhook function and its arg.
Differential Revision: https://reviews.freebsd.org/D11963
the g_journal level needs to check whether it is holding a newer
copy of the block than that which exists on the disk. If so, it
needs to return its copy. If not, it should pass the request down
to the disk to fulfill. It currently considers six queues:
0) delayed queue,
1) unsent (current queue),
2) in-flight to the journal (flush queue),
3) active journal (active queue),
4) inactive journal (inactive queue), and
5) inflight to the disk (copy queue).
Checking on two of these queues is unnecessary:
0) The delayed requests should not be used for reads because they
have not yet been entered into the journal, so their value should
reflect the disk contents, not the future contents that are not
yet committed.
2) Because all the bio's in the flush queue are also found on the
active queue, there is no need to inspect the flush queue for
reads since they will be found when searching the active queue.
Submitted by: Dr. Andreas Longwitz <longwitz@incore.de>
Discussed with: kib
MFC after: 1 week
another parameter that identifies a starting point in the memory address
block. Radix is a power of two, blk is a multiple of radix, and the
starting point is in the range [blk, blk+radix), so that blk can always be
computed from the other two. This change drops the blk parameter from the
meta functions and computes it instead. (On amd64, for example, this
change reduces subr_blist.o's text size by 7%.)
It also makes the radix parameters unsigned to address concerns that the
calculation of '-radix' might overflow without the -fwrapv option. (See
https://reviews.freebsd.org/D11819.)
Submitted by: Doug Moore <dougm@rice.edu>
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D11964
to being called through the newbus DEVICE_SHUTDOWN() path. This ensures that
the NVME controller gets shut down before the device and bus disappear
and prevents data corruption on shutdown on at least Samsung EVO 960 SSDs.
PR: kern/211852
Reviewed by: imp
MFC after: 2 weeks
Previously, debug exceptions were only enabled on the boot CPU if
DDB was enabled in the dbg_monitor_init() function. APs also called
this function, but since mp_machdep.c doesn't include opt_ddb.h, the
APs ended up calling an empty stub defined in <machine/debug_monitor.h>
instead of the real function. Also, if DDB was not enabled in the kernel,
the boot CPU would not enable debug exceptions.
Fix this by adding a new dbg_init() function that always clears the OS
lock to enable debug exceptions which the boot CPU and the APs call.
This function also calls dbg_monitor_init() to enable hardware breakpoints
from DDB on all CPUs if DDB is enabled. Eventually base support for
hardware breakpoints/watchpoints will need to move out of the DDB-only
debug_monitor.c for use by userland debuggers.
Reviewed by: andrew
Differential Revision: https://reviews.freebsd.org/D12001
Only fetch the VFP state from the CPU if the thread whose registers are
being requested is the current thread. If a stopped thread's registers
are being fetched by a debugger, the saved state in the PCB is already
valid.
Reviewed by: andrew
MFC after: 1 week
generic driver with minimal feature support for a large number of chips.
More featureful per-chip drivers might exist (especially out-of-tree) and
those should win the bidding even if they use BUS_PROBE_DEFAULT.
the current state to determine whether to generate a link-state change
notification. This fixes a bug introduced in r321063 that caused the
driver to sometimes skip these notifications.
Reported by: Jason Eggleston @ LLNW
MFC after: 3 days
Sponsored by: Chelsio Communications
Added in r238189, ARM_MANY_BOARD adds support for multiple ARM boards in
a single kernel. Include it for LINT builds to avoid duplicate symbol
errors when linking with lld.
Sponsored by: The FreeBSD Foundation
The encrypted kernel dump code writes data in blocks of this size. A buffer
size of 4096 allows encrypted dumps to work with 4Kn drives.
Reviewed by: cem
Sponsored by: Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D11870
removes the only reference to atrtc_set() from outside of atrtc.c, so make
it static.
The xen timer driver registers as a realtime clock with 1us resolution. In
the past that resulted in only the xen timer's clock_settime() getting
called, so it would call atrtc_set() to set the hardware clock as well. As
of r32090, the clock_settime() method of all registered realtime clocks gets
called, so the xen driver no longer needs to chain-call the lower-resolution
driver.
Thanks to royger@ for talking me through the xen stuff, and for testing.
M_PMC is defined in sys/dev/hwpmc/hwpmc_mod.c, and the LINT kernel build
fails when linking with lld due to a duplicate symbol error.
Sponsored by: The FreeBSD Foundation
TCP connections (order of tens of thousands), with predominantly Transmits.
Choice to perform receive operations either in IThread or Taskqueue Thread.
Submitted by:Vaishali.Kulkarni@cavium.com
MFC after:5 days
cpumask it probably means we are unable to sent interrupts to CPUs outside
the map. As such only return the current CPU when it's within the mask
otherwise return the first valid CPU.
This is needed on ThunderX as, in a dual socket configuration, we are
unable to send MSI/MSI-X interrupts between sockets.
Reviewed by: mmel
Sponsored by: DARPA, AFRL
Differential Revision: https://reviews.freebsd.org/D11957
in the VM area structure in the LinuxKPI when doing mmap() and that
unsupported bits are masked away.
While at it fix some redundant use of parenthesing inside some related
macros.
Found by: KrishnamRaju ErapaRaju <Krishna2@chelsio.com>
MFC after: 1 week
Sponsored by: Mellanox Technologies
Such drivers attach to a vgapci bus rather than directly to a pci bus. For
the rest of the LinuxKPI to work correctly in this case, we override the
vgapci bus' ivars with those of the grandparent.
Reviewed by: hselasky
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D11932
We can remove some unnecessary object radix tree lookups by using the
object memq to iterate over pages in the specified range. This does not,
however, eliminate the lookup needed in vm_page_free_toq() to remove each
tree entry.
Reviewed by: alc, kib (previous revision)
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D11945