It seems that second call does not add any useful state change for all
implemented timecounters.
Discussed with: bde
Sponsored by: The FreeBSD Foundation
MFC after: 3 weeks
If there's no data to read from xenstore short-circuit
xctrl_on_watch_event to return early, there's no reason to continue
since the lack of data would prevent matching against any known event
type.
Sponsored by: Citrix Systems R&D
MFC with: r352925
MFC after: 1 week
The correct type to use to represent disk sectors is blkif_sector_t
(which is an uint64_t underneath). This avoid truncation of the disk
size calculation when resizing on i386, as otherwise the calculation
of d_mediasize in xbd_connect is truncated to the size of unsigned
long, which is 32bits on i386.
Note this issue didn't affect amd64, because the size of unsigned long
is 64bits there.
Sponsored by: Citrix Systems R&D
MFC after: 1 week
Fix returning from xenstore device with locks held, which triggers the
following panic:
# cat /dev/xen/xenstore
^C
userret: returning with the following locks held:
exclusive sx evtchn_ringc_sx (evtchn_ringc_sx) r = 0 (0xfffff8000650be40) locked @ /usr/src/sys/dev/xen/evtchn/evtchn_dev.c:262
Note this is not a security issue since access to the device is
limited to root by default.
Sponsored by: Citrix Systems R&D
MFC after: 1 week
A later change, currently being iterated on in D24459, will in-fact change
the lock type to an sx so that TTY drivers can sleep on it if they need to.
Committing this ahead of time to make the review in question a little more
palatable.
tty_lock_assert() is unfortunately still needed for now in two places to
make sure that the tty lock has not been recursed upon, for those scenarios
where it's supplied by the TTY driver and possibly a mutex that is allowed
to recurse.
Suggested by: markj
r357614 added CTLFLAG_NEEDGIANT to make it easier to find nodes that are
still not MPSAFE (or already are but aren’t properly marked).
Use it in preparation for a general review of all nodes.
This is non-functional change that adds annotations to SYSCTL_NODE and
SYSCTL_PROC nodes using one of the soon-to-be-required flags.
Mark all obvious cases as MPSAFE.
Reviewed by: royger
Approved by: kib (mentor, blanket)
Differential Revision: https://reviews.freebsd.org/D23638
BIO_READ and BIO_WRITE, we've handled this expanded syntax poorly in
drivers when the driver doesn't support a particular command. Do a
sweep and fix that.
Reported by: imp
Currently the Xen console is always attached with priority CN_REMOTE
(highest), which means that when booting with a single console the Xen
console will take preference over the VGA for example, and that's not
intended unless the user has also selected to use a serial console.
Fix this by lowering the priority of the Xen console to NORMAL unless
the user has selected to use a serial console. This keeps the usual
FreeBSD behavior of outputting to the internal consoles (ie: VGA) when
booted as a Xen dom0.
MFC after: 3 days
Sponsored by: Citrix Systems R&D
supposedly may call into ether_input() without network epoch.
They all need to be reviewed before 13.0-RELEASE. Some may need
be fixed. The flag is not planned to be used in the kernel for
a long time.
Filesystems which want to use it in limited capacity can employ the
VOP_UNLOCK_FLAGS macro.
Reviewed by: kib (previous version)
Differential Revision: https://reviews.freebsd.org/D21427
removed from objects including calls to free. Pages must not be xbusy
when freed and not on an object. Strengthen assertions to match these
expectations. In practice very little code had to change busy handling
to meet these rules but we can now make stronger guarantees to busy
holders and avoid conditionally dropping busy in free.
Refine vm_page_remove() and vm_page_replace() semantics now that we have
stronger guarantees about busy state. This removes redundant and
potentially problematic code that has proliferated.
Discussed with: markj
Reviewed by: kib
Differential Revision: https://reviews.freebsd.org/D22822
The current vnode layout is not smp-friendly by having frequently read data
avoidably sharing cachelines with very frequently modified fields. In
particular v_iflag inspected for VI_DOOMED can be found in the same line with
v_usecount. Instead make it available in the same cacheline as the v_op, v_data
and v_type which all get read all the time.
v_type is avoidably 4 bytes while the necessary data will easily fit in 1.
Shrinking it frees up 3 bytes, 2 of which get used here to introduce a new
flag field with a new value: VIRF_DOOMED.
Reviewed by: kib, jeff
Differential Revision: https://reviews.freebsd.org/D22715
Atomics are used for page busy and valid state when the shared busy is
held. The details of the locking protocol and valid and dirty
synchronization are in the updated vm_page.h comments.
Reviewed by: kib, markj
Tested by: pho
Sponsored by: Netflix, Intel
Differential Revision: https://reviews.freebsd.org/D21594
This is the first in a series of patches that promotes the page busy field
to a first class lock that no longer requires the object lock for
consistency.
Reviewed by: kib, markj
Tested by: pho
Sponsored by: Netflix, Intel
Differential Revision: https://reviews.freebsd.org/D21548
Currently only suspend requests are acknowledged by writing an empty
string back to the xenstore control node, but poweroff or reboot
requests are not acknowledged and FreeBSD simply proceeds to perform
the desired action.
Fix this by acknowledging all requests, and remove the suspend specific
ack done in the handler.
Sponsored by: Citrix Systems R&D
MFC after: 3 days
- VM_ALLOC_NOCREAT will grab without creating a page.
- vm_page_grab_valid() will grab and page in if necessary.
- vm_page_busy_acquire() automates some busy acquire loops.
Discussed with: alc, kib, markj
Tested by: pho (part of larger branch)
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D21546
There are several mechanisms by which a vm_page reference is held,
preventing the page from being freed back to the page allocator. In
particular, holding the page's object lock is sufficient to prevent the
page from being freed; holding the busy lock or a wiring is sufficent as
well. These references are protected by the page lock, which must
therefore be acquired for many per-page operations. This results in
false sharing since the page locks are external to the vm_page
structures themselves and each lock protects multiple structures.
Transition to using an atomically updated per-page reference counter.
The object's reference is counted using a flag bit in the counter. A
second flag bit is used to atomically block new references via
pmap_extract_and_hold() while removing managed mappings of a page.
Thus, the reference count of a page is guaranteed not to increase if the
page is unbusied, unmapped, and the object's write lock is held. As
a consequence of this, the page lock no longer protects a page's
identity; operations which move pages between objects are now
synchronized solely by the objects' locks.
The vm_page_wire() and vm_page_unwire() KPIs are changed. The former
requires that either the object lock or the busy lock is held. The
latter no longer has a return value and may free the page if it releases
the last reference to that page. vm_page_unwire_noq() behaves the same
as before; the caller is responsible for checking its return value and
freeing or enqueuing the page as appropriate. vm_page_wire_mapped() is
introduced for use in pmap_extract_and_hold(). It fails if the page is
concurrently being unmapped, typically triggering a fallback to the
fault handler. vm_page_wire() no longer requires the page lock and
vm_page_unwire() now internally acquires the page lock when releasing
the last wiring of a page (since the page lock still protects a page's
queue state). In particular, synchronization details are no longer
leaked into the caller.
The change excises the page lock from several frequently executed code
paths. In particular, vm_object_terminate() no longer bounces between
page locks as it releases an object's pages, and direct I/O and
sendfile(SF_NOCACHE) completions no longer require the page lock. In
these latter cases we now get linear scalability in the common scenario
where different threads are operating on different files.
__FreeBSD_version is bumped. The DRM ports have been updated to
accomodate the KPI changes.
Reviewed by: jeff (earlier version)
Tested by: gallatin (earlier version), pho
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D20486
These calls are not the same in general: the former will dequeue the
page if it is enqueued, while the latter will just leave it alone. But,
all existing uses of the former apply to unmanaged pages, which are
never enqueued in the first place. No functional change intended.
Reviewed by: kib
MFC after: 1 week
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D20470
This allows replacing "sys/eventfilter.h" includes with "sys/_eventfilter.h"
in other header files (e.g., sys/{bus,conf,cpu}.h) and reduces header
pollution substantially.
EVENTHANDLER_DECLARE and EVENTHANDLER_LIST_DECLAREs were moved out of .c
files into appropriate headers (e.g., sys/proc.h, powernv/opal.h).
As a side effect of reduced header pollution, many .c files and headers no
longer contain needed definitions. The remainder of the patch addresses
adding appropriate includes to fix those files.
LOCK_DEBUG and LOCK_FILE_LINE_ARG are moved to sys/_lock.h, as required by
sys/mutex.h since r326106 (but silently protected by header pollution prior
to this change).
No functional change (intended). Of course, any out of tree modules that
relied on header pollution for sys/eventhandler.h, sys/lock.h, or
sys/mutex.h inclusion need to be fixed. __FreeBSD_version has been bumped.
The type represents byte offset in the vm_object_t data space, which
does not span negative offsets in FreeBSD VM. The change matches byte
offset signess with the unsignedness of the vm_pindex_t which
represents the type of the page indexes in the objects.
This allows to remove the UOFF_TO_IDX() macro which was used when we
have to forcibly interpret the type as unsigned anyway. Also it fixes
a lot of implicit bugs in the device drivers d_mmap methods.
Reviewed by: alc, markj (previous version)
Tested by: pho
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
The Xen page-table walker used to resolve the virtual addresses in the
hypercalls will refuse to access user-space pages when SMAP is enabled
unless the AC flag in EFLAGS is set (just like normal hardware with
SMAP support would do).
Since privcmd allows forwarding hypercalls (and buffers) from
user-space into Xen make sure SMAP is temporary disabled for the
duration of the hypercall from user-space.
Approved by: re (gjb)
Sponsored by: Citrix Systems R&D
netfront_backend_changed() is called from the xenwatch_thread(), which means
that the curvnet is not set. We have to set it before we can call things like
arp_ifinit().
PR: 230845
The timespecadd(3) family of macros were imported from NetBSD back in
r35029. However, they were initially guarded by #ifdef _KERNEL. In the
meantime, we have grown at least 28 syscalls that use timespecs in some
way, leading many programs both inside and outside of the base system to
redefine those macros. It's better just to make the definitions public.
Our kernel currently defines two-argument versions of timespecadd and
timespecsub. NetBSD, OpenBSD, and FreeDesktop.org's libbsd, however, define
three-argument versions. Solaris also defines a three-argument version, but
only in its kernel. This revision changes our definition to match the
common three-argument version.
Bump _FreeBSD_version due to the breaking KPI change.
Discussed with: cem, jilles, ian, bde
Differential Revision: https://reviews.freebsd.org/D14725
This fixes the panic caused by deadlocking when grant-table free
callbacks are used.
The cause of the recursion is: check_free_callbacks() is always called
with the lock gnttab_list_lock held. In turn the callback function is
also called with the lock held. Then when the client uses any of the grant
reference methods which also attempt the lock the gnttab_list_lock
mutex from within the free callback a deadlock happens.
Fix this by making the gnttab_list_lock recursive.
Submitted by: Pratyush Yadav <pratyush@freebsd.org>
Differential Revision: https://reviews.freebsd.org/D16505
If gnttab_grant_foreign_access() fails for any of the indirection
pages, the code breaks out of both the loops without freeing the local
variable indirectpages, causing a memory leak.
Submitted by: Pratyush Yadav <pratyush@freebsd.org>
Differential Review: https://reviews.freebsd.org/D16136
Length is an unsigned integer, so checking against < 0 doesn't make
sense. While there also make clear that a length of 0 always succeeds.
Submitted by: Pratyush Yadav <pratyush@freebsd.org>
Differential Review: https://reviews.freebsd.org/D16045
When booted as PVHv2, there's no ACPI CPU object, so attach the PV CPU
device in order to take it's place.
This is required in case some device or driver tries to poke at the
PCPU device field.
Sponsored by: Citrix Systems R&D
HYPERVISOR_start_info is only available to PV and PVHv1 guests, HVM
and PVHv2 guests get this data from HVM parameters that are fetched
using a hypercall.
Instead provide a set of helper functions that should be used to fetch
this data. The helper functions have different implementations
depending on whether FreeBSD is running as PVHv1 or HVM/PVHv2 guest
type.
This helps to cleanup generic Xen code by removing quite a lot of
xen_pv_domain and xen_hvm_domain macro usages.
Sponsored by: Citrix Systems R&D
lock order reversal: (sleepable after non-sleepable)
1st 0xfffffe00357ff538 xnb_softc (xen netback softc lock) @ /usr/src/sys/dev/xen/netback/netback.c:1069
2nd 0xffffffff81fdccb0 intrsrc (intrsrc) @ /usr/src/sys/x86/x86/intr_machdep.c:224
There's no need to hold the lock since the cleaning of the interrupt
cannot happen in parallel due to the XNBF_IN_SHUTDOWN flag being set.
Note that the locking in netback needs some improvement or
clarification.
While there also remove a double newline.
Sponsored by: Citrix Systems R&D
Without a call to check_free_callbacks() clients waiting for grant
references would not be woken up even when there are sufficient grant
references available.
The check was likely left out as a mistake when the function was first
added.
Note that other functions used to free grant references already call
check_free_callbacks.
Submitted by: pratyush
Reviewed by: royger
Differential review: https://reviews.freebsd.org/D15899
Remove the device from the list before unbinding it. Doing it in this
order allows calling xen_intr_unbind without holding the bind_mutex
lock.
Sponsored by: Citrix Systems R&D
There's no need to perform the interrupt unbind while holding the
blkback lock, and doing so leads to the following LOR:
lock order reversal: (sleepable after non-sleepable)
1st 0xfffff8000802fe90 xbbd1 (xbbd1) @ /usr/src/sys/dev/xen/blkback/blkback.c:3423
2nd 0xffffffff81fdf890 intrsrc (intrsrc) @ /usr/src/sys/x86/x86/intr_machdep.c:224
stack backtrace:
#0 0xffffffff80bdd993 at witness_debugger+0x73
#1 0xffffffff80bdd814 at witness_checkorder+0xe34
#2 0xffffffff80b7d798 at _sx_xlock+0x68
#3 0xffffffff811b3913 at intr_remove_handler+0x43
#4 0xffffffff811c63ef at xen_intr_unbind+0x10f
#5 0xffffffff80a12ecf at xbb_disconnect+0x2f
#6 0xffffffff80a12e54 at xbb_shutdown+0x1e4
#7 0xffffffff80a10be4 at xbb_frontend_changed+0x54
#8 0xffffffff80ed66a4 at xenbusb_back_otherend_changed+0x14
#9 0xffffffff80a2a382 at xenwatch_thread+0x182
#10 0xffffffff80b34164 at fork_exit+0x84
#11 0xffffffff8101ec9e at fork_trampoline+0xe
Reported by: Nathan Friess <nathan.friess@gmail.com>
Sponsored by: Citrix Systems R&D
The user-space xenstore device is currently lacking a check to make
sure that the caller is only using transaction ids currently assigned
to it. This allows users of the xenstore device to hijack transactions
not started by them, although the scope is limited to transactions
started by the same domain.
Tested by: Nathan Friess <nathan.friess@gmail.com>
Sponsored by: Citrix Systems R&D
Allow user-space applications to register watches using the xenstore
device. This is needed in order to run toolstack operations on
domains different than the one where xenstore is running (in which
case the device is not used, since the connection to xenstore is done
using a plain socket).
Tested by: Nathan Friess <nathan.friess@gmail.com>
Sponsored by: Citrix Systems R&D
Due to the current synchronous xenstore implementation in FreeBSD, we
cannot return from xs_read_reply without reading a reply, or else the
ring gets out of sync and the next request will read the previous
reply and crash due to the type mismatch. A proper solution involves
making use of the req_id field in the message and allowing multiple
in-flight messages at the same time on the ring.
Remove the PCATCH flag so that signals don't interrupt the wait.
Tested by: Nathan Friess <nathan.friess@gmail.com>
Sponsored by: Citrix Systems R&D
There's no need to prevent suspend while doing xenstore transactions,
callers of transactions are supposed to be prepared for a transaction
to fail.
This fixes a bug that could be triggered from the xenstore user-space
device, since starting a transaction from user-space would result in
returning there with a sx lock held, that causes a WITNESS check to
trigger.
Tested by: Nathan Friess <nathan.friess@gmail.com>
Sponsored by: Citrix Systems R&D
Linux will not connect to a backend that's in state 3
(XenbusStateInitialised), it needs to be in state 2
(XenbusStateInitWait) for Linux to attempt to connect to the backend.
The protocol seems to suggest that the backend should indeed wait in
state 2 for the frontend to connect, which makes state 3 unusable for
disk backends.
Also make sure blkback will connect to the frontend if the frontend
reaches state 3 (XenbusStateInitialised) before blkback has processed
the results from the hotplug script (Submitted by Nathan Friess).
MFC after: 1 week
Current interface to the gntdev in FreeBSD is wrong, and mostly worked
out of luck before the PTI FreeBSD fixes, when kernel and user-space
where sharing the same page tables.
On FreeBSD ioctls have the size of the passed struct encoded in the
ioctl number, because the generic ioctl handler in the OS takes care
of copying the data from user-space to kernel space, and then calls
the device specific ioctl handler. Thus using ioctl structs with
variable sizes is not possible.
The fix is to turn the array of structs at the end of
ioctl_gntdev_alloc_gref and ioctl_gntdev_map_grant_ref into pointers,
that can be properly accessed from the kernel gntdev driver using the
copyin/copyout functions. Note that this is exactly how it's done for
the privcmd driver.
Sponsored by: Citrix Systems R&D
Uses of mallocarray(9).
The use of mallocarray(9) has rocketed the required swap to build FreeBSD.
This is likely caused by the allocation size attributes which put extra pressure
on the compiler.
Given that most of these checks are superfluous we have to choose better
where to use mallocarray(9). We still have more uses of mallocarray(9) but
hopefully this is enough to bring swap usage to a reasonable level.
Reported by: wosch
PR: 225197
Focus on code where we are doing multiplications within malloc(9). None of
these is likely to overflow, however the change is still useful as some
static checkers can benefit from the allocation attributes we use for
mallocarray.
This initial sweep only covers malloc(9) calls with M_NOWAIT. No good
reason but I started doing the changes before r327796 and at that time it
was convenient to make sure the sorrounding code could handle NULL values.
Mainly focus on files that use BSD 2-Clause license, however the tool I
was using misidentified many licenses so this was mostly a manual - error
prone - task.
The Software Package Data Exchange (SPDX) group provides a specification
to make it easier for automated tools to detect and summarize well known
opensource licenses. We are gradually adopting the specification, noting
that the tags are considered only advisory and do not, in any way,
superceed or replace the license texts.