PAT_WRITE_BACK is x86-only, whereas sys/dev/xen could be shared
between multiple architectures.
Reviewed by: royger
Differential Revision: https://reviews.freebsd.org/D28831
Multi page rings are mapped using a single hypercall that gets passed
an array of grants to map. One of the grants in the array failing to
map would lead to the failure of the whole ring setup operation, but
there was no cleanup of the rest of the grant maps in the array that
could have likely been created as a result of the hypercall.
Add proper cleanup on the failure path during ring setup to unmap any
grants that could have been created.
This is part of XSA-361.
Sponsored by: Citrix Systems R&D
FreeBSD when running as a dom0 under Xen is not supposed to access the
run time services directly, and instead should proxy the calls through
Xen using an hypercall interface that exposes access to selected run
time services.
Implement the efirt interface on top of the Xen provided hypercalls.
Sponsored by: Citrix Systems R&D
Reviewed by: kib
Differential revision: https://reviews.freebsd.org/D28621
Use an interface compatible with the Linux one so that the user-space
libraries already using the Linux interface can be used without much
modifications.
This allows an open privcmd instance to limit against which domains it
can act upon.
Sponsored by: Citrix Systems R&D
Use an interface compatible with the Linux one so that the user-space
libraries already using the Linux interface can be used without much
modifications.
This allows user-space to make use of the dm_op family of hypercalls,
which are used by device models.
Sponsored by: Citrix Systems R&D
The interface is mostly the same as the Linux ioctl, so that we don't
need to modify the user-space libraries that make use of it.
The ioctl is just a proxy for the XENMEM_acquire_resource hypercall.
Sponsored by: Citrix Systems R&D
Xenstore watches received are queued in a list and processed in a
deferred thread. Such queuing was done without any checking, so a
guest could potentially trigger a resource starvation against the
FreeBSD kernel if such kernel is watching any user-controlled xenstore
path.
Allowing limiting the amount of pending events a watch can accumulate
to prevent a remote guest from triggering this resource starvation
issue.
For the PV device backends and frontends this limitation is only
applied to the other end /state node, which is limited to 1 pending
event, the rest of the watched paths can still have unlimited pending
watches because they are either local or controlled by a privileged
domain.
The xenstore user-space device gets special treatment as it's not
possible for the kernel to know whether the paths being watched by
user-space processes are controlled by a guest domain. For this reason
watches set by the xenstore user-space device are limited to 1000
pending events. Note this can be modified using the
max_pending_watch_events sysctl of the device.
This is XSA-349.
Sponsored by: Citrix Systems R&D
MFC after: 3 days
Replace MAXPHYS by runtime variable maxphys. It is initialized from
MAXPHYS by default, but can be also adjusted with the tunable kern.maxphys.
Make b_pages[] array in struct buf flexible. Size b_pages[] for buffer
cache buffers exactly to atop(maxbcachebuf) (currently it is sized to
atop(MAXPHYS)), and b_pages[] for pbufs is sized to atop(maxphys) + 1.
The +1 for pbufs allow several pbuf consumers, among them vmapbuf(),
to use unaligned buffers still sized to maxphys, esp. when such
buffers come from userspace (*). Overall, we save significant amount
of otherwise wasted memory in b_pages[] for buffer cache buffers,
while bumping MAXPHYS to desired high value.
Eliminate all direct uses of the MAXPHYS constant in kernel and driver
sources, except a place which initialize maxphys. Some random (and
arguably weird) uses of MAXPHYS, e.g. in linuxolator, are converted
straight. Some drivers, which use MAXPHYS to size embeded structures,
get private MAXPHYS-like constant; their convertion is out of scope
for this work.
Changes to cam/, dev/ahci, dev/ata, dev/mpr, dev/mpt, dev/mvs,
dev/siis, where either submitted by, or based on changes by mav.
Suggested by: mav (*)
Reviewed by: imp, mav, imp, mckusick, scottl (intermediate versions)
Tested by: pho
Sponsored by: The FreeBSD Foundation
Differential revision: https://reviews.freebsd.org/D27225
This ensures that no writes are pending in memory, either metadata or
user data, but not including dirty pages not yet converted to fs writes.
Only filesystems declared local are suspended.
Note that this does not guarantee absence of the metadata errors or
leaks if resume is not done: for instance, on UFS unlinked but opened
inodes are leaked and require fsck to gc.
Reviewed by: markj
Discussed with: imp
Tested by: imp (previous version), pho
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
Differential revision: https://reviews.freebsd.org/D27054
Future changes would require additional initialization of OBJT_PHYS
objects, and vm_object_allocate() is not suitable for it.
Reviewed by: markj
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D24652
It seems that second call does not add any useful state change for all
implemented timecounters.
Discussed with: bde
Sponsored by: The FreeBSD Foundation
MFC after: 3 weeks
If there's no data to read from xenstore short-circuit
xctrl_on_watch_event to return early, there's no reason to continue
since the lack of data would prevent matching against any known event
type.
Sponsored by: Citrix Systems R&D
MFC with: r352925
MFC after: 1 week
The correct type to use to represent disk sectors is blkif_sector_t
(which is an uint64_t underneath). This avoid truncation of the disk
size calculation when resizing on i386, as otherwise the calculation
of d_mediasize in xbd_connect is truncated to the size of unsigned
long, which is 32bits on i386.
Note this issue didn't affect amd64, because the size of unsigned long
is 64bits there.
Sponsored by: Citrix Systems R&D
MFC after: 1 week
Fix returning from xenstore device with locks held, which triggers the
following panic:
# cat /dev/xen/xenstore
^C
userret: returning with the following locks held:
exclusive sx evtchn_ringc_sx (evtchn_ringc_sx) r = 0 (0xfffff8000650be40) locked @ /usr/src/sys/dev/xen/evtchn/evtchn_dev.c:262
Note this is not a security issue since access to the device is
limited to root by default.
Sponsored by: Citrix Systems R&D
MFC after: 1 week
A later change, currently being iterated on in D24459, will in-fact change
the lock type to an sx so that TTY drivers can sleep on it if they need to.
Committing this ahead of time to make the review in question a little more
palatable.
tty_lock_assert() is unfortunately still needed for now in two places to
make sure that the tty lock has not been recursed upon, for those scenarios
where it's supplied by the TTY driver and possibly a mutex that is allowed
to recurse.
Suggested by: markj
r357614 added CTLFLAG_NEEDGIANT to make it easier to find nodes that are
still not MPSAFE (or already are but aren’t properly marked).
Use it in preparation for a general review of all nodes.
This is non-functional change that adds annotations to SYSCTL_NODE and
SYSCTL_PROC nodes using one of the soon-to-be-required flags.
Mark all obvious cases as MPSAFE.
Reviewed by: royger
Approved by: kib (mentor, blanket)
Differential Revision: https://reviews.freebsd.org/D23638
BIO_READ and BIO_WRITE, we've handled this expanded syntax poorly in
drivers when the driver doesn't support a particular command. Do a
sweep and fix that.
Reported by: imp
Currently the Xen console is always attached with priority CN_REMOTE
(highest), which means that when booting with a single console the Xen
console will take preference over the VGA for example, and that's not
intended unless the user has also selected to use a serial console.
Fix this by lowering the priority of the Xen console to NORMAL unless
the user has selected to use a serial console. This keeps the usual
FreeBSD behavior of outputting to the internal consoles (ie: VGA) when
booted as a Xen dom0.
MFC after: 3 days
Sponsored by: Citrix Systems R&D
supposedly may call into ether_input() without network epoch.
They all need to be reviewed before 13.0-RELEASE. Some may need
be fixed. The flag is not planned to be used in the kernel for
a long time.
Filesystems which want to use it in limited capacity can employ the
VOP_UNLOCK_FLAGS macro.
Reviewed by: kib (previous version)
Differential Revision: https://reviews.freebsd.org/D21427
removed from objects including calls to free. Pages must not be xbusy
when freed and not on an object. Strengthen assertions to match these
expectations. In practice very little code had to change busy handling
to meet these rules but we can now make stronger guarantees to busy
holders and avoid conditionally dropping busy in free.
Refine vm_page_remove() and vm_page_replace() semantics now that we have
stronger guarantees about busy state. This removes redundant and
potentially problematic code that has proliferated.
Discussed with: markj
Reviewed by: kib
Differential Revision: https://reviews.freebsd.org/D22822
The current vnode layout is not smp-friendly by having frequently read data
avoidably sharing cachelines with very frequently modified fields. In
particular v_iflag inspected for VI_DOOMED can be found in the same line with
v_usecount. Instead make it available in the same cacheline as the v_op, v_data
and v_type which all get read all the time.
v_type is avoidably 4 bytes while the necessary data will easily fit in 1.
Shrinking it frees up 3 bytes, 2 of which get used here to introduce a new
flag field with a new value: VIRF_DOOMED.
Reviewed by: kib, jeff
Differential Revision: https://reviews.freebsd.org/D22715
Atomics are used for page busy and valid state when the shared busy is
held. The details of the locking protocol and valid and dirty
synchronization are in the updated vm_page.h comments.
Reviewed by: kib, markj
Tested by: pho
Sponsored by: Netflix, Intel
Differential Revision: https://reviews.freebsd.org/D21594
This is the first in a series of patches that promotes the page busy field
to a first class lock that no longer requires the object lock for
consistency.
Reviewed by: kib, markj
Tested by: pho
Sponsored by: Netflix, Intel
Differential Revision: https://reviews.freebsd.org/D21548
Currently only suspend requests are acknowledged by writing an empty
string back to the xenstore control node, but poweroff or reboot
requests are not acknowledged and FreeBSD simply proceeds to perform
the desired action.
Fix this by acknowledging all requests, and remove the suspend specific
ack done in the handler.
Sponsored by: Citrix Systems R&D
MFC after: 3 days
- VM_ALLOC_NOCREAT will grab without creating a page.
- vm_page_grab_valid() will grab and page in if necessary.
- vm_page_busy_acquire() automates some busy acquire loops.
Discussed with: alc, kib, markj
Tested by: pho (part of larger branch)
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D21546
There are several mechanisms by which a vm_page reference is held,
preventing the page from being freed back to the page allocator. In
particular, holding the page's object lock is sufficient to prevent the
page from being freed; holding the busy lock or a wiring is sufficent as
well. These references are protected by the page lock, which must
therefore be acquired for many per-page operations. This results in
false sharing since the page locks are external to the vm_page
structures themselves and each lock protects multiple structures.
Transition to using an atomically updated per-page reference counter.
The object's reference is counted using a flag bit in the counter. A
second flag bit is used to atomically block new references via
pmap_extract_and_hold() while removing managed mappings of a page.
Thus, the reference count of a page is guaranteed not to increase if the
page is unbusied, unmapped, and the object's write lock is held. As
a consequence of this, the page lock no longer protects a page's
identity; operations which move pages between objects are now
synchronized solely by the objects' locks.
The vm_page_wire() and vm_page_unwire() KPIs are changed. The former
requires that either the object lock or the busy lock is held. The
latter no longer has a return value and may free the page if it releases
the last reference to that page. vm_page_unwire_noq() behaves the same
as before; the caller is responsible for checking its return value and
freeing or enqueuing the page as appropriate. vm_page_wire_mapped() is
introduced for use in pmap_extract_and_hold(). It fails if the page is
concurrently being unmapped, typically triggering a fallback to the
fault handler. vm_page_wire() no longer requires the page lock and
vm_page_unwire() now internally acquires the page lock when releasing
the last wiring of a page (since the page lock still protects a page's
queue state). In particular, synchronization details are no longer
leaked into the caller.
The change excises the page lock from several frequently executed code
paths. In particular, vm_object_terminate() no longer bounces between
page locks as it releases an object's pages, and direct I/O and
sendfile(SF_NOCACHE) completions no longer require the page lock. In
these latter cases we now get linear scalability in the common scenario
where different threads are operating on different files.
__FreeBSD_version is bumped. The DRM ports have been updated to
accomodate the KPI changes.
Reviewed by: jeff (earlier version)
Tested by: gallatin (earlier version), pho
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D20486
These calls are not the same in general: the former will dequeue the
page if it is enqueued, while the latter will just leave it alone. But,
all existing uses of the former apply to unmanaged pages, which are
never enqueued in the first place. No functional change intended.
Reviewed by: kib
MFC after: 1 week
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D20470
This allows replacing "sys/eventfilter.h" includes with "sys/_eventfilter.h"
in other header files (e.g., sys/{bus,conf,cpu}.h) and reduces header
pollution substantially.
EVENTHANDLER_DECLARE and EVENTHANDLER_LIST_DECLAREs were moved out of .c
files into appropriate headers (e.g., sys/proc.h, powernv/opal.h).
As a side effect of reduced header pollution, many .c files and headers no
longer contain needed definitions. The remainder of the patch addresses
adding appropriate includes to fix those files.
LOCK_DEBUG and LOCK_FILE_LINE_ARG are moved to sys/_lock.h, as required by
sys/mutex.h since r326106 (but silently protected by header pollution prior
to this change).
No functional change (intended). Of course, any out of tree modules that
relied on header pollution for sys/eventhandler.h, sys/lock.h, or
sys/mutex.h inclusion need to be fixed. __FreeBSD_version has been bumped.
The type represents byte offset in the vm_object_t data space, which
does not span negative offsets in FreeBSD VM. The change matches byte
offset signess with the unsignedness of the vm_pindex_t which
represents the type of the page indexes in the objects.
This allows to remove the UOFF_TO_IDX() macro which was used when we
have to forcibly interpret the type as unsigned anyway. Also it fixes
a lot of implicit bugs in the device drivers d_mmap methods.
Reviewed by: alc, markj (previous version)
Tested by: pho
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
The Xen page-table walker used to resolve the virtual addresses in the
hypercalls will refuse to access user-space pages when SMAP is enabled
unless the AC flag in EFLAGS is set (just like normal hardware with
SMAP support would do).
Since privcmd allows forwarding hypercalls (and buffers) from
user-space into Xen make sure SMAP is temporary disabled for the
duration of the hypercall from user-space.
Approved by: re (gjb)
Sponsored by: Citrix Systems R&D
netfront_backend_changed() is called from the xenwatch_thread(), which means
that the curvnet is not set. We have to set it before we can call things like
arp_ifinit().
PR: 230845
The timespecadd(3) family of macros were imported from NetBSD back in
r35029. However, they were initially guarded by #ifdef _KERNEL. In the
meantime, we have grown at least 28 syscalls that use timespecs in some
way, leading many programs both inside and outside of the base system to
redefine those macros. It's better just to make the definitions public.
Our kernel currently defines two-argument versions of timespecadd and
timespecsub. NetBSD, OpenBSD, and FreeDesktop.org's libbsd, however, define
three-argument versions. Solaris also defines a three-argument version, but
only in its kernel. This revision changes our definition to match the
common three-argument version.
Bump _FreeBSD_version due to the breaking KPI change.
Discussed with: cem, jilles, ian, bde
Differential Revision: https://reviews.freebsd.org/D14725
This fixes the panic caused by deadlocking when grant-table free
callbacks are used.
The cause of the recursion is: check_free_callbacks() is always called
with the lock gnttab_list_lock held. In turn the callback function is
also called with the lock held. Then when the client uses any of the grant
reference methods which also attempt the lock the gnttab_list_lock
mutex from within the free callback a deadlock happens.
Fix this by making the gnttab_list_lock recursive.
Submitted by: Pratyush Yadav <pratyush@freebsd.org>
Differential Revision: https://reviews.freebsd.org/D16505
If gnttab_grant_foreign_access() fails for any of the indirection
pages, the code breaks out of both the loops without freeing the local
variable indirectpages, causing a memory leak.
Submitted by: Pratyush Yadav <pratyush@freebsd.org>
Differential Review: https://reviews.freebsd.org/D16136
Length is an unsigned integer, so checking against < 0 doesn't make
sense. While there also make clear that a length of 0 always succeeds.
Submitted by: Pratyush Yadav <pratyush@freebsd.org>
Differential Review: https://reviews.freebsd.org/D16045
When booted as PVHv2, there's no ACPI CPU object, so attach the PV CPU
device in order to take it's place.
This is required in case some device or driver tries to poke at the
PCPU device field.
Sponsored by: Citrix Systems R&D
HYPERVISOR_start_info is only available to PV and PVHv1 guests, HVM
and PVHv2 guests get this data from HVM parameters that are fetched
using a hypercall.
Instead provide a set of helper functions that should be used to fetch
this data. The helper functions have different implementations
depending on whether FreeBSD is running as PVHv1 or HVM/PVHv2 guest
type.
This helps to cleanup generic Xen code by removing quite a lot of
xen_pv_domain and xen_hvm_domain macro usages.
Sponsored by: Citrix Systems R&D
lock order reversal: (sleepable after non-sleepable)
1st 0xfffffe00357ff538 xnb_softc (xen netback softc lock) @ /usr/src/sys/dev/xen/netback/netback.c:1069
2nd 0xffffffff81fdccb0 intrsrc (intrsrc) @ /usr/src/sys/x86/x86/intr_machdep.c:224
There's no need to hold the lock since the cleaning of the interrupt
cannot happen in parallel due to the XNBF_IN_SHUTDOWN flag being set.
Note that the locking in netback needs some improvement or
clarification.
While there also remove a double newline.
Sponsored by: Citrix Systems R&D
Without a call to check_free_callbacks() clients waiting for grant
references would not be woken up even when there are sufficient grant
references available.
The check was likely left out as a mistake when the function was first
added.
Note that other functions used to free grant references already call
check_free_callbacks.
Submitted by: pratyush
Reviewed by: royger
Differential review: https://reviews.freebsd.org/D15899
Remove the device from the list before unbinding it. Doing it in this
order allows calling xen_intr_unbind without holding the bind_mutex
lock.
Sponsored by: Citrix Systems R&D