in the future in a backward compatible (API and ABI) way.
The cap_rights_t represents capability rights. We used to use one bit to
represent one right, but we are running out of spare bits. Currently the new
structure provides place for 114 rights (so 50 more than the previous
cap_rights_t), but it is possible to grow the structure to hold at least 285
rights, although we can make it even larger if 285 rights won't be enough.
The structure definition looks like this:
struct cap_rights {
uint64_t cr_rights[CAP_RIGHTS_VERSION + 2];
};
The initial CAP_RIGHTS_VERSION is 0.
The top two bits in the first element of the cr_rights[] array contain total
number of elements in the array - 2. This means if those two bits are equal to
0, we have 2 array elements.
The top two bits in all remaining array elements should be 0.
The next five bits in all array elements contain array index. Only one bit is
used and bit position in this five-bits range defines array index. This means
there can be at most five array elements in the future.
To define new right the CAPRIGHT() macro must be used. The macro takes two
arguments - an array index and a bit to set, eg.
#define CAP_PDKILL CAPRIGHT(1, 0x0000000000000800ULL)
We still support aliases that combine few rights, but the rights have to belong
to the same array element, eg:
#define CAP_LOOKUP CAPRIGHT(0, 0x0000000000000400ULL)
#define CAP_FCHMOD CAPRIGHT(0, 0x0000000000002000ULL)
#define CAP_FCHMODAT (CAP_FCHMOD | CAP_LOOKUP)
There is new API to manage the new cap_rights_t structure:
cap_rights_t *cap_rights_init(cap_rights_t *rights, ...);
void cap_rights_set(cap_rights_t *rights, ...);
void cap_rights_clear(cap_rights_t *rights, ...);
bool cap_rights_is_set(const cap_rights_t *rights, ...);
bool cap_rights_is_valid(const cap_rights_t *rights);
void cap_rights_merge(cap_rights_t *dst, const cap_rights_t *src);
void cap_rights_remove(cap_rights_t *dst, const cap_rights_t *src);
bool cap_rights_contains(const cap_rights_t *big, const cap_rights_t *little);
Capability rights to the cap_rights_init(), cap_rights_set(),
cap_rights_clear() and cap_rights_is_set() functions are provided by
separating them with commas, eg:
cap_rights_t rights;
cap_rights_init(&rights, CAP_READ, CAP_WRITE, CAP_FSTAT);
There is no need to terminate the list of rights, as those functions are
actually macros that take care of the termination, eg:
#define cap_rights_set(rights, ...) \
__cap_rights_set((rights), __VA_ARGS__, 0ULL)
void __cap_rights_set(cap_rights_t *rights, ...);
Thanks to using one bit as an array index we can assert in those functions that
there are no two rights belonging to different array elements provided
together. For example this is illegal and will be detected, because CAP_LOOKUP
belongs to element 0 and CAP_PDKILL to element 1:
cap_rights_init(&rights, CAP_LOOKUP | CAP_PDKILL);
Providing several rights that belongs to the same array's element this way is
correct, but is not advised. It should only be used for aliases definition.
This commit also breaks compatibility with some existing Capsicum system calls,
but I see no other way to do that. This should be fine as Capsicum is still
experimental and this change is not going to 9.x.
Sponsored by: The FreeBSD Foundation
Properly round-trip the "operation code" for client requests.
sys/dev/xen/blkback/blkback.c:
In xbb_dispatch_dev() when processing a flush request,
correctly set bio->bio_caller1 to the request list (not
bare request) for the operation, as is expected by the
completion handler xbb_bio_done().
In xbb_get_resources(), initialize "operation" in the
driver's internal request object from the client's "ring
request", so it is correct when used to populate the reply
when this operation completes.
Submitted by: Roger Pau Monné
Sponsored by: Citrix Systems R&D
Reviewed by: gibbs
It turns out that synaptics_support was turned off by default
because its probing method is too intrusive not because it was unstable.
Once this is fixed it should be enabled once again.
Reported by: delphij, jkim
Rework the timeout code to use actual time rather than a DELAY() loop and
to use both typical and maximum to allow logging of timeout failures.
Also correct the erase timeout, it is specified in milliseconds not
microseconds like the other timeouts. Do not invoke DELAY() between
status queries as this adds significant latency which in turn reduced
write performance substantially.
Sanity check timeout values from the hardware.
Implement support for buffered writes (only enabled on Intel/Sharp parts
for now). This yields an order of magnitude speedup on the 64MB Intel
StrataFlash parts we use.
When making a copy of the block to modify, also keep a clean copy around
until we are ready to commit the block and use it to avoid unnecessary
erases. In the non-buffer write case, also use it to avoid
unnecessary writes when the block has not been erased. This yields a
significant speedup when doing things like zeroing a block.
Sponsored by: DARPA, AFRL
Reviewed by: imp (previous version)
set to 15 to indicate that the peer did not send a window scale option
with its SYN. Do not send a window scale option in the SYN|ACK reply
in that case.
Initialize the request id for requests in xbb_get_resources()
instead of its previous location in xbb_dispatch_io(). This
guarantees that all request types (e.g. BLKIF_OP_FLUSH_DISKCACHE)
have the front-end specified id recorded.
Submitted by: Roger Pau Monné
Sponsored by: Citrix Systems R&D
This is a significant rewrite of much of the previous driver; lots of
misc. cleanup was also performed, and support for a few other minor
features was also added.
Partial support for the EVENT_IDX feature was added a while ago,
but this commit adds an interface for the device driver to hint
how long (in terms of descriptors) the next interrupt should be
delayed.
The first user of this will be used to reduce VirtIO net's Tx
completion interrupts.
priority. If the write is requested by a system daemon, sleeping
there would starve resources and cause deadlock.
Reported and tested by: pho
Sponsored by: The FreeBSD Foundation
A warning is emitted again if the temperature became briefly valid
meanwhile. This avoids spamming the user when the sensor is broken.
Other values (ie. not _TMP) always raise a warning.
is not giving us a 100% success rate. Bump the delay to 200ms as
that seems to do the trick.
Note that during testing the delay was added to uart_bus_attach()
in uart_core.c. While having the delay in a different place can
change the behaviour, it was not expected. Having to bump the
delay with another 50ms could therefore be an indication that
the problem can not be solved with delays.
Reported by: kevlo@
Tested by: kevlo@
- Allow the Rx/Tx queue sizes to be configured by tunables
- Bail out earlier if the Tx queue unlikely has enough free
descriptors to hold the frame
- Cleanup some of the offloading capabilities handling
value. Some hosts do not publish "extended" disk IDs via virtual-device in
an attempt to avoid confusing old blkfront drivers, and without this change
we failed to attach such disks.
In particular, this commit allows all 24 ephemeral disks on EC2 hs1.8xlarge
instances to be used, instead of only the first 15.
MFC after: 3 days
cards.
This is a T4 and T5 chip feature which lets the chip deliver multiple
Ethernet frames in a single buffer. This is more efficient within the
chip, in the driver, and reduces wastage of space in rx buffers.
- Always allocate rx buffers from the jumbop zone, no matter what the
MTU is. Do not use the normal cluster refcounting mechanism.
- Reserve space for an mbuf and a refcount in the cluster itself and let
the chip DMA multiple frames in the rest.
- Use the embedded mbuf for the first frame and allocate mbufs on the
fly for any additional frames delivered in the cluster. Each of these
mbufs has a reference on the underlying cluster.
Use this new driver for both PV and HVM instances.
This driver requires a Xen hypervisor that supports vector callbacks,
VCPUOP hypercalls, and reports that it has a "safe PV clock".
New timer driver:
Submitted by: will
Sponsored by: Spectra Logic Corporation
PV port to new driver, and bug fixes:
Submitted by: Roger Pau Monné
Sponsored by: Citrix Systems R&D
sys/dev/xen/timer/timer.c:
- Register a PV timer device driver which (currently)
implements device_{identify,probe,attach} and stubs
device_detach. The detach routine requires functionality
not provided by timecounters(4). The suspend and resume
routines need additional work (due to Xen requiring that
the hypercalls be executed on the target VCPU), and aren't
needed for our purposes.
- Make sure there can only be one device instance of this
driver, and that it only registers one eventtimers(4) and
one timecounters(4) device interface. Make both interfaces
use PCPU data as needed.
- Match, with a few style cleanups & API differences, the
Xen versions of the "fetch time" functions.
- Document the magic scale_delta() better for the i386 version.
- When registering the event timer, bind a separate event
channel for the timer VIRQ to the device's event timer
interrupt handler for each active VCPU. Describe each
interrupt as "xen_et:c%d", so they can be identified per
CPU in "vmstat -i" or "show intrcnt" in KDB.
- When scheduling a timer into the hypervisor, try up to
60 times if the hypervisor rejects the time as being in
the past. In the common case, this retry shouldn't happen,
and if it does, it should only happen once. This is
because the event timer advertises a minimum period of
100usec, which is only less than the usual hypercall round
trip time about 1 out of every 100 tries. (Unlike other
similar drivers, this one actually checks whether the
hypervisor accepted the singleshot timer set hypercall.)
- Implement a RTC PV clock based on the hypervisor wallclock.
sys/conf/files:
- Add dev/xen/timer/timer.c if the kernel configuration
includes either the XEN or XENHVM options.
sys/conf/files.i386:
sys/i386/include/xen/xen_clock_util.h:
sys/i386/xen/clock.c:
sys/i386/xen/xen_clock_util.c:
sys/i386/xen/mp_machdep.c:
sys/i386/xen/xen_rtc.c:
- Remove previous PV timer used in i386 XEN PV kernels, the
new timer introduced in this change is used instead (so
we share the same code between PVHVM and PV).
MFC after: 2 weeks
Re-structure Xen HVM support so that:
- Xen is detected and hypercalls can be performed very
early in system startup.
- Xen interrupt services are implemented using FreeBSD's native
interrupt delivery infrastructure.
- the Xen interrupt service implementation is shared between PV
and HVM guests.
- Xen interrupt handlers can optionally use a filter handler
in order to avoid the overhead of dispatch to an interrupt
thread.
- interrupt load can be distributed among all available CPUs.
- the overhead of accessing the emulated local and I/O apics
on HVM is removed for event channel port events.
- a similar optimization can eventually, and fairly easily,
be used to optimize MSI.
Early Xen detection, HVM refactoring, PVHVM interrupt infrastructure,
and misc Xen cleanups:
Sponsored by: Spectra Logic Corporation
Unification of PV & HVM interrupt infrastructure, bug fixes,
and misc Xen cleanups:
Submitted by: Roger Pau Monné
Sponsored by: Citrix Systems R&D
sys/x86/x86/local_apic.c:
sys/amd64/include/apicvar.h:
sys/i386/include/apicvar.h:
sys/amd64/amd64/apic_vector.S:
sys/i386/i386/apic_vector.s:
sys/amd64/amd64/machdep.c:
sys/i386/i386/machdep.c:
sys/i386/xen/exception.s:
sys/x86/include/segments.h:
Reserve IDT vector 0x93 for the Xen event channel upcall
interrupt handler. On Hypervisors that support the direct
vector callback feature, we can request that this vector be
called directly by an injected HVM interrupt event, instead
of a simulated PCI interrupt on the Xen platform PCI device.
This avoids all of the overhead of dealing with the emulated
I/O APIC and local APIC. It also means that the Hypervisor
can inject these events on any CPU, allowing upcalls for
different ports to be handled in parallel.
sys/amd64/amd64/mp_machdep.c:
sys/i386/i386/mp_machdep.c:
Map Xen per-vcpu area during AP startup.
sys/amd64/include/intr_machdep.h:
sys/i386/include/intr_machdep.h:
Increase the FreeBSD IRQ vector table to include space
for event channel interrupt sources.
sys/amd64/include/pcpu.h:
sys/i386/include/pcpu.h:
Remove Xen HVM per-cpu variable data. These fields are now
allocated via the dynamic per-cpu scheme. See xen_intr.c
for details.
sys/amd64/include/xen/hypercall.h:
sys/dev/xen/blkback/blkback.c:
sys/i386/include/xen/xenvar.h:
sys/i386/xen/clock.c:
sys/i386/xen/xen_machdep.c:
sys/xen/gnttab.c:
Prefer FreeBSD primatives to Linux ones in Xen support code.
sys/amd64/include/xen/xen-os.h:
sys/i386/include/xen/xen-os.h:
sys/xen/xen-os.h:
sys/dev/xen/balloon/balloon.c:
sys/dev/xen/blkback/blkback.c:
sys/dev/xen/blkfront/blkfront.c:
sys/dev/xen/console/xencons_ring.c:
sys/dev/xen/control/control.c:
sys/dev/xen/netback/netback.c:
sys/dev/xen/netfront/netfront.c:
sys/dev/xen/xenpci/xenpci.c:
sys/i386/i386/machdep.c:
sys/i386/include/pmap.h:
sys/i386/include/xen/xenfunc.h:
sys/i386/isa/npx.c:
sys/i386/xen/clock.c:
sys/i386/xen/mp_machdep.c:
sys/i386/xen/mptable.c:
sys/i386/xen/xen_clock_util.c:
sys/i386/xen/xen_machdep.c:
sys/i386/xen/xen_rtc.c:
sys/xen/evtchn/evtchn_dev.c:
sys/xen/features.c:
sys/xen/gnttab.c:
sys/xen/gnttab.h:
sys/xen/hvm.h:
sys/xen/xenbus/xenbus.c:
sys/xen/xenbus/xenbus_if.m:
sys/xen/xenbus/xenbusb_front.c:
sys/xen/xenbus/xenbusvar.h:
sys/xen/xenstore/xenstore.c:
sys/xen/xenstore/xenstore_dev.c:
sys/xen/xenstore/xenstorevar.h:
Pull common Xen OS support functions/settings into xen/xen-os.h.
sys/amd64/include/xen/xen-os.h:
sys/i386/include/xen/xen-os.h:
sys/xen/xen-os.h:
Remove constants, macros, and functions unused in FreeBSD's Xen
support.
sys/xen/xen-os.h:
sys/i386/xen/xen_machdep.c:
sys/x86/xen/hvm.c:
Introduce new functions xen_domain(), xen_pv_domain(), and
xen_hvm_domain(). These are used in favor of #ifdefs so that
FreeBSD can dynamically detect and adapt to the presence of
a hypervisor. The goal is to have an HVM optimized GENERIC,
but more is necessary before this is possible.
sys/amd64/amd64/machdep.c:
sys/dev/xen/xenpci/xenpcivar.h:
sys/dev/xen/xenpci/xenpci.c:
sys/x86/xen/hvm.c:
sys/sys/kernel.h:
Refactor magic ioport, Hypercall table and Hypervisor shared
information page setup, and move it to a dedicated HVM support
module.
HVM mode initialization is now triggered during the
SI_SUB_HYPERVISOR phase of system startup. This currently
occurs just after the kernel VM is fully setup which is
just enough infrastructure to allow the hypercall table
and shared info page to be properly mapped.
sys/xen/hvm.h:
sys/x86/xen/hvm.c:
Add definitions and a method for configuring Hypervisor event
delievery via a direct vector callback.
sys/amd64/include/xen/xen-os.h:
sys/x86/xen/hvm.c:
sys/conf/files:
sys/conf/files.amd64:
sys/conf/files.i386:
Adjust kernel build to reflect the refactoring of early
Xen startup code and Xen interrupt services.
sys/dev/xen/blkback/blkback.c:
sys/dev/xen/blkfront/blkfront.c:
sys/dev/xen/blkfront/block.h:
sys/dev/xen/control/control.c:
sys/dev/xen/evtchn/evtchn_dev.c:
sys/dev/xen/netback/netback.c:
sys/dev/xen/netfront/netfront.c:
sys/xen/xenstore/xenstore.c:
sys/xen/evtchn/evtchn_dev.c:
sys/dev/xen/console/console.c:
sys/dev/xen/console/xencons_ring.c
Adjust drivers to use new xen_intr_*() API.
sys/dev/xen/blkback/blkback.c:
Since blkback defers all event handling to a taskqueue,
convert this task queue to a "fast" taskqueue, and schedule
it via an interrupt filter. This avoids an unnecessary
ithread context switch.
sys/xen/xenstore/xenstore.c:
The xenstore driver is MPSAFE. Indicate as much when
registering its interrupt handler.
sys/xen/xenbus/xenbus.c:
sys/xen/xenbus/xenbusvar.h:
Remove unused event channel APIs.
sys/xen/evtchn.h:
Remove all kernel Xen interrupt service API definitions
from this file. It is now only used for structure and
ioctl definitions related to the event channel userland
device driver.
Update the definitions in this file to match those from
NetBSD. Implementing this interface will be necessary for
Dom0 support.
sys/xen/evtchn/evtchnvar.h:
Add a header file for implemenation internal APIs related
to managing event channels event delivery. This is used
to allow, for example, the event channel userland device
driver to access low-level routines that typical kernel
consumers of event channel services should never access.
sys/xen/interface/event_channel.h:
sys/xen/xen_intr.h:
Standardize on the evtchn_port_t type for referring to
an event channel port id. In order to prevent low-level
event channel APIs from leaking to kernel consumers who
should not have access to this data, the type is defined
twice: Once in the Xen provided event_channel.h, and again
in xen/xen_intr.h. The double declaration is protected by
__XEN_EVTCHN_PORT_DEFINED__ to ensure it is never declared
twice within a given compilation unit.
sys/xen/xen_intr.h:
sys/xen/evtchn/evtchn.c:
sys/x86/xen/xen_intr.c:
sys/dev/xen/xenpci/evtchn.c:
sys/dev/xen/xenpci/xenpcivar.h:
New implementation of Xen interrupt services. This is
similar in many respects to the i386 PV implementation with
the exception that events for bound to event channel ports
(i.e. not IPI, virtual IRQ, or physical IRQ) are further
optimized to avoid mask/unmask operations that aren't
necessary for these edge triggered events.
Stubs exist for supporting physical IRQ binding, but will
need additional work before this implementation can be
fully shared between PV and HVM.
sys/amd64/amd64/mp_machdep.c:
sys/i386/i386/mp_machdep.c:
sys/i386/xen/mp_machdep.c
sys/x86/xen/hvm.c:
Add support for placing vcpu_info into an arbritary memory
page instead of using HYPERVISOR_shared_info->vcpu_info.
This allows the creation of domains with more than 32 vcpus.
sys/i386/i386/machdep.c:
sys/i386/xen/clock.c:
sys/i386/xen/xen_machdep.c:
sys/i386/xen/exception.s:
Add support for new event channle implementation.