Replace the existing ad-hoc configuration via various global variables
with a small database of key-value pairs. The database supports
heirarchical keys using a MIB-like syntax to name the path to a given
key. Values are always stored as strings. The API used to manage
configuation values does include wrappers to handling boolean values.
Other values use non-string types require parsing by consumers.
The configuration values are stored in a tree using nvlists. Leaf
nodes hold string values. Configuration values are permitted to
reference other configuration values using '%(name)'. This permits
constructing template configurations.
All existing command line arguments now set configuration values. For
devices, the "-s" option parses its option argument to generate a list
of key-value pairs for the given device.
A new '-o' command line option permits setting an individual
configuration variable. The key name is always given as a full path
of dot-separated components.
A new '-k' command line option parses a simple configuration file.
This configuration file holds a flat list of 'key=value' lines where
the 'key' is the full path of a configuration variable. Lines
starting with a '#' are comments.
In general, bhyve starts by parsing command line options in sequence
and applying those settings to configuration values. Once this is
complete, bhyve then begins initializing its state based on the
configuration values. This means that subsequent configuration
options or files may override or supplement previously given settings.
A special 'config.dump' configuration value can be set to true to help
debug configuration issues. When this value is set, bhyve will print
out the configuration variables as a flat list of 'key=value' lines.
Most command line argments map to a single configuration variable,
e.g. '-w' sets the 'x86.strictmsr' value to false. A few command
line arguments have less obvious effects:
- Multiple '-p' options append their values (as a comma-seperated
list) to "vcpu.N.cpuset" values (where N is a decimal vcpu number).
- For '-s' options, a pci.<bus>.<slot>.<function> node is created.
The first argument to '-s' (the device type) is used as the value of
a "device" variable. Additional comma-separated arguments are then
parsed into 'key=value' pairs and used to set additional variables
under the device node. A PCI device emulation driver can provide
its own hook to override the parsing of the additonal '-s' arguments
after the device type.
After the configuration phase as completed, the init_pci hook
then walks the "pci.<bus>.<slot>.<func>" nodes. It uses the
"device" value to find the device model to use. The device
model's init routine is passed a reference to its nvlist node
in the configuration tree which it can query for specific
variables.
The result is that a lot of the string parsing is removed from
the device models and centralized. In addition, adding a new
variable just requires teaching the model to look for the new
variable.
- For '-l' options, a similar model is used where the string is
parsed into values that are later read during initialization.
One key note here is that the serial ports use the commonly
used lowercase names from existing documentation and examples
(e.g. "lpc.com1") instead of the uppercase names previously
used internally in bhyve.
Reviewed by: grehan
MFC after: 3 months
Differential Revision: https://reviews.freebsd.org/D26035
The NVMe emulation code did not explicitly initialize queue head and
tail pointers on queue creation. As these pointers are part of
calloc()'ed memory, this only becomes a problem if the queues are
deleted and then recreated.
This error can manifest with messages about completions not matching a
command.
Some operating systems believe bhyve's emulated NVMe drive is failing
based on certain values in the SMART / Health Information log page being
zero. Fix is to set the reported temperature and available spare values
to reasonable defaults.
Submitted by: wanpengqian@gmail.com
Reviewed by: grehan
MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D24202
The NVMe specification requires unused entries in the Identify, Active
Namespace ID data to be zero. Fix is bzero the provided page, similar to
what is done for the Namespace Descriptors list.
Fixes UNH Tests 2.6 and 2.9
Tested by: Jason Tubnor
MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D24901
Dataset Management range specifications may have a zero length (a.k.a.
an empty range definition). Handle the case of all ranges being empty by
completing with Success (DSM commands are advisory only). For
Deallocate, skip empty range definitions when sending TRIM's to the
backing storage.
Fixes UNH Test 2.2.4
Reviewed by: imp
Tested by: Jason Tubnor
MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D24900
If the Predictable Latency Mode is not supported, NVMe Controllers must
return Invalid Field in Command status for the Get Features command
with IDs:
- Predictable Latency Mode Config
- Predictable Latency Mode Window
Fixes UNH Tests 3.6
Tested by: Jason Tubnor
MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D24899
This adds support for NVMe Get Features, Interrupt Vector Config
parameter error checking done by the UNH compliance tests.
Fixes UNH Tests 1.6.8 and 5.5.6
Tested by: Jason Tubnor
MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D24898
This commit updates the Identify Controller data to advertise the
Controller supports a single firmware slot and that firmware slot 1 is
read-only. Additionally, it returns an "Invalid Firmware Slot" error
when the host issues any Firmware Commit command (a.k.a. Firmware
Activate).
Fixes UNH Test 5.5.3
Tested by: Jason Tubnor
MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D24897
This adds support to bhyve's NVMe device emulation for processing Async
Event Requests but not returning them (i.e. Async Event Notifications).
Fixes UNH Test 5.5.2
Tested by: Jason Tubnor
MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D24896
Add checks that the combination of Starting LBA and Number of Logical
Blocks in a command will not exceed the range of the underlying storage.
Note that because NVMe specifices the Starting LBA as a uint64_t, care
must be taken when converting it and the block count to avoid an integer
overflow.
Fixes UNH Tests 2.2.3, 2.3.2, and 2.4.2
Tested by: Jason Tubnor
MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D24895
SMART data in NVMe includes statistics for number of read and write
commands issued as well as the number of "data units" read and written.
NVMe defines "data unit" as thousands of 512 byte blocks (e.g. 1 data
unit is 1-1,000 512 byte blocks, 3 data units are 2,001-3,000 512 byte
blocks).
This patch implements counters for:
- Data Units Read
- Data Units Written
- Host Read Commands
- Host Write Commands
and exposes the values when the guest reads the SMART/Health Log Page.
Fixes UNH Test 1.3.8
Tested by: Jason Tubnor
MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D24894
For NVMe emulation, validate the Data Set Management LBA ranges do not
exceed the capacity of the backing storage. If they do, return an "LBA
Out of Range" error.
Fixes UNH Test 2.2.3
Tested by: Jason Tubnor
MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D24893
NVMe controllers advertise their Max Data Transfer Size (MDTS) to limit
the number of page descriptors in an I/O request. Take advantage of this
and size the struct pci_nvme_ioreq accordingly.
Ensuring these values match both future-proofs the code and allows
removing some complexity which only exists to handle this possibility.
Tested by: Jason Tubnor
MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D24891
Split the NVM I/O function (i.e. nvme_opc_write_read) into separate
functions - one for RAM based backing-store and another for disk based
backing-store for easier maintenance. No functional changes.
Tested by: Jason Tubnor
MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D24890
The Format NVM command mainly allows the host to specify the block size
and protection information used for the Namespace. As the bhyve
implementation simply maps the capabilities of the backing storage
through to the guest, there isn't anything to implement. But a side
effect of the format is the NVMe Controller shall not return any data
previously written (i.e. erase previously written data). This patch
implements this later behavior to provide a compliant implementation.
Fixes UNH Test 1.6
Tested by: Jason Tubnor
MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D24889
Create a generic Get/Set Features by saving off the contents of CDW11
from the Set command and returning the saved value in the completion of
the Get command. Implementation allows providing optional implementation
for both Set and Get.
Add infrastructure to determine which feature ID's are namespace
specific and flag violations of this category of error.
Also adds the feature specific behavior of Set Features, Number of
Queues to only allow this command once per Controller reset.
Fixes UNH Tests 1.2, 5.4, and 5.5.6
Tested by: Jason Tubnor
MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D24887
Fix the logic in nvme_opc_get_log_page to calculate the number of DWORDS
(uint32_t) instead of WORDS (uint16_t) for the byte length. And only
return the allowed number of Log Page bytes as determined by the user
request and actual size of the requested log page.
Fixes UNH Test 1.3
Tested by: Jason Tubnor
MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D24885
Consolidate the code which writes Completion Queue entries and updates
the CQ doorbell value. While in the neighborhood, convert the "toggle CQ
phase bit" code to use an XOR operation instead of an "if/else" branch.
Tested by: Jason Tubnor
MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D24882
The NVMe code attempted to ensure thread safety through a combination of
using atomics and a "busy" flag. But this approach leads to unavoidable
race conditions.
Fix is to use per-queue mutex locks to ensure thread safety within the
queue processing code. While in the neighborhood, move all the queue
initialization code to a common function.
Tested by: Jason Tubnor
MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D19841
This adds support for the NVMe I/O command Flush. For block-based
devices, submit a DIOCGFLUSH to the backing storage. Otherwise, command
is treated like a NOP and completes with a Successful status.
Tested by: Jason Tubnor
MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D24880
This refactors the NVMe I/O command processing function to make adding
new commands easier. The main change is to move command specific
processing (i.e. Read/Write) to separate functions for each NVMe I/O
command and leave the common per-command processing in the existing
pci_nvme_handle_io_cmd() function.
While here, add checks for some common errors (invalid Namespace ID,
invalid opcode, LBA out of range).
Add myself to the Copyright holders
Reviewed by: imp
Tested by: Jason Tubnor
MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D24879
Convert the debug and warning logging macros to be parameterized and
correctly use bhyve's PRINTLN macro.
Reviewed by: imp
Tested by: Jason Tubnor
MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D24878
The SQHD field of a Completion Queue entry indicates the current
Submission Queue head pointer value. The head pointer represents the
next entry to be consumed and is updated after consuming the current
entry.
In the Admin queue processing, the current code updates the head pointer
after reporting the value to the host via the SQHD. This gives the
impression that the Controller is perpetually one command behind in its
processing of the Admin SQ. And while this doesn't appear to bother some
initiators, it is wrong.
Fix is to update the SQ head pointer prior to writing the SQHD value in
the completion.
While here, fix missed update of dword 0 (cdw0) in the completion
message.
Reported by: khng300
Reviewed by: jhb, imp
Approved by: jhb (maintainer)
MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D24083
The bhyve NVMe emulation has a race in the logic which generates command
completion interrupts. On FreeBSD guests, this manifests as kernel log
messages similar to:
nvme0: Missing interrupt
The NVMe emulation code sets a per-submission queue "busy" flag while
processing the submission queue, and only generates an interrupt when
the submission queue is not busy.
Aside from being counter to the NVMe design (i.e. interrupt properties
are tied to the completion queue) and adding complexity (e.g. exceptions
to not generating an interrupt when "busy"), it causes a race condition
under the following conditions:
- guest OS has no outstanding interrupts
- guest OS submits a single NVMe IO command
- bhyve emulation processes the SQ and sets the "busy" flag
- bhyve emulation submits the asynchronous IO to the backing storage
- IO request to the backing storage completes before the SQ processing
loop exits and doesn't generate an interrupt because the SQ is "busy"
- bhyve emulation finishes processing the SQ and clears the "busy" flag
Fix is to remove the "busy" flag and generate an interrupt when the CQ
head and tail pointers do not match.
Reported by: khng300
Reviewed by: jhb, imp
Approved by: jhb (maintainer)
MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D24082
This adds support for the Dataset Management (DSM) command to the NVMe
emulation in general, and more specifically, for the deallocate
attribute (a.k.a. trim in the ATA protocol). If the backing storage for
the namespace supports delete (i.e. deallocate), setting the deallocate
attribute in a DSM will trim/delete the requested LBA ranges in the
underlying storage.
Reviewed by: jhb, araujo, imp
Approved by: jhb (maintainer)
MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D21839
Pass the struct pci_nvme_blockstore pointer for this namespace to the
namespace initialization function instead of only the desired eui64
value.
Minor functional change in that the code updates the eui64 value in the
blockstore.
Reviewed by: jhb, araujo
Approved by: jhb (maintainer)
MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D21838
Add a "copy direction" parameter to nvme_prp_memcpy such that data can
be copied to the memory specified by the PRP entries (current behavior)
or copied from the PRP entries (new behavior). The upcoming deallocate
functionality will use the copy from capability.
Reviewed by: jhb, araujo
Approved by: jhb (maintainer)
MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D21837
Add printf() wrapper to use CR/CRLF terminators depending on whether
stdio is mapped to a tty open in raw mode.
Try to use the wrapper everywhere.
For now we leave the custom DPRINTF/WPRINTF defined by device
models, but we may remove them in the future.
Reviewed by: grehan, jhb
MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D22657
Some of the printf statements only use LF to get a newline. However, a CR character is also required for the serial console to print debug logs in a nice way.
Fix those code locations that only use LF, by adding a CR character.
Reviewed by: markj, aleksandr.fedorov@itglobal.com
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D22552
Instead of skipping the NVMe Completion Queue update based on the
opcode, define a synthetic status value which indicates the completion
queue entry is invalid. This will also allow deferred completion queue
updates for other commands.
Also returns the correct status for unrecognized opcodes ("invalid
opcode").
Reviewed by: imp, jhb, araujo
Approved by: imp (mentor), jhb (maintainer)
MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D20945
Accept an IEEE Extended Unique Identifier (EUI-64) from the command
line for each NVMe namespace. If one isn't provided, it will create one
based on the CRC16 of:
- the FreeBSD IEEE OUI
- PCI bus, device/slot, function values
- Namespace ID
Reviewed by: imp, araujo, jhb, rgrimes
Approved by: imp (mentor), jhb (maintainer)
MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D19905
Follow-up work to improve the handling of unsupported/invalid opcodes
is being developed by chuck@.
Coverity CID: 1398928
Reviewed by: chuck
Approved by: araujo, imp
Differential Revision: https://reviews.freebsd.org/D20914
The NVMe CAM driver reports the PCIe Link Capability and Status for
devices. For emulated bhyve NVMe devices, this looks like:
nda0: nvme version 1.3 x63 (max x63) lanes PCIe Gen15 (max Gen15) link
The driver outputs this because the emulated device doesn't include the
PCIe Capability structure. The NVMe specification requires these
registers, so the fix is to add this set of capability registers to the
emulated device.
Note that PCI Express devices that are integrated into the Root Complex
(i.e. Bus 0x0) do not have to support the Link Capability or Status
registers. Windows will fail to start (i.e. Code 10) devices that appear
to be part of the Root Complex but report being a PCI Express Endpoint.
So also add a check to pci_emul_add_pciecap() to check if the device is
integrated and change the device type.
Reviewed by: imp, ken, araujo, jhb, rgrimes
Approved by: imp (mentor), ken (mentor), jhb (maintainer)
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D19904
bhyve's NVMe emulation was transferring Identify data back to the guest
incorrectly causing memory corruptions. These corruptions resulted in
core dumps and other system level errors in the guest.
In their simplest form, NVMe Physical Region Page (PRP) values in
commands indicate which physical pages to use for data transfer. The
first PRP value is not required to be page aligned but does not cross a
page boundary. The second PRP value must be page aligned, does not cross
a page boundary, and need not be contiguous with PRP1.
The code was copying Identify data past the end of PRP1. This happens to
work if PRP1 and PRP2 are physically contiguous but will corrupt guest
memory in unpredictable ways if they are not.
Fix is to copy the Identify data back to the guest piecewise (i.e. for
each PRP entry). Also fix a similarly wrong problem when copying back
Log page data.
Reviewed by: imp (mentor), araujo, jhb, rgrimes, bhyve
Approved by: imp (mentor), bhyve (jhb)
MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D19695
The NVMe specification defines bits 13:4 of BAR0 as Reserved (i.e. 0x0).
Most drivers do not enforce this, but the Windows NVMe driver does and
will refuse to start the device (i.e. error 10) if any of these bits are
set.
The current BAR size calculation tries to minimize the amount of memory
the device reserves by scaling the BAR size by the maximum number of
queues supported by the device. But unless the device supports a large
number of queue pairs (over 1536), it will reserve too little memory.
The fix is to allocate a minimum of 16K bytes for BAR0.
Tested on Windows Server 2016 and 2019
Reviewed by: imp (mentor), araujo, jhb, bhyve
Approved by: imp (mentor), bhyve (jhb)
MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D19676
The NVMe Identify Namespace data structure's Number of LBA Formats
(NLBAF) field is a 0's based value (i.e. 0x0 means 1). Since the
emulation only supports a single format, set NLBAF to 0x0, not 1.
Reviewed by: imp, araujo, rgrimes
Approved by: imp (mentor)
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D19579
The function which processes Admin commands was not returning the
Command Specific value in Completion Queue Entry, Dword 0 (CDW0). This
effects commands such as Set Features, Number of Queues which returns
the number of queues supported by the device in CDW0. In this case, the
host will only create 1 queue pair (Number of Queues is zero based).
This also masked a bug in the queue counting logic.
Reviewed by: imp, araujo
Approved by: imp (mentor)
MFC after: 1 month
Differential Revision: https://reviews.freebsd.org/D18703
Many size / length parameters in NVMe are "0's based", meaning, a value
of 0x0 represents 1, 0x1 represents 2, etc.. While this leads to an
efficient encoding, it can lead to subtle bugs. With respect to queues,
these parameters include:
- Maximum number of queue entries
- Maximum number of queues
- Number of Completion Queues
- Number of Submission Queues
To be consistent, convert all 0's based values from the host to 1's
based value internally. Likewise, covert internal 1's based values to
0's based values when returned to the host. This fixes an off-by-one bug
when creating IO queues and simplifies some of the code. Note that this
bug is masked by another bug.
While in the neighborhood,
- fix an erroneous queue ID check (checking CQ count when deleting SQ)
- check for queue ID of 0x0 in a few places where this is illegal
- clean up the Set Features, Number of Queues command and check for
illegal values
Reviewed by: imp, araujo
Approved by: imp (mentor)
MFC after: 1 month
Differential Revision: https://reviews.freebsd.org/D18702
Also switch from int to size_t to keep portability.
Reviewed by: brooks
Sponsored by: iXsystems Inc.
Differential Revision: https://reviews.freebsd.org/D17795
The original NVMe API used bit-fields to represent fields in data
structures defined by the specification (e.g. the op-code in the command
data structure). The implementation targeted x86_64 processors and
defined the bit fields for little endian dwords (i.e. 32 bits).
This approach does not work as-is for big endian architectures and was
changed to use a combination of bit shifts and masks to support PowerPC.
Unfortunately, this changed the NVMe API and forces #ifdef's based on
the OS revision level in user space code.
This change reverts to something that looks like the original API, but
it uses bytes instead of bit-fields inside the packed command structure.
As a bonus, this works as-is for both big and little endian CPU
architectures.
Bump __FreeBSD_version to 1200081 due to API change
Reviewed by: imp, kbowling, smh, mav
Approved by: imp (mentor)
Differential Revision: https://reviews.freebsd.org/D16404