- VM_ALLOC_NOCREAT will grab without creating a page.
- vm_page_grab_valid() will grab and page in if necessary.
- vm_page_busy_acquire() automates some busy acquire loops.
Discussed with: alc, kib, markj
Tested by: pho (part of larger branch)
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D21546
races with page busy state. The object lock is still used as an interlock
to ensure that the identity stays valid. Most callers should use
vm_page_sleep_if_busy() to handle the locking particulars.
Reviewed by: alc, kib, markj
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D21255
There are several mechanisms by which a vm_page reference is held,
preventing the page from being freed back to the page allocator. In
particular, holding the page's object lock is sufficient to prevent the
page from being freed; holding the busy lock or a wiring is sufficent as
well. These references are protected by the page lock, which must
therefore be acquired for many per-page operations. This results in
false sharing since the page locks are external to the vm_page
structures themselves and each lock protects multiple structures.
Transition to using an atomically updated per-page reference counter.
The object's reference is counted using a flag bit in the counter. A
second flag bit is used to atomically block new references via
pmap_extract_and_hold() while removing managed mappings of a page.
Thus, the reference count of a page is guaranteed not to increase if the
page is unbusied, unmapped, and the object's write lock is held. As
a consequence of this, the page lock no longer protects a page's
identity; operations which move pages between objects are now
synchronized solely by the objects' locks.
The vm_page_wire() and vm_page_unwire() KPIs are changed. The former
requires that either the object lock or the busy lock is held. The
latter no longer has a return value and may free the page if it releases
the last reference to that page. vm_page_unwire_noq() behaves the same
as before; the caller is responsible for checking its return value and
freeing or enqueuing the page as appropriate. vm_page_wire_mapped() is
introduced for use in pmap_extract_and_hold(). It fails if the page is
concurrently being unmapped, typically triggering a fallback to the
fault handler. vm_page_wire() no longer requires the page lock and
vm_page_unwire() now internally acquires the page lock when releasing
the last wiring of a page (since the page lock still protects a page's
queue state). In particular, synchronization details are no longer
leaked into the caller.
The change excises the page lock from several frequently executed code
paths. In particular, vm_object_terminate() no longer bounces between
page locks as it releases an object's pages, and direct I/O and
sendfile(SF_NOCACHE) completions no longer require the page lock. In
these latter cases we now get linear scalability in the common scenario
where different threads are operating on different files.
__FreeBSD_version is bumped. The DRM ports have been updated to
accomodate the KPI changes.
Reviewed by: jeff (earlier version)
Tested by: gallatin (earlier version), pho
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D20486
this to 2k to prevent them from being truncated and ignored. It
appears to be a sanity check only, but bumping it to 2k allows both of
my iic hid devices to be parsed and the second one to work...
The iicdev_writeto() function basically does scatter-gather IO by filling
in a pair of iic_msg structs to write the register address then the data
from different locations but with a single bus START/xfer/STOP sequence.
It turns out several low-level i2c controller drivers do not honor the
IIC_NOSTART flag, so the second piece of the write gets a new START on
the bus, and that confuses the ads111x chips which expect a continuous
write of 3 bytes to set a register.
A proper fix for this is to track down all the misbehaving controllers
drivers and fix them. For now this change makes this driver work again.
Also, disable the comparator by default; it's not used for anything.
The previous logic would start a measurement, and then pause_sbt() for the
averaging time currently configured in the chip. After waiting that long,
the code would blindly read the measurement register and return its value.
The problem is that the chip's idea of averaging time is based on its
internal free-running 1MHz oscillator, which may be running at a wildly
different rate than the kernel clock. If the chip's internal timer was
running slower than the kernel clock, we'd end up grabbing a stale result
from an old measurement.
The driver now still uses pause_sbt() to yield the cpu while waiting for
the measurement to complete, but after sleeping it checks the chip's status
register to ensure the measurement engine is idle. If it's not, the driver
uses a retry loop to wait a bit (5% of the original wait time) then check
again for completion.
The NVMe standard (1.4) states
>>> 8.6 Doorbell Stride for Software Emulation
>>> The doorbell stride,...is useful in software emulation of an NVM
>>> Express controller. ... For hardware implementations of the NVM
>>> Express interface, the expected doorbell stride value is 0h.
However, hardware in the wild exists with a doorbell stride of 1
(meaning 8 byte separation). This change supports that hardware, as
well as software emulators as envisioned in Section 8.6. Since this is
the fast path, care has been taken to make this computation
efficient. The bit of math to compute an offset for each is replaced
by a memory load from cache of a pre-computed value.
MFC After: 3 days
Reviewed by: scottl@
Differential Revision: https://reviews.freebsd.org/D21514
When we suspend, we need to properly shutdown the NVME controller. The
controller may go into D3 state (or may have the power removed), and
to properly flush the metadata to non-volatile RAM, we must complete a
normal shutdown. This consists of deleting the I/O queues and setting
the shutodown bit. We have to do some extra stuff to make sure we
reset the software state of the queues as well.
On resume, we have to reset the card twice, for reasons described in
the attach funcion. Once we've done that, we can restart the card. If
any of this fails, we'll fail the NVMe card, just like we do when a
reset fails.
Set is_resetting for the duration of the suspend / resume. This keeps
the reset taskqueue from running a concurrent reset, and also is
needed to prevent any hw completions from queueing more I/O to the
card. Pass resetting flag to nvme_ctrlr_start. It doesn't need to get
that from the global state of the ctrlr. Wait for any pending reset to
finish. All queued I/O will get sent to the hardware as part of
nvme_ctrlr_start(), though the upper layers shouldn't send any
down. Disabling the qpairs is the other failsafe to ensure all I/O is
queued.
Rename nvme_ctrlr_destory_qpairs to nvme_ctrlr_delete_qpairs to avoid
confusion with all the other destroy functions. It just removes the
queues in hardware, while the other _destroy_ functions tear down
driver data structures.
Split parts of the hardware reset function up so that I can
do part of the reset in suspsend. Split out the software disabling
of the qpairs into nvme_ctrlr_disable_qpairs.
Finally, fix a couple of spelling errors in comments related to
this.
Relnotes: Yes
MFC After: 1 week
Reviewed by: scottl@ (prior version)
Differential Revision: https://reviews.freebsd.org/D21493
polling within a second. Panic if we don't. All the commands that use this
interface should typically complete within a few tens to hundreds of
microseconds. Panic rather than return ETIMEDOUT because if the command somehow
does later complete, it will randomly corrupt memory. Also, it helps to get a
traceback from where the unexpected failure happens, rather than an infinite
loop.
dump support code, move the while loop into an inline function. These aren't
done in the fast path, so if the compiler choses to not inline, any performance
hit is tiny.
polled interface. Normally this would have the potential to corrupt stack memory
because the completion routines would run after we return. In this case,
however, we're doing a dump so it's safe for reasons explained in the comment.
The returned error number may be EINTR or ERESTART depending on
whether or not the signal is supposed to interrupt the system call.
Reported and tested by: pho
MFC after: 1 week
Sponsored by: The FreeBSD Foundation
This fixes a "timed sleep before timers are working" panic seen
while attaching jedec_dimm(4) instances too early in the boot.
Submitted by: ian
Reviewed by: hselasky
Differential Revision: https://reviews.freebsd.org/D21452
Remove now-redundant items from toepcb and synq_entry and the code to
support them.
Let the driver calculate tx_align, rx_coalesce, and sndbuf by default.
Reviewed by: jhb@
MFC after: 1 week
Sponsored by: Chelsio Communications
Differential Revision: https://reviews.freebsd.org/D21387
descriptor. The per-tid tx credits are in demand during active Tx and
it's best not to use too many just for payload.
Sponsored by: Chelsio Communications
According to ACPI 6.3 specification:
The OS sets this bit to 1 if it supports PCI Segment Groups as defined
by the _SEG object, and access to the configuration space of devices
in PCI Segment Groups as described by this specification. Otherwise,
the OS sets this bit to 0.
As far as I see we support both of those as PCI domains for quite a while.
MFC after: 2 months
According to my tests and errata to several generations of Intel CPUs,
PCIe hot-plug command completion reporting is not very reliable thing.
At least on my Supermicro X11DPi-NT board I never saw it reported.
Before this change timeout code detached devices and tried to disable
the slot, that in my case resulted in hot-plugged device being detached
just a second after it was successfully detected and attached. This
change removes that, so in case of timeout it just prints the error and
continue operation. Linux does the same.
MFC after: 1 week
Sponsored by: iXsystems, Inc.
The netmap_pt.c module has become obsolete after
the refactoring that added netmap_kloop.c.
Remove it and unlink it from the build system.
MFC after: 1 week
While it worked with the kenrel, it wasn't working with the loader.
It failed to handle dependencies correctly. The reason for that is
that we never created a nvme module with the DRIVER_MODULE, but
instead a nvme_pci and nvme_ahci module. Create a real nvme module
that nvd can be dependent on so it can import the nvme symbols it
needs from there.
Arguably, nvd should just be a simple child of nvme, but transitioning
to that (and winning that argument given why it was done this way) is
beyond the scope of this change.
Reviewed by: jhb@
Differential Revision: https://reviews.freebsd.org/D21382
queues, don't try to tear them down in the ctrlr_destroy
path. Otherwise, we dereference queue structures that are NULL and we
trap.
This fix is incomplete: we leak IRQ and MSI resources when this
happens. That's preferable to a crash but still should be fixed.
load and people who pull in nvme/nvd from modules can't load nvd.ko
since it depends on nvme, not nvme_foo. The duplicate doesn't matter
since kldxref properly handles that case.
Turn off bus master after we detach the device (to match the prior
order). Release MSI after we're done detaching and have turned off
all the interrupts. Otherwise this may cause problems as other threads
race nvme_detach. This more closely matches the old order.
Reviewed by: mav@
After r351243 when ALTQ was enabled in the kernel, the inline functions
in ifq.h would not have full type information as if_var.h was not
included.
Given usb_ethernet.h already includes all the various headers (which)
is the cause of the problem here, add if_var.h to it. This fixes the
builds again.
Reported by: CI system, e.g. FreeBSD-head-aarch64-LINT
Intel has created RST and many laptops from vendors like Lenovo and Asus. It's a
mechanism for creating multiple boot devices under windows. It effectively hides
the nvme drive inside of the ahci controller. The details are supposed to be a
trade secret. However, there's a reverse engineered Linux driver, and this
implements similar operations to allow nvme drives to attach. The ahci driver
attaches nvme children that proxy the remapped resources to the child. nvme_ahci
is just like nvme_pci, except it doesn't do the PCI specific things. That's
moved into ahci where appropriate.
When the nvme drive is remapped, MSI-x interrupts aren't forwarded (the linux
driver doesn't know how to use this either). INTx interrupts are used
instead. This is suboptimal, but usually sufficient for the laptops these parts
are in.
This is based loosely on https://www.spinics.net/lists/linux-ide/msg53364.html
submitted, but not accepted by, Linux. It was written by Dan Williams. These
changes were written from scratch by Olivier Houchard.
Submitted by: cognet@ (Olivier Houchard)
Nvme drives can be attached in a number of different ways. Separate out the PCI
attachment so that we can have other attachment types, like ahci and various
types of NVMeoF.
Submitted by: cognet@
If device is unplugged from the system (CSTS register reads return
0xffffffff), it makes no sense to send any more recovery requests or
expect any responses back. If there is a detach call in such state,
just stop all activity and free resources. If there is no detach
call (hot-plug is not supported), rely on normal timeout handling,
but when it trigger controller reset, do not wait for impossible and
quickly report failure.
MFC after: 2 weeks
Sponsored by: iXsystems, Inc.
This fixes possible double call of fail_fn, for example on hot removal.
It also allows ctrlr_fn to safely return NULL cookie in case of failure
and not get useless ns_fn or fail_fn call with NULL cookie later.
MFC after: 2 weeks