supporting older kernels. However, all supported versions of FreeBSD
have unmapped I/Os (as do several that have gone EOL), remove it. It's
unlikely the driver would work on the older kernels anyway at this
point.
supported in years. A number of changes have been made to the driver
that likely wouldn't work on those older versions that aren't properly
ifdef'd and it's project policy to GC such code once it is stale.
Use recent best practices for Copyright form at the top of
the license:
1. Remove all the All Rights Reserved clauses on our stuff. Where we
piggybacked others, use a separate line to make things clear.
2. Use "Netflix, Inc." everywhere.
3. Use a single line for the copyright for grep friendliness.
4. Use date ranges in all places for our stuff.
Approved by: Netflix Legal (who gave me the form), adrian@ (pmc files)
o In vm_pager_bufferinit() create pbuf_zone and start accounting on how many
pbufs are we going to have set.
In various subsystems that are going to utilize pbufs create private zones
via call to pbuf_zsecond_create(). The latter calls uma_zsecond_create(),
and sets a limit on created zone. After startup preallocate pbufs according
to requirements of all pbuf zones.
Subsystems that used to have a private limit with old allocator now have
private pbuf zones: md(4), fusefs, NFS client, smbfs, VFS cluster, FFS,
swap, vnode pager.
The following subsystems use shared pbuf zone: cam(4), nvme(4), physio(9),
aio(4). They should have their private limits, but changing that is out of
scope of this commit.
o Fetch tunable value of kern.nswbuf from init_param2() and while here move
NSWBUF_MIN to opt_param.h and eliminate opt_swap.h, that was holding only
this option.
Default values aren't touched by this commit, but they probably should be
reviewed wrt to modern hardware.
This change removes a tight bottleneck from sendfile(2) operation, that
uses pbufs in vnode pager. Other pagers also would benefit from faster
allocation.
Together with: gallatin
Tested by: pho
Dell-branded Intel P4600 NVMe drives benefit from NVMe 1.3's NOIOB
feature. Unfortunately just like Intel DC P4500s, they don't advertise
themselves as benefiting from this...
This changes adds P4600s to the existing list of old drives which
benefit from striping.
PR: 233969
Submitted by: David Fugate <dave.fugate@gmail.com>
Reviewed by: imp, mav
Approved by: imp (mentor)
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D18772
CAM does not require SIM lock since FreeBSD 10.4, and NVMe code never
required it at all, using per-queue locks instead. This formally allows
parallel request submission in CAM mode as much as single per-device and
per-queue locks of CAM allow.
MFC after: 1 month
In the nda(4) driver, only set DISKFLAG_CANDELETE (a.k.a. can support
BIO_DELETE) if the drive supports Dataset Management. There are reports
that without this check, VMWare Workstation does not work reliably.
Fix is to check the ONCS field in the NVMe Controller Data structure for
support. This check previously existed but did not survive the
big-endian changes.
Reported by: yuripv@yuripv.net
Reviewed by: imp, mav, jimharris
Approved by: imp (mentor)
MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D18493
At least one NVMe drive has a bug that makeing the Command Time Out
PCIe feature unreliable. The workaround is to disable this
feature. The driver wouldn't deal correctly with a timeout anyway.
Only do this for drives that are known bad.
Sponsored by: Netflix, Inc
Differential Revision: https://reviews.freebsd.org/D17708
The original NVMe API used bit-fields to represent fields in data
structures defined by the specification (e.g. the op-code in the command
data structure). The implementation targeted x86_64 processors and
defined the bit fields for little endian dwords (i.e. 32 bits).
This approach does not work as-is for big endian architectures and was
changed to use a combination of bit shifts and masks to support PowerPC.
Unfortunately, this changed the NVMe API and forces #ifdef's based on
the OS revision level in user space code.
This change reverts to something that looks like the original API, but
it uses bytes instead of bit-fields inside the packed command structure.
As a bonus, this works as-is for both big and little endian CPU
architectures.
Bump __FreeBSD_version to 1200081 due to API change
Reviewed by: imp, kbowling, smh, mav
Approved by: imp (mentor)
Differential Revision: https://reviews.freebsd.org/D16404
Summary:
Some architectures, in this case powerpc64, need explicit synchronization
barriers vs device accesses.
Prior to this change, when running 'make buildworld -j72' on a 18-core
(72-thread) POWER9, I would see controller resets often. With this change, I
don't see these resets messages, though another tester still does, for yet to be
determined reasons, so this may not be a complete fix. Additionally, I see a
~5-10% speed up in buildworld times, likely due to not needing to reset the
controller.
Reviewed By: jimharris
Differential Revision: https://reviews.freebsd.org/D16570
- Remove layering violation, when NVMe SIM code accessed CAM internal
device structures to set pointers on controller and namespace data.
Instead make NVMe XPT probe fetch the data directly from hardware.
- Cleanup NVMe SIM code, fixing support for multiple namespaces per
controller (reporting them as LUNs) and adding controller detach support
and run-time namespace change notifications.
- Add initial support for namespace change async events. So far only
in CAM mode, but it allows run-time namespace arrival and departure.
- Add missing nvme_notify_fail_consumers() call on controller detach.
Together with previous changes this allows NVMe device detach/unplug.
Non-CAM mode still requires a lot of love to stay on par, but at least
CAM mode code should not stay in the way so much, becoming much more
self-sufficient.
Reviewed by: imp
MFC after: 1 month
Sponsored by: iXsystems, Inc.
Admin pass-through requests took controller lock before the queue lock,
but in case of request submission to a failed controller controller lock
was taken after the queue lock. Fix that by reducing the lock scopes and
switching to mtx_pool locks to track pass-through request completion.
Sponsored by: iXsystems, Inc.
This change allows clean device detach on attach failures and driver unload,
while previous code tried to talk to already shut down controller, or even
accessed resources failed to allocate.
Sponsored by: iXsystems, Inc.
dma_tag_payload should not be destroyed before payload_dma_map, and seems
it should be used there instead of dma_tag to match creation.
Sponsored by: iXsystems, Inc.
that cross this boundary, they perform better when this isn't the
case. Intel uses the 3rd byte in the vendor specific area for
this. The DC P3500 was previously listed without any explanation. Add
the DC P3520 and DC P4500 to the list.
There won't be any others drives needing this quirk. Intel has
standardized a field in the namespace data in 1.3 (noiob). A future
patch will use that if it exists, with fallback to this method.
Submitted by: Keith Busch
Reviewed by: jimharris@
latter casts the LBA to a 32-bit number before assigning it to the 64
bit structure entity. This works fine on the first 2TB of TRIMs, but
terrible beyond that due to trucation.
Also, add an assert to make sure we don't end too many DSM TRIM
entries in one request.
Sponsored by: Netflix
On some systems, we're getting timeouts when we use multiple queues on
drives that work perfectly well on other systems. On a hunch, Jim
Harris suggested I poll the completion queue when we get a timeout.
This patch polls the completion queue if no fatal status was
indicated. If it had pending I/O, we complete that request and
return. Otherwise, if aborts are enabled and no fatal status, we abort
the command and return. Otherwise we reset the card.
This may clear up the problem, or we may see it result in lots of
timeouts and a performance problem. Either way, we'll know the next
step. We may also need to pay attention to the fatal status bit
of the controller.
PR: 211713
Suggested by: Jim Harris
Sponsored by: Netflix
The NVME standard has required in section 7.2.6, since at least 1.1,
that a clean shutdown is signalled by deleting the subission and the
completion queues before setting the shutdown bit in CC. The 1.0
standard, apparently, did not and many of the early Intel cards didn't
care. Some newer cards care, at least one whose beta firmware can
scramble the card on an unclean shutdown. Linux has done this for some
time. To make it possible to move forward with an evaluation of this
pre-release card with wonky firmware, delete the queues on the card
when we delete the qpair structures.
Sponsored by: Netflix
We'll need to delete namespaces soon, so go ahead and stop making
these devices eternal. It doesn't help much, and will be getting in
the way soon.
Sponsored by: Netflix
When multiple trims are in the queue, collapse them as much as
possible. At present, this usually results in only a few trims being
collapsed together, but more work on that will make it possible to do
hundreds (up to some configurable max).
Sponsored by: Netflix
Remove bitfields from defined structures as they are not portable.
Instead use shift and mask macros in the driver and nvmecontrol application.
NVMe is now working on powerpc64 host.
Submitted by: Michal Stanek <mst@semihalf.com>
Obtained from: Semihalf
Reviewed by: imp, wma
Sponsored by: IBM, QCM Technologies
Differential revision: https://reviews.freebsd.org/D13916
Now that we're queueing BIO_DELETE requests in the CAM I/O scheduler,
it make sense to try to combine as many as possible into a single
request to send down to hardware. Hopefully, lots of larger requests
like this are better than lots of individual transactions.
Note for future: need to limit based on total size of the trim
request. Should also collapse adjacent ranges where possible to
increase the size of the max payload.
Sponsored by: Netflix
optimize away these loops. Change boolean to int to match what atomic
API supplies. Remove wmb() since the atomic_store_rel() on status.done
ensure the prior writes to status. It also fixes the fact that there
wasn't a rmb() before reading done. This should also be more efficient
since wmb() is fairly heavy weight.
Sponsored by: Netflix
Reviewed by: kib@, jim harris
Differential Revision: https://reviews.freebsd.org/D14053
Uses of mallocarray(9).
The use of mallocarray(9) has rocketed the required swap to build FreeBSD.
This is likely caused by the allocation size attributes which put extra pressure
on the compiler.
Given that most of these checks are superfluous we have to choose better
where to use mallocarray(9). We still have more uses of mallocarray(9) but
hopefully this is enough to bring swap usage to a reasonable level.
Reported by: wosch
PR: 225197
Focus on code where we are doing multiplications within malloc(9). None of
these is likely to overflow, however the change is still useful as some
static checkers can benefit from the allocation attributes we use for
mallocarray.
This initial sweep only covers malloc(9) calls with M_NOWAIT. No good
reason but I started doing the changes before r327796 and at that time it
was convenient to make sure the sorrounding code could handle NULL values.
bug that requires 'hands off' for a period of time (2.3s) before we
check the RDY bit. Sicne this is a very odd quirk for a very limited
selection of drives, do this as a quirk. This prevented a successful
reset of the card when the card wedged.
Also, make sure that we comply with the advice from section 3.1.5 of
the 1.3 spec says that transitioning CC.EN from 0 to 1 when CSTS.RDY
is 1 or transitioning CC.EN from 1 to 0 when CSTS.RDY is 0 "has
undefined results". Short circuit when EN == RDY == desired state.
Finally, fail the reset if the disable fails. This will lead to a
failed device, which is what we want. (note: nda device needs
work for coping with a failed device).
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D13389
Mainly focus on files that use BSD 2-Clause license, however the tool I
was using misidentified many licenses so this was mostly a manual - error
prone - task.
The Software Package Data Exchange (SPDX) group provides a specification
to make it easier for automated tools to detect and summarize well known
opensource licenses. We are gradually adopting the specification, noting
that the tags are considered only advisory and do not, in any way,
superceed or replace the license texts.
information for that and XPT_PATH_INQ. Provide macros to encode/decode
major/minor versions. Read the link speed and lane count to compute
the base_transfer_speed for XPT_PATH_INQ.
Sponsored by: Netflix
allocations (for req and ccb, which ultimately contain the
nvme_cmd). As such, we can micro-optimize these routines. Add a
comment to this effect, and bzero the ccb used to make the requests
for the nda dump rotuine so it more closely matches a ccb allocated
with xpt_get_ccb().
Sponsored by: Netflix
more general and doesn't try to access registers that may be undefined
when the card is in MSIX mode.
This change, along with r324630, r324631, r324632, makes nda crash
dumps work again. Previously, they only worked on CPU 0 when the stack
garbage was just so.
Sponsored by: Netflix
Suggested by: scottl@ (who provided earlier version of the patch)
acidentally sending bogus values in these fields, which some drives
may reject with an error or worse (undefined behavior).
This is especially needed for the ndadump routine which allocates the
cmd from stack garbage....
Sponsored by: Netflix