Expose arc_meta_limit, et al via kstats.
Note that as a result, vfs.zfs.arc_meta_used is removed.
The existing vfs.zfs.arc_meta_limit sysctl/tunable is retained
with a SYSCTL_PROC wrapper.
Illumos ZFS issues:
3561 arc_meta_limit should be exposed via kstats
Relnotes: yes
MFC after: 2 weeks
feature is to quisce the system before suspend.
Stop is implemented by reusing the thread_single(9) with the special
mode SINGLE_ALLPROC. SINGLE_ALLPROC differs from the existing
single-threading modes by allowing (requiring) caller to operate on
other process. Interruptible sleeps for !TDF_SBDRY threads are
suspended like SIGSTOP does it, instead of aborting the sleep, like
SINGLE_NO_EXIT, to avoid spurious EINTRs on resume.
Provide debugging sysctl debug.stop_all_proc, which causes total stop
and suspends syncer, while waiting for variable reset for resume. It
is used for debugging; should be removed after the real use of the
interface is added.
In collaboration with: pho
Discussed with: avg
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
filesystem specified VFCF_SBDRY flag, i.e. for NFS.
There are two issues with the sleeps. First, applications may get
unexpected EINTR from the disk i/o syscalls. Second, interruptible
sleep allows the stop of the process, and since mount point is
referenced while thread sleeps, unmount cannot free mount point
structure' memory, blocking unmount indefinitely.
Even for NFS, it is probably only reasonable to enable PCATCH for intr
mounts, but this information is currently not available at VFS level.
Reported and tested by: pho (previous version)
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
creating delayed write buffers belonging to the reclaimed vnode. Put
the buffer cleanup code after inactivation.
Add asserts that ensure that buffer queues are empty and add BO_DEAD
flag for bufobj to check that no buffers are added after the cleanup.
BO_DEAD is only used by INVARIANTS-enabled kernels.
Reported and tested by: pho (previous version)
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Verify that the block pointer is structurally valid, before attempting to
read it in. It can only be invalid in the case of a ZFS bug, but this
change will help identify such bugs in a more transparent way, by
panic'ing with a relevant message, rather than indexing off the end of an
array or something.
Illumos issue:
5349 verify that block pointer is plausible before reading
MFC after: 2 weeks
Reduce scrub activities when system there is enough dirty data, namely when
dirty data is more than zfs_vdev_async_write_active_min_dirty_percent (once
we start to increase the number of concurrent async writes).
While there also correct rounding error which would make scrub end up
pausing for (zfs_txg_timeout + 1) seconds instead of the desired
zfs_txg_timeout seconds.
Illumos issue:
5351 scrub goes for an extra second each txg
5352 scrub should pause when there is some dirty data
MFC after: 2 weeks
If zio_checksum_error() returns other than ECKSUM (e.g. EINVAL), it does not
fill in the "zio_bad_cksum_t *info" parameter. Caller should not attempt to
use it in this case.
Illumos issue:
5348 zio_checksum_error() only fills in info if ECKSUM
MFC after: 2 weeks
If a dnode has a spill block and there is an error while accessing
a data block then traverse_dnode() loses information about that error
and returns a status of visiting the spill block.
This issue is discovered by Spectra Logic.
Illumos issue:
5311 traverse_dnode may report success when it should not
Original author: gibbs
MFC after: 2 weeks
for counter mode), and AES-GCM. Both of these modes have been added to
the aesni module.
Included is a set of tests to validate that the software and aesni
module calculate the correct values. These use the NIST KAT test
vectors. To run the test, you will need to install a soon to be
committed port, nist-kat that will install the vectors. Using a port
is necessary as the test vectors are around 25MB.
All the man pages were updated. I have added a new man page, crypto.7,
which includes a description of how to use each mode. All the new modes
and some other AES modes are present. It would be good for someone
else to go through and document the other modes.
A new ioctl was added to support AEAD modes which AES-GCM is one of them.
Without this ioctl, it is not possible to test AEAD modes from userland.
Add a timing safe bcmp for use to compare MACs. Previously we were using
bcmp which could leak timing info and result in the ability to forge
messages.
Add a minor optimization to the aesni module so that single segment
mbufs don't get copied and instead are updated in place. The aesni
module needs to be updated to support blocked IO so segmented mbufs
don't have to be copied.
We require that the IV be specified for all calls for both GCM and ICM.
This is to ensure proper use of these functions.
Obtained from: p4: //depot/projects/opencrypto
Relnotes: yes
Sponsored by: FreeBSD Foundation
Sponsored by: NetGate
ipsec6_in_reject() does the same things, also it counts policy violation
errors.
Do IPSEC check in the ip6_forward() after addresses checks.
Also use ip6_ipsec_fwd() to make code similar to IPv4 implementation.
Obtained from: Yandex LLC
Sponsored by: Yandex LLC
ipsec_getpolicybyaddr()
ipsec4_checkpolicy()
ip_ipsec_output()
ip6_ipsec_output()
The only flag used here was IP_FORWARDING.
Obtained from: Yandex LLC
Sponsored by: Yandex LLC
and make its prototype similar to ipsec6_process_packet.
The flags argument isn't used here, tunalready is always zero.
Obtained from: Yandex LLC
Sponsored by: Yandex LLC
Remove check for presence PACKET_TAG_IPSEC_IN_DONE mbuf tag from
ip_ipsec_fwd(). PACKET_TAG_IPSEC_IN_DONE tag means that packet is
already handled by IPSEC code. This means that before IPSEC processing
it was destined to our address and security policy was checked in
the ip_ipsec_input(). After IPSEC processing packet has new IP
addresses and destination address isn't our own. So, anyway we can't
check security policy from the mbuf tag, because it corresponds
to different addresses.
We should check security policy that corresponds to packet
attributes in both cases - when it has a mbuf tag and when it has not.
Obtained from: Yandex LLC
Sponsored by: Yandex LLC
security policy. The changed block of code in ip*_ipsec_input() is
called when packet has ESP/AH header. Presence of
PACKET_TAG_IPSEC_IN_DONE mbuf tag in the same time means that
packet was already handled by IPSEC and reinjected in the netisr,
and it has another ESP/AH headers (encrypted twice?).
Since it was already processed by IPSEC code, the AH/ESP headers
was already stripped (and probably outer IP header was stripped too)
and security policy from the tdb_ident was applied to those headers.
It is incorrect to apply this security policy to current headers.
Also make ip_ipsec_input() prototype similar to ip6_ipsec_input().
Obtained from: Yandex LLC
Sponsored by: Yandex LLC
PACKET_TAG_IPSEC_OUT_CRYPTO_NEEDED mbuf tags. They aren't used in FreeBSD.
Instead check presence of PACKET_TAG_IPSEC_OUT_DONE mbuf tag. If it
is found, bypass security policy lookup as described in the comment.
PACKET_TAG_IPSEC_OUT_DONE tag added to mbuf when IPSEC code finishes
ESP/AH processing. Since it was already finished, this means the security
policy placed in the tdb_ident was already checked. And there is no reason
to check it again here.
Obtained from: Yandex LLC
Sponsored by: Yandex LLC
By default Xen binds all event channels to vCPU#0, and FreeBSD only shuffles
the interrupt sources once, at the end of the boot process. Since new event
channels might be created after this point (because new devices or backends
are added), try to automatically shuffle them at creation time.
This does not affect VIRQ or IPI event channels, that are already bound to a
specific vCPU as requested by the caller.
Sponsored by: Citrix Systems R&D
Mask the event channel source before trying to bind it to a CPU, this
prevents stray interrupts from firing while assigning them and hitting the
KASSERT in xen_intr_handle_upcall.
Sponsored by: Citrix Systems R&D
This allows the Grant-table code to attach directly to the xenpv bus,
allowing us to remove the grant-table initialization done in xenpv.
Sponsored by: Citrix Systems R&D
Mave the grant table code into the dev/xen folder in preparation for turning
it into a device using the newbus interface. This is just code motion, no
functional changes.
Sponsored by: Citrix Systems R&D
can't do a timeout bigger than 15 seconds. The code wasn't checking for
this and because bitmasking was involved the requested timeout was
basically adjusted modulo-16. That led to things like a 128 second
timeout actually being a 9 second timeout, which accidentally worked fine
until watchdogd was changed to only pet the dog once every 10 seconds.
When running as a Xen PVH Dom0 we need to add custom buses that override
some of the functionality present in the ACPI PCI Bus and the PCI Bus. We
currently override the ACPI PCI Bus, but not the PCI Bus, so add a new
override for the PCI Bus and share the generic functions between them.
Reported by: David P. Discher <dpd@dpdtech.com>
Sponsored by: Citrix Systems R&D
conf/files.amd64:
- Add the new files.
x86/xen/xen_pci_bus.c:
- Generic file that contains the PCI overrides so they can be used by the
several PCI specific buses.
xen/xen_pci.h:
- Prototypes for the generic overried functions.
dev/xen/pci/xen_pci.c:
- Xen specific override for the PCI bus.
dev/xen/pci/xen_acpi_pci.c:
- Xen specific override for the ACPI PCI bus.
o Move similar block/networking methods to common file
o Follow r275640 and correct MMIO registers width
o Pass value to MMIO platform_note method.
Sponsored by: DARPA, AFRL
Overrunning buffer pointed to by (caddr_t)&oip->i_db[0] of 48 bytes by
passing it to a function which accesses it at byte offset 59 using
argument 60UL.
The issue was inherited from an older FFS implementation and
fixed there with by merging UFS2 in r98542. We follow the
FFS fix.
Discussed with: bde
CID: 1007665
MFC after: 3 days
If the SCI is remapped to a non-ISA global interrupt notify the ACPI
subsystem about the override.
Reported by: David P. Discher <dpd@dpdtech.com>
Sponsored by: Citrix Systems R&D
There are two main parts to get it to work, 1) most of the register
accesses need to be word sized, other than the config register which
needs to be byte aligned, and 2) we don't need the platform driver
for this to work on the Foundation Model, allow it to be NULL.
Differential Revision: https://reviews.freebsd.org/D1240
Reviewed by: bryanv
Sponsored by: The FreeBSD Foundation
Since VFS does not/cannot stop writes, sync might run indefinitely, or
be a wrong thing to do at all. E. g. NFS ignores VFS_SYNC() for
forced unmounts, since non-responding server does not allow sync to
finish. On the other hand, filesystems can and do stop writes using
fs-specific facilities, and should already fully flush caches in
VFS_UNMOUNT() due to the race.
Adjust msdosfs tp sync in unmount for forced call, to accomodate the
new behaviour. Note that it is still racy, since writes are not
stopped.
Discussed with: avg, bjk, mckusick
Reported and tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 3 weeks
to be called before suspension and after resume, correspondingly. The
syncer_suspend() ensures that all filesystems dirty data and metadata
are saved to the permanent storage, and stops kernel threads which
might modify filesystems. The syncer_resume() restores stopped
threads.
For now, only syncer is stopped. This is needed, because each sync
loop causes superblock updates for UFS.
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
the vnode owning the buffer is not locked. More, it cannot be locked
safely, since getnewbuf_reuse_bp() is called from newbuf(), and some
other vnode is already locked, for which reused buffer will be
reassigned.
As the consequence, reclamation of the owning vnode could go in
parallel, in particular, the call to vnode_destroy_vobject(), which
deallocates the vm object and zeroes the v_bufobj->bo_object. Note
that the pages wired by the buffer are left wired and can be safely
freed by the vfs_vmio_release() without the need for the vm object
lock. Also, seeing stale pointer to the v_object is safe due to vm
object type stability.
Check for bo_bufobj != NULL and cache the value in local variable to
avoid trying to lock NULL vm object.
Reported and tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
This is not correct at least for the stop requests. Check for stop
conditions and suspend threads if requested.
Reported and tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
preparation for the global stop commit.
Move the code to weed suspended or sleeping threads into the
appropriate state, into the helper weed_inhib(). Current code already
has deep nesting and hard to follow [1].
Add currently useless helper remain_for_mode(), which returns the
count of threads which are allowed to run, according to the
single-threading mode.
In thread_single_end(), do not save curthread into local variable, it
is unused after, except to find curproc.
Remove stray empty line.
Requested by: avg [1]
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
for the suspension.
Currently, the loop performs uninterruptible cv_wait(9) call, which
prevents suspension until child allows further execution of parent.
If child is stopped, suspension or single-threading is delayed
indefinitely.
Create a helper thread_suspend_check_needed() to identify the need for
a call to thread_suspend_check(). It is required since call to the
thread_suspend_check() cannot be safely done while owning the child
(p2) process lock. Only when suspension is needed, drop p2 lock and
call thread_suspend_check(). Perform wait for cv with timeout, in
case suspend is requested after wait started; I do not see a better
way to interrupt the wait.
Reported and tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
multithreaded status of the process.
The stopped state must be cleared before P_WEXIT is set. A stop
signal delivered just before first PROC_LOCK() block in exit1(9) would
put the process into pending stop with P_WEXIT set or assertion
triggered. Also recheck for the suspension after failed
thread_single(9) call, since process lock could be dropped.
Reported and tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Without this fix, a kernel with VIMAGE + Infiniband will panic on bootup.
Certain necessary #include statements require LIST_HEAD.
Add these includes to ofed/include/linux/list.h, because
LIST_HEAD is specifically overridden in this file.
PR: 191468
Differential Revision: D1279
Reviewed by: hselasky
When importing a pool, don't assume that the passed pool configuration
at vdev_load is always vaild. It's possible that a stale configuration
that comes with extra vdevs, where metaslab_init() would fail because
of lower layer returns error.
Change the code to make metaslab_init() handle and return errors from
lower layer and pass it back to upper layer and handle it there.
Illumos issue:
5213 panic in metaslab_init due to space_map_open returning ENXIO
MFC after: 2 weeks
number of races which could cause double frees or use-after-frees when
performing DAD on an address. In particular, an IPv6 address can now only be
marked as a duplicate from the DAD callout.
Differential Revision: https://reviews.freebsd.org/D1258
Reviewed by: ae, hrs
Reported by: rstone
MFC after: 1 month
In the old days callout(9) had 1 tick precision and that was inadequate
for some uses, e.g. DTrace profile module, so we had to emulate cyclic
API and behavior. Now we can directly use callout(9) in the very few
places where cyclic was used.
Differential Revision: https://reviews.freebsd.org/D1161
Reviewed by: gnn, jhb, markj
MFC after: 2 weeks
Technically read requests can be executed in any order or simultaneously
since they are not changing any data. But ZFS prefetcher goes crasy when
it receives consecutive requests from different threads. Since prefetcher
works on level of separate blocks, instead of two consecutive 128K requests
it may receive 32 8K requests in mixed order.
This patch is more workaround then a real fix, and it does not fix all of
prefetcher problems, but it improves sequential read speed by 3-4x times
in some configurations. On the other side it may hurt performance if
some backing store has no prefetch, that is why it is disabled by default
for raw devices.
MFC after: 2 weeks
adapters. Set the pack boundary for T5 cards to be the same as the
PCIe max payload size. The chip likes it this way.
In this revision the driver allocate rx buffers that align on both
boundaries. This is not a strict requirement and a followup commit
will switch the driver to a more relaxed allocation strategy.
MFC after: 2 weeks
far away from a ldr psuedo instruction. With this clang will place the
literal value here where it's close enough to be loaded.
MFC after: 1 week
Sponsored by: ABT Systems Ltd
In dsl_dataset_hold_obj() we used zap_contains(.., DS_FIELD_LARGE_BLOCKS)
to determine whether the extensible (zapifyed) dataset have large blocks.
The code expects the result be either 0 (found) or ENOENT (not found),
however reused the variable 'err' which later code expects to be 0.
Fix this by adopting similar code construct that is used later for
DS_FIELD_BOOKMARK_NAMES, which uses a temporary variable zaperr to catch
errors from zap_* rountines.
Reported by: Peter J. Creath (on FreeNAS; FreeNAS bug #6848)
Illumos issue: 5393 spurious failures from dsl_dataset_hold_obj()
Reviewed by: mahrens
Sponsored by: iXsystems, Inc.
X-MFC with: r274337
Some old libraries may be used even with newer binaries (specifically the
Nvidia driver libraries).
Differential Revision: https://reviews.freebsd.org/D1262
Reviewed by: kib
vp->v_vflag without taking vnode lock and without bypass. We do know
that vp is the lowest level in the stack, since the pointer is
obtained from the object' handle. Stale VV_TEXT flag read can only
happen if parallel execve() is performed and not yet activated the
image, since process takes reference for text mapping. In this case,
the execve() code manages the VV_TEXT flag on its own already.
It was observed that otherwise read-only sendfile(2) requires
exclusive vnode lock and contending on it on some loads for VV_TEXT
handling.
Reported by: glebius, scottl
Tested by: glebius, pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
While without UNMAP support there is not much initiator can do about it,
the administrator still better be notified about the storage overflow.
MFC after: 2 weeks
Sponsored by: iXsystems, Inc.
Previously it was supported only for ZVOL-backed LUNs, but now should work
for file-backed LUNs too. Used value in this case is a space occupied by
the backing file, while available value is an available space on file
system. Pool thresholds are still not implemented in this case.
MFC after: 2 weeks
Sponsored by: iXsystems, Inc.
It is implemented for LUNs backed by ZVOLs in "dev" mode and files.
GEOM has no such API, so for LUNs backed by raw devices all LBAs will
be reported as mapped/unknown.
MFC after: 2 weeks
Sponsored by: iXsystems, Inc.
re-using a hardware propritary transfer descriptor, PTD, in USB host
mode. If the PTD's are recycled too quickly, it has been observed that
the hardware simply fails to schedule the requested job or resets
completely disconnecting all devices.
After recent optimizations this change is no longer blocked by CTL memory
consumption. Those limits are still not free, but much cheaper now.
MFC after: 1 week
Relnotes: yes
Sponsored by: iXsystems, Inc.
Abusing ability of major UAs cover minor ones we may not account UAs for
inactive ports. Allocate UAs storage for port and start accounting only
after some initiator from that port fetched its first POWER ON OCCURRED.
This reduces per-LUN CTL memory usage from >1MB to less then 100K.
MFC after: 1 month
The .text, .bss, and .data sections claimed 16-byte alignment, but were
only aligned to 8 by the linker script.
Discovered with strip(1) from elftoolchain, which performs validation
absent from the binutils strip(1).
Sponsored by: DARPA, AFRL
In configurations with many ports, like iSCSI, each LUN is typically
accessed only by limited subset of ports. Allocating that memory on
demand allows to reduce CTL memory usage from 5.3MB/LUN to 1.3MB/LUN.
MFC after: 1 month
Always include the card human readable name. We support ~270 cards and
at ~20 bytes each, this bloats things by only ~5k. Retain the
PCMCIA_CARD vs PCMCIA_CARD_D distinction, though, in case this is
intolerable.
WITNESS and INVARIANTS checking, which are known to have significant
performance impact on running systems. When benchmarking new features
this kernel should be used instead of the standard GENERIC.
This kernel configuration should never appear outside of the HEAD
of the FreeBSD tree.
interpreating NULLs as EOLs, but converting them to spaces.
SPC-4 does not tell that T10-based IDs should be NULL-terminated/padded.
And while it tells that it should include only ASCII chars (0x20-0x7F),
there are some USB sticks (SanDisk Ultra Fit), that have NULLs inside
the value. Treating NULLs as EOLs there made those LUN IDs non-unique.
MFC after: 1 week
from the FreeBSD network code. The flag is still kept around in the
"sys/mbuf.h" header file, but does no longer have any users. Instead
the "m_pkthdr.rsstype" field in the mbuf structure is now used to
decide the meaning of the "m_pkthdr.flowid" field. To modify the
"m_pkthdr.rsstype" field please use the existing "M_HASHTYPE_XXX"
macros as defined in the "sys/mbuf.h" header file.
This patch introduces new behaviour in the transmit direction.
Previously network drivers checked if "M_FLOWID" was set in "m_flags"
before using the "m_pkthdr.flowid" field. This check has now now been
replaced by checking if "M_HASHTYPE_GET(m)" is different from
"M_HASHTYPE_NONE". In the future more hashtypes will be added, for
example hashtypes for hardware dedicated flows.
"M_HASHTYPE_OPAQUE" indicates that the "m_pkthdr.flowid" value is
valid and has no particular type. This change removes the need for an
"if" statement in TCP transmit code checking for the presence of a
valid flowid value. The "if" statement mentioned above is now a direct
variable assignment which is then later checked by the respective
network drivers like before.
Additional notes:
- The SCTP code changes will be committed as a separate patch.
- Removal of the "M_FLOWID" flag will also be done separately.
- The FreeBSD version has been bumped.
MFC after: 1 month
Sponsored by: Mellanox Technologies
uma_reclaim(). Reclamation code must not see half-constructed or
destructed zones. Do this by bracing uma_zcreate() and uma_zdestroy()
into a shared-locked sx, and take the sx exclusively in uma_reclaim().
Usually zones are not created/destroyed during the system operation,
but tmpfs mounts do cause zone operations and exposed the bug.
Another solution could be to only expose a new keg on uma_kegs list
after the corresponding zone is fully constructed, and similar
treatment for the destruction. But it probably requires more risky
code rearrangement as well.
Reported and tested by: pho
Discussed with: avg
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
- Add support for GEOM direct completion. Depending on the benchmark,
this tends to give a ~30% improvement w.r.t IOPs and BW.
- Remove an invariants check in the strategy routine. This assertion
is caught later on by an existing panic.
- Rename and resort various related functions to make more sense.
MFC after: 1 month
- Provide pru_ready function for TCP.
- Don't call tcp_output() from tcp_usr_send() if no ready data was put
into the socket buffer.
- In case of dropped connection don't try to m_freem() not ready data.
Sponsored by: Nginx, Inc.
Sponsored by: Netflix
Provide pru_ready for AF_LOCAL sockets. Local sockets sendsdata directly
to the receive buffer of the peer, thus pru_ready also works on the peer
socket.
Sponsored by: Netflix
Sponsored by: Nginx, Inc.
sending not ready data:
o Add new flag to pru_send() flags - PRUS_NOTREADY.
o Add new protocol method pru_ready().
Sponsored by: Nginx, Inc.
Sponsored by: Netflix
o Introduce a notion of "not ready" mbufs in socket buffers. These
mbufs are now being populated by some I/O in background and are
referenced outside. This forces following implications:
- An mbuf which is "not ready" can't be taken out of the buffer.
- An mbuf that is behind a "not ready" in the queue neither.
- If sockbet buffer is flushed, then "not ready" mbufs shouln't be
freed.
o In struct sockbuf the sb_cc field is split into sb_ccc and sb_acc.
The sb_ccc stands for ""claimed character count", or "committed
character count". And the sb_acc is "available character count".
Consumers of socket buffer API shouldn't already access them directly,
but use sbused() and sbavail() respectively.
o Not ready mbufs are marked with M_NOTREADY, and ready but blocked ones
with M_BLOCKED.
o New field sb_fnrdy points to the first not ready mbuf, to avoid linear
search.
o New function sbready() is provided to activate certain amount of mbufs
in a socket buffer.
A special note on SCTP:
SCTP has its own sockbufs. Unfortunately, FreeBSD stack doesn't yet
allow protocol specific sockbufs. Thus, SCTP does some hacks to make
itself compatible with FreeBSD: it manages sockbufs on its own, but keeps
sb_cc updated to inform the stack of amount of data in them. The new
notion of "not ready" data isn't supported by SCTP. Instead, only a
mechanical substitute is done: s/sb_cc/sb_ccc/.
A proper solution would be to take away struct sockbuf from struct
socket and allow protocols to implement their own socket buffers, like
SCTP already does. This was discussed with rrs@.
Sponsored by: Netflix
Sponsored by: Nginx, Inc.
Summary:
Revert the initial FBT-with-KDB changes for trap_subr*.S, and instead use the
db_trap filter function to handle dtrace trap filtering. With this, the MMU is
enabled by the support code, simplifying the codepath altogether.
Test Plan: Tested on my G4 PowerBook
Reviewers: #powerpc, nwhitehorn
Reviewed By: nwhitehorn
Differential Revision: https://reviews.freebsd.org/D1207
MFC after: 3 weeks
crowded as we now are at about 70k. Bump the limit to 1MB instead
which is still quite a reasonable limit and allows for future growth
of this file and possible future expansion to additional data.
MFC After: 2 weeks
recursion on mutex initialization.
The only places where the recursive acquire is performed are read and
write filters, since knlist, which uses the pipe pair mutex as lock,
is locked when filter is called.
The recursion was added in r93296, and consistent locking for
kn_fop->f_event() introduced in r133741.
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 month
Call to the driver-specific ioctl used to process ioctl number
that will lead to the out-of-bounds access to the ioctl handler
array.
PR: 193367
Approved by: kib
MFC after: 1 week
Bump the default from 16 to 32, to accommodate kernel flamegraphs.
Bump the maximum from 32 to 128, to accommodate deep user stacks.
Reviewed by: gnn
MFC after: 1 month
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D1203
This allows one to make a kernel module to tune the
number of queues before the driver loads.
This is needed so that a module at SI_SUB_CPU can set
tunables for these drivers to take. Otherwise getenv
is called too early by the TUNABLE macros.
Reviewed by: smh
Phabric: https://reviews.freebsd.org/D1149
xform_ipip was used as fallback with low priority for IPIP
encapsulated packets that were decrypted. In some cases
it can decapsulate packets, that it shouldn't. This leads to situations,
when wrong configurations are magically working. Also it can propagate
wrong ingress interface and this can break security.
Now we redesigned the IPSEC code and IPIP encapsulation is called directly
from ipsec_output, and decapsulation is done in the ipsec_input with m_striphdr.
Differential Revision: https://reviews.freebsd.org/D1220
MFC after: 1 month
Sponsored by: Yandex LLC
- Threads lifetime cycle, in particular, counting of the threads in
the process, and interlocking with process mutex and thread lock.
The main reason of this is that turnstile locks are after thread
locks, so you e.g. cannot unlock blockable mutex (think process
mutex) while owning thread lock.
- Virtual and profiling itimers, since the timers activation is done
from the clock interrupt context. Replace the p_slock by p_itimmtx
and PROC_ITIMLOCK().
- Profiling code (profil(2)), for similar reason. Replace the p_slock
by p_profmtx and PROC_PROFLOCK().
- Resource usage accounting. Need for the spinlock there is subtle,
my understanding is that spinlock blocks context switching for the
current thread, which prevents td_runtime and similar fields from
changing (updates are done at the mi_switch()). Replace the p_slock
by p_statmtx and PROC_STATLOCK().
The split is done mostly for code clarity, and should not affect
scalability.
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
method needs pre-reset state of the ps_siginfo to correctly construct
signal frame.
Move sigdflt() call after the sv_sendsig() invocation in postsig().
Simultaneously extract common code from trapsignal() and postsig()
into new helper postsig_done().
Submitted by: rea
MFC after: 1 week
Records with target_mode == 1 are allocated from the end of portdb, so it
seems logical to start search from the end not traverse whole array.
MFC after: 1 month
Make CTL core and block backend set success status before initiating last
data move for read commands. Make CAM target and iSCSI frontends detect
such condition and send command status together with data. New I/O flag
allows to skip duplicate status sending on later fe_done() call.
For Fibre Channel this change saves one of three interrupts per read command,
increasing performance from 126K to 160K IOPS. For iSCSI this change saves
one of three PDUs per read command, increasing performance from 1M to 1.2M
IOPS.
MFC after: 1 month
Sponsored by: iXsystems, Inc.
Special thanks to Nicholas Esborn for the loaner router to get this
target bootstrapped.
Review: D777
Reviewed by: adrian
Sponsored by: Nicholas Esborn <nick@desert.net>
- add parentheses around macro parameters for consistent style
- remove redundant parentheses around an expression
- use tab before a line continuation symbol
Differential Revision: https://reviews.freebsd.org/D1161 (partial)
Reviewed by: markj
MFC after: 1 week
a new per-device '%domain' sysctl node that returns the NUMA domain a
device is associated with if it is associated with one.
Note that this API is still a WIP and might change before 11.0 actually
ships.
Differential Revision: https://reviews.freebsd.org/D930
Reviewed by: kib, adrian
This was previously working by accident because BUSDMA_COHERENT_MEMORY has
always been set to strongly-ordered on arm. Now we're moving towards
normal-uncacheable (what might be called write-combining on other platforms)
and using the proper sync ops will be more important. Of course, that
opens the question of just what is the "proper" sync op for shared
concurrent dma access as opposed to accesses where the handoff of control
of the memory has well-defined sequence points that match the available
busdma sync operations.
Old allocator created significant lock congestion protecting its lists
of preallocated I/Os, while UMA provides much better SMP scalability.
The downside of UMA is lack of reliable preallocation, that could guarantee
successful allocation in non-sleepable environments. But careful code
review shown, that only CAM target frontend really has that requirement.
Fix that making that frontend preallocate and statically bind CTL I/O for
every ATIO/INOT it preallocates any way. That allows to avoid allocations
in hot I/O path. Other frontends either may sleep in allocation context
or can properly handle allocation errors.
On 40-core server with 6 ZVOL-backed LUNs and 7 iSCSI client connections
this change increases peak performance from ~700K to >1M IOPS! Yay! :)
MFC after: 1 month
Sponsored by: iXsystems, Inc.
If this feels like deja vu... the last time this was fixed in this file
only ARM_MMU_V6 was fixed, this time it's ARM_ARCH_V6 (and this time I
searched for other occurrances of pj4b in here).
to automatically set the armv6 option when MACHINE_ARCH is armv6. That
allows replacing ever-growing lists of cpu names as options to compile
a given file with the using either "optional armv6" or "optional !armv6".
using the VM_MIN_ADDRESS constant.
HardenedBSD redefines VM_MIN_ADDRESS to be 64K, which results in
bhyve VM startup failing. Guest memory is always assumed to start
at 0 so use the absolute value instead.
Reported by: Shawn Webb, lattera at gmail com
Reviewed by: neel, grehan
Obtained from: Oliver Pinter via HardenedBSD
23bd719ce1
MFC after: 1 week
ath kernel module:
sys/dev/ath/ath_hal/ar5212/ar5212_reset.c:2642:7: error: taking the absolute value of unsigned type 'unsigned int' has no effect [-Werror,-Wabsolute-value]
if (abs(lp[0] * EEP_SCALE - target) < EEP_DELTA) {
^
sys/dev/ath/ah_osdep.h:74:18: note: expanded from macro 'abs'
#define abs(_a) __builtin_abs(_a)
^
sys/dev/ath/ath_hal/ar5212/ar5212_reset.c:2642:7: note: remove the call to '__builtin_abs' since unsigned values cannot be negative
sys/dev/ath/ah_osdep.h:74:18: note: expanded from macro 'abs'
#define abs(_a) __builtin_abs(_a)
^
1 error generated.
This warning occurs because both lp[0] and target are unsigned, so the
subtraction expression is also unsigned, and calling abs() is a no-op.
However, the intention was to look at the absolute difference between
the two unsigned quantities. Introduce a small static function to
clarify what we're doing, and call that instead.
Reviewed by: adrian
MFC after: 3 days
Differential Revision: https://reviews.freebsd.org/D1212
isochronous endpoint descriptor used for the data transfers, hence the
synchronization feature might not be supposed to be supported [yet].
This makes seamless playback synced with the USB HOST clock work with
the DN32-USB module for Midas audio systems and possibly other similar
products from Klark Teknik.
MFC after: 1 week
o Provide a new VOP_GETPAGES_ASYNC(), which works like VOP_GETPAGES(), but
doesn't sleep. It returns immediately, and will execute the I/O done handler
function that must be supplied as argument.
o Provide VOP_GETPAGES_ASYNC() for the FFS, which uses vnode_pager.
o Extend pagertab to support pgo_getpages_async method, and implement this
method for vnode_pager.
Reviewed by: kib
Tested by: pho
Sponsored by: Netflix
Sponsored by: Nginx, Inc.
SAF1761 OTG driver. Currently the driver logic is very simple and
double buffering the USB transactions is not done. Also you need to
use an external USB high speed USB HUB for reliable FULL speed
outgoing ISOCHRONOUS traffic, because the internal one chokes on
so-called split transfers above 188 bytes.
establishing connection.
This is a workaround for Chelsio TOE driver, that does not update socket
buffer size in hardware after connection established, and unless that is
done beforehand, kernel code will stuck, attempting to send/receive full
PDU at once.
MFC after: 1 week
I've missed that iscsi_outstanding_remove() frees the second pointer,
so it should no longer be used. And in fact we don't really need to.
MFC after: 2 weeks
During heavy reads data copying in icl_pdu_get_data() may consume large
percent of CPU time. Moving it out of the lock significantly reduces
lock hold time and respectively lock congestion on read operations.
MFC after: 2 weeks
the first cacheline if the buffer start address is not on a cacheline
boundary. Normally a buffer which is not cacheline-aligned is bounced,
but a special rule applies for mbufs, which are always misaligned due to
the header. We know the cpu will not write to the header while dma is in
progress (so we've been told anyway), but it may have written to the
header shortly before starting a read, so we need to flush that write out
to memory before invalidating the whole buffer.
In collaboration with Mical Meloun and Svata Kraus.
rename it to ufs_dirhashreclaimpercent, as suggested
by jhb@. As an added bonus this avoids divide-by-zero
errors.
Requested by: jhb, markj
Reviewied by: jhb, markj
commit 6d3c4c0922
Add terasic_mtl vt(4) framebuffer driver
terasic_mtl can be built with syscons(4) and vt(4) attachments, selected
at compile time.
commit 33240259b4
Clear terasic_mtl text buffer on attach
commit d188c2d241
Update terasic vt(4) driver for FreeBSD r269783
commit d1cc54eee8
Safety belt to ensure vt(4) fb parameters are correct
commit 76e6d468ef
Improve terasic_mtl_vt fdt parsing
- Use OF_getencprop to avoid need for explicit endian handling
(submitted by ray@freebsd.org)
- Check for expected length and correct pointer type
commit 3e2524b899
Correct device_printf usage
commit 9e53e3c8e0
Switch framebuffer to match host endianness
Xorg and xf86-video-scfb work much better with a native-endian
framebuffer.
commit 0f49259d59
Switch DE4 to vt(4) and enable kbdmux
commit 5bc96ebc89
Add missing \n in device_printf calls
Submitted by: emaste
Sponsored by: DARPA, AFRL
commit d0c7d235c0
Make the Altera JTAG UART device driver slightly more forgiving of
the foibles of a sub-par hrdware interface by increasing the timeout
for spotting JTAG polling from one to two seconds.
commit 19ed45a188
Update comment.
commit 8edfe803f0
Add a comment about a device-driver race condition that could cause the BERI
pipeline to wedge awaiting JTAG in the event that both the low-level console
and the tty layer decide to write to the JTAG FIFO just before JTAG is
disconnected. Resolving this race is a bit tricky as it looks like there
isn't a way to 'give the character back' to the tty layer when we discover
the race. The easy fix is to drop the character, which we don't yet do, but
perhaps should as that is a better outcome than wedging the pipeline.
commit 2ea26cf579
Add a comment about an inherent race with hardware in the Altera JTAG
UART's low-level console code.
Submitted by: rwatson
MFC after: 1 week
Sponsored by: DARPA, AFRL
Previously, any timeout value for which (timeout * hz) will overflow the
signed integer, will give weird results, since callout(9) routines will
convert negative values of ticks to '1'. For unsigned integer overflow we
will get sufficiently smaller timeout values than expected.
Switch from callout_reset, which requires conversion to int based ticks
to callout_reset_sbt to avoid this.
Also correct isci to correctly resolve ccb timeout.
This was based on the original work done by Eygene Ryabinkin
<rea@freebsd.org> back in 5 Aug 2011 which used a macro to help avoid
the overlow.
Differential Revision: https://reviews.freebsd.org/D1157
Reviewed by: mav, davide
MFC after: 1 month
Sponsored by: Multiplay
- Dump an NT_X86_XSTATE note if XSAVE is in use. This note is designed
to match what Linux does in that 1) it dumps the entire XSAVE area
including the fxsave state, and 2) it stashes a copy of the current
xsave mask in the unused padding between the fxsave state and the
xstate header at the same location used by Linux.
- Teach readelf() to recognize NT_X86_XSTATE notes.
- Change PT_GET/SETXSTATE to take the entire XSAVE state instead of
only the extra portion. This avoids having to always make two
ptrace() calls to get or set the full XSAVE state.
- Add a PT_GET_XSTATE_INFO which returns the length of the current
XSTATE save area (so the size of the buffer needed for PT_GETXSTATE)
and the current XSAVE mask (%xcr0).
Differential Revision: https://reviews.freebsd.org/D1193
Reviewed by: kib
MFC after: 2 weeks
This change saves/restores the callee-saved MIPS floating point
registers as documented by the o32/n32/n64 spec ("MIPSpro N32
ABI Handbook", Table 2-1) for the _setjmp(3), _longjmp(3),
setjmp(3) and longjmp(3) C library functions. This is only
included when the C library is built with hardware floating point
support (or when "SOFTFLOAT" is not defined).
Submitted by: sson
MFC after: 1 month
Sponsored by: DARPA, AFRL