Previously ISID was changed every time, that made impossible correct
persistent reservation, because reconnected session was identified as
completely new one.
Reviewed by: trasz
MFC after: 1 week
prior to starting "/sbin/init" which will run all the "/etc/rc.d/xxx"
scripts. Else there can be a race configuring the interfaces via
"/etc/rc.conf".
MFC after: 4 weeks
Sponsored by: Mellanox Technologies
Instead make ports provide wanted port and target IDs, and LUNs provide
wanted LUN IDs. After that core Device ID VPD code only had to link all
of them together and add relative port and port group numbers.
LUN ID for iSCSI LUNs no longer created by CTL, but by ctld, and passed
to CTL as "scsiname" LUN option. This makes LUNs to report the same set
of IDs, independently from the port through which it is accessed, as
required by SCSI specifications.
Having single port for all iSCSI connections makes problematic implementing
some more advanced SCSI functionality in CTL, that require proper ports
enumeration and identification.
This change extends CTL iSCSI API, making ctld daemon to control list of
iSCSI ports in CTL. When new target is defined in config fine, ctld will
create respective port in CTL. When target is removed -- port will be
also removed after all active commands through that port properly aborted.
This change require ctld to be rebuilt to match the kernel.
As a minor side effect, this allows to have iSCSI targets without LUNs.
While that may look odd and not very useful, that is not incorrect.
6679140 asymmetric alloc/dealloc activity can induce dynamic variable drops
6679193 dtrace_dynvar walker produces flood of dtrace_dynhash_sink
This finishes a set of merges from the older OpenSolaris releases.
Still the FreeBSD port has many differences that are difficult to
account for but that seems normal given that the kernels are different.
MFC after: 1 week
getenv_xxx() functions instead of strtoq(), because the getenv_xxx()
functions include wrappers for various postfixes like G/M/K, which
strtoq() doesn't do.
the reply to ReaddirPlus when the server failed within the loop
that calls VFS_VGET(). This failure is most likely an error
return from VFS_VGET() caused by a bogus d_fileno that was
truncated to 32bits.
This patch fixes the server so that it will return directory postop
attributes for the failure. It does not fix the underlying issue caused
by d_fileno being uint32_t when a file system like ZFS generates
a fileno that is greater than 32bits.
Reported by: jpaetzel
Reviewed by: jpaetzel
MFC after: 1 month
Before iSCSI implementation CTL had no knowledge about frontend drivers,
it had only frontends, which really were ports (alike to LUNs, if comparing
to backends). But iSCSI added there ioctl() method, which does not belong
to frontend as a port, but belongs to a frontend driver.
partitions of types other than "freebsd-boot" (in particular, "efi").
This allows the removal of some nasty hacks for supporting PowerPC systems,
in particular aliasing freebsd-boot to apple-boot on APM and an IBM-specific
code on MBR.
This changes the installer to use the correct names, which also breaks a
degeneracy in the meaning of "freebsd-boot" that allows the addition
of support for some newer IBM systems that can boot from GPT in addition to
MBR. Since I have no idea how to detect which those systems are, leave
the default on IBM PPC systems as MBR for now.
console's ability to enter the debugger.... rwatson forgot to document
this when he changed it back in 2011... There is more docs to write
about this, but at least fix this for now...
Reviewed by: emaste
MFC after: 1 week
camcontrol(8) now supports a new 'persist' subcommand that allows users to
issue SCSI PERSISTENT RESERVE IN / OUT commands.
sbin/camcontrol/Makefile:
Add persist.c.
sbin/camcontrol/persist.c:
New persistent reservation support for camcontrol(8).
We have support for all known operation modes for PERSISTENT RESERVE
IN and PERSISTENT RESERVE OUT.
exceptions noted above.
sbin/camcontrol/camcontrol.8:
Document the new 'persist' subcommand.
In the section on the Transport ID (-I) option, explain what
Transport IDs for each protocol should look like. At some point
some of this information could probably get moved off in a
separate man page, either on Transport IDs alone or a man page
documenting the Transport ID parsing code.
Add a number of examples of persistent reservation commands.
Persistent Reservations are complex enough that the average user
probably won't be able to get the commands exactly right by just
reading the man page. These examples show a few basic and
advanced examples of how to use persistent reservations.
sbin/camcontrol/camcontrol.h:
Move the definition for camcontrol_optret here, so we can use it
for the persistent reservation code.
Add a definition for the new scsipersist() function.
sbin/camcontrol/camcontrol.c:
Add 'persist' to the list of subcommands.
Document 'persist' in the help text.
sys/cam/scsi/scsi_all.c:
Add the scsi_persistent_reserve_in() and
scsi_persistent_reserve_out() CCB building functions.
Add a new function, scsi_transportid_sbuf(). This takes a
SCSI Transport ID (documented in SPC-4), and prints it to
an sbuf(9). There are some transports (like ATA, USB, and
SSA) for which there is no transport defined. We need to
come up with a reasonable thing to do if we're presented
with a Transport ID that claims to be for one of those
protocols.
Add new routines scsi_get_nv() and scsi_nv_to_str().
These functions do a table lookup to go between a string and an
integer. There are lots of table lookups needed in the
persistent reservation code in camcontrol(8).
Add a new function, scsi_parse_transportid(), along with leaf node
functions to parse:
FC, 1394 and SAS (scsi_parse_transportid_64bit())
iSCSI (scsi_parse_transportid_iscsi())
SPI (scsi_parse_transportid_spi())
RDMA (scsi_parse_transportid_rdma())
PCIe (scsi_parse_transportid_sop())
Transport IDs. Given a string with the general form proto,id these
functions create a SCSI Transport ID structure.
sys/cam/scsi/scsi_all.h:
Update the various persistent reservation data structures to
SPC4r36l, but also rename some fields that were previously
obsolete with the proper names from older SCSI specs. This
allows using older, obsolete persistent reservation types when
desired.
Add function prototypes for the new persistent reservation CCB
building functions.
Add a data strucure for the READ FULL STATUS service action
of the PERSISTENT RESERVE IN command.
Add Transport ID structures for all protocols described in SPC-4.
Add a new series of SCSI_PROTO_XXX definitions, and
redefine other defines in terms of these new definitions.
Add a prototype for scsi_transportid_sbuf().
Change a couple of "obsolete" persistent reservation data
structure fields into something more meaningful, based on
what the field was called when it was defined in the spec.
(e.g. SPC, SPC-2, etc.)
Create a new define, SPRI_MAX_LEN, for the maximum allocation
length allowed for the PERSISTENT RESERVE IN command.
Add data structures and enumerations for the new name/value
translation functions.
Add data structures for SCSI over PCIe Routing IDs.
Bring the PERSISTENT RESERVE OUT Register and Move parameter list
structure (struct scsi_per_res_out_parms) up to date with SPC-4.
Add a data structure for the transport IDs that can optionally be
appended to the basic PERSISTENT RESERVE OUT parameter list.
Move SCSI protocol macro definitions out of the VPD page 0x83
definition and combine them with the more up to date protocol
definitions higher in the file.
Add function prototypes for scsi_nv_to_str(), scsi_get_nv(),
scsi_parse_transportid_64bit(), scsi_parse_transportid_spi(),
scsi_parse_transportid_rdma(), scsi_parse_transportid_iscsi(),
scsi_parse_transportid_sop(), and scsi_parse_transportid().
Sponsored by: Spectra Logic Corporation
MFC after: 1 week
handle packets up to 1536 bytes)
This fixes the need to frag that could happen when using vlans on top of
if_arge (which is a common case for the use the switch ports as individual
NICs).
Previously to this commit any vlan setup with if_arge as parent would have
the MTU of the parent interface reduced by the size of dot1q header
(4 bytes).
Tested on TP-Link 1043ND (where the WAN port is just a switch port setup to
tag packets in a different VLAN than the LAN ports).
Reported and tested by: Harm Weites (harm at weites.com)
Update some comments on code, specifying the correct vlans used on switch
setup.
Advertise the proper switch operation mode (the rtl8366rb only support
dot1q vlans).
This fixes the breakage that i introduced on r249752 and make the rtl8366rb
switch works again with etherswitchcfg(8).
Tested on TP-Link 1043ND.
Tested by: me, Harm Weites (harm at weites.com)
6851093 system drops to kmdb with anonymous dtrace probes + kmdb
This has no effect on FreeBSD (code is ifdef'ed) but is useful as
reference for future merges.
MFC after: 1 week
The EFI framebuffer produces corrupted output on certain systems. For
now display the framebuffer parameters (address, dimensions, etc.) on
boot to aid in tracking down these issues.
Sponsored by: The FreeBSD Foundation
Mark cpu_search_lowest/cpu_search_highest/cpu_search_both as noinline,
while cpu_search() gets always_inline. With the attributes set,
cpu_search() is inlined in wrappers, and if()s with constant
conditionals are optimized.
On some tests on many-core machine, the hwpmc reported samples for
cpu_search*() are reduced from 25% total to 9%.
Submitted by: "Rang, Anton" <anton.rang@isilon.com>
MFC after: 1 week
- Don't discard frames if the dropped or error flag is set.
- Don't remove the last 4-bytes of every packet.
- Add extra range check for data position offset when receiving data.
MFC after: 1 day
PR: 191432
The array index for the callchain is getting double-incremented -- both in the
loop and the storing. It should only be incremented in one location.
Also, constrain the stack pointer range check.
MFC after: 2 weeks
requests on the trim_queue, even for the CFA ERASE. This allows us, in
the future, to collapse adjacent requests. Since CFA ERASE is only for
CF cards, and it is so restrictive in what it can do, the collapse
code is not presently here. This also brings the ada driver more in
line with the da driver's treatment of BIO_DELETEs.
Reviewed by: mav@
Don't check ZVOL_WCE flag, used in Solaris to control device "write cache".
It is not applicable on FreeBSD and by default set to "disable".
MFC after: 3 days
Both vt(4) and ofwfb(4) need a lot of love to be usable on sparc64 and even
then the performance of ofwfb(4) would suck compared to hardware accelerated
drivers like creator(4) and machfb(4).
The UEFI framebuffer driver vt_efifb requires vt(4), so add a mechanism
for the startup routine to set the preferred console. This change is
ugly because console init happens very early in the boot, making a
cleaner interface difficult. This change is intended only to facilitate
the sc(4) / vt(4) transition, and can be reverted once vt(4) is the
default.
1. oce_multiq_start(): make sure the buffer is consumed even on ENXIO
2. oce_multiq_transmit(): there is an extra call to drbr_enqueue()
causing the mbuf to be enqueued twice when the NIC's queue is full,
and potential panics
3. oce_multiq_transmit(): same problem fixed recently in ixgbe (r267187)
and other drivers: if the mbuf is enqueued, the proper return value is 0
Submitted by: Stefano Garzarella
MFC after: 3 days
These have no effect on FreeBSD, in fact they are ifdef'ed,
but make easier future merges:
6699767 panic in spec_open()
6718877 crgetzoneid() use can cause problems when forking processes with
USDT providers in a non global zone
MFC after: 3 days
into head. The code is not believed to have any effect
on the semantics of non-NFSv4.1 server behaviour.
It is a rather large merge, but I am hoping that there will
not be any regressions for the NFS server.
MFC after: 1 month
UFS rather than for all but ZFS. This code was assuming that offsets were
monotonically increasing for all file systems except ZFS and that the
cookies from a previous call may have been rewound to a block boundary.
According to mckusick@ only UFS is known to do this, so only requests against
UFS file systems should remove cookies smaller than the given offset. This
fixes serving TMPFS over NFS as it too does not have monotonically increasing
offsets. The comment around the code also indicated it was specific to UFS.
Some of the code using 'not_zfs' is specific to ZFS snapshot handling, so
add a 'is_zfs' variable for those cases.
It's possible that 'is_zfs' check for VFS_VGET() support may not be
specific to ZFS. This needs more research and testing.
After this fix TMPFS and other file systems can be served over NFS.
To test I compared the results of syncing a /usr/src tree into a tmpfs and
serving that over NFS. Before the fix 3589 files were missing on the remote
view. After the fix all files were successfully found.
Reviewed by: rmacklem
Discussed with: mckusick, rmacklem via fs@
Discussed at: http://lists.freebsd.org/pipermail/freebsd-fs/2014-April/019264.html
MFC after: 2 weeks
Sponsored by: EMC / Isilon Storage Division
For every supported command define CDB length and mask of bits that are
allowed to be set. This allows to remove bunch of checks through the code
and still make the validation more strict. To properly do it for commands
supporting multiple service actions, formalize their parsing by adding
subtables for each of such commands.
As visible effect, this change allows to add support for REPORT SUPPORTED
OPERATION CODES command, reporting to client all the data about supported
SCSI commands, except timeouts.
MFC after: 2 weeks
graph. In particular, this solves some issues with (probably leaked)
IPSec-related tags being looped back through netgraph to the inbound
path which then misinterpreted the stale tags.
MFC after: 7 days
rules prevent the USB serial module to be unloaded before any client
modules. This patch ensures that the "ucom_mtx" mutex is destroyed
last when doing a system uninit in a monotolith build aswell.
MFC after: 3 days
It is used only during execve (i.e. singlethreaded), so there is no fear
of returning 'not shared' which soon becomes 'shared'.
While here reorganize the code a little to avoid proc lock/unlock in
shared case.
MFC after: 1 week
Solaris and other OSs have support for \< and \> as word
delimiters in utilities like sed(1). These are useful to
have for general compatiblity with Solaris but should be
avoided for portability with other systems, including the
traditional BSDs.
Bump __FreeBSD_version as this is likely to affect some
userland utilities.
Reference:
https://www.illumos.org/issues/516
PR: bin/153257
Obtained from: Illumos
MFC after: 1 month
The ixgbe(4) hardware is capable of RSS hashing RX packets and doing RSS
queue selection for up to 8 queues.
However, even if multi-queue is enabled for ixgbe(4), the RX path doesn't use
the RSS flowid from the received descriptor. It just uses the MSIX queue id.
This patch does a handful of things if RSS is enabled:
* Instead of using a random key at boot, fetch the RSS key from the RSS code
and program that in to the RSS redirection table.
That whole chunk of code should be double checked for endian correctness.
* Use the RSS queue mapping to CPU ID to figure out where to thread pin
the RX swi thread and the taskqueue threads for each queue.
* The software queue is now really an "RSS bucket".
* When programming the RSS indirection table, use the RSS code to
figure out which RSS bucket each slot in the indirection table maps
to.
* When transmitting, use the flowid RSS mapping if the mbuf has
an RSS aware hash. The existing method wasn't guaranteed to align
correctly with the destination RSS bucket (and thus CPU ID.)
This code warns if the number of RSS buckets isn't the same as the
automatically configured number of hardware queues. The administrator
will have to tweak one of them for better performance.
There's currently no way to re-balance the RSS indirection table after
startup. I'll worry about that later.
Additionally, it may be worthwhile to always use the full 32 bit flowid if
multi-queue is enabled. It'll make things like lagg(4) behave better with
respect to traffic distribution.
The igb(4) hardware is capable of RSS hashing RX packets and doing RSS
queue selection for up to 8 queues. (I believe some hardware is limited
to 4 queues, but I haven't tested on that.)
However, even if multi-queue is enabled for igb(4), the RX path doesn't use
the RSS flowid from the received descriptor. It just uses the MSIX queue id.
This patch does a handful of things if RSS is enabled:
* Instead of using a random key at boot, fetch the RSS key from the RSS code
and program that in to the RSS redirection table.
That whole chunk of code should be double checked for endian correctness.
* Use the RSS queue mapping to CPU ID to figure out where to thread pin
the RX swi thread and the taskqueue threads for each queue.
* The software queue is now really an "RSS bucket".
* When programming the RSS indirection table, use the RSS code to
figure out which RSS bucket each slot in the indirection table maps
to.
* When transmitting, use the flowid RSS mapping if the mbuf has
an RSS aware hash. The existing method wasn't guaranteed to align
correctly with the destination RSS bucket (and thus CPU ID.)
This code warns if the number of RSS buckets isn't the same as the
automatically configured number of hardware queues. The administrator
will have to tweak one of them for better performance.
There's currently no way to re-balance the RSS indirection table after
startup. I'll worry about that later.
Additionally, it may be worthwhile to always use the full 32 bit flowid if
multi-queue is enabled. It'll make things like lagg(4) behave better with
respect to traffic distribution.
reset device task request from the driver. If the drive fails to respond
with a signature FIS, the driver would previously get into an endless retry
loop, stalling all I/O to the drive and keeping user processes stranded.
Instead, fail the i/o and invalidate the device if the task management
command times out. This is controllable with the sysctl and tunable
hw.isci.fail_on_task_timeout
dev.isci.0.fail_on_task_timeout
The default for these is 1.
Reviewed by: jimharris
Obtained from: Netflix, Inc.
MFC after: 2 days
If a controller is set to JBOD, it has no RAID functions turned on.
Populate even more of the firmware specification headers, copied from
cciss_vol_status.
Reviewed by: Benesh, Scott <scott.benesh@hp.com>
MFC after: 2 weeks
If cam_periph_find() doesn't locate the path we requested, bail to error
condition.
Acquire ciss->mtx for this operation.
Reviewed by: "Benesh, Scott" <scott.benesh@hp.com>
MFC after: 2 weeks
when a newly created file has another open done on it that
update the open mode. This patch moves the code that updates
the open mode up into the block where the mutex is held to
ensure this cannot happen. No bug caused by this potential
race has been observed, but this fix is a safety belt to ensure
it cannot happen.
MFC after: 2 weeks
MFV r260708
4427 pid provider rejects probes with valid UTF-8 names
Use of u8_textprep.c broke the build on powerpc.
Reported by: bz, rpaulo and tinderbox.
Pointyhat: me
Remove duplicate "debug_ktr.mask" sysctl definition.
Remove now unused variable from "kern_ktr.c".
This fixes build of "ktr" which was broken by r267961.
Let the default value for "vm_kmem_size_scale" be zero. It is setup
after that the sysctl has been initialized from "getenv()" in the
"kmeminit()" function to equal the "VM_KMEM_SIZE_MAX" value, if
zero. On Sparc64 the "VM_KMEM_SIZE_MAX" macro is not a constant. This
fixes build of Sparc64 which was broken by r267961.
Add a special macro to dynamically create SYSCTL root nodes, because
root nodes have a special parent. This fixes build of existing OFED
module and CANBUS module for pc98 which was broken by r267961.
Add missing "sysctl.h" includes to get the needed sysctl header file
declarations. This is needed after r267961.
MFC after: 2 weeks
Filetable can be shared with other processes. Previous code failed to
clear the pointer for all but the last process getting rid of the table.
This is mostly cosmetics.
Get rid of 'This should happen earlier' comment. Clearing the pointer in
this place is fine as consumers can reliably check for files availability
by inspecting fd_refcnt and vnodes availabity by NULL-checking them.
MFC after: 1 week
These changes prevent sysctl(8) from returning proper output,
such as:
1) no output from sysctl(8)
2) erroneously returning ENOMEM with tools like truss(1)
or uname(1)
truss: can not get etype: Cannot allocate memory
implement options TERMINAL_{KERN,NORM}_ATTR. These are aliased to
SC_{KERNEL_CONS,NORM}_ATTR and like these latter, allow to change the
default colors of normal and kernel text respectively.
Note on the naming: Although affecting the output of vt(4), technically
kern/subr_terminal.c is primarily concerned with changing default colors
so it would be inconsistent to term these options VT_{KERN,NORM}_ATTR.
Actually, if the architecture and abstraction of terminal+teken+vt would
be perfect, dev/vt/* wouldn't be touched by this commit at all.
Reviewed by: emaste
MFC after: 3 days
Sponsored by: Bally Wulff Games & Entertainment GmbH
With this change and previous work from ray@ it will be possible to put
both in GENERIC, and have one enabled by default, but allow the other to
be selected via the loader.
(The previous implementation had separate kern.vt.disable and
hw.syscons.disable tunables, and would panic if both drivers were
compiled in and neither was explicitly disabled.)
MFC after: 1 week
Sponsored by: The FreeBSD Foundation
there is an environment variable which shall initialize the SYSCTL
during early boot. This works for all SYSCTL types both statically and
dynamically created ones, except for the SYSCTL NODE type and SYSCTLs
which belong to VNETs. A new flag, CTLFLAG_NOFETCH, has been added to
be used in the case a tunable sysctl has a custom initialisation
function allowing the sysctl to still be marked as a tunable. The
kernel SYSCTL API is mostly the same, with a few exceptions for some
special operations like iterating childrens of a static/extern SYSCTL
node. This operation should probably be made into a factored out
common macro, hence some device drivers use this. The reason for
changing the SYSCTL API was the need for a SYSCTL parent OID pointer
and not only the SYSCTL parent OID list pointer in order to quickly
generate the sysctl path. The motivation behind this patch is to avoid
parameter loading cludges inside the OFED driver subsystem. Instead of
adding special code to the OFED driver subsystem to post-load tunables
into dynamically created sysctls, we generalize this in the kernel.
Other changes:
- Corrected a possibly incorrect sysctl name from "hw.cbb.intr_mask"
to "hw.pcic.intr_mask".
- Removed redundant TUNABLE statements throughout the kernel.
- Some minor code rewrites in connection to removing not needed
TUNABLE statements.
- Added a missing SYSCTL_DECL().
- Wrapped two very long lines.
- Avoid malloc()/free() inside sysctl string handling, in case it is
called to initialize a sysctl from a tunable, hence malloc()/free() is
not ready when sysctls from the sysctl dataset are registered.
- Bumped FreeBSD version to indicate SYSCTL API change.
MFC after: 2 weeks
Sponsored by: Mellanox Technologies
If passed cm->cmsg_len was below cmsghdr size the experssion:
datalen = (caddr_t)cm + cm->cmsg_len - (caddr_t)data;
would give negative result. However, in practice it would not
result in a crash because the kernel would try to obtain garbage fds
for given process and would error out with EBADF.
PR: 124908
Submitted by: campbell mumble.net (modified a little)
MFC after: 1 week
Instead of trying to guess size of disk I/O operations (it just won't work
that way for newly added commands, and is equal to data move size for old
ones), account data move traffic. If disk I/Os are that interesting, then
backends have to account and provide that information.
Block backend already exports the information about disk I/Os via devstat,
so having it here too is excessive.
MFC after: 2 weeks
2915 DTrace in a zone should see "cpu", "curpsinfo", et al
2916 DTrace in a zone should be able to access fds[]
2917 DTrace in a zone should have limited provider access
MFC after: 2 weeks
configs. Switch the BERI_NETFPGA_MDROOT to 64bit by default.
Give we have working interrupts also cleanup the extra polling CFLAGS from
the module Makefile.
MFC after: 2 weeks
time by setting NF10BMAC_64BIT and using a REGWTYPE #define to set correct
variable and return value widths.
Adjust comments to indicate the 32 or 64bit register widths.
MFC after: 2 weeks
have to undo it by calling crfree(). This reduces the total number of calls
by vm_map_insert() to crhold() and crfree() by 45% in my tests.
Eliminate an unnecessary variable from vm_map_insert().
Reviewed by: kib
Tested by: pho
This gives some use to 512KB per-LUN buffers, allocated for Copan-specific
processor code and not used. It allows, for example, to test transport
performance and/or correctness without accessing the media, as supported
by Linux version of sg3_utils.
MFC after: 2 weeks
<sys/time.h>. INT64_MAX actually requires __INT64_C() hack to get the
type right on exotic architectures (e.g. on ones with 63-bit ints or long
0x7fffffffffffffff is unsigned int or long). The hardcoded LL suffix is
good enough to avoid these problems for SBT_MAX (it makes the type always
signed long long, without overflow since long long has at least 64 bits).
Many thanks to Bruce Evans for the time spent me to explain this.
Reported by: bde
Reviewed by: bde
map the bucket to an RSS queue, then map the queue to a CPU ID.
This way the bucket->queue and queue->CPU mapping can change
over time.
Introduce IP_RSSBUCKETID - which instead looks up the RSS bucket.
User applications can then map the RSS bucket to a CPU.
There's 128 indirection table entries which correspond to the
low 7 bits of the 32 bit RSS hash. Each value will correspond
to an RSS bucket. (Then each RSS bucket currently will map
to a CPU.)
This is a more explicit way of figuring out which RSS bucket
is in each RSS indirection slot. It can be inferred by the other
methods but I'd rather drivers use something more simplified and
explicit.
This is different than the amount shown for the process e.g. by
/usr/bin/top - that is the mappings faulted in by the mmap'd region
of guest memory.
The values can be fetched with bhyvectl
# bhyvectl --get-stats --vm=myvm
...
Resident memory 413749248
Wired memory 0
...
vmm_stat.[ch] -
Modify the counter code in bhyve to allow direct setting of a counter
as opposed to incrementing, and providing a callback to fetch a
counter's value.
Reviewed by: neel
PCI root bridges except for the one known-valid case on x86 where bridges
claim the I/O port registers used for PCI config space access.
Tested by: Hilko Meyer <hilko.meyer@gmx.de>
MFC after: 1 week
Split global ctl_lock, historically protecting most of CTL context:
- remaining ctl_lock now protects lists of fronends and backends;
- per-LUN lun_lock(s) protect LUN-specific information;
- per-thread queue_lock(s) protect request queues.
This allows to radically reduce congestion on ctl_lock.
Create multiple worker threads, depending on number of CPUs, and assign
each LUN to one of them. This allows to spread load between multiple CPUs,
still avoiging congestion on queues and LUNs locks.
On 40-core server, exporting 5 LUNs, each backed by gstripe of SATA SSDs,
accessed via 6 iSCSI connections, this change improves peak request rate
from 250K to 680K IOPS.
MFC after: 2 weeks
Sponsored by: iXsystems, Inc.
While for FreeBSD client that is only a minor optimization, VMWare client
doesn't support additional data requests after all data being sent once as
immediate.
MFC after: 1 week
Sponsored by: iXsystems, Inc.
4427 pid provider rejects probes with valid UTF-8 names
This make use of Solaris' u8_validate() which we happen to
use since r185029 for ZFS.
Illumos Revision: 1444d846b126463eb1059a572ff114d51f7562e5
Reference:
https://www.illumos.org/issues/4427
Obtained from: Illumos
MFC after: 2 weeks
Prevent the Xen and VirtIO balloon drivers from marking pages as
wired. This prevents them from increasing the system wired page count,
which can lead to mlock failing because of hitting the limit in
vm.max_wired.
In the Xen case make sure pages are zeroed before giving them back to
the hypervisor, or else we might be leaking data. Also remove the
balloon_{append/retrieve} and link pages directly into the
ballooned_pages queue using the plinks.q field in the page struct.
Sponsored by: Citrix Systems R&D
Reviewed by: kib, bryanv
Approved by: gibbs
dev/virtio/balloon/virtio_balloon.c:
- Don't allocate pages with VM_ALLOC_WIRED.
dev/xen/balloon/balloon.c:
- Don't allocate pages with VM_ALLOC_WIRED.
- Make sure pages are zeroed before giving them back to the
hypervisor.
- Remove the balloon_entry struct and the balloon_{append/retrieve}
functions and use the page plinks.q entry to link the pages
directly into the ballooned_pages queue.
peripheral devices. When transmitting (rx) from slave to master,
sometimes nAKC delays. As a result, some slaves fails their
transmission.
Submitted by: Masanori OZAWA <ozawa@ongs.co.jp>
Reviewed by: brix
MFC after: 1 week
usage from dtrace. The dtrace code already uses cdevpriv(9) since FreeBSD
8, so this change should be quite harmless.
Reviewed by: markj
Approved by: markj
MFC after: never
that it creates (r267645), we can place the check that blocks map entry
coalescing on stack entries in vm_map_simplify_entry() where it properly
belongs.
Reviewed by: kib
the file which is compiled with SSE disabled. The functions set up
the FPU context for kernel, and compiler optimizations which could
lead to use of XMM registers before the fpu_kern_enter(9) is called or
after fpu_kern_leave(9), panic the machine.
Discussed with: jmg
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
context into memory for the kernel threads which called
fpu_kern_thread(9). This allows the fpu_kern_enter() callers to not
check for is_fpu_kern_thread() to get the optimization.
Apply the flag to padlock(4) and aesni(4). In aesni_cipher_process(),
do not leak FPU context state on error.
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
numerical values as MAP_STACK_GROWS_*, but the former is for entries'
eflags, while the later for the cow argument of vm_map_insert().
Submitted by: alc
The sound drivers that use own buffer management can use sndbuf_setup
and not do any busdma allocation, so the driver will end up with the
managed buffer but no valid dma map and tag for it. Avoid calling
bus_dmamem_free in such cases.
Reported by: ache
Missed in review by: kan
the 4 byte-aligned dtrace_invop_callsite can be found and that it
immediately follows the call to dtrace_invop(). Secondly, fix some pointer
arithmetic to account for differences between struct i386_frame and illumos'
struct frame. Finally, ensure that dtrace_getarg() isn't inlined. It works
by following a fixed number of frame pointers to the probe site, so inlining
breaks it.
MFC after: 3 weeks
o assert in each one that fdp is not shared
o remove unnecessary NULL checks - all userspace processes have fdtables
and kernel processes cannot execve
o remove comments about the danger of fd_ofiles getting reallocated - fdtable
is not shared and fd_ofiles could be only reallocated if new fd was about to be
added, but if that was possible the code would already be buggy as setugidsafety
work could be undone
MFC after: 1 week
first five for probes entered through a UD fault (i.e. FBT probes).
Specifically, handle the fact that dtrace_invop_callsite must be
16 byte-aligned and thus may not immediately follow the call to
dtrace_invop() in dtrace_invop_start(). Also fetch register arguments and
the stack pointer through a struct trapframe instead of a struct reg.
PR: 191260
Submitted by: luke.tw@gmail.com
MFC after: 3 weeks
We can read refcnt safely and only care if it is equal to 1.
If it could suddenly change from 1 to something bigger the code would be
buggy even in the previous form and transitions from > 1 to 1 are equally racy
and harmless (we copy even though there is no need).
MFC after: 1 week
binding their threads to particular CPU.
Changing ithread cpu mask is now performed by special cpuset_setithread().
It creates additional cpuset root group on first bind invocation.
No objection: jhb
Tested by: hiren
MFC after: 2 weeks
Sponsored by: Yandex LLC
fd_lastfile is guaranteed to be the biggest open fd, so when the intent
is to iterate over active fds or lookup one, there is no point in looking
beyond that limit.
Few places are left unpatched for now.
MFC after: 1 week
defined. This ensures that the sdt:zfs:: probes appear despite the fact
the sdt provider is defined in the kernel rather than in zfs.ko.
Reported by: hiren
Tested by: hiren
MFC after: 2 weeks
reporting IP-addresses to the peer during the handshake, adding
addresses to the host, reporting the addresses via the sysctl
interface (used by netstat, for example) and reporting the
addresses to the application via socket options.
This issue was reported by Bernd Walter.
MFC after: 3 days
separate argument structure with added level_type field for
CPUID_CPUID_COUNT request.
Reviewed by: attilio (previous version)
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
map entries list, and that it does not overlap with the previous and
next entries.
Reviewed by: alc
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Assume the number of description used is reasonable value to
increment this otherwise opaque field by.
While here, reduce a minor difference between the legacy and
multiqueue transmit paths.
MFC after: 1 week
This requires the VMware vmxnet3 device to flip the start of packet
descriptor's generation before the rest of the packet's descriptors
have been loaded into the Rx ring. I've never observed this behavior,
and it seems to make the most sense not to do it this way. But it is
not a lot of work for the driver to handle this situation just in case.
MFC after: 1 week
The sbp_cam_detach_target can be called from sbp_post_explore function
on the first target that is not really attached and it was written with
the corresponding safety check in place to tolerate that. Unfortunately
the recent locking cleanup did add a locking assertion that tries to
dereference the target->sbp pointer unconditionally, which causes less
than desirable outcome. Since the assertion is useful, just initialize
the target sbp pointer once when sbp device is being initialized instead
of when the target is being attached. This makes assertion work in all
cases and fixes the crash on boot.
performing cpuid calls.
Add also a new way to specify the level type to cpucontrol(8) as
reported in the manpage.
Sponsored by: EMC / Isilon storage division
Reviewed by: bdrewery, gcooper
Testerd by: bdrewery
instead of trying to cache it.
Previously, we only trusted the state if we did not have a cached state.
However, once a state was cached, the _STA method was always ignored.
Specifically, once a power resource had been turned on once (e.g.
during resume), the driver assumed it was always on even if _STA said it
was off and never turned it back on. This prevented the power resource
from being turned back on if a laptop was resumed twice, for example.
To fix, just remove the cached state entirely and always use the results
of _STA. The loops already skip any resources where _STA fails.
Submitted by: trasz (initial patch to invoke _ON)
MFC after: 1 week
corresponding flag(s) in the new map entry. Previously, the caller was
responsible for setting them after vm_map_insert() returned.
Pass MAP_STACK_GROWS_DOWN to vm_map_insert() from vm_map_growstack() when
extending the stack in the downward direction.
Together these changes slightly simplify the caller's task when creating a
downward growing stack. In particular, the caller no longer needs to clip
the previous entry, because the new stack entry can't possibly coalesce
with the previous entry.
Reviewed by: kib
Sponsored by: EMC / Isilon Storage Division
From one side it allows to remove CTL_FLAG_TASK_PENDING flag, handling of
which significantly complicates fine-grained locking. From the other side
it reduces task management requests latency even below then that flag could.
As downside, it denies task management code to sleep, but that is not needed
any way now.
Discussed with: ken
This should allow to abort commands doing mostly disk I/O, such as VERIFY
or WRITE SAME. Before this change CTL_FLAG_ABORT was only checked around
data moves, which for these commands may not happen for a very long time.
MFC after: 2 weeks
SPC-4 recommends T10 vendor ID based LUN ID was created by concatenating
product name and serial number (and istgt follows that). But product name
is 16 bytes long by itself, so 16 bytes total length is clearly not enough
to fit both.
To keep compatibility with existing configurations, pad short device IDs
to old length of 16, same as before.
This change probably breaks CTL user-level ABI, so control tools should
be rebuilt after this change.
MFC after: 2 weeks
we're now back to the pre-r228483 level of default verbosity. This in
turn again typically allows for reading information that userland might
have printed on the screen before initiating a halt, but still permits
to debug potential device shutdown problems on system shutdown via
CAM_DEBUG etc.
Reviewed by: mav
MFC after: 3 days
Sponsored by: Bally Wulff Games & Entertainment GmbH
and prevents the request from deleting existing mappings in the
region, failing instead.
Reviewed by: alc
Discussed with: jhb
Tested by: markj, pho (previous version, as part of the bigger patch)
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
for any outstanding commands to be properly aborted by CTL.
Without it, in some cases (such as files backing the LUNs
stored on failing disk drives), terminating a busy session
would result in panic.
Reviewed by: mav@ (earlier version)
Sponsored by: The FreeBSD Foundation
Fix the gate in xen_pv_lapic_ipi_vectored to prevent access to element
at position nitems(xen_ipis).
Sponsored by: Citrix Systems R&D
Coverity ID: 1223203
Approved by: gibbs
- Don't compare the DMA map to NULL to determine if bus_dmamap_unload()
should be called when releasing a static allocation. Instead, compare
the bus address against 0.
- Don't assume that the DMA map for static allocations is NULL. Instead,
save the value set by bus_dmamem_alloc() so it can later be passed to
bus_dmamem_free(). Also, add missing calls to bus_dmamap_unload() in
these cases before freeing the buffer.
- Use the bus address from the bus_dma callback instead of calling
vtophys() on the address allocated by bus_dmamem_alloc().
Reviewed by: kan
- Add missing calls to bus_dmamap_unload() in et(4).
- Check the bus address against 0 to decide when to call
bus_dmamap_unload() instead of comparing the bus_dma map against NULL.
- Check the virtual address against NULL to decide when to call
bus_dmamem_free() instead of comparing the bus_dma map against NULL.
- Don't clear bus_dma map pointers to NULL for static allocations.
Instead, treat the value as completely opaque.
- Pass the correct virtual address to bus_dmamem_free() in wpi(4) instead
of trying to free a pointer to the virtual address.
Reviewed by: yongari
The functions' definitions are protected by #ifdef SMP.
Keeping apic_ops.ipi_*() methods NULL would allow to catch the use
on UP machines.
Reviewed by: royger
Sponsored by: The FreeBSD Foundation
permissions test, forgotten in r164033.
Refactor the permission checks for utimes(2) into vnode helper
function vn_utimes_perm(9), and simplify its code comparing with the
UFS origin, by writing the call to VOP_ACCESSX only once. Use the
helper for UFS(5), tmpfs(5), devfs(5) and msdosfs(5).
Reported by: bde
Reviewed by: bde, trasz
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
the queue where to enqueue pages that are going to be unwired.
- Add stronger checks to the enqueue/dequeue for the pagequeues when
adding and removing pages to them.
Of course, for unmanaged pages the queue parameter of vm_page_unwire() will
be ignored, just as the active parameter today.
This makes adding new pagequeues quicker.
This change effectively modifies the KPI. __FreeBSD_version will be,
however, bumped just when the full cache of free pages will be
evicted.
Sponsored by: EMC / Isilon storage division
Reviewed by: alc
Tested by: pho
Make data_submit backends method support not only read and write requests,
but also two new ones: verify and compare. Verify just checks readability
of the data in specified location without transferring them outside.
Compare reads the specified data and compares them to received data,
returning error if they are different.
VERIFY(10/12/16) commands request either verify or compare from backend,
depending on BYTCHK CDB field. COMPARE AND WRITE command executed in two
stages: first it requests compare, and then, if succeesed, requests write.
Atomicity of operation is guarantied by CTL request ordering code.
MFC after: 2 weeks
Sponsored by: iXsystems, Inc.
This is needed because syscons depends on ISA.
Sponsored by: Citrix Systems R&D
Approved by: gibbs
x86/isa/isa.c:
- Allow the ISA bus to attach to xenpv.
Switch the initialization of gnttab to use an unused physical memory
range for both PVHVM and PVH.
In the past PVHVM was using the xenpci BAR, but there's no reason to
do that, and in fact FreeBSD was probably doing it because it was the
way it was done in Windows, were drivers cannot probably request for
unused physical memory ranges, but it was never enforced in the
hypervisor.
Sponsored by: Citrix Systems R&D
Approved by: gibbs
xen/gnttab.c:
- Allocate contiguous physical memory for grant table frames for both
PVHVM and PVH.
- Since gnttab is not a device, use the xenpv device in order to
request for this allocation.
dev/xen/xenpci/xenpcivar.h:
dev/xen/xenpci/xenpci.c:
- Remove the now unused xenpci_alloc_space and xenpci_alloc_space_int
functions.
xen/gnttab.h:
- Change the prototype of gnttab_init and gnttab_resume, that now
takes a device_t parameter.
dev/xen/control/control.c:
x86/xen/xenpv.c:
- Changes to accomodate the new prototype of gnttab_init and
gnttab_resume.
Currently the grant table is initialized from xenstore, but a better
place to do this would be xenpv, so move grant table initialization
there.
Sponsored by: Citrix Systems R&D
Approved by: gibbs
x86/xen/xenpv.c:
- Add gnttab initialization.
xen/xenstore/xenstore.c:
- Remove gnttab initialization.
For PVH guests the xenstore parameters are fetched from the start_info
struct, just like on PV.
Sponsored by: Citrix Systems R&D
Approved by: gibbs
xen/xenstore/xenstore.c:
- Fetch xenstore event channel port from start_info.
Add the PV shutdown hook to PVH.
Sponsored by: Citrix Systems R&D
Approved by: gibbs
dev/xen/control/control.c:
- Make xen_pv_shutdown_final available on XENHVM builds.
- Register the Xen PV shutdown hook for PVH guests.
Introduce a Xen specific nexus that is going to be used by Xen PV/PVH
guests.
Sponsored by: Citrix Systems R&D
Approved by: gibbs
x86/xen/xen_nexus.c:
- Introduce a Nexus to use on Xen PV(H) guests, this prevents PV(H)
guests from using the legacy Nexus.
conf/files.amd64:
conf/files.i386:
- Add the xen nexus to the build.
Since there's no ACPI on PVH guests, we need to create a dummy CPU
device in order to fill the pcpu->pc_device field.
Sponsored by: Citrix Systems R&D
Approved by: gibbs
dev/xen/pvcpu/pvcpu.c:
- Create a dummy CPU device for PVH guests in order to fill the
per-cpu pc_device field.
conf/files:
- Add the pvcpu device to kernels using XEN or XENHVM options.
Create a dummy bus so top level Xen devices can attach to it (instead
of attaching directly to the nexus). This allows to have all the Xen
related devices grouped under a single bus.
Sponsored by: Citrix Systems R&D
Approved by: gibbs
x86/xen/xenpv.c:
- Attach the xenpv bus when running as a Xen guest.
- Attach the ISA bus if needed, in order to attach syscons.
conf/files.amd6:
conf/files.i386:
- Include the xenpv.c file in the build of i386/amd64 kernels using
XENHVM.
dev/xen/console/console.c:
dev/xen/timer/timer.c:
xen/xenstore/xenstore.c:
- Attach to the xenpv bus instead of the Nexus.
dev/xen/xenpci/xenpci.c:
- Xen specific devices on PVHVM guests are no longer attached to the
xenpci device, they are instead attached to the xenpv bus, remove
the now unused methods.
Create the necessary hooks in order to provide a Xen PV APIC
implementation that can be used on PVH. Most of the lapic ops
shouldn't be called on Xen, since we trap those operations at a higher
layer.
Sponsored by: Citrix Systems R&D
Approved by: gibbs
x86/xen/hvm.c:
x86/xen/xen_apic.c:
- Move IPI related code to xen_apic.c
x86/xen/xen_apic.c:
- Introduce Xen PV APIC implementation, most of the functions of the
lapic interface should never be called when running as PV(H) guest,
so make sure FreeBSD panics when trying to use one of those.
- Define the Xen APIC implementation in xen_apic_ops.
xen/xen_pv.h:
- Extern declaration of the xen_apic struct.
x86/xen/pv.c:
- Use xen_apic_ops as apic_ops when running as PVH guest.
conf/files.amd64:
conf/files.i386:
- Include the xen_apic.c file in the build of i386/amd64 kernels
using XENHVM.
This is needed for Xen PV(H) guests, since there's no hardware lapic
available on this kind of domains. This commit should not change
functionality.
Sponsored by: Citrix Systems R&D
Reviewed by: jhb
Approved by: gibbs
amd64/include/cpu.h:
amd64/amd64/mp_machdep.c:
i386/include/cpu.h:
i386/i386/mp_machdep.c:
- Remove lapic_ipi_vectored hook from cpu_ops, since it's now
implemented in the lapic hooks.
amd64/amd64/mp_machdep.c:
i386/i386/mp_machdep.c:
- Use lapic_ipi_vectored directly, since it's now an inline function
that will call the appropiate hook.
x86/x86/local_apic.c:
- Prefix bare metal public lapic functions with native_ and mark them
as static.
- Define default implementation of apic_ops.
x86/include/apicvar.h:
- Declare the apic_ops structure and create inline functions to
access the hooks, so the change is transparent to existing users of
the lapic_ functions.
x86/xen/hvm.c:
- Switch to use the new apic_ops.
The header structure consists of two 1-byte elements, but it must always
be describable by a single SG entry. Note for consistency, specify the
alignment everywhere, even if the structure has the appropriate natural
alignment since it contains a uint16_t.
Obtained from: DragonFlyBSD
MFC after: 1 week
These defines are applicable to userland too, but virtqueue.h contains
the kernel virtqueue interface, and is therefore not usable in userland.
Note that Linux places these defines in virtio_ring.h, but I don't want
the drivers including this header file to keep the VirtIO ring opaque to
everything but the virtqueue.
MFC after: 1 week
The eventual goal is to share this file with userland, so
remove the macro that is only specific for virtio_pci(4).
Instead, add the VIRTIO_PCI_CONFIG_OFF macro from Linux to
get the config size whether MSIX is enabled or not.
MFC after: 1 week