- clear capability flags when hw timeouts
- retire comc_started status variable and directly use c_flags to see
if comconsole is selected for use
Reviewed by: jhb
Tested by: Uffe Jakobsen <uffe@uffe.org>,
Olivier Cochard-Labbe <olivier@cochard.me>
MFC after: 26 days
- clarify meaning of console flags
- perform i/o via a console only if both of the following conditions are met:
o console is active (selected by user or config)
o console flags that it can perform the operation
- warn if a chosen console can not work (the warning may go nowhere without
working and active console, though)
Reviewed by: jhb
Tested by: Uffe Jakobsen <uffe@uffe.org>,
Olivier Cochard-Labbe' <olivier@cochard.me>
MFC after: 26 days
... instead of deferring the action until first open.
Unlike upstream this has no benefit on FreeBSD.
We know that as soon as the provider is created it is going to be tasted
and thus opened. Initial mediasize of zero causes tasting failure
and subsequent retasting because of the size change.
MFC after: 14 days
The first discovered pool, whether it covers the whole boot disk or not,
is going to be first in zfs_pools list. So there is no need at all
for spapp parameter.
This commit also fixes a bug where NULL would be assigned to NULL
pointer when probe_drive was called with the spapp parameter of NULL.
MFC after: 21 days
the temporary mappings that are used to implement operations like
pmap_zero_page(). There is no reason for the MIPS pmap to deviate
from that practice.
This should allow to mount a dataset as a root filesystem even if
it belongs to a pool that is not described in zpool.cache.
This adds some overhead to the boot process though.
If the root filesystem's pool is found in zpool.cache, the by default
its cached configuration will be used for import.
vfs.zfs.rootpool.prefer_cached_config could be set to zero to force
the config to be retasted.
Discussed with: gibbs, pjd, des
MFC after: 25 days
- only filesystem datasets are supported
- children names are printed to stdout
To do: allow to iterate over the list and fetch names programatically
MFC after: 17 days
.. when deciding whether to continue tracing across suid/sgid exec.
Otherwise if root ktrace-d an unprivileged process and the processed
exec-ed a suid program, then tracing didn't continue across exec.
Reviewed by: bde, kib
MFC after: 22 days
having PTE_RO set instead of PTE_D. This avoids some unnecessary failures
by pmap_extract_and_hold() that will have to be handled by a call to
vm_fault_hold(). Testing the PTE for both being non-zero and having PTE_V
set is redundant. The latter suffices.
- All packets in NETISR_IP queue are in net byte order.
- ip_input() is entered in net byte order and converts packet
to host byte order right _after_ processing pfil(9) hooks.
- ip_output() is entered in host byte order and converts packet
to net byte order right _before_ processing pfil(9) hooks.
- ip_fragment() accepts and emits packet in net byte order.
- ip_forward(), ip_mloopback() use host byte order (untouched actually).
- ip_fastforward() no longer modifies packet at all (except ip_ttl).
- Swapping of byte order there and back removed from the following modules:
pf(4), ipfw(4), enc(4), if_bridge(4).
- Swapping of byte order added to ipfilter(4), based on __FreeBSD_version
- __FreeBSD_version bumped.
- pfil(9) manual page updated.
Reviewed by: ray, luigi, eri, melifaro
Tested by: glebius (LE), ray (BE)
to 32k swamped the controller causing firmware hangs. Instead, round
requests smaller than 64k up to the next power of 2 as a general rule.
To handle the one known special case of a command that accepts a 12k
buffer returning a 24k-ish reply, round requests between 8k and 16k up
to 32k rather than 16k. The result is that commands less than 8k should
now be rounded up to a smaller size (either 4k or 8k) rather than 32k.
PR: kern/155658
Tested by: Andreas Longwitz
MFC after: 1 week
AR5416 and AR9280, but leave it disabled by default.
TL;DR: don't enable this code at all unless you go through the process
of getting the NIC re-certified. This is purely to be used as a
reference and NOT a certified solution by any stretch of the imagination.
The background:
The AR5112 RF synth right up to the AR5133 RF synth (used on the AR5416,
derivative is used for the AR9130/AR9160) only implement down to 2.5MHz
channel spacing in 5GHz. Ie, the RF synth is programmed in steps of 2.5MHz
(or 5, 10, 20MHz.) So they can't represent the quarter rate channels
in the 4.9GHz PSB (which end in xxx2MHz and xxx7MHz). They support
fractional spacing in 2GHz (1MHz spacing) (or things wouldn't work,
right?)
So instead of doing this, the RF synth programming for the AR5112 and
later code will round to the nearest available frequency.
If all NICs were RF5112 or later, they'll inter-operate fine - they all
program the same. (And for reference, only the latest revision of the
RF5111 NICs do it, but the driver doesn't yet implement the programming.)
However:
* The AR5416 programming didn't at all implement the fractional synth
work around as above;
* The AR9280 programming actually programmed the accurate centre frequency
and thus wouldn't inter-operate with the legacy NICs.
So this patch:
* Implements the 4.9GHz PSB fractional synth workaround, exactly as the
RF5112 and later code does;
* Adds a very dirty workaround from me to calculate the same channel
centre "fudge" to the AR9280 code when operating on fractional frequencies
in 5GHz.
HOWEVER however:
It is disabled by default. Since the HAL didn't implement this feature,
it's highly unlikely that the AR5416 and AR928x has been tested in these
centre frequencies. There's a lot of regulatory compliance testing required
before a NIC can have this enabled - checking for centre frequency,
for drift, for synth spurs, for distortion and spectral mask compliance.
There's likely a lot of other things that need testing so please don't
treat this as an exhaustive, authoritative list. There's a perfectly good
process out there to get a NIC certified by your regulatory domain, please
go and engage someone to do that for you and pay the relevant fees.
If a company wishes to grab this work and certify existing 802.11n NICs
for work in these bands then please be my guest. The AR9280 works fine
on the correct fractional synth channels (49x2 and 49x7Mhz) so you don't
need to get certification for that. But the 500KHz offset hack may have
the above issues (spur, distortion, accuracy, etc) so you will need to
get the NIC recertified.
Please note that it's also CARD dependent. Just because the RF synth
will behave correctly doesn't at all mean that the card design will also
behave correctly. So no, I won't enable this by default if someone
verifies a specific AR5416/AR9280 NIC works. Please don't ask.
Tested:
I used the following NICs to do basic interoperability testing at
half and quarter rates. However, I only did very minimal spectrum
analyser testing (mostly "am I about to blow things up" testing;
not "certification ready" testing):
* AR5212 + AR5112 synth
* AR5413 + AR5413 synth
* AR5416 + AR5113 synth
* AR9280
After further discussion, instead of pretending to use
uid_t and gid_t as upstream Solaris and linux try to, we
are better using u_int, which is in fact what the code
can handle and best approaches the range of values used
by uid and gid.
Discussed with: bde
Reviewed by: bde
net80211 node power save state.
* Add an ATH_NODE_UNLOCK_ASSERT() check
* Add a new node field - an_is_powersave
* Pause/unpause the queue based on the node state
* Attempt to handle net80211 concurrency issues so the queue
doesn't get paused/unpaused more than once at a time from
the net80211 power save code.
Whilst here (and breaking my usual rule), set CLRDMASK when a queue
is unpaused, regardless of whether the queue has some pending traffic.
This means the first frame from that TID (now or later) will hvae
CLRDMASK set.
Also whilst here, bump the swretrymax counters whenever the
filtered frames code expires a frame. Again, breaking my rule, but
this is just a statistics thing rather than a functional change.
This doesn't fix ps-poll (but it doesn't break it too much worse
than it is at the present) or correcting the TID updates.
That's next on the list.
Tested:
* AR9220 AP (Atheros AP96 reference design)
* Macbook Pro and LG Optimus 1 Android phone, both setting
and clearing power save state (but not using PS-POLL.)
The base netmap pointer and offsets involved are provided by the kernel
side of the netmap interface and will have appropriate alignment.
Sponsored by: ADARA Networks
MFC After: 2 weeks
When performing a non-blocking read(2), on a TTY while no data is
available, we should return EAGAIN. But if there's a modem disconnect,
we should return 0. Right now we only return 0 when doing a blocking
read, which is wrong.
MFC after: 1 month
Update some of the comments. In particular, use "sleep" in preference to
"block" where appropriate.
Eliminate some unnecessary casts.
Make a few whitespace changes for consistency.
Reviewed by: kib
MFC after: 3 days
When creating a client with clnt_tli_create, it uses strdup to copy
strings for these fields if nconf is passed in. clnt_dg_destroy frees
these strings already. Make sure clnt_vc_destroy frees them in the same
way.
This change matches the reference (OpenSolaris) implementation.
Tested by: David Wolfskill
Obtained from: Bull GNU/Linux NFSv4 Project (libtirpc)
MFC after: 2 weeks
This turns ieee80211_node_pwrsave(), ieee80211_sta_pwrsave() and
ieee80211_recv_pspoll() into methods.
The intent is to let drivers override these and tie into the power save
management pathway.
For ath(4), this is the beginning of forcing a node software queue to
stop and start as needed, as well as supporting "leaking" single frames
from the software queue to the hardware.
Right now, ieee80211_recv_pspoll() will attempt to transmit a single frame
to the hardware (whether it be a data frame on the power-save queue or
a NULL data frame) but the driver may have hardware/software queued frames
queued up. This initial work is an attempt at providing the hooks required
to implement correct behaviour.
Allowing ieee80211_node_pwrsave() to be overridden allows the ath(4)
driver to pause and unpause the entire software queue for a given node.
It doesn't make sense to transmit anything whilst the node is asleep.
Please note that there are other corner cases to correctly handle -
specifically, setting the MORE data bit correctly on frames to a station,
as well as keeping the TIM updated. Those particular issues can be
addressed later.
the CAM "enc" peripheral (part of ses(4)). Previously the two modules
used the same name, so only one was included in a linked kernel causing
enc0 to not be created if you added IPSEC to GENERIC. The new module
name follows the pattern of other network interfaces (e.g. "if_loop").
MFC after: 1 week
Both functions need to obtain lock on the found PCB, and they can't do
classic inter-lock with the PCB hash lock, due to lock order reversal.
To keep the PCB stable, these functions put a reference on it and after PCB
lock is acquired drop it. If the reference was the last one, this means
we've raced with in_pcbfree() and the PCB is no longer valid.
This approach works okay only if we are acquiring writer-lock on the PCB.
In case of reader-lock, the following scenario can happen:
- 2 threads locate pcb, and do in_pcbref() on it.
- These 2 threads drop the inp hash lock.
- Another thread comes to delete pcb via in_pcbfree(), it obtains hash lock,
does in_pcbremlists(), drops hash lock, and runs in_pcbrele_wlocked(), which
doesn't free the pcb due to two references on it. Then it unlocks the pcb.
- 2 aforementioned threads acquire reader lock on the pcb and run
in_pcbrele_rlocked(). One gets 1 from in_pcbrele_rlocked() and continues,
second gets 0 and considers pcb freed, returns.
- The thread that got 1 continutes working with detached pcb, which later
leads to panic in the underlying protocol level.
To plumb that problem an additional INPCB flag introduced - INP_FREED. We
check for that flag in the in_pcbrele_rlocked() and if it is set, we pretend
that that was the last reference.
Discussed with: rwatson, jhb
Reported by: Vladimir Medvedkin <medved rambler-co.ru>
w.r.t. a Linux NFS client doing a krb5 NFS mount against the
FreeBSD server. We determined this was a Linux bug:
http://www.spinics.net/lists/linux-nfs/msg32466.html, however
the mount failed to work, because the Destroy operation with a
bogus encrypted checksum destroyed the authenticator handle.
This patch changes the rpcsec_gss code so that it doesn't
Destroy the authenticator handle for this case and, as such,
the Linux mount will work.
Tested by: Attila Bogar and Herbert Poeckl
MFC after: 2 weeks
this some compilers will place a cmp instruction before the atomic operation
and expect to be able to use the result afterwards. By adding "cc" to the
list of used registers we tell the compiler to not do this.
- Write method of a queue now is void,length of item is taken
as queue property.
- Write methods don't need to know about mbud, supply just buf
to them.
- No need for safe queue iterator in pfsync_sendout().
Obtained from: OpenBSD
disk_open(). Very often this is called several times for one file.
This leads to reading partition table metadata for each call. To
reduce the number of disk I/O we have a simple block cache, but it
is very dumb and more than half of I/O operations related to reading
metadata, misses this cache.
Introduce new cache layer to resolve this problem. It is independent
and doesn't need initialization like bcache, and will work by default
for all loaders which use the new DISK API. A successful disk_open()
call to each new disk or partition produces new entry in the cache.
Even more, when disk was already open, now opening of any nested
partitions does not require reading top level partition table.
So, if without this cache, partition table metadata was read around
20-50 times during boot, now it reads only once. This affects the booting
from GPT and MBR from the UFS.
pf_purge_expired_states().
Now pf purging daemon stores the current hash table index on stack
in pf_purge_thread(), and supplies it to next iteration of
pf_purge_expired_states(). The latter returns new index back.
The important change is that whenever pf_purge_expired_states() wraps
around the array it returns immediately. This makes our knowledge about
status of states expiry run more consistent. Prior to this change it
could happen that n-th run stopped on i-th entry, and returned (1) as
full run complete, then next (n+1) full run stopped on j-th entry, where
j < i, and that broke the mark-and-sweep algorythm that saves references
rules. A referenced rule was freed, and this later lead to a crash.
tree used it incorrectly, which lead to inaccurate overrated
if_obytes accounting. The drbr(9) used to update ifnet stats on
drbr_enqueue(), which is not accurate since enqueuing doesn't
imply successful processing by driver. Dequeuing neither mean
that. Most drivers also called drbr_stats_update() which did
accounting again, leading to doubled if_obytes statistics. And
in case of severe transmitting, when a packet could be several
times enqueued and dequeued it could have been accounted several
times.
o Thus, make drbr(9) API thinner. Now drbr(9) merely chooses between
ALTQ queueing or buf_ring(9) queueing.
- It doesn't touch the buf_ring stats any more.
- It doesn't touch ifnet stats anymore.
- drbr_stats_update() no longer exists.
o buf_ring(9) handles its stats itself:
- It handles br_drops itself.
- br_prod_bytes stats are dropped. Rationale: no one ever
reads them but update of a common counter on every packet
negatively affects performance due to excessive cache
invalidation.
- buf_ring_enqueue_bytes() reduced to buf_ring_enqueue(), since
we no longer account bytes.
o Drivers handle their stats theirselves: if_obytes, if_omcasts.
o mlx4(4), igb(4), em(4), vxge(4), oce(4) and ixv(4) no longer
use drbr_stats_update(), and update ifnet stats theirselves.
o bxe(4) was the most correct driver, it didn't call
drbr_stats_update(), thus it was the only driver accurate under
moderate load. Now it also maintains stats itself.
o ixgbe(4) had already taken stats from hardware, so just
- drop software stats updating.
- take multicast packet count from hardware as well.
o mxge(4) just no longer needs NO_SLOW_STATS define.
o cxgb(4), cxgbe(4) need no change, since they obtain stats
from hardware.
Reviewed by: jfv, gnn
bits under #ifdef _KERNEL but leave definitions for various structures
defined by standards ($PIR table, SMAP entries, etc.) available to
userland.
- Consolidate duplicate SMBIOS table structure definitions in ipmi(4)
and smbios(4) in <machine/pc/bios.h> and make them available to
userland.
MFC after: 2 weeks
If you have a binary on a filesystem which is also mounted over by
nullfs, you could execute the binary from the lower filesystem, or
from the nullfs mount. When executed from lower filesystem, the lower
vnode gets VV_TEXT flag set, and the file cannot be modified while the
binary is active. But, if executed as the nullfs alias, only the
nullfs vnode gets VV_TEXT set, and you still can open the lower vnode
for write.
Add a set of VOPs for the VV_TEXT query, set and clear operations,
which are correctly bypassed to lower vnode.
Tested by: pho (previous version)
MFC after: 2 weeks
number is not exactly specified. When the disk has MBR, also try to read
BSD label after ptable_getpart() call. When the disk has GPT, also set
d_partition to 255. Mostly, this is how it worked before.
mutexes held and the topology lock is an sx lock.
The topology lock was there to protect traversing through the list of providers
of disk's geom, but it seems that disk's geom has always exactly one provider.
Change the code to call g_wither_provider() for this one provider, which is
safe to do without holding the topology lock and assert that there is indeed
only one provider.
Discussed with: ken
MFC after: 1 week
ngthread properly set the item's depth to 1. In particular, prior to this
change if ng_snd_item failed to acquire a lock on a node, the item's depth
would not be set at all. This fix ensures that the error code from rcvmsg/
rcvdata is properly passed back to the apply callback. For example, this
fixes a bug where an error from rcvmsg/rcvdata would not previously
propagate back to a libnetgraph consumer when the message was queued.
Reviewed by: mav
MFC after: 1 month
Sponsored by: Sandvine Incorporated
The attempt to merge changes from the linux libtirpc caused
rpc.lockd to exit after startup under unclear conditions.
After many hours of selective experiments and inconsistent results
the conclusion is that it's better to just revert everything and
restart in a future time with a much smaller subset of the
changes.
____
MFC after: 3 days
Reported by: David Wolfskill
Tested by: David Wolfskill
I have to note that POSIX is simply stupid in how it describes O_EXEC/fexecve
and friends. Yes, not only inconsistent, but stupid.
In the open(2) description, O_RDONLY flag is described as:
O_RDONLY Open for reading only.
Taken from:
http://pubs.opengroup.org/onlinepubs/9699919799/functions/open.html
Note "for reading only". Not "for reading or executing"!
In the fexecve(2) description you can find:
The fexecve() function shall fail if:
[EBADF]
The fd argument is not a valid file descriptor open for executing.
Taken from:
http://pubs.opengroup.org/onlinepubs/9699919799/functions/exec.html
As you can see the function shall fail if the file was not open with O_EXEC!
And yet, if you look closer you can find this mess in the exec.html:
Since execute permission is checked by fexecve(), the file description
fd need not have been opened with the O_EXEC flag.
Yes, O_EXEC flag doesn't have to be specified after all. You can open a file
with O_RDONLY and you still be able to fexecve(2) it.
global variables are placed. When a module is loaded by link_elf
linker its variables from "set_vnet" linker set are copied to the
kernel "set_vnet" ("modspace") and all references to these variables
inside the module are relocated accordingly.
The issue is when a module is loaded that has references to global
variables from another, previously loaded module: these references are
not relocated so an invalid address is used when the module tries to
access the variable. The example is V_layer3_chain, defined in ipfw
module and accessed from ipfw_nat.
The same issue is with DPCPU variables, which use "set_pcpu" linker
set.
Fix this making the link_elf linker on a module load recognize
"external" DPCPU/VNET variables defined in the previously loaded
modules and relocate them accordingly. For this set_pcpu_list and
set_vnet_list are used, where the addresses of modules' "set_pcpu" and
"set_vnet" linker sets are stored.
Note, archs that use link_elf_obj (amd64) were not affected by this
issue.
Reviewed by: jhb, julian, zec (initial version)
MFC after: 1 month
of reviewing of r231025.
Unlike other options from this family TCP_KEEPCNT doesn't specify
time interval, but a count, thus parameter supplied doesn't need
to be multiplied by hz.
Reported & tested by: amdmi3
This doesn't specifically fix the issue(s) i'm seeing in this 2GHz
environment (where setting/increasing spur immunity causes OFDM restart
errors to skyrocket through the roof; but leaving it at 0 would leave
the environment cleaner..)
Pointy-hat-to: me, for committing this broken code in the first place.
problematic because some callers to pmap_kextract() expect its
implementation to be lock-less. In particular, uma_dbg_alloc() implicitly
requires this. Otherwise, lock-order reversals occur between pmap locks and
UMA zone locks. So, this change introduces a lock-less implementation of
pmap_kextract().
Disable recursion on the pvh global lock in the new armv6 pmap. While
recursion on this locks occurs in the old arm pmap, it thankfully doesn't
occur in the armv6 pmap.
Tested by: jmg
- Use a dedicated task to handle deferred transmits from the if_transmit
method instead of reusing the existing per-queue interrupt task.
Reusing the per-queue interrupt task could result in both an interrupt
thread and the taskqueue thread trying to handle received packets on a
single queue resulting in out-of-order packet processing and lock
contention.
- Don't define ixgbe_start() at all where if_transmit is used.
Tested by: Vijay Singh
Reviewed by: jfv
MFC after: 2 weeks
Device nodes are in the format /dev/led/isci.busX.portY.locate.
Sponsored by: Intel
Requested by: Paul Maulberger <paul dot maulberger at gmx dot de>
MFC after: 1 week
things like EAPOL frames make it out.
After a whole bunch of hacking/testing, I discovered that they weren't
being early-dropped by the stack (but I should look at ensuring that
later..) but were even making to the hardware transmit queue.
They were mostly even being received by the remote end. However, the
remote end was completely ignoring them.
This didn't happen under 150-170MBit TCP tests as I'm guessing the TX
queue stayed very busy and the STA didn't do any scanning. However, when
doing 100Mbit/s of TCP traffic, the STA would do background scanning -
which involves it coming in and out of powersave mode with the AP.
Now, this is a total and utter hack around the real problems, which are:
* I need to implement proper power save handling and integrate it into
the filtered frames support, so the driver/stack doesn't send frames
whilst the station is actually in sleep;
* .. but frames were actually making it to the STA (macbook pro) and
the AP did receive an ACK; but a tcpdump on the receiving side showed
the EAPOL frame never made it. So the stack was dropping it for
some reason;
* Importantly - the EAPOL frames are currently going into the non-QoS
TID, which maps to the BE queue and is susceptible to that queue being
busy doing other things, but;
* There's other traffic going on in the non-QoS TID from other contexts
when scanning is going on and it's possible there's some races causing
sequence number/IV issues, but;
* Importantly importantlly, I think the interaction with TID 16 multicast
traffic in power save mode is causing issues - since I -believe- the
sequence number space being used by the EAPOL frames on TID 16 overlaps
with the multicast frames that have sequence numbers allocated and
are then stuffed on the cabq. Since with EAPOL frames being in TID 16
and queued to the BE queue, it's going to be waiting to be serviced
with all of the aggregate traffic going on - and if the CABQ gets
emptied beforehand, those TID 16 multicast frames with sequence numbers
will go out beforehand.
Now, there's quite likely a bunch of "stuff happening slightly out of
sequence" going on due to the nature of the TX path (read: lots of
overlapping and concurrent ath_start() and ath_raw_xmit() calls going
on, sigh) but I thought I had caught them all and stuffed each TID TX
behind a lock (that lasted as long as it needed to in order to get
the frame onto the relevant destination queue - thus keeping things
in order.)
Unfortunately the last problem is the big one and I'm going to stare at
it some more. If it _is_
So this is a work around for now to ensure that EAPOL frames actually
make it out before any other stuff in the non-QoS TID and HOPEFULLY
before the CABQ gets active.
I'm now going to spend a little time in the TX path figuring out exactly
why the sender is rejecting things. There's two (well, three if you count
EAPOL contents invalid) possibilities:
* The sequence number is out of order (ie, something else like the multicast
traffic on CABQ) is going out first on TID 16;
* The CCMP IV is out of order (similar to above - but less likely, as the
TX key for multicast traffic is different to unicast traffic);
* EAPOL contents strangely invalid.
AP: Ubiquiti RSPRO, AR9160/AR9220 NICs
STA: Macbook Pro, Broadcom 11n NIC
getmq_read() and getmq_write() respectively, just like sys_kmq_timedreceive()
and sys_kmq_timedsend().
Sponsored by: FreeBSD Foundation
MFC after: 2 weeks
The requirement (implied by the KASSERT in tap_destroy) that the tap is
closed isn't valid; destroy_dev will block in devdrn while other threads
are in d_* functions.
Note: if_tun had the same issue, addressed in SVN revisions r186391,
r186483 and r186497. The use of the condvar there appears to be
redundant with the functionality provided by destroy_dev.
Sponsored by: ADARA Networks
Reviewed by: dwhite
MFC after: 2 weeks
Well, in theory we can pass those two flags, because O_RDONLY is 0,
but we won't be able to read from a descriptor opened with O_EXEC.
Update the comment.
Sponsored by: FreeBSD Foundation
MFC after: 2 weeks
If O_EXEC is provided don't require CAP_READ/CAP_WRITE, as O_EXEC
is mutually exclusive to O_RDONLY/O_WRONLY/O_RDWR.
Without this change CAP_FEXECVE capability right is not enforced.
Sponsored by: FreeBSD Foundation
MFC after: 3 days
lock may be held.
Kim reported that the TID lock wasn't held when ath_tx_update_clrdmask()
was called. Well, the underlying hardware TXQ for that TID.
I'm betting it's the cabq stuff. ath_tx_xmit_normal() can be called
for both real and software cabq. For software cabq, the real destination
txq is different to the txq. So, the lock check will fail.
Reported by: Kim Culhan <w8hdkim@gmail.com>
"genunix" This will requires us to modify externally created
DTrace scripts but makes logical sense for FreeBSD.
Requested by: rpaulo
MFC after: 2 weeks
offline in response to a INQUIRY command that does not retreive vital
product data(I personally have observed the behaviour on an Adaptec 2405
and a 5805). Force the peripheral qualifier to "connected" so that upper
layers correctly recognize that a disk is present.
This bug was uncovered by r216236. Prior to that fix, aac(4) was
accidentally clearing the peripheral qualifier for all inquiry commands.
This fixes an issue where passthrough devices were not created for
disks behind aac(4) controllers suffering from the bug. I have
verified that if a disk is not present that we still properly detect
that and not create the passthrough device.
Sponsored by: Sandvine Incorporated
MFC after: 1 week
as controlled by kern.random.sys.harvest.swi. SWI harvesting feeds into
the interrupt FIFO and each event is estimated as providing a single bit of
entropy.
Reviewed by: markm, obrien
MFC after: 2 weeks
LUNs respectively. This removes a huge number of error messages
from CAM during bus scans.
Copied almost verbatim from mav's commit r237460.
Submitted by: Mike Tancsa <mike at sentex dot net>
MFC after: 3 days
immediately panics on boot with INVARIANTS enabled. The driver already
clearly expects to be able to recurse on this mutex - the main I/O
is always recursing on this lock.
Reported and tested by: Mike Tancsa <mike at sentex dot net>
MFC after: 1 week
This should eventually be unified with ATH_DEBUG() so I can get both
from one macro; that may take some time.
Add some new probes for TX and TX completion.
* use the correct frame status - although the completion descriptor is
the _last_ in the frame/aggregate, the status is currently stored in
the _first_ buffer.
* Print out ath_buf specific fields once, not per descriptor in an ath_buf.
at r230551.
Also while there, make sense polling use reported for each node separately
instead of reporting accumulated total status.
Submitted by: Barbara <barbara.freebsd@gmail.com> (1)
MFC after: 3 days
it's disabled.
The previous commit to enable CLRDMASK setting didn't do it at all
correctly for non-aggregate sessions - so the CLRDMASK bit would be
cleared and never re-set.
* move ath_tx_update_clrdmask() to be called by functions that setup
descriptors and queue frames to the hardware, rather than scattered
everywhere.
* Force CLRDMASK to be set on all non-aggregate session frames being
transmitted.
* Use ath_tx_normal_comp() now on non-aggregate sessoin frames
that are queued via ath_tx_xmit_normal(). That way the TID hwq is
updated and they can trigger (eventual) filter frame queue resets
and software retransmits.
There's still a bit more work to do in this area to reverse the silly
short-sightedness on my part, however it's likely going to be better
to fix this now than just reverting the patch.
Thanks to people on the freebsd-wireless@ mailing list for promptly
pointing this out.
The following change caused rpc.lockd to exit after startup:
____
libtirpc: be sure to free cl_netid and cl_tp
When creating a client with clnt_tli_create, it uses strdup to copy
strings for these fields if nconf is passed in. clnt_dg_destroy frees
these strings already. Make sure clnt_vc_destroy frees them in the
same way.
____
MFC after: 3 days
Reported by: David Wolfskill
Tested by: David Wolfskill
adapter->dropped_pkts instead of if_ierrors because if_ierrors is
overwritten by hw stats collection.
Submitted by: Andrew Boyer <aboyer@averesystems.com>
Reviewed by: Jack F Vogel <jfv@freebsd.org>
MFC after: 2 weeks
doesn't exist on a dataset we are starting from. For example if we
have the following configuration:
tank
tank/foo
tank/foo@snap
tank/bar
tank/bar@snap
We can execute:
# zfs destroy -t tank@snap
eventhough tank@snap doesn't exit.
Unfortunately it is not possible to do the same with recursive rename:
# zfs rename -r tank@snap tank@pans
cannot open 'tank@snap': dataset does not exist
...until now. This change allows to recursively rename snapshots even if
snapshot doesn't exist on the starting dataset.
Sponsored by: rsync.net
MFC after: 2 weeks
The code builds a map of regions that were freed. On every write the
code consults the map and eventually removes ranges that were freed
before, but are now overwritten.
Freed blocks are not TRIMed immediately. There is a tunable that defines
how many txg we should wait with TRIMming freed blocks (64 by default).
There is a low priority thread that TRIMs ranges when the time comes.
During TRIM we keep in-flight ranges on a list to detect colliding
writes - we have to delay writes that collide with in-flight TRIMs in
case something will be reordered and write will reached the disk before
the TRIM. We don't have to do the same for in-flight writes, as
colliding writes just remove ranges to TRIM.
Sponsored by: multiplay.co.uk
This work includes some important fixes and some improvements obtained
from the zfsonlinux project, including TRIMming entire vdevs on pool
create/add/attach and on pool import for spare and cache vdevs.
Obtained from: zfsonlinux
Submitted by: Etienne Dechamps <etienne.dechamps@ovh.net>
queues lock is acquired before the page lock is released, there is no
guarantee that the page will still be in that same page queue when
vm_page_requeue() is called.
Reported by: pho
In collaboration with: kib
MFC after: 3 days
gone rule. Optimise use of channels so that when a channel
is not ready another channel is used. Instead of using the SOF interrupt
use the system timer to drive the host statemachine. This might
give lower throughput and higher latency, but reduces the CPU usage
significantly. The DWC OTG host mode support should not be considered
for serious USB host controller applications. Some problems are still
seen with LOW speed USB devices.
Commit changes missed from r237435. Properly calculate the signal
trampoline addresses after the shared page is enabled. Handle FreeBSD
ABIs without shared page support too.
MFi386: revision 238792
Introduce curpcb magic variable.
Do this by checking if spa_namespace_lock is already held and not taking
it again in that case.
Add a comment explaining why that is done and why it is safe.
Reviewed by: pjd
MFC after: 24 days
It is possible that provider is destroyed while we are iterating over the
list.
Reported by: Brian Parkison <parkison@panzura.com>
Discussed with: phk
MFC after: 1 week
slot. This eventually results in exhaustion of the tid space, causing
new threads get tid -1 as identifier.
The bad effect of having the thread id equal to -1 is that
UMTX_OP_UMUTEX_WAIT returns EFAULT for a lock owned by such thread,
because casuword cannot distinguish between literal value -1 read from
the address and -1 returned as an indication of faulted
access. _thr_umutex_lock() helper from libthr does not check for
errors from _umtx_op_err(2), causing an infinite loop in
mutex_lock_sleep().
We observed the JVM processes hanging and consuming enormous amount of
system time on machines with approximately 100 days uptime.
Reported by: Mykola Dzham <freebsd levsha org ua>
MFC after: 1 week
we are actually editing table, which means editing rules,
thus we need writer access to 'em.
Fix this by offloading the update of table to the same taskqueue,
we already use for flushing. Since taskqueues major task is now
overloading, and flushing is optional, do mechanical rename
s/flush/overload/ in the code related to the taskqueue.
Since overloading tasks do unsafe referencing of rules, provide
a bandaid in pf_purge_unlinked_rules(). If the latter sees any
queued tasks, then it skips purging for this run.
In table code:
- Assert any lock in pfr_lookup_addr().
- Assert writer lock in pfr_route_kentry().
there is no need to release and reacquire the pmap and pvh global locks
around calls to uma_zfree(). Recursion into the pmap simply won't occur.
Eliminate the use of M_USE_RESERVE. It is deprecated and, in fact, counter-
productive, meaning that it actually makes the memory allocation request
more likely to fail.
Eliminate the macros pmap_{alloc,free}_l2_dtable(). They are of limited
utility, and pmap_free_l2_dtable() was inconsistently used.
Tidy up pmap_init(). In particular, change the initialization of the PV
zone so that it doesn't span the initialization of the l2 and l2table zones.
Tested by: jmg
On single core devices set_stackptrs is only ever called with cpu = 0 in
initarm and will be identical to the existing function. On SMP this needs
to be implemented for sys/arm/mp_machdep.c, but the implementations are
identical for each SoC.
is performed on the vnode mapping which is wired in other address space.
While there, explicitely assert that the page is unwired and zero the
wire_count instead of substract. The condition is rechecked later in
vm_page_free(_toq) already.
Reported and tested by: zont
Reviewed by: alc (previous version)
MFC after: 1 week
frames to occur.
* Create a new function which will set the bf_flags CLRDMASK bit
if required.
* For raw frames, always set CLRDMASK.
* For BAR, ADDBA frames, always set CLRDMASK.
* For everything else, check if CLRDMASK needs to be set before
calling tx_setds() or tx_setds11n().
* When unpausing a queue or drain/resetting it, set tid->clrdmask=1
just to ensure traffic starts flowing.
What I need to do:
* Modify that function to _clear_ the CLRDMASK if it's not required,
or retried frames may have CLRDMASK set when they don't need to.
(Which isn't a huge deal, but..)
Whilst I'm here:
* ath_tx_normal_xmit() should really act like the AMPDU session TX
functions - any incomplete frames will end up being assigned
ath_tx_normal_comp() which will decrement tid->hwq_depth - but that
won't have been incremented.
So whilst I'm here, add a comment to do that.
* Fix the debug print function to be slightly clearer about things;
it's not a good sign when I can't interpret my own debugging output.
I've done some testing on AR9280/AR5416/AR9160 STA and AP modes.
stack.
There are unfortunately quite a few odd cases in BAR TX and BAR TX
retransmission that I haven't yet fully diagnosed. So for now, add
this work-around so the resume() function isn't called too often,
decrementing pause to -1 (and causing things to stay paused.)
and owner_group strings that consist entirely of
digits, interpreting them as the uid/gid number.
This change was needed since new (>= 3.3) Linux
servers reply with these strings by default.
This change is mandated by the rfc3530bis draft.
Reported on freebsd-stable@ under the Subject
heading "Problem with Linux >= 3.3 as NFSv4 server"
by Norbert Aschendorff on Aug. 20, 2012.
Tested by: norbert.aschendorff at yahoo.de
Reviewed by: jhb
MFC after: 2 weeks
a 4kb buffer if a request uses a buffer size of 0. (The Linux ioctl path
already did this.)
PR: kern/155658
Submitted by: Andreas Longwitz
MFC after: 1 week
#defines. This also has the advantage that it makes the names more
compact, iand also allows us to correct the non-uniform naming of
the PCIM_LINK_* defines, making them all consistent amongst themselves.
This is a mostly mechanical rename:
s/PCIR_EXPRESS_/PCIER_/g
s/PCIM_EXP_/PCIEM_/g
s/PCIM_LINK_/PCIEM_LINK_/g
When this is MFC'd, #defines will be added for the old names to assist
out-of-tree drivers.
Discussed with: jhb
MFC after: 1 week
is done.
The aggregate path was definitely accessing 'ts' before it was actually
being assigned.
This had the side effect of over-filtering frames, since occasionally that
bit would be '1'.
Whilst here, do the same thing in the non-aggregate completion function -
as calling the filter function may also invalidate bf.
Pointy hat to: adrian, for not noticing this over many, many code reviews.
This fixes issue in nvmecontrol(8), where clang throws a cast-align
warning when casting a __packed structure pointer to a uint32_t
pointer as part of printing raw hex output.
Reported by: dhw
'path' argument of ofw_parsedev() if devspec refers raw device with no path.
For example, `ls /pci@1f,0/ide@d/disk@0,0:a/` works fine, while
`ls /pci@1f,0/ide@d/disk@0,0:a` panicked before this change.
altq_add() and its descendants. Currently altq(4) in FreeBSD is configured
via pf(4) ioctls, which can't configure altq(4) w/o holding locks.
Fortunately, altq(4) code in spife of using M_WAITOK is ready to receive
NULL from malloc(9), so change is mostly mechanical. While here, utilize
M_ZERO instead of bzero().
A large redesign needed to achieve M_WAITOK usage when configuring altq(4).
Or an alternative (not pf(4)) configuration interface should be implemented.
Reported by: pluknet
This is important to secure a small timeframe at boot time, when
network is already configured, but pf(4) is not yet.
PR: kern/171622
Submitted by: Olivier Cochard-LabbИ <olivier cochard.me>
1) Ruleset parser uses a global variable for anchor stack.
2) When processing a wildcard anchor, matching anchors are marked.
To fix the first one:
o Allocate anchor processing stack on stack. To make this allocation
as small as possible, following measures taken:
- Maximum stack size reduced from 64 to 32.
- The struct pf_anchor_stackframe trimmed by one pointer - parent.
We can always obtain the parent via the rule pointer.
- When pf_test_rule() calls pf_get_translation(), the former lends
its stack to the latter, to avoid recursive allocation 32 entries.
The second one appeared more tricky. The code, that marks anchors was
added in OpenBSD rev. 1.516 of pf.c. According to commit log, the idea
is to enable the "quick" keyword on an anchor rule. The feature isn't
documented anywhere. The most obscure part of the 1.516 was that code
examines the "match" mark on a just processed child, which couldn't be
put here by current frame. Since this wasn't documented even in the
commit message and functionality of this is not clear to me, I decided
to drop this examination for now. The rest of 1.516 is redone in a
thread safe manner - the mark isn't put on the anchor itself, but on
current stack frame. To avoid growing stack frame, we utilize LSB
from the rule pointer, relying on kernel malloc(9) returning pointer
aligned addresses.
Discussed with: dhartmei
The hardware can optionally "filter" frames if successive transmissions
to a given node (ie, "entry in the keycache") fail. That way the hardware
can implement a kind of early abort of all the other frames queued to
that destination, rather than simply trying to TX each frame to that
destination (and failing.)
The background:
* If a frame comes back as being filtered, the hardware didn't try to
TX it (or it was outside the TX burst opportunity.) So, take it as a hint
that some (but not all, see below) frames to the destination may be
filtered.
* If the CLRDMASK bit is set in a TX descriptor, the "filter to this
destination" bit in the keycache entry is cleared and TX to that host
will be unconditionally retried.
* Right now everything has the CLRDMASK bit set, so filtered frames
tend to be aggregates and frames that fall outside of the WME burst
window. It was a bit worse in the past as I had messed up the TX
flags and CLRDMASK wasn't being set on aggregate frames.
The annoying bits:
* It's easy (ish) to do for aggregate session frames - firstly, they
can be retried in any order as long as they're within the BAW, and
there's already a bunch of infrastructure tracking how many frames
the TID has queued to the hardware (tid->hwq_depth.) However, for
frames that bypassed the software queue, hwq_depth doesn't get
incremented. I'll fix that in a subsequent commit.
* For non-aggregate session frames, the only retries that can occur
are ones for sequence numbers that hvaen't successfully been TXed yet.
Since there's no re-ordering going on in non-aggregate sessions, if any
subsequent seqno frames make it out, any filtered frames before that
seqno need to be dropped.
Hence why this initially is just for aggregate session frames.
* Since there may be intermediary frames to the destination that
have CLRDMASK set - for example, any directly dispatched management
frames to that destination - it's possible that there will be some
filtered frames followed up by some non filtered frames. Thus,
it can't be assumed that once you see a filtered frame for the given
destination node, all subsequent frames for all TIDs will be filtered.
Ok, with that in mind:
* Create a per-TID filtered frame queue for frames that the hardware
returns as filtered.
* Track filtered frames per-tid, rather than per-node. It just makes
the locking much easier.
* When a filtered frame appears in the completion function, the node
transitions to "filtered", and all subsequent completed error frames
(filtered or otherwise) are put on the filtered frame queue. The TID
is paused once (during the transition from non-filtered to filtered).
* If a filtered frame retry count exceeds SWMAX_RETRIES, a BAR should be
sent.
* Once all the frames queued to the hardware for the given filtered frame
TID, transition back from filtered frame to non-filtered frame, which
means pre-pending all the filtered frames onto the head of the software
queue, clearing the filtered frame state and unpausing the TID.
Things get quite hairy around handling completion (aggr, non-aggr, norm,
direct-dispatched frames to a hardware queue); whether it's an "error",
"cleanup" or "BAR" state as well as filtered, which order to do things
in (eg do filtered BEFORE checking for BAR, as the filter completion
may be needed to actually transmit a BAR frame.)
This work has definitely reminded me that I have to tidy up all the locking
and remove some of the ridiculous lock/unlock/lock/unlock going on in the
completion functions.
It's also reminded me that I should really split out TID versus hardware TXQ
locking, even if the underlying locking is still the destination hardware TXQ.
Finally, this is all pre-requisite for working on AP mode power save support
(PS-POLL, uAPSD) as well as improving performance to misbehaving nodes (as
they can transition into filter mode, stopping any TX until everything has
caught up.)
Finally (ish) - this should also be done for non-aggregate sessions as
there are still plenty of laptops and mobile devices that don't speak
802.11n but do wish for stable, useful power save AP support where packets
aren't simply dropped. This requires software retransmission for
non-aggregate sessions to be implemented, which includes the caveats I've
mentioned above.
Finally finally - this doesn't yet do anything about the CLRDMASK bit in the
TX descriptor. That's still unconditionally set to 1. I'll debug the
current work (mostly ensuring I haven't busted up the hairy transitions
between BAR, filtered, error (all frames in an aggregate failing) and
cleanup (when transitioning from aggregation -> non-aggregation.))
Finally finally finally - this is all original work by yours truely, rather
than ported from the Atheros internal driver codebase or Linux ath9k.
Tested:
* AR9280, AR5416 in STA mode
* AR9280, AR9130 in hostap mode
* Lots and lots of iperf testing in very marginal and non-marginal conditions,
complete with inducing filtered frames + BAR TX conditions.
Use of __builtin_constant_p in a function that is only called via
a pointer is a good example of how out-of-date it was.
Suggested by: bde
MFC after: 1 week
Since all attribute values start at 8-byte aligned boundary, we would
previously incorrectly calculate dn_bonuslen if any attribute but the
last had a variable-length value with length not multiple of 8.
Reported by: Nicolas Rachinsky <fbsd-mas-0@ml.turing-complete.org>
Tested by: Nicolas Rachinsky <fbsd-mas-0@ml.turing-complete.org>
Reviewed by: Matthew Ahrens <mahrens@delphix.com> (for upstream)
MFC after: 2 weeks
directly in _rmlock.h and then including it (and its dependencies)
in pcpu.h. This leads to few _*.h headers to be included in pcpu.h
but this is not considered a big deal.
Really pc_rm_queue should be implemented as a dynamic member with
DPCPU interface, but we really want to keep the read acquisition as
fast as possible, so even the further pc_dynamic indirection should be
avoided, and the pollution is dealt like this.
Discussed with: jhb
MFC after: 1 week
support to FreeBSD. A full description of the overall functionality
being added is below. nvmexpress.org defines NVM Express as "an optimized
register interface, command set and feature set fo PCI Express (PCIe)-based
Solid-State Drives (SSDs)."
This commit adds nvme(4) and nvd(4) driver source code and Makefiles
to the tree.
Full NVMe functionality description:
Add nvme(4) and nvd(4) drivers and nvmecontrol(8) for NVM Express (NVMe)
device support.
There will continue to be ongoing work on NVM Express support, but there
is more than enough to allow for evaluation of pre-production NVM Express
devices as well as soliciting feedback. Questions and feedback are welcome.
nvme(4) implements NVMe hardware abstraction and is a provider of NVMe
namespaces. The closest equivalent of an NVMe namespace is a SCSI LUN.
nvd(4) is an NVMe consumer, surfacing NVMe namespaces as GEOM disks.
nvmecontrol(8) is used for NVMe configuration and management.
The following are currently supported:
nvme(4)
- full mandatory NVM command set support
- per-CPU IO queues (enabled by default but configurable)
- per-queue sysctls for statistics and full command/completion queue
dumps for debugging
- registration API for NVMe namespace consumers
- I/O error handling (except for timeoutsee below)
- compilation switches for support back to stable-7
nvd(4)
- BIO_DELETE and BIO_FLUSH (if supported by controller)
- proper BIO_ORDERED handling
nvmecontrol(8)
- devlist: list NVMe controllers and their namespaces
- identify: display controller or namespace identify data in
human-readable or hex format
- perftest: quick and dirty performance test to measure raw
performance of NVMe device without userspace/physio/GEOM
overhead
The following are still work in progress and will be completed over the
next 3-6 months in rough priority order:
- complete man pages
- firmware download and activation
- asynchronous error requests
- command timeout error handling
- controller resets
- nvmecontrol(8) log page retrieval
This has been primarily tested on amd64, with light testing on i386. I
would be happy to provide assistance to anyone interested in porting
this to other architectures, but am not currently planning to do this
work myself. Big-endian and dmamap sync for command/completion queues
are the main areas that would need to be addressed.
The nvme(4) driver currently has references to Chatham, which is an
Intel-developed prototype board which is not fully spec compliant.
These references will all be removed over time.
Sponsored by: Intel
Contributions from: Joe Golio/EMC <joseph dot golio at emc dot com>
- Use callout(9) rather than timeout(9).
- Add a mutex as an I/O lock that protects the adapter and is used
for the I/O path.
- Add an sx lock as a configuration lock that protects the relationship
of configured volumes.
- Freeze the request queue when a DMA load is deferred with EINPROGRESS
and unfreeze the queue when the DMA callback is invoked.
- Explicitly poll the hardware while waiting to submit a command to
allow completed commands to free up slots in the command ring.
- Remove driver-wide 'initted' variable from mlx_*_fw_handshake() routines.
That state should be per-controller instead. Add it as an argument
since the first caller knows when it is the first caller.
- Remove explicit bus_space tag/handle and use bus_*() rather than
bus_space_*().
- Move duplicated PCI device ID probing into a mlx_pci_match() routine.
- Don't check for PCIM_CMD_MEMEN (the PCI bus will enable that when
allocating the resource) and use pci_enable_busmaster() rather than
manipulating the register directly.
Tested by: no one despite multiple requests (hope it works)
an NVidia Tegra 2 CPU.
Tegra 2 needs an external patch to pmap for atomic operations to work. Even
with this the Kernel only gets to the mount root prompt. As such Tegra
support is considered experimental, however adding the kernel config will
help ensure the Tegra code builds.