When I ported this code from netbsd I was .. slightly mips74k greener.
I used writethrough because (a) it's what netbsd did, and (b) if I used
writethrough then things "didn't work."
Fast-forward a couple years, more MIPS hacking and a whole lot more
understanding of the bus APIs (the last few commits notwithstanding;
it's been a long week, ok?) and I have this working for arge,
argemdio, spi and ath. Hans has it working for USB. The ath barrier
code will come in a later commit.
This gets the routing throughput up from 220mbit -> 337mbit.
I'm sure the bridging throughput will be similarly improved.
Tested:
* QCA955x SoC, routing workload.
* use barriers in a slightly better fashion. You can blame this
glass of whiskey on putting barriers in the wrong spot. Grr adrian.
* steal/rewrite the mdio busy check from ag7100 from openwrt and
refactor the existing code out. This is .. more correct.
This seems to fix the boot-to-boot variation that I've been seeing
and it quietens the switch port status flapping.
Tested:
* QCA9558 SoC (AP135.)
Obtained from: Linux OpenWRT
This driver and the linux ag71xx driver both treat the transmit ring
as a circular linked list of descriptors. There's no "end" pointer
that is ever NULL - instead, it expects the MAC to hit a finished
descriptor (ARGE_DESC_EMPTY) and stop.
Now, since it's a circular buffer, we may end up with the hardware
hitting the beginning of our multi-descriptor frame before we've finished
setting it up. It then DMA's it in, starts sending it, and we finish
writing out the new descriptor. The hardware may then write its
completion for the next descriptor out; then we do, and when we next
read it it'll show up as "not done" and transmit completion stops.
This unfortunately manifests itself as the transmit queue always
being active and a massive TX interrupt storm. We need to actively
ACK packets back from the transmit engine and if we don't (eg because
we think the transmit isn't finished but it is) then the unit will
just keep generating interrupts.
I hit this finally with the below testing setup. This fixed it for me.
Strictly speaking I should put in a sync in between writing out all of
the descriptors and writing out that final descriptor.
Tested:
* QCA9558 SoC (AP135 reference board) w/ arge1 + vlans acting as a
router, and iperf -d (tcp, bidirectional traffic.)
Obtained from: Linux OpenWRT (ag71xx_main.c.)
The MIPS busdma sync operations currently are a big no-op on coherent memory.
This isn't strictly correct behaviour as we need a SYNC in here to ensure that
the writes have finished and are visible in main memory before the MMIO accesses
occur. This will have to be addressed in a later commit.
But, before that happens, let's at least do a flush here to make things
more "correct".
This is required for even remotely sensible behaviour on mips74k with
write-through memory enabled.
The mips74k programmers guide notes that reads can be re-ordered, even
uncached ones, so we need an explicit SYNC between them.
Yes, this is a case of a driver author actively doing a bus barrier
operation.
This ends up being necessary when the mips74k core is run in write-back
mode rather than write-through mode. That's coming in an upcoming
commit.
Tested:
* mips74k, QCA9558 SoC (AP135 reference board), arge<->arge interface
routing traffic tests.
send frames.
This matches the other check for space.
"enough" is a misnomer, for "reasons". The biggest reason is that
the TX ring is actually a circular linked list, with no head/tail pointers.
This is just a bit more headroom between head/tail so we have time to
schedule frames before we hit where the hardware is at.
Ideally this would be tunable and a little larger.
This flushes out the write to the system before anything continues.
The mips74k guide, chapter 3.3.3 (write gathering) notes that writes
can be buffered in FIFOs - even uncached ones - so we can't guarantee
the device has felt its effects. Now, since we're all lazy driver
authors and don't pepper read/write barriers everywhere, fake it here.
tested:
* mips74k - QCA9558 SoC (AP135 reference board)
self-documented, and eases addition of new ops.
For the similar reasons, eliminate UMTX_OP_MAX. nitems() handles the
only use of the symbol.
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Fixes race condition observed under following circumstances:
1) I/O split on 128KB boundary with Intel NVMe controller.
Current Intel controllers produce better latency when
I/Os do not span a 128KB boundary - even if the I/O size
itself is less than 128KB.
2) Per-CPU I/O queues are enabled.
3) Child I/Os are submitted on different submission queues.
4) Interrupts for child I/O completions occur almost
simultaneously.
5) ithread for child I/O A increments bio_inbed, then
immediately is preempted (rendezvous IPI, higher priority
interrupt).
6) ithread for child I/O B increments bio_inbed, then completes
parent bio since all children are now completed.
7) parent bio is freed, and immediately reallocated for a VFS
or gpart bio (including setting bio_children to 1 and
clearing bio_driver1).
8) ithread for child I/O A resumes processing. bio_children
for what it thinks is the parent bio is set to 1, so it
thinks it needs to complete the parent bio.
Result is either calling a NULL callback function, or double freeing
the bio to its uma zone.
PR: 203746
Reported by: Drew Gallatin <gallatin@netflix.com>,
Marc Goroff <mgoroff@quorum.net>
Tested by: Drew Gallatin <gallatin@netflix.com>
MFC after: 3 days
Sponsored by: Intel
if they are not required for mounting rootfs. However, it's possible
that some setups try to mount them in mountcritlocal (ie from fstab).
Export the list of current root mount holds using a new sysctl,
vfs.root_mount_hold, and make mountcritlocal retry if "mount -a" fails
and the list is not empty.
MFC after: 1 month
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D3709
status registers for every interrupt. Check a common host channel
status interrupt register first, then conditionally read the
individual host channel status registers.
Submitted by: Sebastian Huber <sebastian.huber@embedded-brains.de>
MFC after: 1 week
b_asize can be zero if the block is compressed into an empty block
(ZIO_COMPRESS_EMPTY) and the trim code asserts that meaningless
zero-sized trimming is not attempted.
The logic for calling trim_map_free() is extracted into a new function
l2arc_trim() to minimize code duplication.
PR: 203473
Reported by: Willem Jan Withagen <wjw@digiware.nl>
Tested by: Willem Jan Withagen <wjw@digiware.nl>
MFC after: 11 days
We can't use copyout because destination memory is userland address
in another process but we have reference to respective page so map
the page into kernel address space and copy fragments there
receives video memory address from VideoCore through property mailbox
channel. Older versions of firmware (and the one that is currently part
of sysutils/u-boot-rpi and sysutils/u-boot-rpi2) returned real physical
address, newer one returns VideoCore bus address, so we need to convert
it to actual physical address. this version works with both older and
newer interface.
point, e.g. on RaspberryPi 2 when control is passed from loader to kernel
it contains garbage. So we use cpsr as a base for new cpsr value: if we
have reached this point it means current value is OK
Reviewed by: andrew
When using route-to (or reply-to) pf sends the packet directly to the output
interface. If that interface doesn't support checksum offloading the checksum
has to be calculated in software.
That was already done in the IPv4 case, but not for the IPv6 case. As a result
we'd emit packets with pseudo-header checksums (i.e. incorrect checksums).
This issue was exposed by the changes in r289316 when pf stopped performing full
checksum calculations for all packets.
Submitted by: Luoqi Chen
MFC after: 1 week
pmap_change_attr must change the memory type of both the requested KVA
and the corresponding DMAP mappings (if such mappings exist), to satisfy
an Intel requirement that two or more mappings to the same physical
pages must have the same memory type.
However, not all kernel mapped pages have corresponding DMAP mappings --
for example, 64-bit BARs. Skip fixing up the DMAP for out-of-bounds
addresses.
Submitted by: Steve Wahl <steve_wahl@dell.com>
Reviewed by: alc, jhb
Sponsored by: Dell Compellent
Differential Revision: https://reviews.freebsd.org/D4030
When destroying a character device the si_devsw field is set to NULL
before all references are gone, to indicate the character device is
going away. This can cause a NULL-dereference fault inside physio().
The callers of physio() should own a thread reference on the cdev and
if si_devsw is seen as non-NULL, it is usable during the execution of
the function. Else an ENXIO error code is returned.
Reviewed by: kib
MFC after: 2 weeks
- Move all files related to the LinuxKPI into sys/compat/linuxkpi and
its subfolders.
- Update sys/conf/files and some Makefiles to use new file locations.
- Added description of COMPAT_LINUXKPI to sys/conf/NOTES which in turn
adds the LinuxKPI to all LINT builds.
- The LinuxKPI can be added to the kernel by setting the
COMPAT_LINUXKPI option. The OFED kernel option no longer builds the
LinuxKPI into the kernel. This was done to keep the build rules for
the LinuxKPI in sys/conf/files simple.
- Extend the LinuxKPI module to include support for USB by moving the
Linux USB compat from usb.ko to linuxkpi.ko.
- Bump the FreeBSD_version.
- A universe kernel build has been done.
Reviewed by: np @ (cxgb and cxgbe related changes only)
Sponsored by: Mellanox Technologies
AMD64 pmap assumes ranges will be in the DMAP, which isn't necessarily
true for NTB memory windows (especially 64-bit BARs).
Suggested by: pmap_change_attr_locked -> kassert_panic
Sponsored by: EMC / Isilon Storage Division
Allows DMA from/to arbitrary KVA or physical address. /dev/ioat_test
must be enabled by root and is only R/W root, so this is approximately
as dangerous as /dev/mem and /dev/kmem.
Sponsored by: EMC / Isilon Storage Division
suggested by RFC 6675.
Currently differnt places in the stack tries to guess this in suboptimal ways.
The main problem is that current calculations don't take sacked bytes into
account. Sacked bytes are the bytes receiver acked via SACK option. This is
suboptimal because it assumes that network has more outstanding (unacked) bytes
than the actual value and thus sends less data by setting congestion window
lower than what's possible which in turn may cause slower recovery from losses.
As an example, one of the current calculations looks something like this:
snd_nxt - snd_fack + sackhint.sack_bytes_rexmit
New proposal from RFC 6675 is:
snd_max - snd_una - sackhint.sacked_bytes + sackhint.sack_bytes_rexmit
which takes sacked bytes into account which is a new addition to the sackhint
struct. Only thing we are missing from RFC 6675 is isLost() i.e. segment being
considered lost and thus adjusting pipe based on that which makes this
calculation a bit on conservative side.
The approach is very simple. We already process each ack with sack info in
tcp_sack_doack() and extract sack blocks/holes out of it. We'd now also track
this new variable sacked_bytes which keeps track of total sacked bytes reported.
One downside to this approach is that we may get incorrect count of sacked_bytes
if the other end decides to drop sack info in the ack because of memory pressure
or some other reasons. But in this (not very likely) case also the pipe
calculation would be conservative which is okay as opposed to being aggressive
in sending packets into the network.
Next step is to use this more accurate pipe estimation to drive congestion
window adjustments.
In collaboration with: rrs
Reviewed by: jason_eggnet dot com, rrs
MFC after: 2 weeks
Sponsored by: Limelight Networks
Differential Revision: https://reviews.freebsd.org/D3971