The MIPS busdma sync operations currently are a big no-op on coherent memory.
This isn't strictly correct behaviour as we need a SYNC in here to ensure that
the writes have finished and are visible in main memory before the MMIO accesses
occur. This will have to be addressed in a later commit.
But, before that happens, let's at least do a flush here to make things
more "correct".
This is required for even remotely sensible behaviour on mips74k with
write-through memory enabled.
The mips74k programmers guide notes that reads can be re-ordered, even
uncached ones, so we need an explicit SYNC between them.
Yes, this is a case of a driver author actively doing a bus barrier
operation.
This ends up being necessary when the mips74k core is run in write-back
mode rather than write-through mode. That's coming in an upcoming
commit.
Tested:
* mips74k, QCA9558 SoC (AP135 reference board), arge<->arge interface
routing traffic tests.
send frames.
This matches the other check for space.
"enough" is a misnomer, for "reasons". The biggest reason is that
the TX ring is actually a circular linked list, with no head/tail pointers.
This is just a bit more headroom between head/tail so we have time to
schedule frames before we hit where the hardware is at.
Ideally this would be tunable and a little larger.
This flushes out the write to the system before anything continues.
The mips74k guide, chapter 3.3.3 (write gathering) notes that writes
can be buffered in FIFOs - even uncached ones - so we can't guarantee
the device has felt its effects. Now, since we're all lazy driver
authors and don't pepper read/write barriers everywhere, fake it here.
tested:
* mips74k - QCA9558 SoC (AP135 reference board)
This should make it easier to track down interrupt storms from arge.
Tested:
* AP135 (QCA955x) SoC - defaults to ARGE_DEBUG enabled
* Carambola2 (AR9331 SoC) - defaults to ARGE_DEBUG disabled
I couldn't test arge0->arge1 bridging, only arge0 VLAN bridging.
The DIR-825C1 only hooks up arge0 to the switch GMAC0 and so
you need to abuse VLANs to test.
Tested:
* DIR-825C1 (AR9344)
This is an AR9331 part based on the AP121 reference design but with
32MB RAM. Yes, it has 4MB flash and it has no USB, so clever hacks
are required to get it up and working.
But boot/work it does.
This part seems to work bug-free with single byte TX/RX buffer alignment.
This drops the CPU requirement to bridge 100mbit iperf from 100% CPU
to ~ 50% CPU.
Tested:
* AP121 (AR9330) SoC, highly magic netbooted kernel + USB rootfs
due to 4mb flash, 16mb RAM; doing bridging between arge0 and arge1.
Notes:
* Yes, I likely can also turn this on for the AR934x SoC family now.
But since hardware design apparently follows similar branching
strategies to software design, I'll go and make sure all the AR934x's
that made it out into shipping products work before I flip it on.
offset within the buffer to align the L3 headers we know the buffer itself
was allocated and sized on cacheline boundaries and we don't need to
preserve partitial cachelines at the start and end of the buffer when
doing busdma sync operations.
to copying in some code from the armv4 busdma, and adapting a few variable
and flag names to match the surrounding mips code.
Instead of keeping a local cache of prealloced busdma_map structs on a
mutex-protected list, set up an uma zone to cache them.
Instead of all memory allocations using M_DEVBUF, use new categories
M_BUSDMA for allocations of metadata (tags, maps, segment tracking lists),
and M_BOUNCE for bounce pages.
When buffers are allocated out of the busdma_bufalloc zones the alignment
and size of the buffers is known, and the code can skip doing any "partial
cacheline flush" logic to preserve data that may be adjacent to the DMA
buffer but contain non-DMA data.
Reviewed by: adrian, imp
and implement support for VM_MEMATTR_UNCACHEABLE. This will be used in
upcoming changes to support BUS_DMA_COHERENT in bus_dmamem_alloc().
Reviewed by: adrian, imp
This was triggering when using it as an AP bridge rather than an ethernet
bridge.
The code is unclear but it works; I'll fix it to be clearer and test
performance at a later stage.
The existing code meets the "alignment" requirement for the l3 payload
by offsetting the mbuf by uint64_t and then calling an rx fixup routine
to copy the frame backwards by 2 bytes. This DWORD aligns the
L3 payload so tcp, etc doesn't panic on unaligned access.
This is .. slow.
For arge MACs that support 1 byte TX/RX address alignment, we can do
the "other" hack: offset the RX address of the mbuf so the L3 payload
again is hopefully DWORD aligned.
This is much cheaper - since TX/RX is both 1 byte align ready (thanks
to the previous commit) there's no bounce buffering going on and there
is no rx fixup copying.
This gets bridging performance up from 180mbit/sec -> 410mbit/sec.
There's around 10% of CPU cycles spent in _bus_dmamap_sync(); I'll
investigate that later.
Tested:
* QCA955x SoC (AP135 reference board), bridging arge0/arge1
by programming the switch to have two vlangroups in dot1q mode:
# ifconfig bridge0 inet 192.168.2.20/24
# etherswitchcfg config vlan_mode dot1q
# etherswitchcfg vlangroup0 members 0,1,2,3,4
# etherswitchcfg vlangroup1 vlan 2 members 5,6
# etherswitchcfg port5 pvid 2
# etherswitchcfg port6 pvid 2
# ifconfig arge1 up
# ifconfig bridge0 addm arge1
The early ethernet MACs (I think AR71xx and AR913x) require that both
TX and RX require 4-byte alignment for all packets.
The later MACs have started relaxing the requirements.
For now, the 1-byte TX and 1-byte RX alignment requirements are only for
the QCA955x SoCs. I'll add in the relaxed requirements as I review the
datasheets and do testing.
* Add a hardware flags field and 1-byte / 4-byte TX/RX alignment.
* .. defaulting to 4-byte TX and 4-byte RX alignment.
* Only enforce the TX alignment fixup if the hardware requires a 4-byte
TX alignment. This avoids a call to m_defrag().
* Add counters for various situations for further debugging.
* Set the 1-byte and 4-byte busdma alignment requirement when
the tag is created.
This improves the straight bridging performance from 130mbit/sec
to 180mbit/sec, purely by removing the need for TX path bounce buffers.
The main performance issue is the RX alignment requirement and any RX
bounce buffering that's occuring. (In a local test, removing the RX
fixup path and just aligning buffers raises the performance to above
400mbit/sec.
In theory it's a no-op for SoCs before the QCA955x.
Tested:
* QCA9558 SoC in AP135 board, using software bridging between arge0/arge1.
The ERL is a fairly cheap (~$100 USD) and readily available dual core
MIPS64 device so it makes a useful MIPS reference platform.
This is based in part on the kernel config generated by the mkerlimage
script from http://rtfm.net/FreeBSD/ERL/.
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D3884
* Shuffle the kernel to be at the beginning
* Give the kernel 2mb, the rootfs 6mb, and 'mib0' the rest
* put the cfg parition just before the ART calibration data for the
wifi part in the SoC
* .. and make sure ART points to the right 64k region.
I've updated the freebsd-wifi-build wiki the instructions on using this.
If someone has an AP135 with 8MB SPI flash then this won't work; everything
minus the big mib0 partition is just a bit over 8MB. Come see me if this
ever happens (you'll likely just have to shrink the rootfs and the kernel
a little in order to make it fit.)
Tested:
* AP135 reference board.
belong to a vm object, they can't be paged out. Since they can't be paged
out, they are never enqueued in a paging queue. Nonetheless, passing
PQ_INACTIVE to vm_page_unwire() creates the appearance that these pages
are being enqueued in the inactive queue. As of r288122, we can avoid
this false impression by passing PQ_NONE.
Submitted by: kmacy (an earlier version)
Differential Revision: https://reviews.freebsd.org/D1674
linkers no longer raise an error when undefined weak symbols are
found, but relocate as if the symbol value was 0. Note that we do not
repeat the mistake of userspace dynamic linker of making the symbol
lookup prefer non-weak symbol definition over the weak one, if both
are available. In fact, kernel linker uses the first definition
found, and ignores duplicates.
Signature of the elf_lookup() and elf_obj_lookup() functions changed
to split result/error code and the symbol address returned.
Otherwise, it is impossible to return zero address as the symbol
value, to MD relocation code. This explains the mechanical changes in
elf_machdep.c sources.
The powerpc64 R_PPC_JMP_SLOT handler did not checked error from the
lookup() call, the patch leaves the code as is (untested).
Reported by: glebius
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
When the system has more than a single PCI domain, the bus numbers
are not unique, thus they cannot be used for "pci" device numbering.
Change bus numbers to -1 (i.e. to-be-determined automatically)
wherever the code did not care about domains.
Reviewed by: jhb
Obtained from: Semihalf
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D3406
running thread.
It is currently implemented only on amd64 and i386; on these
architectures, it is implemented by raising an NMI on the CPU on which
the target thread is currently running. Unlike stack_save_td(), it may
fail, for example if the thread is running in user mode.
This change also modifies the kern.proc.kstack sysctl to use this function,
so that stacks of running threads are shown in the output of "procstat -kk".
This is handy for debugging threads that are stuck in a busy loop.
Reviewed by: bdrewery, jhb, kib
Sponsored by: EMC / Isilon Storage Division
Differential Revision: https://reviews.freebsd.org/D3256
The only operation which is prevented by the hold is the kernel stack
swapout for the faulted thread, which should be fine to allow.
Remove useless checks for NULL curproc or curproc->p_vmspace from the
trap_pfault() wrappers on x86 and powerpc.
Reviewed by: alc (previous version)
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
nlge(4) is supposed to deprecate rge(4) for Broadcom XLR when it was
introduced 5 years ago.
rge doesn't build on -CURRENT due to MII changes. All the XLR kernel confs
use nlge. Let's get rid of the old driver for FreeBSD 11. We can use
10-STABLE or SVN to go back and look at the old driver if needed.
Differential Revision: https://reviews.freebsd.org/D3339
Submitted by: kevin.bowling@kev009.com
Add a check to preload_search_info to make sure mod is set. Most of the
callers of preload_search_info don't check that the mod parameter is
set, which can cause page faults. While at it, remove some now unnecessary
checks before calling preload_search_info.
Sponsored by: Citrix Systems R&D
Reviewed by: kib
Differential Revision: https://reviews.freebsd.org/D3440
vm_offset_t pmap_quick_enter_page(vm_page_t m)
void pmap_quick_remove_page(vm_offset_t kva)
These will create and destroy a temporary, CPU-local KVA mapping of a specified page.
Guarantees:
--Will not sleep and will not fail.
--Safe to call under a non-sleepable lock or from an ithread
Restrictions:
--Not guaranteed to be safe to call from an interrupt filter or under a spin mutex on all platforms
--Current implementation does not guarantee more than one page of mapping space across all platforms. MI code should not make nested calls to pmap_quick_enter_page.
--MI code should not perform locking while holding onto a mapping created by pmap_quick_enter_page
The idea is to use this in busdma, for bounce buffer copies as well as virtually-indexed cache maintenance on mips and arm.
NOTE: the non-i386, non-amd64 implementations of these functions still need review and testing.
Reviewed by: kib
Approved by: kib (mentor)
Differential Revision: http://reviews.freebsd.org/D3013
provide a semantic defined by the C11 fences with corresponding
memory_order.
atomic_thread_fence_acq() gives r | r, w, where r and w are read and
write accesses, and | denotes the fence itself.
atomic_thread_fence_rel() is r, w | w.
atomic_thread_fence_acq_rel() is the combination of the acquire and
release in single operation. Note that reads after the acq+rel fence
could be made visible before writes preceeding the fence.
atomic_thread_fence_seq_cst() orders all accesses before/after the
fence, and the fence itself is globally ordered against other
sequentially consistent atomic operations.
Reviewed by: alc
Discussed with: bde
Sponsored by: The FreeBSD Foundation
MFC after: 3 weeks
and start teaching subsystems about it.
The Atheros MIPS platforms don't guarantee any kind of FIFO consistency
with interrupts in hardware. So software needs to do a flush when it
receives an interrupt and before it calls the interrupt handler.
There are new ones for the QCA934x and QCA955x, so do a few things:
* Get rid of the individual ones (for ethernet and IP2);
* Create a mux and enum listing all the variations on DDR flushes;
* replace the uses of IP2 with the relevant one (which will typically
be "PCI" here);
* call the USB DDR flush before calling the real USB interrupt handlers;
* call the ethernet one upon receiving an interrupt that's for us,
rather than never calling it during operation.
Tested:
* QCA9558 (TP-Link archer c7 v2)
* AR9331 (Carambola 2)
TODO:
* PCI, USB, ethernet, etc need to do a double-check to see if the
interrupt was truely for them before doing the DDR. For now I
prefer "correct" over "fast".