up to now.
The new sendfile is the code that Netflix uses to send their multiple tens
of gigabits of data per second. The new implementation features asynchronous
I/O, when I/O operations are launched, but not awaited to be complete. An
explanation of why such behavior is beneficial compared to old one is
going to be too long for a commit message, so we will skip it here.
Additional features of new syscall are extra flags, which provide an
application more control over data sent. The SF_NOCACHE flag tells
kernel that data shouldn't be cached after it was sent. The SF_READAHEAD()
macro allows to specify readahead size in pages.
The new syscalls is a drop in replacement. No modifications are required
to applications. One can take nginx binary for stable/10 and run it
successfully on head. Although SF_NODISKIO lost its original sense, as now
sendfile doesn't block, and now means something completely different (tm),
using the new sendfile the old way is absolutely safe.
Celebrates: Netflix global launch!
Sponsored by: Nginx, Inc.
Sponsored by: Netflix
Relnotes: yes
socket-buffer implementations, introduce a return value for MCLGET()
(and m_cljget() that underlies it) to allow the caller to avoid testing
M_EXT itself. Update all callers to use the return value.
With this change, very few network device drivers remain aware of
M_EXT; the primary exceptions lie in mbuf-chain pretty printers for
debugging, and in a few cases, custom mbuf and cluster allocation
implementations.
NB: This is a difficult-to-test change as it touches many drivers for
which I don't have physical devices. Instead we've gone for intensive
review, but further post-commit review would definitely be appreciated
to spot errors where changes could not easily be made mechanically,
but were largely mechanical in nature.
Differential Revision: https://reviews.freebsd.org/D1440
Reviewed by: adrian, bz, gnn
Sponsored by: EMC / Isilon Storage Division
- Do not set if_collisions on interrupt, read them in ti_get_counter().
- Add missing bus_dmamap_sync(BUS_DMASYNC_PREREAD) in ti_ioctl2(). [1]
Submitted by: mav [1]
- Add missing calls to bus_dmamap_unload() in et(4).
- Check the bus address against 0 to decide when to call
bus_dmamap_unload() instead of comparing the bus_dma map against NULL.
- Check the virtual address against NULL to decide when to call
bus_dmamem_free() instead of comparing the bus_dma map against NULL.
- Don't clear bus_dma map pointers to NULL for static allocations.
Instead, treat the value as completely opaque.
- Pass the correct virtual address to bus_dmamem_free() in wpi(4) instead
of trying to free a pointer to the virtual address.
Reviewed by: yongari
the queue where to enqueue pages that are going to be unwired.
- Add stronger checks to the enqueue/dequeue for the pagequeues when
adding and removing pages to them.
Of course, for unmanaged pages the queue parameter of vm_page_unwire() will
be ignored, just as the active parameter today.
This makes adding new pagequeues quicker.
This change effectively modifies the KPI. __FreeBSD_version will be,
however, bumped just when the full cache of free pages will be
evicted.
Sponsored by: EMC / Isilon storage division
Reviewed by: alc
Tested by: pho
to this event, adding if_var.h to files that do need it. Also, include
all includes that now are included due to implicit pollution via if_var.h
Sponsored by: Netflix
Sponsored by: Nginx, Inc.
them, please let me know if not). Most of these are of the form:
static const struct bzzt_type {
[...list of members...]
} const bzzt_devs[] = {
[...list of initializers...]
};
The second const is unnecessary, as arrays cannot be modified anyway,
and if the elements are const, the whole thing is const automatically
(e.g. it is placed in .rodata).
I have verified this does not change the binary output of a full kernel
build (except for build timestamps embedded in the object files).
Reviewed by: yongari, marius
MFC after: 1 week
datagrams. Traditionally upper stack fragmented packets without
computing TCP/UDP checksum and these datagrams were passed to
driver. But there are chances that other packets slip into the
interface queue in SMP world. If this happens firmware running on
MIPS 4000 processor in the controller would see mixed packets and
it shall send out corrupted packets.
While I'm here simplify checksum offloading setup.
MFC After: 1 week
- Don't use a single big DMA block for all rings. Create separate
DMA area for each ring instead. Currently the following DMA
areas are created:
Event ring, standard RX ring, jumbo RX ring, RX return ring,
hardware MAC statistics and producer/consumer status area.
For Tigon II, mini RX ring and TX ring are additionally created.
- Added missing bus_dmamap_sync(9) in various TX/RX paths.
- TX ring is no longer created for Tigon 1 such that it saves more
resources on Tigon 1.
- Data sheet is not clear about alignment requirement of each ring
so use 32 bytes alignment for normal DMA area but use 64 bytes
alignment for jumbo RX ring where the extended RX descriptor
size is 64 bytes.
- For each TX/RX buffers use separate DMA tag(e.g. the size of a
DMA segment, total size of DMA segments etc).
- Tigon allows separate DMA area for event producer, RX return
producer and TX consumer which is really cool feature. This
means TX and RX path could be independently run in parallel.
However ti(4) uses a single driver lock so it's meaningless
to have separate DMA area for these producer/consumer such that
this change creates a single status DMA area.
- It seems Tigon has no limits on DMA address space and I also
don't see any problem with that but old comments in driver
indicates there could be issues on descriptors being located in
64bit region. Introduce a tunable, dev.ti.%d.dac, to disable
using 64bit DMA in driver. The default is 0 which means it would
use full 64bit DMA. If there are DMA issues, users can disable
it by setting the tunable to 0.
- Do not increase watchdog timer in ti_txeof(). Previously driver
increased the watchdog timer whenever there are queued TX frames.
- When stat ticks is set to 0, skip processing ti_stats_update(),
avoiding bus_dmamap_sync(9) and updating if_collisions counter.
- MTU does not include FCS bytes, replace it with
ETHER_VLAN_ENCAP_LEN.
With these changes, ti(4) should work on PAE environments.
Many thanks to Jay Borkenhagen for remote hardware access.
have administrators control them. ti(4) provides a character
device to control various other features of driver via ioctls but
users had to write their own code to manipulate these parameters.
It seems some default values for these parameters are not optimal
on today's system but leave it as it was and let administrators
change them. The following parameters could be changed:
dev.ti.%d.rx_coal_ticks
dev.ti.%d.rx_max_coal_bds
dev.ti.%d.tx_coal_ticks
dev.ti.%d.tx_max_coal_bds
dev.ti.%d.tx_buf_ratio
dev.ti.%d.stat_ticks
The interface has to be brought down and up again before a change
takes effect.
ti(4) controller supports hardware MAC counters with additional
DMA statistics. So it's doable to export these counters via
sysctl interface. Unfortunately, these counters are cumulative
such that driver have to either send an explicit clear command to
controller after extracting them or have to maintain internal
counters to get actual changes. Neither look good to me so
counters were not exported via sysctl.
Pre-allocate the memory in device attach time. While I'm here
remove unnecessary reassignment of error variable as it was already
initialized. Also added a missing driver lock in TIIOCSETTRACE
handler.
allocator with UMA backed jumbo allocator by default. Previously
ti(4) used sf_buf(9) interface for jumbo buffers but it was broken
at this moment such that enabling jumbo frame caused instant panic.
Due to the nature of sf_buf(9) it heavily relies on VM changes but
it seems ti(4) was not received much blessing from VM gurus. I
don't understand VM magic and implications used in driver either.
Switching to UMA backed jumbo allocator like other network drivers
will make jumbo frame work on ti(4).
While I'm here, fully allocate all RX buffers. This means ti(4) now
uses 512 RX buffer and 1024 mini RX buffers.
To use sf_buf(9) interface for jumbo buffers, introduce a new
'options TI_SF_BUF_JUMBO'. If it is proven that sf_buf(9) is better
for jumbo buffers, interesting developers can fix the issue in
future.
ti(4) still needs more bus_dma(9) cleanups and should use separate
DMA tag/map for each ring(standard, jumbo, mini, command, event
etc) but it should work on all platforms except PAE.
Special thanks to Jay[1] who provided complete remote debugging
access.
Tested by: Jay Borkenhagen <jayb <> braeburn dot org > [1]
o Do not blindly UP controller when MTU is changed. Reinitialize
controller only if driver is running.
o Remove useless ti_stop() in ti_watchdog() since ti_init_locked()
always invokes ti_stop().
checksum offloading and VLAN hardware tag insertion/stripping from
the currently enabled hardware offloading capabilities.
Previously if_hwassist, which was initialized to TX/RX checksum
offloading, was blindly used to enable both TX and RX checksum
offloading such that disabling either TX or RX checksum offloading
was not possible.
ti(4) controllers support TX/RX checksum offloading with VLAN
tagging so announce TX/RX checksum offloading capability over VLAN
to vlan(4).
Make VLAN hardware tag insertion/stripping honors currently enabled
interface capability instead of blindly enabling VLAN hardware
tagging. This change allows disabling hardware support of VLAN tag.
Because ti(4) supports VLAN oversized frames, make network stack
know the capability by setting if_hdrlen.
While I'm here, rewrite SIOCSIFCAP handler and make sure to
reinitialize controller whenever TX/RX checksum offloading and VLAN
hardware tagging option is changed. The requirement of controller
reinitialization comes from the limitation of Tigon I/II firmware.
Tigon I/II firmware requires all related RCBs should be
reinitialized whenever any of its hardware offloading capabilities
change.
vlan(4) is also notified whenever the parent interface's capability
changes such that it can correctly handle TX/RX checksum offloading
based on parent interface's enabled offloading capabilities.
RX checksum offloading handler was changed to make upper stack use
controller computed partial checksum value. Previously, ti(4) just
set the computed value for any frames(IPv4, IPv6) and the value was
not used in upper stack because driver didn't set CSUM_DATA_VALID
such that upper network stack had to recompute checksum of TCP/UDP
packets. I have no idea how this was not noticed for a long time.
With this change, upper network stack does not have to fully
recompute the checksum such that calculating pseudo checksum based
on partial checksum is sufficient to know whether received packet's
checksum is correct or not. However, I don't know why ti(4) does
not have controller compute pseudo checksum as controller has
ability to do it. I'm just guessing enabling that feature could
trigger a firmware bug or could be slower than computing it on host
side so just leave it as it was.
In order not to produce false positives, ti(4) now checks whether
controller actually computed IP or TCP/UDP checksum by checking
ti_flags field.
state changes. Hide superfluous link up/down message under
bootverbose since if_link_state_change(9) shows that information.
While I'm here, change baudrate with the resolved speed of the
established link instead of blindly setting it 1G. Unfortunately,
it seems there is no way to differentiate 10/100Mbps from
non-gigabit link so just assume we established a 100Mbps link if
current link is not a gigabit link.
This was broken in r175872.
We have a UMA backed jumbo allocator and that is much better
implementation than having a local jumbo buffer allocator in
driver. This local allocator would be removed in near future but
fixing build before removal wouldn't be a bad idea.
if_watchdog and if_timer.
- Fix some issues in detach for sn(4), ste(4), and ti(4). Primarily this
means calling ether_ifdetach() before anything else.
IF_ADDR_UNLOCK() across network device drivers when accessing the
per-interface multicast address list, if_multiaddrs. This will
allow us to change the locking strategy without affecting our driver
programming interface or binary interface.
For two wireless drivers, remove unnecessary locking, since they
don't actually access the multicast address list.
Approved by: re (kib)
MFC after: 6 weeks
free function controlable, instead of passing the KVA of the buffer
storage as the first argument.
Fix all conventional users of the API to pass the KVA of the buffer
as the first argument, to make this a no-op commit.
Likely break the only non-convetional user of the API, after informing
the relevant committer.
Update the mbuf(9) manual page, which was already out of sync on
this point.
Bump __FreeBSD_version to 800016 as there is no way to tell how
many arguments a CPP macro needs any other way.
This paves the way for giving sendfile(9) a way to wait for the
passed storage to have been accessed before returning.
This does not affect the memory layout or size of mbufs.
Parental oversight by: sam and rwatson.
No MFC is anticipated.
If these drivers are setting M_VLANTAG because they are stripping the
layer 2 802.1Q headers, then they need to be re-inserting them so any
bpf(4) peers can properly decode them.
It should be noted that this is compiled tested only.
MFC after: 3 weeks
sparc64 GENERIC and the sound device drivers known working on sparc64
to use bus_get_dma_tag() to obtain the parent DMA tag so we can get rid
of the sparc64_root_dma_tag kludge eventually. Except for ath(4), sk(4),
stge(4) and ti(4) these changes are runtime tested (unless I booted up
the wrong kernels again...).