Add additionally safety and overflow checks to clock_ts_to_ct and the
BCD routines while we're here.
Perform a safety check in sys_clock_settime() first to avoid easy local
root panic, without having to propagate an error value back through
dozens of APIs currently lacking error returns.
PR: 211960, 214300
Submitted by: Justin McOmie <justin.mcomie at gmail.com>, kib@
Reported by: Tim Newsham <tim.newsham at nccgroup.trust>
Reviewed by: kib@
Sponsored by: Dell EMC Isilon, FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D9279
Add internal tracking of smp startup status to reliably figure out
what methods are to be used to get gtaskqueue up and running.
e1000:
Calculating this pointer gives undefined behaviour when (last == -1)
(it is before the buffer). The pointer is always followed. Panics
occurred when it points to an unmapped page. Otherwise, the pointed-to
garbage tends to not have the E1000_TXD_STAT_DD bit set in it, so in the
broken case the loop was usually null and the function just returned, and
this was acidentally correct.
Submitted by: bde
Reported by: Matt Macy <mmacy@nextbsd.org>
Add internal tracking of smp startup status to reliably figure out
what methods are to be used to get gtaskqueue up and running.
e1000:
Calculating this pointer gives undefined behaviour when (last == -1)
(it is before the buffer). The pointer is always followed. Panics
occurred when it points to an unmapped page. Otherwise, the pointed-to
garbage tends to not have the E1000_TXD_STAT_DD bit set in it, so in the
broken case the loop was usually null and the function just returned, and
this was acidentally correct.
Submitted by: bde
Reviewed by: Matt Macy <mmacy@nextbsd.org>
If "capacity" LU option is set, ramdisk backend now implements featured
thin provisioned disk, storing data in malloc(9) allocated memory blocks
of pblocksize bytes (default PAGE_SIZE or 4KB). Additionally ~0.2% of LU
size is used for indirection tree (bigger pblocksize reduce the overhead).
Backend supports all unmap and anchor operations. If configured capacity
is overflowed, proper error conditions are reported.
If "capacity" LU option is not set, the backend operates mostly the same
as before without allocating real storage: writes go to nowhere, reads
return zeroes, reporting that all LBAs are unmapped.
This backend is still mostly oriented on testing and benchmarking (it is
still a volatile RAM disk), but now it should allow to run real FS tests,
not only simple dumb dd.
MFC after: 2 weeks
This makes it easier for the userland script to find the releated
VF interface.
Reviewed by: sephe
Approved by: sephe (mentor)
MFC after: 2 weeks
Sponsored by: Microsoft
Differential Revision: https://reviews.freebsd.org/D9101
Hyper-V's NIC SR-IOV implementation needs a Hyper-V synthetic NIC and
a VF NIC to work together (both NICs have the same MAC address), mainly to
support seamless live migration.
When the VF device becomes UP (or DOWN), the synthetic NIC driver needs
to switch the data path from the synthetic NIC to the VF (or the opposite).
Note: multicast/broadcast packets are still received through the synthetic
NIC and we need to inject the packets through the VF interface (if the VF is
UP), even if the synthetic NIC is DOWN (so we need to force the rxfilter
to be NDIS_PACKET_TYPE_PROMISCUOUS, when the VF is UP).
Reviewed by: sephe
Approved by: sephe (mentor)
MFC after: 2 weeks
Sponsored by: Microsoft
Differential Revision: https://reviews.freebsd.org/D8964
Hyper-V's NIC SR-IOV implementation needs a Hyper-V synthetic NIC and
a VF NIC to work together, mainly to support seamless live migration.
When the VF device becomes UP (or DOWN), the synthetic NIC driver needs
to switch the data path from the synthetic NIC to the VF (or the opposite).
So the synthetic NIC driver needs to know when a VF device is becoming
UP or DOWN and hence the patch is made.
Reviewed by: sephe
Approved by: sephe (mentor)
MFC after: 2 weeks
Sponsored by: Microsoft
Differential Revision: https://reviews.freebsd.org/D8963
It's unnecessary because the upper nework stack does the same checking.
In the case of Hyper-V SR-IOV, we need to remove the checking because
1) multicast/broadcast packets are still received through the synthetic
NIC and we need to inject the packets through the VF interface;
2) we must inject the packets even if the synthetic NIC is down, or has
a different MTU from the VF device.
Reviewed by: sephe
Approved by: sephe (mentor)
MFC after: 2 weeks
Sponsored by: Microsoft
Differential Revision: https://reviews.freebsd.org/D8962
This will be used by the coming NIC SR-IOV patch.
Reviewed by: sephe
Approved by: sephe (mentor)
MFC after: 2 weeks
Sponsored by: Microsoft
Differential Revision: https://reviews.freebsd.org/D8909
Variables "fast" and "active" are both constant in lacp_port_create(), but
comments mispleadingly suggest that "fast" can be changed via ioctl. The
constant values control the value of "lp->lp_state", so it too is constant,
and the code for assigning different value to it is essentially dead.
Remove both "fast" and "active", and set "lp->lp_state" unconditionally;
that gets rid of the dead code and misleading comments.
CID: 1305692
CID: 1305734
Reported by: asomers
Reviewed by: asomers
MFC after: 1 week
Sponsored by: Panasas
Differential Revision: https://reviews.freebsd.org/D9302
Remove custom DTS duplicate of tda19988 node and use upstream-provided
one introduced by r295436. This duplication created two tdaX devices
which confused fb driver into using only 640x480 area while setting
display to native resolution.
Reported by: Michael Smith
MFC after: 3 days
* limit cabq to 64 - in practice if this stays at ath_txbuf then
all buffers can be tied up by a very busy broadcast domain (eg ARP
storm, way too much MDNS/NETBIOS). It's been like this in the
freebsd-wifi-build AP project for the longest time.
* Now that I figured out the hilarity inherent in aggregate forming
and AR9380 EDMA work, change the per-node to 64 frames by default.
I'll do some more work to shorten the queue latency introduced when
doing data so TCP isn't so terrible, but it's now no longer /always/
tens of milliseconds of extra latency when doing active iperf tests.
Notes:
The reason for the extra latency is partly tx/rx taskqueue handling and
scheduling, and partly due to a lack of airtime/QoS awareness of per-node
traffic. Ideally we'd have different limits/priorities on the QoS/TID
levels per node so say, voice/video data got a better share of buffer
allocations over best effort/bulk data, but we currently don't implement
that. It's not /hard/ to do, I just need to do it.
Tested:
* AR9380 (STA), AR9580 (hostap) - both with the relevant changes.
TCP is now at around 180mbit with rate control and RTS protection
enabled. UDP stays at 355mbit at MCS23, no HT protection.
This is two fixes, which establishes what I /think/ is pretty close to the
theoretical PHY maximum speed on the AR9380 devices.
* When doing A-MPDU on a TID, don't queue to the hardware directly if
the hardware queue is busy. This gives us time to get more packets
queued up (and the hardware is busy, so there's no point in queuing
more to the hardware right now) to potentially form an A-MPDU.
This fixes up the throughput issue I was seeing where a couple hundred
single frames were being sent a second interspersed between A-MPDU
frames. It just happened that the software queue had exactly one
frame in it at that point. Queuing it until the hardware finishes
transmitting isn't exactly costly.
* When determining whether to dequeue from a software node/TID queue into
the hardware queue, fix up the checks to work right for EDMA chips
(ar9380 and later.) Before it was not dispatching anything until
the FIFO was empty. Now we allow it to dispatch another aggregate
up to the hardware aggregate limit, like I intended with the earlier
work.
This allows a 5GHz HT40, short-GI, "htprotmode off" test at MCS23
to achieve 357 Mbit/sec in a one-way UDP test. The stars have to be
aligned /just right/ so there are no retries but it can happen.
Just don't expect it to work in an OTA test if your 2yo is running
around the room - MCS23 is very very sensitive to channel conditions.
Tested:
* AR9380 STA (test) -> AR9580 hostap
TODO:
* More thorough testing on pre-AR9380 chips (AR5416, AR9160, AR9280)
* (Finally) teach ath_rate_sample about throughput/latency rather than
air time, so I can get good transmit rates with a 2yo running around.
When investigating performance on UDP TX on the AR9380 I found that the
following sequence was occuring:
* INTR
* EINPROGRESS - nothing yet
* INTR
* TXSTATUS - process a TX completion for an aggregate
* INTR, INTR
* TXSTATUS - process a TX completion for an aggregate
* TXD, TXD ... populate frames from the hardware queue and submit
What should be happening is a completed TXSTATUS fires off more packets
that are queued on active TIDs.
What /was/ happening was after that first TXSTATUS the TX queue hardware queue
was still empty, so it didn't push anything into the FIFO. Only after the
second TXSTATUS did any progress get made.
This is one of two commits - it ensures that the software TX queue scheduler
is called /after/ TX completion, otherwise no frames from the software staging
queues will be processed into the hardware queues.
The second commit will fix it so it populates aggregate frames correctly
when the above occurs - right now ath_txq_sched() is called, but it doesn't
populate anything because its pre-check conditions are wrong.
Whilst here, add/tweak debugging.
Tested:
* AR9380 STA (testing device) -> AR9580 hostap
Building kernel with devel/powerpc64-gcc (6.2.0) yields the following error:
/usr/src/sys/powerpc/powerpc/db_trace.c:299:20: error: calling
'__builtin_frame_address' with a nonzero argument is unsafe
[-Werror=frame-address]
Work around this by dereferencing the frame address manually instead.
PR: 215600
Reported by: Mark Millard <markmi AT dsl-only DOT net>
MFC after: 2 weeks
This ioctl has been considered legacy by upstream since the DTrace code
was first imported, and is unused. The removal also allows some
simplification of dtrace_helper_slurp().
Also remove a bogus copyout in the DTRACEHIOC_ADDDOF handler. Due to a
bug, it would overwrite an in-memory copy of the DOF header rather than
the passed-in DOF helper. Moreover, DTRACEHIOC_ADDDOF already copies the
helper back out automatically since its argument has the IOC_OUT attribute.
critical_exit().
Based on the discussion with: jhb
Reviewed by: imp
Sponsored by: The FreeBSD Foundation
Differential revision: D9276
MFC after: 1 week
Our base binutils sets -many by default anyway, but external gcc may not do
this.
PR: kern/215948
Submitted by: Mark Millard <markmi AT dsl-only DOT net>
Reported by: Mark Millard
MFC after: 2 weeks
This is supposed to only be applied to the first subframe and only if
RTS/CTS is being done. I'm still not yet checking RTS/CTS exchange status
so it's just happening for all subframes on AR9380 and later.
This gets MCS23 throughput up from around 250mbit to 303mbit with RTS/CTS
protection enabled, and around 330mbit with no HT protection enabled.
Now, MCS23 has a PHY rate of 450mbit and we should be seeing closer to
400mbit for a straight one-way UDP test, but this beats the previous
maximum throughput.
Tested:
* AR9380 (STA) -> AR9580 (AP) - STA with the modifications, doing UDP TX
test using iperf.
mappings for armv6 pmap zero and copy operations to the MD PCPU region.
Change sysmap initialization to only allocate KVA pages for CPUs that
are actually present.
While here, collapse CMAP3 into CMAP2 (their use was mutually exclusive
anyway) and "recover" some space in PCPU padding that has always been
available due to 64-byte cacheline padding.
Reviewed by: skra
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D9172
descriptor state will not change anymore). This seems to eliminate the
race where we can miss a stalled queue under high load.
While here remove the unnecessary curly brackets.
Reported by: Konstantin Kormashev <konstantin@netgate.com>
MFC after: 3 days
Sponsored by: Rubicon Communications, LLC (Netgate)
Set both IEEE80211_HTCAP_LDPC and IEEE80211_HTC_TXLDPC capability flags
if LDPC is supported + set 'do_ldpc = 1' only when it is not disabled,
not just supported.
Reviewed by: adrian
Differential Revision: https://reviews.freebsd.org/D9277
- Pad small packets to 60 bytes and not 64 (exclude the CRC bytes);
- Pad the packet using m_append(9), if the packet has enough space for
padding, which is usually true, it will not be necessary append a newly
allocated mbuf to the chain.
Suggested by: yongari
MFC after: 3 days
Sponsored by: Rubicon Communications, LLC (Netgate)
It is only a first step and not perfect, but better then nothing.
The main blocker is CAM target frontend, that can not be unloaded,
since CAM does not have mechanism to unregister periph driver now.
MFC after: 2 weeks
Stop testing for LK_RETRY and error multiple times. Also postpone the
VI_DOOMED until after LK_RETRY was seen as it reads from the vnode.
No functional changes.
A recent change enforced the VAP limit as well as the peer limit.
I now need to actually set iv_ampdu_limit or we don't transmit more
than 8K sized aggregates.
This restores the expected (suboptimal, but still much faster) behaviour.
Tested:
* AR9380, STA mode
SDM states that CLFLUSHOPT instructions can be ordered with other
writes by SFENCE, heavier MFENCE is not required.
Reviewed by: alc
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
The length of the scsi_set_timestamp_parameters struct was incorrect. LTO-5
drives don't care, but LTO-7 drives do.
Reviewed by: Sam Klopsch
MFC after: 2 weeks
Sponsored by: Spectra Logic Corp
configtimer().
During normal operation "state->nextcallopt" will always be less than
or equal to "state->nextcall" and checking only "state->nextcallopt"
before calling "callout_process()" is sufficient. However when
"configtimer()" is called a race might happen requiring both of these
binary times to be checked.
Short description of race:
1) A configtimer() call will reset both "state->nextcall" and
"state->nextcallopt" to the same binary time.
2) If a "callout_reset()" call happens between "configtimer()" and the
next "callout_process()" call, "state->nextcallopt" will get updated
and "state->nextcall" will remain at the current time. Refer to logic
inside cpu_new_callout().
3) getnextcpuevent() only respects "state->nextcall" and returns this
value over and over again, even if it is in the past, until "now >=
state->nextcallopt" becomes true. Then these two time variables are
corrected by a "callout_process()" call and the situation goes back to
normal.
The problem manifests itself in different ways. The common factor is
the timer process(es) consume all CPU on one or more CPU cores for a
long time, blocking other kernel processes from getting execution
time. This can be seen by very high interrupt counts as displayed by
"vmstat -i | grep timer" right after boot.
When EARLY_AP_STARTUP was enabled in r310177 the likelyhood of hitting
this bug apparently increased.
Example output from "vmstat -i" before patch:
cpu0:timer 7591 69
cpu9:timer 39031773 358089
cpu4:timer 9359 85
cpu3:timer 9100 83
cpu2:timer 9620 88
Example output from "vmstat -i" after patch:
cpu0:timer 4242 34
cpu6:timer 5531 44
cpu3:timer 6450 52
cpu1:timer 4545 36
cpu9:timer 7153 58
Before the patch cpu9 in the example above, was spinning in a loop in
order to reach 39 million interrupts just a few seconds after
bootup. After the patch the timer interrupt counts are more or less
consistent.
Discussed with: mav @
Reported by: several people
MFC after: 1 week
Sponsored by: Mellanox Technologies
When ixgbe receives an interrupt indicating that a new optical module
may have been inserted, it discards all of its current media types
by calling ifmedia_removeall() and then creates a new set of media
types for the supported media on the new module. However,
ifmedia_removeall() was maintaining a pointer to whatever the
current media type was before the call to ifmedia_removealL().
The result of this was that any attempt to read the current media
type of the interface (e.g. via ifconfig) would return potentially
garbage data from free memory (or if one were particularly unlucky
on an architecture that does not malloc() from a direct map, page
fault the kernel).
Fix this by NULL'ing out the current media field in if_media.c,
and have ixgbe update the current media type after recreating
them.
Submitted by: Matt Joras <matt.joras AT gmail DOT com>
Reviewed by: sbruno, erj
MFC after: 1 week
Sponsored by: Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D9164
For consistency with the qualifiers added in r310977, define a new
qualifier _Null_unspecified which is also defined in clang 3.7+.
Add two new macros:
__NULLABILITY_PRAGMA_PUSH
__NULLABILITY_PRAGMA_POP
These are for use in headers when we want avoid noisy warnings if
some pointers are left without nullability annotations.
These are added with way ahead of their first use to teach the GCC
ports headers of their existance before their first use.
- Add new sysctl node to control the transmit packet bufring.
- Add optimised version of the transmit routine which output packets
directly to the DMA ring instead of using bufring in case the transmit
lock is congested. This can reduce the number of taskswitches which in
turn influence the overall system CPU usage, depending on the
workload.
- Add " TX" suffix to debug name for transmit mutexes to silence some
witness warnings about aquiring duplicate locks having same name.
MFC after: 1 week
Sponsored by: Mellanox Technologies
Suggested by: gallatin @
6569 large file delete can starve out write ops
illumos/illumos-gate@ff5177ee8bff5177ee8bhttps://www.illumos.org/issues/6569
The core issue I've found is that there is no throttle for how many
deletes get assigned to one TXG. As a results when deleting large files
we end up filling consecutive TXGs with deletes/frees, then write
throttling other (more important) ops.
There is an easy test case for this problem. Try deleting several
large files (at least 1/2 TB) while you do write ops on the same
pool. What we've seen is performance of these write ops (let's
call it sideload I/O) would drop to zero.
More specifically the problem is that dmu_free_long_range_impl()
can/will fill up all of the dirty data in the pool "instantly",
before many of the sideload ops can get in. So sideload
performance will be impacted until all the files are freed.
The solution we have tested at Nexenta (with positive results)
creates a relatively simple throttle for how many "free" ops we let
into one TXG.
However this solution exposes other problems that should also be
addressed. If we are to slow down freeing of data that means one
has to wait even longer (assuming vnode ref count of 1) to get shell
back after an rm or for NFS thread to finish the free-ing op.
To avoid this the proposed solution is to call zfs_inactive() async
for "large" files. Async freeing then begs for the reclaimed space
to be accounted for in the zpool's "freeing" prop.
The other issue with having a longer delete is the inability to
export/unmount for a longer period of time. The proposed solution
is to interrupt freeing of blocks when a fs is unmounted.
Author: Alek Pinchuk <alek@nexenta.com>
Reviewed by: Matt Ahrens <mahrens@delphix.com>
Reviewed by: Sanjay Nadkarni <sanjay.nadkarni@nexenta.com>
Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Approved by: Dan McDonald <danmcd@omniti.com>
Reviewed by: avg
Differential Revision: D9008
It's possible to get EFAULT when writing a segment backed by a file
if the segment extends beyond the file.
The core dump could still be useful if we skip the rest of the segment
and proceed to other segements.
The skipped segment (or a portion of it) will be zero-filled.
While there, use 'const' to signify that core_write() only reads the
buffer and use __DECONST before calling vn_rdwr_inchunks() because it
can be used for both reading and writing.
Before the change:
kernel: Failed to write core file for process mmap_trunc_core (error 14)
kernel: pid 77718 (mmap_trunc_core), uid 1001: exited on signal 6
After the change:
kernel: Failed to fully fault in a core file segment at VA 0x800645000 with size 0x4000 to be written at offset 0x29000 for process mmap_trunc_core
kernel: pid 4901 (mmap_trunc_core), uid 1001: exited on signal 6 (core dumped)
Reviewed by: julian, kib
Obtained from: Panzura (older version of the change)
MFC after: 5 days
Sponsored by: Panzura
Differential Revision: https://reviews.freebsd.org/D9233
The error is:
vmm_dev.c: In function 'alloc_memseg':
vmm_dev.c:261:11: error: null argument where non-null required (argument 1) [-Werror=nonnull]
Apparently, the gcc is unable to figure out that if a ternary operator
produced a non-NULL value once, then the operator with exactly the same
operands would produce the same value again.
MFC after: 1 week
Add own state variable to track if a sendqueue is stopped or not.
This will prevent traffic from entering the sendqueue while it is
being destroyed.
Update drain function to wait for traffic to be transmitted before
returning when the link state is active.
Add extra checks in transmit path for stopped SQ's.
While at it:
- Use likely() for a mbuf pointer check.
- Remove redundant IFF_DRV_RUNNING check.
MFC after: 1 week
Sponsored by: Mellanox Technologies
There were several places where reference to compression were left
unfinished. Furthermore, KASSERTs contained references to MPPC_INVALID
which is not defined in the tree and therefore were sure to break with
INVARIANTS: comment them out.
Reported by: Eugene Grosbein
PR: 216265
MFC after: 3 days
All of the printing from the tables file now has wrappers so that the
handling is cleaner and it's possible to print something out (say, during
development) without having to fight the global debug flags. This re-org
will also make it easier to have the tables be compiled out at build time
if desired.
Other than fixing some minor bugs, there are no user-visible changes from
this change
Sponsored by: Netflix, Inc.
Differential Revision: D9238
users that choose not to use EARLY_AP_STARTUP.
There is still an initialization issue/panic with !SMP and !EARLY_AP_STARTUP
that we have yet to resolve.
Submitted by: bde
The option "nonc" disables using of namecache for the created mount,
by default namecache is used. The rationale for the option is that
namecache duplicates the information which is already kept in memory
by tmpfs. Since it believed that namecache scales better than tmpfs,
or will scale better, do not enable the option by default. On the
other hand, smaller machines may benefit from lesser namecache
pressure.
Discussed with: mjg
Tested by: pho (as part of larger patch)
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
For directories, node->tn_spec.tn_dir.tn_parent pointer to the parent
is used. For non-directories, the implementation is naive, all
directory nodes are scanned to find a dirent linking the specified
node. This can be significantly improved by maintaining tn_parent for
all nodes, later.
Tested by: pho (as part of larger patch)
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
On dotdot lookup and fhtovp operations, it is possible for the file
represented by tmpfs node to be removed after the thread calculated
the pointer. In this case, tmpfs_alloc_vp() accesses freed memory.
Introduce the reference count on the nodes. The allnodes list from
tmpfs mount owns 1 reference, and threads performing unlocked
operations on the node, add one transient reference. Similarly, since
struct tmpfs_mount maintains the list where nodes are enlisted,
refcount it by one reference from struct mount and one reference from
each node on the list. Both nodes and tmpfs_mounts are removed when
refcount goes to zero.
Note that this means that nodes and tmpfs_mounts might survive some
time after the node is deleted or tmpfs_unmount() finished. The
tmpfs_alloc_vp() in these cases returns error either due to node
removal (tn_nlinks == 0) or because of insmntque1(9) error.
Tested by: pho (as part of larger patch)
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
Commit r270423 fixed a regression in sched_yield() that was introduced
in earlier changes. Unfortunately, at the same time it introduced an
new regression. The problem is that SWT_RELINQUISH (6), like all other
SWT_* constants and unlike SW_* flags, is not a bit flag. So, (flags &
SWT_RELINQUISH) is true in cases where that was not really indended,
for example, with SWT_OWEPREEMPT (2) and SWT_REMOTEPREEMPT (11).
A straight forward fix would be to use (flags & SW_TYPE_MASK) ==
SWT_RELINQUISH, but my impression is that the switch types are designed
mostly for gathering statistics, not for influencing scheduling
decisions.
So, I decided that it would be better to check for SW_PREEMPT flag
instead. That's also the same flag that was checked before r239157.
I double-checked how that flag is used and I am confident that the flag
is set only in the places where we really have the preemption:
- critical_exit + td_owepreempt
- sched_preempt in the ULE scheduler
- sched_preempt in the 4BSD scheduler
Reviewed by: kib, mav
MFC after: 4 days
Sponsored by: Panzura
Differential Revision: https://reviews.freebsd.org/D9230
sure the XHCI controller is reset after halting it. The problem is
clearly a BIOS bug as the suspend and resume is failing without
loading the XHCI driver. The same happens when using Linux and the
XHCI driver is not loaded.
Submitted by: Yanko Yankulov <yanko.yankulov@gmail.com>
PR: 216261
MFC after: 1 week
As suggested in r167010, use the structure type and macros to access and
modify UFS2 extended attributes. Add assertions that pointers are
aligned in places where we now access the data through a structure
pointer, instead of character-by-character.
PR: 216127
Reported by: dewayne at heuristicsystems.com.au
Reviewed by: kib@
Sponsored by: Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D9225
disabled (Hi netmap!).
Only remove the CRC bytes from packets when the hardware tell us to do so.
Fixes the 'discard frame w/o leading ethernet header' issues.
Sponsored by: Rubicon Communications, LLC (Netgate)
Remove TMPFS_ASSERT_ELOCKED(). Its claims are already stated by other
asserts nearby and by VFS guarantees.
Change TMPFS_ASSERT_LOCKED() and one inlined place to use
ASSERT_VOP_(E)LOCKED() instead of hand-rolled imprecise asserts.
Tested by: pho (as part of the larger patch)
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Edit comments which explain no longer relevant details, and add
locking annotations to the struct tmpfs_node members.
Tested by: pho (as part of the larger patch)
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
The ea_name string is not nul-terminated. Correct the documentation.
Because the subsequent field is padded to 8 bytes, and the padding is
zeroed, the ea_name string will appear to be nul-terminated whenever the
length isn't exactly one (mod eight).
This was introduced in r167010 (2007).
Additionally, mark the length fields as unsigned. This particularly
matters for the single byte ea_namelength field, which can represent
extended attribute names up to 255 bytes long.
No functional change.
PR: 216127
Reported by: dewayne at heuristicsystems.com.au
Reviewed by: kib@
Sponsored by: Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D9206
- Add RATELIMIT kernel configuration keyword which must be set to
enable the new functionality.
- Add support for hardware driven, Receive Side Scaling, RSS aware, rate
limited sendqueues and expose the functionality through the already
established SO_MAX_PACING_RATE setsockopt(). The API support rates in
the range from 1 to 4Gbytes/s which are suitable for regular TCP and
UDP streams. The setsockopt(2) manual page has been updated.
- Add rate limit function callback API to "struct ifnet" which supports
the following operations: if_snd_tag_alloc(), if_snd_tag_modify(),
if_snd_tag_query() and if_snd_tag_free().
- Add support to ifconfig to view, set and clear the IFCAP_TXRTLMT
flag, which tells if a network driver supports rate limiting or not.
- This patch also adds support for rate limiting through VLAN and LAGG
intermediate network devices.
- How rate limiting works:
1) The userspace application calls setsockopt() after accepting or
making a new connection to set the rate which is then stored in the
socket structure in the kernel. Later on when packets are transmitted
a check is made in the transmit path for rate changes. A rate change
implies a non-blocking ifp->if_snd_tag_alloc() call will be made to the
destination network interface, which then sets up a custom sendqueue
with the given rate limitation parameter. A "struct m_snd_tag" pointer is
returned which serves as a "snd_tag" hint in the m_pkthdr for the
subsequently transmitted mbufs.
2) When the network driver sees the "m->m_pkthdr.snd_tag" different
from NULL, it will move the packets into a designated rate limited sendqueue
given by the snd_tag pointer. It is up to the individual drivers how the rate
limited traffic will be rate limited.
3) Route changes are detected by the NIC drivers in the ifp->if_transmit()
routine when the ifnet pointer in the incoming snd_tag mismatches the
one of the network interface. The network adapter frees the mbuf and
returns EAGAIN which causes the ip_output() to release and clear the send
tag. Upon next ip_output() a new "snd_tag" will be tried allocated.
4) When the PCB is detached the custom sendqueue will be released by a
non-blocking ifp->if_snd_tag_free() call to the currently bound network
interface.
Reviewed by: wblock (manpages), adrian, gallatin, scottl (network)
Differential Revision: https://reviews.freebsd.org/D3687
Sponsored by: Mellanox Technologies
MFC after: 3 months
the fifth argument to functions being traced, however there was an error
where the userspace stack was being used. This may be invalid leading to
a kernel panic if this address is unmapped.
Submitted by: Graeme Jenkinson <graeme.jenkinson@cl.cam.ac.uk>
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D9229
As the efi_devpath_last_node() and efi_devpath_trim() can return NULL
pointers, the consumers of this API should check the the NULL pointers.
Same for efinet_dev_init() using calloc().
Reported by: Robert Mustacchi <rm@joyent.com>
Reviewed by: jhb, allanjude
Approved by: allanjude (mentor)
Differential Revision: https://reviews.freebsd.org/D9203
Clang apparently requires the explicit form of this instruction, and rejects
uses which ignore the optional cmpD register. This was the only use of the
shorthand form of the instruction, so just fix it up to match the others.
PR: kern/215681
Submitted by: Mark Millard
Reported by: Mark Millard <markmi _AT_ dsl-only.net>
MFC after: 2 weeks
Languages like C++17 and Go provide direct support for slice types:
pointer/length pairs. The CloudABI generator now has more complete for
this, meaning that for the C binding, pointer/length pairs now use an
automatic naming scheme of ${name} and ${name}_len.
Apart from this change and some reformatting, the ABI definitions are
identical. Binary compatibility is preserved entirely.
This field has no practical use and never readed. Initiators already
receive respective residual size from frontends. Removed field had
different semantics, which looks useless, and was never passed through
by any frontend.
While there, fix kern_data_resid field support in case of HA, missed in
r312291.
MFC after: 13 days
This lock was replaced from rwlock in r272840. But unlike rwlock, rmlock
doesn't allow recursion on rm_rlock(), so at this time fix this with
RM_RECURSE flag. Later we need to change ipfw to avoid such recursions.
PR: 216171
Reported by: Eugene Grosbein
MFC after: 1 week
The Zedboard has a hardware bug where initialization of the USB PHY
occasionally fails on boot-up. Fix regression in -CURRENT when
kernel panics on such occasion. 11-RELEASE branch works fine
PR: 215862
Submitted by: Thomas Skibo <thoma555-bsd@yahoo.com>
Previously "panic: msleep" could happen for a few different reasons.
Break the KASSERTs out into individual cases to identify the failing
condition. Found during the investigation that resulted in r308288.
Reviewed by: kib, jhb
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D8604
sources to return timestamps when SO_TIMESTAMP is enabled. Two additional
clock sources are:
o nanosecond resolution realtime clock (equivalent of CLOCK_REALTIME);
o nanosecond resolution monotonic clock (equivalent of CLOCK_MONOTONIC).
In addition to this, this option provides unified interface to get bintime
(equivalent of using SO_BINTIME), except it also supported with IPv6 where
SO_BINTIME has never been supported. The long term plan is to depreciate
SO_BINTIME and move everything to using SO_TS_CLOCK.
Idea for this enhancement has been briefly discussed on the Net session
during dev summit in Ottawa last June and the general input was positive.
This change is believed to benefit network benchmarks/profiling as well
as other scenarios where precise time of arrival measurement is necessary.
There are two regression test cases as part of this commit: one extends unix
domain test code (unix_cmsg) to test new SCM_XXX types and another one
implementis totally new test case which exchanges UDP packets between two
processes using both conventional methods (i.e. calling clock_gettime(2)
before recv(2) and after send(2)), as well as using setsockopt()+recv() in
receive path. The resulting delays are checked for sanity for all supported
clock types.
Reviewed by: adrian, gnn
Differential Revision: https://reviews.freebsd.org/D9171
gtaskqueue bits at SI_SUB_INIT_IF instead of waiting until SI_SUB_SMP
which is far too late.
Add an assertion in taskqgroup_attach() to catch startup initialization
failures in the future.
Reported by: kib bde
it into pmap-v4.h where they are used. Other than those few lines of
support for different MMU types, nothing in cpuconf.h has been used in our
code for quite a while.
The file existed to set up a variety of symbols to describe the
architecture. Over the past few years we have converted all of our source
to use the new architecture symbols standardized by ARM Inc, and predefined
by both clang and gcc.
PR: 216104
It seems like kern_data_resid was never really implemented. This change
finally does it. Now frontends update this field while transferring data,
while CTL/backends getting it can more flexibly handle the result.
At this point behavior should not change significantly, still reporting
errors on write overrun, but that may be changed later, if we decide so.
CAM target frontend still does not properly handle overruns due to CAM API
limitations. We may need to add some fields to struct ccb_accept_tio to
pass information about initiator requested transfer size(s).
MFC after: 2 weeks
This patch adds driver for temperature/humidity sensor connected via GPIO.
To compile it into kernel add "device gpioths". To activate driver, use
hints (.at and .pins) for gpiobus. As result it will provide temperature &
humidity values via sysctl.
DHT11 is cheap & popular temperature/humidity sensor used via GPIO on ARM
or MIPS devices like Raspberry Pi or Onion Omega.
Reviewed by: adrian
Approved by: adrian (mentor)
Differential Revision: https://reviews.freebsd.org/D9185
The find_currdev() is using variable "copy" to store the reference to trimmed
devpath pointer, if for some reason the efi_devpath_handle() fails, we will
leak this copy.
Also we can simplify the code there a bit.
Reviewed by: allanjude
Approved by: allanjude (mentor)
Differential Revision: https://reviews.freebsd.org/D9191
Replace archaic "busses" with modern form "buses."
Intentionally excluded:
* Old/random drivers I didn't recognize
* Old hardware in general
* Use of "busses" in code as identifiers
No functional change.
http://grammarist.com/spelling/buses-busses/
PR: 216099
Reported by: bltsrc at mail.ru
Sponsored by: Dell EMC Isilon
arswitch_setled() and a number of _global_setup functions did not acquire the
lock before calling arswitch_modifyreg(). With WITNESS enabled this would
instantly panic.
Discovered on a TPLink-3600:
("panic: mutex arswitch not owned at sys/dev/etherswitch/arswitch/arswitch_reg.c:236")
Reviewed by: adrian, kan
Differential Revision: https://reviews.freebsd.org/D9187
between exp(3) and `exp` var.
The approach taken previously was not ideal for multiple
functional and stylistic reasons.
Add to existing sed call in Makefile to replace `exp` with
`exponent` instead.
MFC after: 13 days
Requested by: bde
vm_object_madvise() is frequently used to apply advice to a contiguous
set of pages in an object with no backing object. Optimize this case by
skipping non-resident subranges in constant time, and by iterating over
resident pages using the object memq, thus avoiding radix tree lookups on
each page index in the specified range.
While here, move MADV_WILLNEED handling to vm_page_advise(), and rename the
"advise" parameter to vm_object_madvise() to "advice."
Reviewed by: alc, kib
MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D9098