when I reordered events in accept1() to allocate a file descriptor
earlier, I didn't properly update use of goto on exit to unwind for
cases where the file descriptor is now held, but wasn't previously.
The result was that, in the event of accept() on a non-blocking socket,
or in the event of a socket error, a file descriptor would be leaked.
This ended up being non-fatal in many cases, as the file descriptor
would be properly GC'd on process exit, so only showed up for processes
that do a lot of non-blocking accept() calls, and also live for a long
time (such as qmail).
This change updates the use of goto targets to do additional unwinding.
Eyes provided by: Brian Feldman <green@freebsd.org>
Feet, hands provided by: Stefan Ehmann <shoesoft@gmx.net>,
Dimitry Andric <dimitry@andric.com>
Arjan van Leeuwen <avleeuwen@piwebs.com>
is generic to any threading system. This commit does not link this
file to the build yet, nor does it remove these functions from their
current location in kern_thread.c. (that commit coming up after further review)
with an ASUS A7N8X-E motherboard in APIC mode, since storming interrupts
don't repeat immediately. Use DELAY(1) to wait a bit for them to repeat.
This affects all systems. Only delay for the first
(10 * intr_storm_threshold) interrupts (per interrupt handler) so that
this is only a pessimization while warming up. Throttle after calling
the sub-handlers instead of before so that the long delay given by
throttling can be used instead of the DELAY(1) to detect storms after
warming up.
Reduced the throttling period from 1/10 second to 1/hz seconds so that
throttling doesn't destroy performance so much. Interrupts that are
detected as storming are effectively handled by polling at a frequency
of hz Hz. On A7N8X-E's there is another hardware or configuration bug
that makes the throttled frequency closer to 2*hz Hz.
tree, output an empty string instead of "?". This is already what
happened with DEVICE_SYSCTL_LOCATION and DEVICE_SYSCTL_PNPINFO. This
makes the output of "sysctl dev" much nicer (it won't display those
empty sysctls).
Reviewed by: des
size_t and size_t *, respectively. Update callers for the new interface.
This is a better fix for overflows that occurred when dumping segments
larger than 2GB to core files.
called ttyldoptim().
Use this function from all the relevant drivers.
I belive no drivers finger linesw[] directly anymore, paving the way for
locking and refcounting.
class variables in addition to per-device variables. In plain English,
this means that dev.foo0.bar is now called dev.foo.0.bar, and it is
possible to to have dev.foo.bar as well.
double NULL entries signal Witness to stop processing the array of
order entries meaning none of the spin locks are added resulting in
panics on boot.
- Add a missing NULL, NULL terminator to the Slip locks list to keep them
separate from the spin locks.
relationships:
Sockets: filedesc->accept->sellck
Routing: radix node head->rtentry->ifaddr
UDP: udp->udpinp
TCP: tcp->tcpinp
SLIP: slip_mtx->slip sc_mtx
Drop in a place holder section for UNIX domain sockets. Various
sections to be expanded over the next few days.
sched_clock() rather than using callouts. This means we no longer have to
take the load of the callout thread into consideration while balancing and
should make the balancing decisions simpler and more accurate.
Tested on: x86/UP, amd64/SMP
global mutex, accept_mtx, which serializes access to the following
fields across all sockets:
so_qlen so_incqlen so_qstate
so_comp so_incomp so_list
so_head
While providing only coarse granularity, this approach avoids lock
order issues between sockets by avoiding ownership of the fields
by a specific socket and its per-socket mutexes.
While here, rewrite soclose(), sofree(), soaccept(), and
sonewconn() to add assertions, close additional races and address
lock order concerns. In particular:
- Reorganize the optimistic concurrency behavior in accept1() to
always allocate a file descriptor with falloc() so that if we do
find a socket, we don't have to encounter the "Oh, there wasn't
a socket" race that can occur if falloc() sleeps in the current
code, which broke inbound accept() ordering, not to mention
requiring backing out socket state changes in a way that raced
with the protocol level. We may want to add a lockless read of
the queue state if polling of empty queues proves to be important
to optimize.
- In accept1(), soref() the socket while holding the accept lock
so that the socket cannot be free'd in a race with the protocol
layer. Likewise in netgraph equivilents of the accept1() code.
- In sonewconn(), loop waiting for the queue to be small enough to
insert our new socket once we've committed to inserting it, or
races can occur that cause the incomplete socket queue to
overfill. In the previously implementation, it was sufficient
to simply tested once since calling soabort() didn't release
synchronization permitting another thread to insert a socket as
we discard a previous one.
- In soclose()/sofree()/et al, it is the responsibility of the
caller to remove a socket from the incomplete connection queue
before calling soabort(), which prevents soabort() from having
to walk into the accept socket to release the socket from its
queue, and avoids races when releasing the accept mutex to enter
soabort(), permitting soabort() to avoid lock ordering issues
with the caller.
- Generally cluster accept queue related operations together
throughout these functions in order to facilitate locking.
Annotate new locking in socketvar.h.
descriptors out of fdrop_locked() and into vn_closefile(). This
removes all knowledge of vnodes from fdrop_locked(), since the lock
behavior was specific to vnodes. This also removes the specific
requirement for Giant in fdrop_locked(), it's now only required by
code that it calls into.
Add GIANT_REQUIRED to vn_closefile() since VFS requires Giant.
nextpkt within the m_hdr was not being initialized to NULL for
!M_PKTHDR cases. *Maybe* this will fix weird socket buffer
inconsistency panics, but we'll see.
the socket is on an accept queue of a listen socket. This change
renames the flags to SQ_COMP and SQ_INCOMP, and moves them to a new
state field on the socket, so_qstate, as the locking for these flags
is substantially different for the locking on the remainder of the
flags in so_state.
them to behave the same as if the SS_NBIO socket flag had been set
for this call. The SS_NBIO flag for ordinary sockets is set by
fcntl(fd, F_SETFL, O_NONBLOCK).
Pass the MSG_NBIO flag to the soreceive() and sosend() calls in
fifo_read() and fifo_write() instead of frobbing the SS_NBIO flag
on the underlying socket for each I/O operation. The O_NONBLOCK
flag is a property of the descriptor, and unlike ordinary sockets,
fifos may be referenced by multiple descriptors.
mbuma is an Mbuf & Cluster allocator built on top of a number of
extensions to the UMA framework, all included herein.
Extensions to UMA worth noting:
- Better layering between slab <-> zone caches; introduce
Keg structure which splits off slab cache away from the
zone structure and allows multiple zones to be stacked
on top of a single Keg (single type of slab cache);
perhaps we should look into defining a subset API on
top of the Keg for special use by malloc(9),
for example.
- UMA_ZONE_REFCNT zones can now be added, and reference
counters automagically allocated for them within the end
of the associated slab structures. uma_find_refcnt()
does a kextract to fetch the slab struct reference from
the underlying page, and lookup the corresponding refcnt.
mbuma things worth noting:
- integrates mbuf & cluster allocations with extended UMA
and provides caches for commonly-allocated items; defines
several zones (two primary, one secondary) and two kegs.
- change up certain code paths that always used to do:
m_get() + m_clget() to instead just use m_getcl() and
try to take advantage of the newly defined secondary
Packet zone.
- netstat(1) and systat(1) quickly hacked up to do basic
stat reporting but additional stats work needs to be
done once some other details within UMA have been taken
care of and it becomes clearer to how stats will work
within the modified framework.
From the user perspective, one implication is that the
NMBCLUSTERS compile-time option is no longer used. The
maximum number of clusters is still capped off according
to maxusers, but it can be made unlimited by setting
the kern.ipc.nmbclusters boot-time tunable to zero.
Work should be done to write an appropriate sysctl
handler allowing dynamic tuning of kern.ipc.nmbclusters
at runtime.
Additional things worth noting/known issues (READ):
- One report of 'ips' (ServeRAID) driver acting really
slow in conjunction with mbuma. Need more data.
Latest report is that ips is equally sucking with
and without mbuma.
- Giant leak in NFS code sometimes occurs, can't
reproduce but currently analyzing; brueffer is
able to reproduce but THIS IS NOT an mbuma-specific
problem and currently occurs even WITHOUT mbuma.
- Issues in network locking: there is at least one
code path in the rip code where one or more locks
are acquired and we end up in m_prepend() with
M_WAITOK, which causes WITNESS to whine from within
UMA. Current temporary solution: force all UMA
allocations to be M_NOWAIT from within UMA for now
to avoid deadlocks unless WITNESS is defined and we
can determine with certainty that we're not holding
any locks when we're M_WAITOK.
- I've seen at least one weird socketbuffer empty-but-
mbuf-still-attached panic. I don't believe this
to be related to mbuma but please keep your eyes
open, turn on debugging, and capture crash dumps.
This change removes more code than it adds.
A paper is available detailing the change and considering
various performance issues, it was presented at BSDCan2004:
http://www.unixdaemons.com/~bmilekic/netbuf_bmilekic.pdf
Please read the paper for Future Work and implementation
details, as well as credits.
Testing and Debugging:
rwatson,
brueffer,
Ketrien I. Saihr-Kesenchedra,
...
Reviewed by: Lots of people (for different parts)