Typically <foo>list is used for a structure that holds a list head in
FreeBSD, not for members of a list. As such, rename 'struct aiocblist'
to 'struct kaiocb' (the kernel version of 'struct aiocb').
While here, use more consistent variable names for AIO control blocks:
- Use 'job' instead of 'aiocbe', 'cb', 'cbe', or 'iocb' for kernel job
objects.
- Use 'jobn' instead of 'cbn' for use with TAILQ_FOREACH_SAFE().
- Use 'sjob' and 'sjobn' instead of 'scb' and 'scbn' for fsync jobs.
- Use 'ujob' instead of 'aiocbp', 'job', 'uaiocb', or 'uuaiocb' to hold
a user pointer to a 'struct aiocb'.
- Use 'ujobp' instead of 'aiocbp' for a user pointer to a 'struct aiocb *'.
Reviewed by: kib
Sponsored by: Chelsio Communications
Differential Revision: https://reviews.freebsd.org/D5125
with interpreter name exactly matching one wanted by the binary. If
no such brand exists, return first brand which accepted the binary by
note.
The change fixes a regression after r292749, where e.g. our two ia32
compat brands, ia32_brand_info and ia32_brand_oinfo, only differ by
the interpeter path and binary matches to a brand by linkage order.
Then old binaries which require /usr/libexec/ld-elf.so.1 but matched
against ia32_brand_info with interp_path /libexec/ld-elf.so.1, were
considered requiring non-standard interpreter name, and magic to force
ld-elf32.so.1 did not happen.
Note that it might make sense to apply the same selection of brands
for other matching criteria, SCO EI_OSABI and 3.x string.
Reported and tested by: dwmalone
Sponsored by: The FreeBSD Foundation
MFC after: 3 days
export_args on mount update, bzero() is consistent with
vfs_oexport_conv().
Make the code structure more explicit by using switch.
Return EINVAL if export option layout (deduced from size) is unknown.
Based on the submission by: bde
Sponsored by: The FreeBSD Foundation
fork1 required its callers to pass a pointer to struct proc * which would
be set to the new process (if any). procdesc and racct manipulation also
used said pointer.
However, the process could have exited prior to do_fork return and be
automatically reaped, thus making this a use-after-free.
Fix the problem by letting callers indicate whether they want the pid or
the struct proc, return the process in stopped state for the latter case.
Reviewed by: kib
print until cncheckc returned a non -1, i.e. a character had been entered.
After this change it would print only if cncheckc returned a character.
As this was before each call to db_mach_vtrace the normal outcome was
nothing was printed.
With this change 'show ktr /a' will now keep printing until the user stops
the command with a key press, or there is no more entries to print.
NOTE_CHILD and NOTE_EXIT return something in kevent.data: the parent
pid (ppid) for NOTE_CHILD and the exit status for NOTE_EXIT.
Do not let the two events be combined, since one would overwrite
the other's data.
PR: 180385
Submitted by: David A. Bright <david_a_bright@dell.com>
Reviewed by: jhb
MFC after: 1 month
Sponsored by: Dell Inc.
Differential Revision: https://reviews.freebsd.org/D4900
a buffer pointer in the event of an error (for some errors it would
return a buffer pointer and for other errors it would not return a
buffer pointer). The cluster_read() function was similarly inconsistent.
Clients of these functions were inconsistent in handling errors.
Some would assume that no buffer was returned after an error and
would thus lose buffers under certain error conditions. Others would
assume that brelse() should always be called after an error and
would thus panic the system under certain error conditions.
To correct both of these problems with minimal code churn, bread()
and cluster_write() now always free the buffer when returning an
error thus ensuring that buffers will never be lost. The brelse()
routine checks for being passed a NULL buffer pointer and silently
returns to avoid panics. Thus both approaches to handling error
returns from bread() and cluster_read() will work correctly.
Future code should be written assuming that bread() and cluster_read()
will never return a buffer with an error, so should not attempt to
brelse() the buffer when an error is returned.
Reviewed by: kib
Summary:
Migrate to using the semi-opaque type rman_res_t to specify rman resources. For
now, this is still compatible with u_long.
This is step one in migrating rman to use uintmax_t for resources instead of
u_long.
Going forward, this could feasibly be used to specify architecture-specific
definitions of resource ranges, rather than baking a specific integer type into
the API.
This change has been broken out to facilitate MFC'ing drivers back to 10 without
breaking ABI.
Reviewed By: jhb
Sponsored by: Alex Perez/Inertial Computing
Differential Revision: https://reviews.freebsd.org/D5075
close or assert the bug that it is clear when leaving.
Remove an unrelated rotted comment that was attached to the buggy
clearing.
Since draining is not done in more cases, flushing is needed in more
cases, so start fixing flushing:
- do a full flush in ttydisc_close(). State what POSIX requires more
clearly. This was missing ttydevsw_pktnotify() calls to tell the
devsw layer to flush. Hardware tty drivers don't actually flush
since they don't understand this API.
- fix 2 missing wakeups in tty_flush(). Most of the wakeups here are
unnecessary for last close. But ttydisc_close() did one of the
missing ones.
This flow control bug ameliorated the design bug of requiring
potentially unbounded waits in draining. Software flow control is the
easiest way to get an unbounded wait, and a long wait is sometimes
actually useful. Users can type the xoff character on the receiver
and (if ixon is set on the sender) expect the output to be held until
the user is ready for more.
Hardware flow control can also give the unbounded wait, and this bug
didn't affect hardware flow control. Unbounded waits from hardware
flow control take a more unusual configuration. E.g., a terminal
program that controls the modem status lines, or unplugging the cable
in a configuration where this doesn't break the connection.
The design bug is still ameliorated by a newer bug in draining for
last close -- the 1 second timeout. E.g., if the user types the
xoff character and the sender reaches last close, then output is
not resumed and the wait times out after just 1 second. This is
broken, but preferable to an unbounded wait. Before this change,
the output was resumed immediately and usually completed.
Submitted by: bde
MFC after: 2 weeks
intended behaviour in its man page. Simplify tty_drain() to match.
Don't call ttydevsw methods in tty_flush() if the device is gone
since we now sometimes call it then.
The flushing was supposed to be implemented by passing the FNONBLOCK
flag to VOP_CLOSE() for revoke(). The tty driver is one of the few
that can block in close and was one of the fewer that knew about this.
This almost worked in FreeBSD-1 and similarly in Net/2. These
versions only almost worked because there was and is considerable
confusion between IO_NDELAY and FNONBLOCK (aka O_NONBLOCK). IO_NDELAY
is only valid for VOP_READ() and VOP_WRITE(). For other VOPs it has
the same value as O_SHLOCK. But since vfs_subr.c and tty.c
consistently used the wrong flag and the O_SHLOCK flag is rarely set,
this mostly worked. It also gave the feature than applications could
get the non-blocking close by abusing O_SHLOCK.
This was first broken then fixed in 1995. I changed only the tty
driver to use FNONBLOCK, as a hack to get non-blocking via the normal
flag FNONBLOCK for last closes. I didn't know about revoke()'s use
of IO_NDELAY or change it to be consistent, so revoke() was broken.
Then I changed revoke() to match.
This was next broken in 1997 then fixed in 1998. Importing Lite2 made
the flags inconsistent again by undoing the fix only in vfs_subr.c.
This was next broken in 2008 by replacing everything in tty.c and not
checking any flags in last close. Other bugs in draining limited the
resulting unbounded waits to drain in some cases.
It is now possible to fix this better using the new FREVOKE flag.
Just restore flushing for revoke() for now. Don't restore or undo any
hacks for ordinary last closes yet. But remove dead code in the
1-second relative timeout (r272789). This did extra work to extend
the buggy draining for revoke() for as long as possible. The 1-second
timeout made this not very long by usually flushing after 1 second.
Submitted by: bde
MFC after: 2 weeks
The fail point handler may sleep, but this is not permitted while holding a
rm read lock.
MFC after: 1 week
Sponsored by: EMC / Isilon Storage Division
open (in disguise as the console device). The only allowed combination
was supposed to be the callin device with the console.
Fix the assertion in ttydev_close() that was meant to detect this (it
only detected all 3 devices being open). Assert this in ttydev_open()
too.
Submitted by: bde
MFC after: 2 weeks
(TF_OPENED_CONS) were broken in r188147 by adding TF_OPENED_CONS
without updating the string. It was especially confusing to display
OPENED_CONS as GONE and BYPASS as ZOMBIE. 2 flags at the end were
not updated in r188487.
Don't print an extra 0x prefix for %p in a ddb command. In the rest
of the kernel there are more than 6000 lines with %p and only about
40 with this bug.
Print a non-extra 0x prefix for %b in a ddb command. In the rest
of the kernel, there are approx. 180 lines with %b and 2/3 of them
have this bug.
Submitted by: bde
MFC after: 2 weeks
There are number of radix consumers in kernel land (pf,ipfw,nfs,route)
with different requirements. In fact, first 3 don't have _any_ requirements
and first 2 does not use radix locking. On the other hand, routing
structure do have these requirements (rnh_gen, multipath, custom
to-be-added control plane functions, different locking).
Additionally, radix should not known anything about its consumers internals.
So, radix code now uses tiny 'struct radix_head' structure along with
internal 'struct radix_mask_head' instead of 'struct radix_node_head'.
Existing consumers still uses the same 'struct radix_node_head' with
slight modifications: they need to pass pointer to (embedded)
'struct radix_head' to all radix callbacks.
Routing code now uses new 'struct rib_head' with different locking macro:
RADIX_NODE_HEAD prefix was renamed to RIB_ (which stands for routing
information base).
New net/route_var.h header was added to hold routing subsystem internal
data. 'struct rib_head' was placed there. 'struct rtentry' will also
be moved there soon.
data. If vnode bypass for devfs file failed, vn_read/vn_write are
called and might try to dereference f_advice. Limit the accesses to
f_advice to VREG vnodes only, which is the type ensured by
posix_fadvise().
The f_advice for regular files is protected by mtxpool lock. Recheck
that f_advice is not NULL after lock is taken.
Reported and tested by: bde
Sponsored by: The FreeBSD Foundation
MFC after: 3 weeks
separate file. Claim my copyright.
- Provide more comments, better function and structure names.
- Sort out unneeded includes from resulting two files.
No functional changes.
user VM spaces while servicing jobs. Update various comments and data
structures that refer to AIO daemons as threads to refer to processes
instead.
Reviewed by: kib
Sponsored by: Chelsio Communications
Differential Revision: https://reviews.freebsd.org/D4999
In r55943, a per-process queue of pending socket AIO requests (requests
waiting for the socket to become ready) was added so that they could be
cancelled during process rundown. In r154765, the rundown code was
changed to handle jobs in this state (JOBST_JOBQSOCK) directly removing
the need for the extra queue. However, the per-process queue head and
global lock were never removed.
Reviewed by: kib
Sponsored by: Chelsio Communications
Differential Revision: https://reviews.freebsd.org/D4997
1. vhold and zap immediately instead of postponing few lines later
2. increment numneg after new entry is added
No functional changes.
No objections: kib
Previously the code would just increment statistics while only holding a
shared lock, in effect losing updates.
Separate tracking for nchstats is removed as values can be obtained from
existing counters. Note that some fields are updated by external
consumers and are left unfixed. This should not be a serious issue as
this structure looks quite obsolete.
No strong objections: kib
to actually wait until the TX FIFOs of UARTs have be drained before
returning. This is done by bringing the equivalent of the TS_BUSY flag
found in the previous implementation back in an ABI-preserving way.
Reported and tested by: Patrick Powell
Most likely, drivers for USB-serial-adapters likewise incorporating
TX FIFOs as well as other terminal devices that buffer output in some
form should also provide implementations of tsw_busy.
MFC after: 3 days
- Pull the vmspace logic out into helper functions and reduce duplication.
Operations on the vmspace are all isolated to vm_map.c, but it now exports
a new 'vmspace_switch_aio' for use by AIO kernel processes.
- When an AIO kernel process wants to exit, break out of the main loop and
perform cleanup after the loop end. This reduces a lot of indentation and
allows cleanup to more closely mirror setup actions before the loop starts.
- Convert a DIAGNOSTIC to KASSERT().
- Replace mycp with more typical 'p'.
Reviewed by: kib
Sponsored by: Chelsio Communications
Differential Revision: https://reviews.freebsd.org/D4990
This code dates back to the initial AIO support and the commit log does
not explain why it is needed. However, I cannot find anything in the
AIO code or the various file methods (fo_read/fo_write) that would change
behavior due to using a private session instead of proc0's session.
Reviewed by: kib
Sponsored by: Chelsio Communications
Differential Revision: https://reviews.freebsd.org/D4988
This API has no in-tree consumers at the moment but is useful to at least
one out-of-tree consumer, and naturally complements existing vnode refcount
functions (vholdl(9), vdropl(9)).
Obtained from: kib (sys/ portion)
Sponsored by: EMC / Isilon Storage Division
Differential Revision: https://reviews.freebsd.org/D4947
Differential Revision: https://reviews.freebsd.org/D4953
not call VOP_CLOSE() manually. Instead, delegate the close to
fo_close() performed as part of the fdrop() on the file failed to
open. For this, finish constructing file on error, in particular, set
f_vnode and f_ops.
Forcibly resetting f_ops to badfileops disabled additional cleanups
performed by fo_close() for some file types, in this case it was noted
that cdevpriv data was corrupted. Since fo_close() call must be
enabled for some file types, it makes more sense to enable it for all
files opened through vn_open_cred().
In collaboration with: pho
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
It hasn't been used since the AIO code was made MPSAFE 10 years ago.
Reviewed by: kib
Sponsored by: Chelsio Communications
Differential Revision: https://reviews.freebsd.org/D4946
This taskqueue is not used to handle bio requests. It is only used to
run aio_kick_nowait() to spin up new aio daemon processes.
Reviewed by: kib
Differential Revision: https://reviews.freebsd.org/D4904
and replace crcopysafe by crcopy as crcopysafe is is not intended to be
safe in a threaded environment, it drops PROC_LOCK() in while() that
can lead to unexpected results, such as overwrite kernel memory.
In my POV crcopysafe() needs special attention. For now I do not see
any problems with this function, but who knows.
Submitted by: dchagin
Found by: trinity
Security: SA-16:04.linux
"The problem with the approach taken both in _bus_dmamap_load_pages and
bus_dmamap_load_ma_triv is that they split the request buffer into
arbitrary chunks based on page boundaries, creating segments that no
longer have a size that's a multiple of the sector size. This breaks
drivers like blkfront (and probably other stuff)." [1]
This was most easily triggered by running `fsck /` on a system running
in Xen (e.g. Amazon EC2) but also showed up via growfs(8) and probably
many other userland tools which access the disk directly.
Patch by: royger [1]
"Thinks this should be fine" by: ken
r282971 attempted to fix this problem by decrementing cv_waiters after
waking up from sleeping on a condition variable, but this can result in
a use-after-free if the CV is freed before all woken threads have had a
chance to run. Instead, avoid incrementing cv_waiters past INT_MAX, and
have cv_signal() explicitly check for sleeping threads once cv_waiters has
reached this bound.
Reviewed by: jhb
MFC after: 2 weeks
Sponsored by: EMC / Isilon Storage Division
Differential Revision: https://reviews.freebsd.org/D4822
up to now.
The new sendfile is the code that Netflix uses to send their multiple tens
of gigabits of data per second. The new implementation features asynchronous
I/O, when I/O operations are launched, but not awaited to be complete. An
explanation of why such behavior is beneficial compared to old one is
going to be too long for a commit message, so we will skip it here.
Additional features of new syscall are extra flags, which provide an
application more control over data sent. The SF_NOCACHE flag tells
kernel that data shouldn't be cached after it was sent. The SF_READAHEAD()
macro allows to specify readahead size in pages.
The new syscalls is a drop in replacement. No modifications are required
to applications. One can take nginx binary for stable/10 and run it
successfully on head. Although SF_NODISKIO lost its original sense, as now
sendfile doesn't block, and now means something completely different (tm),
using the new sendfile the old way is absolutely safe.
Celebrates: Netflix global launch!
Sponsored by: Nginx, Inc.
Sponsored by: Netflix
Relnotes: yes
sbappendstream() does. Although, M_NOTREADY may appear only on SOCK_STREAM
sockets, due to sendfile(2) supporting only the latter, there is a corner
case of AF_UNIX/SOCK_STREAM socket, that still uses records for the sake
of control data, albeit being stream socket.
Provide private version of m_clrprotoflags(), which understands PRUS_NOTREADY,
similar to m_demote().