It hasn't been used since the AIO code was made MPSAFE 10 years ago.
Reviewed by: kib
Sponsored by: Chelsio Communications
Differential Revision: https://reviews.freebsd.org/D4946
This taskqueue is not used to handle bio requests. It is only used to
run aio_kick_nowait() to spin up new aio daemon processes.
Reviewed by: kib
Differential Revision: https://reviews.freebsd.org/D4904
and replace crcopysafe by crcopy as crcopysafe is is not intended to be
safe in a threaded environment, it drops PROC_LOCK() in while() that
can lead to unexpected results, such as overwrite kernel memory.
In my POV crcopysafe() needs special attention. For now I do not see
any problems with this function, but who knows.
Submitted by: dchagin
Found by: trinity
Security: SA-16:04.linux
"The problem with the approach taken both in _bus_dmamap_load_pages and
bus_dmamap_load_ma_triv is that they split the request buffer into
arbitrary chunks based on page boundaries, creating segments that no
longer have a size that's a multiple of the sector size. This breaks
drivers like blkfront (and probably other stuff)." [1]
This was most easily triggered by running `fsck /` on a system running
in Xen (e.g. Amazon EC2) but also showed up via growfs(8) and probably
many other userland tools which access the disk directly.
Patch by: royger [1]
"Thinks this should be fine" by: ken
r282971 attempted to fix this problem by decrementing cv_waiters after
waking up from sleeping on a condition variable, but this can result in
a use-after-free if the CV is freed before all woken threads have had a
chance to run. Instead, avoid incrementing cv_waiters past INT_MAX, and
have cv_signal() explicitly check for sleeping threads once cv_waiters has
reached this bound.
Reviewed by: jhb
MFC after: 2 weeks
Sponsored by: EMC / Isilon Storage Division
Differential Revision: https://reviews.freebsd.org/D4822
up to now.
The new sendfile is the code that Netflix uses to send their multiple tens
of gigabits of data per second. The new implementation features asynchronous
I/O, when I/O operations are launched, but not awaited to be complete. An
explanation of why such behavior is beneficial compared to old one is
going to be too long for a commit message, so we will skip it here.
Additional features of new syscall are extra flags, which provide an
application more control over data sent. The SF_NOCACHE flag tells
kernel that data shouldn't be cached after it was sent. The SF_READAHEAD()
macro allows to specify readahead size in pages.
The new syscalls is a drop in replacement. No modifications are required
to applications. One can take nginx binary for stable/10 and run it
successfully on head. Although SF_NODISKIO lost its original sense, as now
sendfile doesn't block, and now means something completely different (tm),
using the new sendfile the old way is absolutely safe.
Celebrates: Netflix global launch!
Sponsored by: Nginx, Inc.
Sponsored by: Netflix
Relnotes: yes
sbappendstream() does. Although, M_NOTREADY may appear only on SOCK_STREAM
sockets, due to sendfile(2) supporting only the latter, there is a corner
case of AF_UNIX/SOCK_STREAM socket, that still uses records for the sake
of control data, albeit being stream socket.
Provide private version of m_clrprotoflags(), which understands PRUS_NOTREADY,
similar to m_demote().
Tty.c was untypical in that it handled the si_drv1 issue consistently
and correctly, by always checking for si_drv1 being non-NULL and
sleeping if NULL. The removed code also illustrated unneeded
complications in drivers which are eliminated by the use of new KPI.
Reviewed by: hps, jhb
Discussed with: bde
Sponsored by: The FreeBSD Foundation
MFC after: 3 weeks
Differential revision: https://reviews.freebsd.org/D4746
Immediate problem fixed by the new KPI is the long-standing race
between device creation and assignments to cdev->si_drv1 and
cdev->si_drv2, which allows the window where cdevsw methods might be
called with si_drv1,2 fields not yet set. Devices typically checked
for NULL and returned spurious errors to usermode, and often left some
methods unchecked.
The new function interface is designed to be extensible, which should
allow to add more features to make_dev_s(9) without inventing yet
another name for function to create devices, while maintaining KPI and
even KBI backward-compatibility.
Reviewed by: hps, jhb
Sponsored by: The FreeBSD Foundation
MFC after: 3 weeks
Differential revision: https://reviews.freebsd.org/D4746
Advance the logical block number to the lblkno of the found block plus
one, instead of incrementing the block number which was used for
lookup. This change skips sparcely populated buffer ranges, similar
to r292325, instead of doing useless lookups.
Do not restart the bnoreuselist() from the start of the range if
buffer lock cannot be obtained without sleep. Only retry lookup and
lock for the same queue and same logical block number.
Reported by: benno
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 3 days
providing compiled-in static environment data that is used instead of any
data passed in from a boot loader.
Previously 'env' worked only on i386 and arm xscale systems, because it
required the MD startup code to examine the global envmode variable and
decide whether to use static_env or an environment obtained from the boot
loader, and set the global kern_envp accordingly. Most startup code wasn't
doing so. Making things even more complex, some mips startup code uses an
alternate scheme that involves calling init_static_kenv() to pass an empty
buffer and its size, then uses a series of kern_setenv() calls to populate
that buffer.
Now all MD startup code calls init_static_kenv(), and that routine provides
a single point where envmode is checked and the decision is made whether to
use the compiled-in static_kenv or the values provided by the MD code.
The routine also continues to serve its original purpose for mips; if a
non-zero buffer size is passed the routine installs the empty buffer ready
to accept kern_setenv() values. Now if the size is zero, the provided buffer
full of existing env data is installed. A NULL pointer can be passed if the
boot loader provides no env data; this allows the static env to be installed
if envmode is set to do so.
Most of the work here is a near-mechanical change to call the init function
instead of directly setting kern_envp. A notable exception is in xen/pv.c;
that code was originally installing a buffer full of preformatted env data
along with its non-zero size (like mips code does), which would have allowed
kern_setenv() calls to wipe out the preformatted data. Now it passes a zero
for the size so that the buffer of data it installs is treated as
non-writeable.
CPU_ISSET(), CPU_SET etc. in sparc64 asm. This approach has the
benefit of not clobbering %y, allowing to revert r222827 and
partially r222828.
- In r222828, CATR() already was changed to use the equivalent of
PCPU_GET(cpuid) instead of the MD module ID for KTR_CPU, so
belatedly also catch up with the C side of ktr(9). Originally,
in r203838 CATR() was moved away from directly reading the
module ID or equivalent as that became impractical with other
CPU types than USI/II supported. With r222828 in place, per-CPU
data generally is set up soon enough, though, that employing
PCPU things in ktr(9) also for use during early stages works.
- Unfortunately, an exception to the latter is the ktr(9) use
in pmap_bootstrap(), which actually is run so early that even
checking for bootverbose being set via the loader doesn't work.
Consequently, replace the ktr(9) use in pmap_bootstrap() with
OF_printf(9) and put it under #ifdef DIAGNOSTIC instead.
MFC after: 3 days
Add two new LWPINFO flags: PL_FLAG_BORN and PL_FLAG_EXITED for reporting
thread creation and destruction. Newly created threads will stop to report
PL_FLAG_BORN before returning to userland and exiting threads will stop to
report PL_FLAG_EXIT before exiting completely. Both of these events are
only enabled and reported if PT_LWP_EVENTS is enabled on a process.
This code is missing the racct_subr() call from kern_thr_exit() and would
require further code duplication in future changes.
Reviewed by: kib
MFC after: 1 week
(CLOCK_REALTIME case) system calls is non negative.
This commit hides a kernel panic in atrtc_settime() as the clock_ts_to_ct()
does not properly convert negative tv_sec.
ps. in my opinion clock_ts_to_ct() should be rewritten to properly handle
negative tv_sec values.
Differential Revision: https://reviews.freebsd.org/D4714
Reviewed by: kib
MFC after: 1 week
different from the interpreter path requested by the binary.
Before this change, it is impossible to activate non-default
interpreter for 32bit image on amd64, when /libexec/ld-elf32.so.1 file
exists.
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
vim overzealously removed some trailing `+' and I didn't check the
diff
MFC after: 1 week
X-MFC with: r292640
Pointyhat to: ngie
Sponsored by: EMC / Isilon Storage Division
exhausted.
It is possible for a bug in the code (or, theoretically, even unusual
network conditions) to exhaust all possible mbufs or mbuf clusters.
When this occurs, things can grind to a halt fairly quickly. However,
we currently do not call mb_reclaim() unless the entire system is
experiencing a low-memory condition.
While it is best to try to prevent exhaustion of one of the mbuf zones,
it would also be useful to have a mechanism to attempt to recover from
these situations by freeing "expendable" mbufs.
This patch makes two changes:
a) The patch adds a generic API to the UMA zone allocator to set a
function that should be called when an allocation fails because the
zone limit has been reached. Because of the way this function can be
called, it really should do minimal work.
b) The patch uses this API to try to free mbufs when an allocation
fails from one of the mbuf zones because the zone limit has been
reached. The function schedules a callout to run mb_reclaim().
Differential Revision: https://reviews.freebsd.org/D3864
Reviewed by: gnn
Comments by: rrs, glebius
MFC after: 2 weeks
Sponsored by: Juniper Networks
During fork p_starcopy - p_endcopy area of a process is populated with bcopy
with only proc lock held. Another forking thread can find such a process and
proceed to access p_pgrp included in said area.
Fix the problem by moving the field outside. It is being properly assigned
later.
Reviewed by: kib
Diagnosed by: kib
Tested by: Fabian Keil <freebsd-listen fabiankeil.de>
MFC after: 10 days
The ci20 port (by kan@) is going to reuse almost all of the intrng code
since the SoC in question looks suspiciously like someone took an ARM
SoC design and replaced the ARM core with a MIPS core.
* migrate out the code;
* rename ARM_ -> INTR_;
* rename arm_ -> intr_;
* move the interrupt flush routine from intr.c / intrng.c into
arm/machdep_intr.c - removing the code duplication and removing
the ARM specific bits from here.
Thanks to the Star Wars: The Force Awakens premiere line for allowing
me a couple hours of quiet time to finish the universe builds.
Tested:
* make universe
TODO:
* The structure definitions in subr_intr.c still includes machine/intr.h
which requires one duplicates all of the intrng definitions in
the platform code (which kan has done, and I think we don't have to.)
Instead I should break out the generic things (function declarations,
common intr structures, etc) into a separate header.
* Kan has requested I make the PIC based IPI stuff optional.
Rather than pushing all eight possible arguments into dtrace_probe()'s
stack frame, make the syscall_args struct for the current syscall available
via the current thread. Using a custom getargval method for the systrace
provider, this allows any syscall argument to be fetched, even in kernels
that have modified the maximum number of system call arguments.
Sponsored by: EMC / Isilon Storage Division
- Use SDT_PROBE<N>() instead of SDT_PROBE(). This has no functional effect
at the moment, but will be needed for some future changes.
- Don't hardcode the module component of the probe identifier. This is
set automatically by the SDT framework.
MFC after: 1 week
o With new KPI consumers can request contiguous ranges of pages, and
unlike before, all pages will be kept busied on return, like it was
done before with the 'reqpage' only. Now the reqpage goes away. With
new interface it is easier to implement code protected from race
conditions.
Such arrayed requests for now should be preceeded by a call to
vm_pager_haspage() to make sure that request is possible. This
could be improved later, making vm_pager_haspage() obsolete.
Strenghtening the promises on the business of the array of pages
allows us to remove such hacks as swp_pager_free_nrpage() and
vm_pager_free_nonreq().
o New KPI accepts two integer pointers that may optionally point at
values for read ahead and read behind, that a pager may do, if it
can. These pages are completely owned by pager, and not controlled
by the caller.
This shifts the UFS-specific readahead logic from vm_fault.c, which
should be file system agnostic, into vnode_pager.c. It also removes
one VOP_BMAP() request per hard fault.
Discussed with: kib, alc, jeff, scottl
Sponsored by: Nginx, Inc.
Sponsored by: Netflix
buffer for each block number in the range with gbincore(), look up the
next instantiated buffer with the logical block number which is
greater or equal to the next lblkno. This significantly speeds up the
iteration for sparce-populated range.
Move the iteration into new helper bnoreuselist(), which is structured
similarly to flushbuflist().
Reported and tested by: pho
Reviewed by: markj
Sponsored by: The FreeBSD Foundation
the type stability of the buffers memory. Instead of memoizing
pointer to the next buffer and validating it, remember the next
logical block number in the bo list and re-lookup.
Reviewed by: markj
Tested by: pho
Sponsored by: The FreeBSD Foundation
The intention was to just limit leading zeroes on numeric names. That
check is now improved to also catch the leading spaces and '+' that
strtoul can pass through.
PR: 204897
MFC after: 3 days
A panicking thread always executes with a critical section held, so any
attempt to allocate or free memory while dumping will otherwise cause a
second panic. This can occur, for example, if xpt_polled_action() completes
non-dump I/O that was pending at the time of the panic. The fact that this
can occur is itself a bug, but asserting in this case does little but
reduce the reliability of kernel dumps.
Suggested by: kib
Reported by: pho
These helper functions can be used to read in or write a buffer from or to
an arbitrary process' address space. Without them, this can only be done
using proc_rwmem(), which requires the caller to fill out a uio. This is
onerous and results in code duplication; the new functions provide a simpler
interface which is sufficient for most existing callers of proc_rwmem().
This change also adds a manual page for proc_rwmem() and the new functions.
Reviewed by: jhb, kib
Differential Revision: https://reviews.freebsd.org/D4245
clock_gettime(2) on ARMv7 and ARMv8 systems which have architectural
generic timer hardware. It is similar how the RDTSC timer is used in
userspace on x86.
Fix a permission problem where generic timer access from EL0 (or
userspace on v7) was not properly initialized on APs.
For ARMv7, mark the stack non-executable. The shared page is added for
all arms (including ARMv8 64bit), and the signal trampoline code is
moved to the page.
Reviewed by: andrew
Discussed with: emaste, mmel
Sponsored by: The FreeBSD Foundation
Differential revision: https://reviews.freebsd.org/D4209
For completeness add a VNASSERT that there are no threads waiting on a
range lock (this was previously checked on every vnode free).
Reported by; Rick Macklem
Fix from: Mateusz Guzik
PR: 204949
camdd(8) utility.
CCBs may be queued to the driver via the new CAMIOQUEUE ioctl, and
completed CCBs may be retrieved via the CAMIOGET ioctl. User
processes can use poll(2) or kevent(2) to get notification when
I/O has completed.
While the existing CAMIOCOMMAND blocking ioctl interface only
supports user virtual data pointers in a CCB (generally only
one per CCB), the new CAMIOQUEUE ioctl supports user virtual and
physical address pointers, as well as user virtual and physical
scatter/gather lists. This allows user applications to have more
flexibility in their data handling operations.
Kernel memory for data transferred via the queued interface is
allocated from the zone allocator in MAXPHYS sized chunks, and user
data is copied in and out. This is likely faster than the
vmapbuf()/vunmapbuf() method used by the CAMIOCOMMAND ioctl in
configurations with many processors (there are more TLB shootdowns
caused by the mapping/unmapping operation) but may not be as fast
as running with unmapped I/O.
The new memory handling model for user requests also allows
applications to send CCBs with request sizes that are larger than
MAXPHYS. The pass(4) driver now limits queued requests to the I/O
size listed by the SIM driver in the maxio field in the Path
Inquiry (XPT_PATH_INQ) CCB.
There are some things things would be good to add:
1. Come up with a way to do unmapped I/O on multiple buffers.
Currently the unmapped I/O interface operates on a struct bio,
which includes only one address and length. It would be nice
to be able to send an unmapped scatter/gather list down to
busdma. This would allow eliminating the copy we currently do
for data.
2. Add an ioctl to list currently outstanding CCBs in the various
queues.
3. Add an ioctl to cancel a request, or use the XPT_ABORT CCB to do
that.
4. Test physical address support. Virtual pointers and scatter
gather lists have been tested, but I have not yet tested
physical addresses or scatter/gather lists.
5. Investigate multiple queue support. At the moment there is one
queue of commands per pass(4) device. If multiple processes
open the device, they will submit I/O into the same queue and
get events for the same completions. This is probably the right
model for most applications, but it is something that could be
changed later on.
Also, add a new utility, camdd(8) that uses the asynchronous pass(4)
driver interface.
This utility is intended to be a basic data transfer/copy utility,
a simple benchmark utility, and an example of how to use the
asynchronous pass(4) interface.
It can copy data to and from pass(4) devices using any target queue
depth, starting offset and blocksize for the input and ouptut devices.
It currently only supports SCSI devices, but could be easily extended
to support ATA devices.
It can also copy data to and from regular files, block devices, tape
devices, pipes, stdin, and stdout. It does not support queueing
multiple commands to any of those targets, since it uses the standard
read(2)/write(2)/writev(2)/readv(2) system calls.
The I/O is done by two threads, one for the reader and one for the
writer. The reader thread sends completed read requests to the
writer thread in strictly sequential order, even if they complete
out of order. That could be modified later on for random I/O patterns
or slightly out of order I/O.
camdd(8) uses kqueue(2)/kevent(2) to get I/O completion events from
the pass(4) driver and also to send request notifications internally.
For pass(4) devcies, camdd(8) uses a single buffer (CAM_DATA_VADDR)
per CAM CCB on the reading side, and a scatter/gather list
(CAM_DATA_SG) on the writing side. In addition to testing both
interfaces, this makes any potential reblocking of I/O easier. No
data is copied between the reader and the writer, but rather the
reader's buffers are split into multiple I/O requests or combined
into a single I/O request depending on the input and output blocksize.
For the file I/O path, camdd(8) also uses a single buffer (read(2),
write(2), pread(2) or pwrite(2)) on reads, and a scatter/gather list
(readv(2), writev(2), preadv(2), pwritev(2)) on writes.
Things that would be nice to do for camdd(8) eventually:
1. Add support for I/O pattern generation. Patterns like all
zeros, all ones, LBA-based patterns, random patterns, etc. Right
Now you can always use /dev/zero, /dev/random, etc.
2. Add support for a "sink" mode, so we do only reads with no
writes. Right now, you can use /dev/null.
3. Add support for automatic queue depth probing, so that we can
figure out the right queue depth on the input and output side
for maximum throughput. At the moment it defaults to 6.
4. Add support for SATA device passthrough I/O.
5. Add support for random LBAs and/or lengths on the input and
output sides.
6. Track average per-I/O latency and busy time. The busy time
and latency could also feed in to the automatic queue depth
determination.
sys/cam/scsi/scsi_pass.h:
Define two new ioctls, CAMIOQUEUE and CAMIOGET, that queue
and fetch asynchronous CAM CCBs respectively.
Although these ioctls do not have a declared argument, they
both take a union ccb pointer. If we declare a size here,
the ioctl code in sys/kern/sys_generic.c will malloc and free
a buffer for either the CCB or the CCB pointer (depending on
how it is declared). Since we have to keep a copy of the
CCB (which is fairly large) anyway, having the ioctl malloc
and free a CCB for each call is wasteful.
sys/cam/scsi/scsi_pass.c:
Add asynchronous CCB support.
Add two new ioctls, CAMIOQUEUE and CAMIOGET.
CAMIOQUEUE adds a CCB to the incoming queue. The CCB is
executed immediately (and moved to the active queue) if it
is an immediate CCB, but otherwise it will be executed
in passstart() when a CCB is available from the transport layer.
When CCBs are completed (because they are immediate or
passdone() if they are queued), they are put on the done
queue.
If we get the final close on the device before all pending
I/O is complete, all active I/O is moved to the abandoned
queue and we increment the peripheral reference count so
that the peripheral driver instance doesn't go away before
all pending I/O is done.
The new passcreatezone() function is called on the first
call to the CAMIOQUEUE ioctl on a given device to allocate
the UMA zones for I/O requests and S/G list buffers. This
may be good to move off to a taskqueue at some point.
The new passmemsetup() function allocates memory and
scatter/gather lists to hold the user's data, and copies
in any data that needs to be written. For virtual pointers
(CAM_DATA_VADDR), the kernel buffer is malloced from the
new pass(4) driver malloc bucket. For virtual
scatter/gather lists (CAM_DATA_SG), buffers are allocated
from a new per-pass(9) UMA zone in MAXPHYS-sized chunks.
Physical pointers are passed in unchanged. We have support
for up to 16 scatter/gather segments (for the user and
kernel S/G lists) in the default struct pass_io_req, so
requests with longer S/G lists require an extra kernel malloc.
The new passcopysglist() function copies a user scatter/gather
list to a kernel scatter/gather list. The number of elements
in each list may be different, but (obviously) the amount of data
stored has to be identical.
The new passmemdone() function copies data out for the
CAM_DATA_VADDR and CAM_DATA_SG cases.
The new passiocleanup() function restores data pointers in
user CCBs and frees memory.
Add new functions to support kqueue(2)/kevent(2):
passreadfilt() tells kevent whether or not the done
queue is empty.
passkqfilter() adds a knote to our list.
passreadfiltdetach() removes a knote from our list.
Add a new function, passpoll(), for poll(2)/select(2)
to use.
Add devstat(9) support for the queued CCB path.
sys/cam/ata/ata_da.c:
Add support for the BIO_VLIST bio type.
sys/cam/cam_ccb.h:
Add a new enumeration for the xflags field in the CCB header.
(This doesn't change the CCB header, just adds an enumeration to
use.)
sys/cam/cam_xpt.c:
Add a new function, xpt_setup_ccb_flags(), that allows specifying
CCB flags.
sys/cam/cam_xpt.h:
Add a prototype for xpt_setup_ccb_flags().
sys/cam/scsi/scsi_da.c:
Add support for BIO_VLIST.
sys/dev/md/md.c:
Add BIO_VLIST support to md(4).
sys/geom/geom_disk.c:
Add BIO_VLIST support to the GEOM disk class. Re-factor the I/O size
limiting code in g_disk_start() a bit.
sys/kern/subr_bus_dma.c:
Change _bus_dmamap_load_vlist() to take a starting offset and
length.
Add a new function, _bus_dmamap_load_pages(), that will load a list
of physical pages starting at an offset.
Update _bus_dmamap_load_bio() to allow loading BIO_VLIST bios.
Allow unmapped I/O to start at an offset.
sys/kern/subr_uio.c:
Add two new functions, physcopyin_vlist() and physcopyout_vlist().
sys/pc98/include/bus.h:
Guard kernel-only parts of the pc98 machine/bus.h header with
#ifdef _KERNEL.
This allows userland programs to include <machine/bus.h> to get the
definition of bus_addr_t and bus_size_t.
sys/sys/bio.h:
Add a new bio flag, BIO_VLIST.
sys/sys/uio.h:
Add prototypes for physcopyin_vlist() and physcopyout_vlist().
share/man/man4/pass.4:
Document the CAMIOQUEUE and CAMIOGET ioctls.
usr.sbin/Makefile:
Add camdd.
usr.sbin/camdd/Makefile:
Add a makefile for camdd(8).
usr.sbin/camdd/camdd.8:
Man page for camdd(8).
usr.sbin/camdd/camdd.c:
The new camdd(8) utility.
Sponsored by: Spectra Logic
MFC after: 1 week
on every allocation and fully releases them on every free. These
are not trivial costs: it starts by zeroing a large structure then
initializes a mutex, a lock manager lock, an rw lock, four lists,
and six pointers. And looking at vfs.vnodes_created, these operations
are being done millions of times an hour on a busy machine.
As a performance optimization, this code update uses the uma_init
and uma_fini routines to do these initializations and cleanups only
as the vnodes enter and leave the vnode_zone. With this change the
initializations are only done kern.maxvnodes times at system startup
and then only rarely again. The frees are done only if the vnode_zone
shrinks which never happens in practice. For those curious about the
avoided work, look at the vnode_init() and vnode_fini() functions in
kern/vfs_subr.c to see the code that has been removed from the main
vnode allocation/free path.
Reviewed by: kib
Tested by: Peter Holm