Commit Graph

14597 Commits

Author SHA1 Message Date
Mateusz Guzik
6b53d1bc6f cache: ansify functions and fix some style issues
No functional changes.
2016-01-07 02:04:17 +00:00
Konstantin Belousov
1041e09089 Two fixes for excessive iterations after r292326.
Advance the logical block number to the lblkno of the found block plus
one, instead of incrementing the block number which was used for
lookup.  This change skips sparcely populated buffer ranges, similar
to r292325, instead of doing useless lookups.

Do not restart the bnoreuselist() from the start of the range if
buffer lock cannot be obtained without sleep.  Only retry lookup and
lock for the same queue and same logical block number.

Reported by:	benno
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	3 days
2016-01-05 14:48:40 +00:00
Ian Lepore
69dcb7e771 Make the 'env' directive described in config(5) work on all architectures,
providing compiled-in static environment data that is used instead of any
data passed in from a boot loader.

Previously 'env' worked only on i386 and arm xscale systems, because it
required the MD startup code to examine the global envmode variable and
decide whether to use static_env or an environment obtained from the boot
loader, and set the global kern_envp accordingly.  Most startup code wasn't
doing so.  Making things even more complex, some mips startup code uses an
alternate scheme that involves calling init_static_kenv() to pass an empty
buffer and its size, then uses a series of kern_setenv() calls to populate
that buffer.

Now all MD startup code calls init_static_kenv(), and that routine provides
a single point where envmode is checked and the decision is made whether to
use the compiled-in static_kenv or the values provided by the MD code.

The routine also continues to serve its original purpose for mips; if a
non-zero buffer size is passed the routine installs the empty buffer ready
to accept kern_setenv() values.  Now if the size is zero, the provided buffer
full of existing env data is installed.  A NULL pointer can be passed if the
boot loader provides no env data; this allows the static env to be installed
if envmode is set to do so.

Most of the work here is a near-mechanical change to call the init function
instead of directly setting kern_envp.  A notable exception is in xen/pv.c;
that code was originally installing a buffer full of preformatted env data
along with its non-zero size (like mips code does), which would have allowed
kern_setenv() calls to wipe out the preformatted data.  Now it passes a zero
for the size so that the buffer of data it installs is treated as
non-writeable.
2016-01-02 02:53:48 +00:00
Marius Strobl
45005f7907 - (Ab)use udivx for dividing the u_int pc_cpuid when implementing
CPU_ISSET(), CPU_SET etc. in sparc64 asm. This approach has the
  benefit of not clobbering %y, allowing to revert r222827 and
  partially r222828.
- In r222828, CATR() already was changed to use the equivalent of
  PCPU_GET(cpuid) instead of the MD module ID for KTR_CPU, so
  belatedly also catch up with the C side of ktr(9). Originally,
  in r203838 CATR() was moved away from directly reading the
  module ID or equivalent as that became impractical with other
  CPU types than USI/II supported. With r222828 in place, per-CPU
  data generally is set up soon enough, though, that employing
  PCPU things in ktr(9) also for use during early stages works.
- Unfortunately, an exception to the latter is the ktr(9) use
  in pmap_bootstrap(), which actually is run so early that even
  checking for bootverbose being set via the loader doesn't work.
  Consequently, replace the ktr(9) use in pmap_bootstrap() with
  OF_printf(9) and put it under #ifdef DIAGNOSTIC instead.

MFC after:	3 days
2015-12-30 13:49:20 +00:00
John Baldwin
5fcfab6e32 Add ptrace(2) reporting for LWP events.
Add two new LWPINFO flags: PL_FLAG_BORN and PL_FLAG_EXITED for reporting
thread creation and destruction. Newly created threads will stop to report
PL_FLAG_BORN before returning to userland and exiting threads will stop to
report PL_FLAG_EXIT before exiting completely. Both of these events are
only enabled and reported if PT_LWP_EVENTS is enabled on a process.
2015-12-29 23:25:26 +00:00
John Baldwin
d1e7a4a553 Call kern_thr_exit() instead of duplicating it.
This code is missing the racct_subr() call from kern_thr_exit() and would
require further code duplication in future changes.

Reviewed by:	kib
MFC after:	1 week
2015-12-29 23:16:20 +00:00
Dmitry Chagin
3e18d701de Verify that tv_sec value specified in settimeofday() and clock_settime()
(CLOCK_REALTIME case) system calls is non negative.
This commit hides a kernel panic in atrtc_settime() as the clock_ts_to_ct()
does not properly convert negative tv_sec.

ps. in my opinion clock_ts_to_ct() should be rewritten to properly handle
negative tv_sec values.

Differential Revision:	https://reviews.freebsd.org/D4714
Reviewed by:		kib

MFC after:	1 week
2015-12-27 15:37:07 +00:00
Konstantin Belousov
1899507706 Do not substitute interpeter if the brand interpreter path is
different from the interpreter path requested by the binary.

Before this change, it is impossible to activate non-default
interpreter for 32bit image on amd64, when /libexec/ld-elf32.so.1 file
exists.

Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
2015-12-26 15:40:12 +00:00
Jonathan T. Looney
d3ee0a15a4 Only allow one PT_INTERP ELF program header. This also fixes a potential
memory leak for interp_buf.

Differential Revision:	https://reviews.freebsd.org/D4692
Reviewed by:	kib
MFC after:	2 weeks
Sponsored by:	Juniper Networks
2015-12-24 00:58:11 +00:00
Enji Cooper
853a17ad6f Fix r292640
vim overzealously removed some trailing `+' and I didn't check the
diff

MFC after: 1 week
X-MFC with: r292640
Pointyhat to: ngie
Sponsored by: EMC / Isilon Storage Division
2015-12-23 03:34:43 +00:00
Enji Cooper
905b145f0a Clean up trailing whitespace; no functional change
MFC after: 1 week
Sponsored by: EMC / Isilon Storage Division
2015-12-23 03:29:37 +00:00
Enji Cooper
20b4c1d2cb Fold lim_shared into lim_copy to mute a -Wunused compiler warning from
clang when the kernel is compiled without INVARIANTS

Differential Revision: https://reviews.freebsd.org/D4683
Reviewed by: kib, jhb
MFC after: 1 week
Sponsored by: EMC / Isilon Storage Division
2015-12-22 21:07:33 +00:00
Konstantin Belousov
d943fa35ad If we annoy user with the terminal output due to failed load of
interpreter, also show the actual error code instead of some
interpretation.

Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2015-12-22 20:12:52 +00:00
Jonathan T. Looney
54503a13d8 Add a safety net to reclaim mbufs when one of the mbuf zones become
exhausted.

It is possible for a bug in the code (or, theoretically, even unusual
network conditions) to exhaust all possible mbufs or mbuf clusters.
When this occurs, things can grind to a halt fairly quickly. However,
we currently do not call mb_reclaim() unless the entire system is
experiencing a low-memory condition.

While it is best to try to prevent exhaustion of one of the mbuf zones,
it would also be useful to have a mechanism to attempt to recover from
these situations by freeing "expendable" mbufs.

This patch makes two changes:

a) The patch adds a generic API to the UMA zone allocator to set a
function that should be called when an allocation fails because the
zone limit has been reached. Because of the way this function can be
called, it really should do minimal work.

b) The patch uses this API to try to free mbufs when an allocation
fails from one of the mbuf zones because the zone limit has been
reached. The function schedules a callout to run mb_reclaim().

Differential Revision:	https://reviews.freebsd.org/D3864
Reviewed by:	gnn
Comments by:	rrs, glebius
MFC after:	2 weeks
Sponsored by:	Juniper Networks
2015-12-20 02:05:33 +00:00
Mateusz Guzik
aa0241d623 proc: fix a race which could result in dereference of bad p_pgrp pointer on fork
During fork p_starcopy - p_endcopy area of a process is populated with bcopy
with only proc lock held. Another forking thread can find such a process and
proceed to access p_pgrp included in said area.

Fix the problem by moving the field outside. It is being properly assigned
later.

Reviewed by:	kib
Diagnosed by:	kib
Tested by:	Fabian Keil <freebsd-listen fabiankeil.de>
MFC after:	10 days
2015-12-18 16:33:15 +00:00
Adrian Chadd
2b3ad18853 [intrng] Migrate the intrng code from sys/arm/arm to sys/kern/subr_intr.c.
The ci20 port (by kan@) is going to reuse almost all of the intrng code
since the SoC in question looks suspiciously like someone took an ARM
SoC design and replaced the ARM core with a MIPS core.

* migrate out the code;
* rename ARM_ -> INTR_;
* rename arm_ -> intr_;
* move the interrupt flush routine from intr.c / intrng.c into
  arm/machdep_intr.c - removing the code duplication and removing
  the ARM specific bits from here.

Thanks to the Star Wars: The Force Awakens premiere line for allowing
me a couple hours of quiet time to finish the universe builds.

Tested:

* make universe

TODO:

* The structure definitions in subr_intr.c still includes machine/intr.h
  which requires one duplicates all of the intrng definitions in
  the platform code (which kan has done, and I think we don't have to.)

  Instead I should break out the generic things (function declarations,
  common intr structures, etc) into a separate header.

* Kan has requested I make the PIC based IPI stuff optional.
2015-12-18 05:43:59 +00:00
Mark Johnston
8ff6d9dd22 Support an arbitrary number of arguments to DTrace syscall probes.
Rather than pushing all eight possible arguments into dtrace_probe()'s
stack frame, make the syscall_args struct for the current syscall available
via the current thread. Using a custom getargval method for the systrace
provider, this allows any syscall argument to be fetched, even in kernels
that have modified the maximum number of system call arguments.

Sponsored by:	EMC / Isilon Storage Division
2015-12-17 00:00:27 +00:00
Mark Johnston
3616095801 Fix style issues around existing SDT probes.
- Use SDT_PROBE<N>() instead of SDT_PROBE(). This has no functional effect
  at the moment, but will be needed for some future changes.
- Don't hardcode the module component of the probe identifier. This is
  set automatically by the SDT framework.

MFC after:	1 week
2015-12-16 23:39:27 +00:00
Gleb Smirnoff
b0cd20172d A change to KPI of vm_pager_get_pages() and underlying VOP_GETPAGES().
o With new KPI consumers can request contiguous ranges of pages, and
  unlike before, all pages will be kept busied on return, like it was
  done before with the 'reqpage' only. Now the reqpage goes away. With
  new interface it is easier to implement code protected from race
  conditions.

  Such arrayed requests for now should be preceeded by a call to
  vm_pager_haspage() to make sure that request is possible. This
  could be improved later, making vm_pager_haspage() obsolete.

  Strenghtening the promises on the business of the array of pages
  allows us to remove such hacks as swp_pager_free_nrpage() and
  vm_pager_free_nonreq().

o New KPI accepts two integer pointers that may optionally point at
  values for read ahead and read behind, that a pager may do, if it
  can. These pages are completely owned by pager, and not controlled
  by the caller.

  This shifts the UFS-specific readahead logic from vm_fault.c, which
  should be file system agnostic, into vnode_pager.c. It also removes
  one VOP_BMAP() request per hard fault.

Discussed with:	kib, alc, jeff, scottl
Sponsored by:	Nginx, Inc.
Sponsored by:	Netflix
2015-12-16 21:30:45 +00:00
Konstantin Belousov
106ebb761a Optimize vop_stdadvise(POSIX_FADV_DONTNEED). Instead of looking up a
buffer for each block number in the range with gbincore(), look up the
next instantiated buffer with the logical block number which is
greater or equal to the next lblkno.  This significantly speeds up the
iteration for sparce-populated range.

Move the iteration into new helper bnoreuselist(), which is structured
similarly to flushbuflist().

Reported and tested by:	pho
Reviewed by:	markj
Sponsored by:	The FreeBSD Foundation
2015-12-16 08:48:37 +00:00
Konstantin Belousov
8549b4b9fe Simplify the loop step in the flushbuflist() and make it independed on
the type stability of the buffers memory.  Instead of memoizing
pointer to the next buffer and validating it, remember the next
logical block number in the bo list and re-lookup.

Reviewed by:	markj
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
2015-12-16 08:39:51 +00:00
Adrian Chadd
45130aa12e Don't call wakeup if we're just returning reserved space; just
return the reservation and wait for more space to appear.

Submitted by:	jeff
Reviewed by:	kib
2015-12-16 00:13:16 +00:00
Jamie Gritton
6ab6058ec4 Fix jail name checking that disallowed anything that starts with '0'.
The intention was to just limit leading zeroes on numeric names.  That
check is now improved to also catch the leading spaces and '+' that
strtoul can pass through.

PR:		204897
MFC after:	3 days
2015-12-15 17:25:00 +00:00
Edward Tomasz Napierala
af1a7b2526 Tweak comments.
MFC after:	1 month
Sponsored by:	The FreeBSD Foundation
2015-12-13 11:30:36 +00:00
Edward Tomasz Napierala
a8e0740e85 Actually make the 'amount' argument to racct_adjust_resource() signed,
as it was always supposed to be.

MFC after:	1 month
Sponsored by:	The FreeBSD Foundation
2015-12-13 11:21:13 +00:00
Edward Tomasz Napierala
57e2ebdc08 Avoid useless relocking.
MFC after:	1 month
Sponsored by:	The FreeBSD Foundation
2015-12-13 11:08:29 +00:00
Mark Johnston
d9e2e68d38 Don't make assertions about td_critnest when the scheduler is stopped.
A panicking thread always executes with a critical section held, so any
attempt to allocate or free memory while dumping will otherwise cause a
second panic. This can occur, for example, if xpt_polled_action() completes
non-dump I/O that was pending at the time of the panic. The fact that this
can occur is itself a bug, but asserting in this case does little but
reduce the reliability of kernel dumps.

Suggested by:	kib
Reported by:	pho
2015-12-11 20:05:07 +00:00
Warner Losh
da82615ae2 Create the MDT_PNP_INFO metadata record to communicate PNP info about
modules. External agents may use this data to automatically load those
modules.

Differential Review: https://reviews.freebsd.org/D3461
2015-12-11 05:27:53 +00:00
Steven Hartland
f1b13a89b0 Don't use 0 for pointer comparison
Use NULL instead of 0 for comparison with panicstr.

MFC after:	1 week
Sponsored by:	Multiplay
2015-12-08 18:38:33 +00:00
Mark Johnston
711fbd17ec Add helper functions proc_readmem() and proc_writemem().
These helper functions can be used to read in or write a buffer from or to
an arbitrary process' address space. Without them, this can only be done
using proc_rwmem(), which requires the caller to fill out a uio. This is
onerous and results in code duplication; the new functions provide a simpler
interface which is sufficient for most existing callers of proc_rwmem().

This change also adds a manual page for proc_rwmem() and the new functions.

Reviewed by:	jhb, kib
Differential Revision:	https://reviews.freebsd.org/D4245
2015-12-07 21:33:15 +00:00
Ed Maste
4c22b4686b Replace magic value ELF note type with NT_FREEBSD_ABI_TAG
As of r291909 elf_common.h provides a definition.

Suggested by:	kib
Sponsored by:	The FreeBSD Foundation
2015-12-07 18:43:27 +00:00
Konstantin Belousov
4d22d07a07 Add support for usermode (vdso-like) gettimeofday(2) and
clock_gettime(2) on ARMv7 and ARMv8 systems which have architectural
generic timer hardware. It is similar how the RDTSC timer is used in
userspace on x86.

Fix a permission problem where generic timer access from EL0 (or
userspace on v7) was not properly initialized on APs.

For ARMv7, mark the stack non-executable. The shared page is added for
all arms (including ARMv8 64bit), and the signal trampoline code is
moved to the page.

Reviewed by:	andrew
Discussed with:	emaste, mmel
Sponsored by:	The FreeBSD Foundation
Differential revision:	https://reviews.freebsd.org/D4209
2015-12-07 12:20:26 +00:00
Kirk McKusick
d9ea698c75 We need to zero out the clustering variables in a freed vnode structure.
For completeness add a VNASSERT that there are no threads waiting on a
range lock (this was previously checked on every vnode free).

Reported by; Rick Macklem
Fix from:    Mateusz Guzik
PR:          204949
2015-12-04 03:54:18 +00:00
Kenneth D. Merry
a9934668aa Add asynchronous command support to the pass(4) driver, and the new
camdd(8) utility.

CCBs may be queued to the driver via the new CAMIOQUEUE ioctl, and
completed CCBs may be retrieved via the CAMIOGET ioctl.  User
processes can use poll(2) or kevent(2) to get notification when
I/O has completed.

While the existing CAMIOCOMMAND blocking ioctl interface only
supports user virtual data pointers in a CCB (generally only
one per CCB), the new CAMIOQUEUE ioctl supports user virtual and
physical address pointers, as well as user virtual and physical
scatter/gather lists.  This allows user applications to have more
flexibility in their data handling operations.

Kernel memory for data transferred via the queued interface is
allocated from the zone allocator in MAXPHYS sized chunks, and user
data is copied in and out.  This is likely faster than the
vmapbuf()/vunmapbuf() method used by the CAMIOCOMMAND ioctl in
configurations with many processors (there are more TLB shootdowns
caused by the mapping/unmapping operation) but may not be as fast
as running with unmapped I/O.

The new memory handling model for user requests also allows
applications to send CCBs with request sizes that are larger than
MAXPHYS.  The pass(4) driver now limits queued requests to the I/O
size listed by the SIM driver in the maxio field in the Path
Inquiry (XPT_PATH_INQ) CCB.

There are some things things would be good to add:

1. Come up with a way to do unmapped I/O on multiple buffers.
   Currently the unmapped I/O interface operates on a struct bio,
   which includes only one address and length.  It would be nice
   to be able to send an unmapped scatter/gather list down to
   busdma.  This would allow eliminating the copy we currently do
   for data.

2. Add an ioctl to list currently outstanding CCBs in the various
   queues.

3. Add an ioctl to cancel a request, or use the XPT_ABORT CCB to do
   that.

4. Test physical address support.  Virtual pointers and scatter
   gather lists have been tested, but I have not yet tested
   physical addresses or scatter/gather lists.

5. Investigate multiple queue support.  At the moment there is one
   queue of commands per pass(4) device.  If multiple processes
   open the device, they will submit I/O into the same queue and
   get events for the same completions.  This is probably the right
   model for most applications, but it is something that could be
   changed later on.

Also, add a new utility, camdd(8) that uses the asynchronous pass(4)
driver interface.

This utility is intended to be a basic data transfer/copy utility,
a simple benchmark utility, and an example of how to use the
asynchronous pass(4) interface.

It can copy data to and from pass(4) devices using any target queue
depth, starting offset and blocksize for the input and ouptut devices.
It currently only supports SCSI devices, but could be easily extended
to support ATA devices.

It can also copy data to and from regular files, block devices, tape
devices, pipes, stdin, and stdout.  It does not support queueing
multiple commands to any of those targets, since it uses the standard
read(2)/write(2)/writev(2)/readv(2) system calls.

The I/O is done by two threads, one for the reader and one for the
writer.  The reader thread sends completed read requests to the
writer thread in strictly sequential order, even if they complete
out of order.  That could be modified later on for random I/O patterns
or slightly out of order I/O.

camdd(8) uses kqueue(2)/kevent(2) to get I/O completion events from
the pass(4) driver and also to send request notifications internally.

For pass(4) devcies, camdd(8) uses a single buffer (CAM_DATA_VADDR)
per CAM CCB on the reading side, and a scatter/gather list
(CAM_DATA_SG) on the writing side.  In addition to testing both
interfaces, this makes any potential reblocking of I/O easier.  No
data is copied between the reader and the writer, but rather the
reader's buffers are split into multiple I/O requests or combined
into a single I/O request depending on the input and output blocksize.

For the file I/O path, camdd(8) also uses a single buffer (read(2),
write(2), pread(2) or pwrite(2)) on reads, and a scatter/gather list
(readv(2), writev(2), preadv(2), pwritev(2)) on writes.

Things that would be nice to do for camdd(8) eventually:

1.  Add support for I/O pattern generation.  Patterns like all
    zeros, all ones, LBA-based patterns, random patterns, etc. Right
    Now you can always use /dev/zero, /dev/random, etc.

2.  Add support for a "sink" mode, so we do only reads with no
    writes.  Right now, you can use /dev/null.

3.  Add support for automatic queue depth probing, so that we can
    figure out the right queue depth on the input and output side
    for maximum throughput.  At the moment it defaults to 6.

4.  Add support for SATA device passthrough I/O.

5.  Add support for random LBAs and/or lengths on the input and
    output sides.

6.  Track average per-I/O latency and busy time.  The busy time
    and latency could also feed in to the automatic queue depth
    determination.

sys/cam/scsi/scsi_pass.h:
	Define two new ioctls, CAMIOQUEUE and CAMIOGET, that queue
	and fetch asynchronous CAM CCBs respectively.

	Although these ioctls do not have a declared argument, they
	both take a union ccb pointer.  If we declare a size here,
	the ioctl code in sys/kern/sys_generic.c will malloc and free
	a buffer for either the CCB or the CCB pointer (depending on
	how it is declared).  Since we have to keep a copy of the
	CCB (which is fairly large) anyway, having the ioctl malloc
	and free a CCB for each call is wasteful.

sys/cam/scsi/scsi_pass.c:
	Add asynchronous CCB support.

	Add two new ioctls, CAMIOQUEUE and CAMIOGET.

	CAMIOQUEUE adds a CCB to the incoming queue.  The CCB is
	executed immediately (and moved to the active queue) if it
	is an immediate CCB, but otherwise it will be executed
	in passstart() when a CCB is available from the transport layer.

	When CCBs are completed (because they are immediate or
	passdone() if they are queued), they are put on the done
	queue.

	If we get the final close on the device before all pending
	I/O is complete, all active I/O is moved to the abandoned
	queue and we increment the peripheral reference count so
	that the peripheral driver instance doesn't go away before
	all pending I/O is done.

	The new passcreatezone() function is called on the first
	call to the CAMIOQUEUE ioctl on a given device to allocate
	the UMA zones for I/O requests and S/G list buffers.  This
	may be good to move off to a taskqueue at some point.
	The new passmemsetup() function allocates memory and
	scatter/gather lists to hold the user's data, and copies
	in any data that needs to be written.  For virtual pointers
	(CAM_DATA_VADDR), the kernel buffer is malloced from the
	new pass(4) driver malloc bucket.  For virtual
	scatter/gather lists (CAM_DATA_SG), buffers are allocated
	from a new per-pass(9) UMA zone in MAXPHYS-sized chunks.
	Physical pointers are passed in unchanged.  We have support
	for up to 16 scatter/gather segments (for the user and
	kernel S/G lists) in the default struct pass_io_req, so
	requests with longer S/G lists require an extra kernel malloc.

	The new passcopysglist() function copies a user scatter/gather
	list to a kernel scatter/gather list.  The number of elements
	in each list may be different, but (obviously) the amount of data
	stored has to be identical.

	The new passmemdone() function copies data out for the
	CAM_DATA_VADDR and CAM_DATA_SG cases.

	The new passiocleanup() function restores data pointers in
	user CCBs and frees memory.

	Add new functions to support kqueue(2)/kevent(2):

	passreadfilt() tells kevent whether or not the done
	queue is empty.

	passkqfilter() adds a knote to our list.

	passreadfiltdetach() removes a knote from our list.

	Add a new function, passpoll(), for poll(2)/select(2)
	to use.

	Add devstat(9) support for the queued CCB path.

sys/cam/ata/ata_da.c:
	Add support for the BIO_VLIST bio type.

sys/cam/cam_ccb.h:
	Add a new enumeration for the xflags field in the CCB header.
	(This doesn't change the CCB header, just adds an enumeration to
	use.)

sys/cam/cam_xpt.c:
	Add a new function, xpt_setup_ccb_flags(), that allows specifying
	CCB flags.

sys/cam/cam_xpt.h:
	Add a prototype for xpt_setup_ccb_flags().

sys/cam/scsi/scsi_da.c:
	Add support for BIO_VLIST.

sys/dev/md/md.c:
	Add BIO_VLIST support to md(4).

sys/geom/geom_disk.c:
	Add BIO_VLIST support to the GEOM disk class.  Re-factor the I/O size
	limiting code in g_disk_start() a bit.

sys/kern/subr_bus_dma.c:
	Change _bus_dmamap_load_vlist() to take a starting offset and
	length.

	Add a new function, _bus_dmamap_load_pages(), that will load a list
	of physical pages starting at an offset.

	Update _bus_dmamap_load_bio() to allow loading BIO_VLIST bios.
	Allow unmapped I/O to start at an offset.

sys/kern/subr_uio.c:
	Add two new functions, physcopyin_vlist() and physcopyout_vlist().

sys/pc98/include/bus.h:
	Guard kernel-only parts of the pc98 machine/bus.h header with
	#ifdef _KERNEL.

	This allows userland programs to include <machine/bus.h> to get the
	definition of bus_addr_t and bus_size_t.

sys/sys/bio.h:
	Add a new bio flag, BIO_VLIST.

sys/sys/uio.h:
	Add prototypes for physcopyin_vlist() and physcopyout_vlist().

share/man/man4/pass.4:
	Document the CAMIOQUEUE and CAMIOGET ioctls.

usr.sbin/Makefile:
	Add camdd.

usr.sbin/camdd/Makefile:
	Add a makefile for camdd(8).

usr.sbin/camdd/camdd.8:
	Man page for camdd(8).

usr.sbin/camdd/camdd.c:
	The new camdd(8) utility.

Sponsored by:	Spectra Logic
MFC after:	1 week
2015-12-03 20:54:55 +00:00
Kirk McKusick
003a7c2b01 We need to zero out the union of pointers in a freed vnode structure.
PR:        204949
Fix from:  Mateusz Guzik
Tested by: Jason Unovitch
2015-12-03 02:04:22 +00:00
Nathan Whitehorn
f19d421ac6 Missed header_supported call from r291020: make really, really sure the brand
likes the executable.
2015-12-01 17:00:31 +00:00
Mateusz Guzik
d6767a972c capsicum: plug spurious memset in __cap_rights_init
Reviewed by:	pjd
2015-12-01 02:48:42 +00:00
Kirk McKusick
41d4f10391 As the kernel allocates and frees vnodes, it fully initializes them
on every allocation and fully releases them on every free.  These
are not trivial costs: it starts by zeroing a large structure then
initializes a mutex, a lock manager lock, an rw lock, four lists,
and six pointers. And looking at vfs.vnodes_created, these operations
are being done millions of times an hour on a busy machine.

As a performance optimization, this code update uses the uma_init
and uma_fini routines to do these initializations and cleanups only
as the vnodes enter and leave the vnode_zone. With this change the
initializations are only done kern.maxvnodes times at system startup
and then only rarely again. The frees are done only if the vnode_zone
shrinks which never happens in practice. For those curious about the
avoided work, look at the vnode_init() and vnode_fini() functions in
kern/vfs_subr.c to see the code that has been removed from the main
vnode allocation/free path.

Reviewed by: kib
Tested by:   Peter Holm
2015-11-29 21:42:26 +00:00
Konstantin Belousov
724f4b62b0 Remove sv_prepsyscall, sv_sigsize and sv_sigtbl members of the struct
sysent.

sv_prepsyscall is unused.

sv_sigsize and sv_sigtbl translate signal number from the FreeBSD
namespace into the ABI domain.  It is only utilized on i386 for iBCS2
binaries.  The issue with this approach is that signals for iBCS2 were
delivered with the FreeBSD signal frame layout, which does not follow
iBCS2.  The same note is true for any other potential user if
sv_sigtbl.  In other words, if ABI needs signal number translation, it
really needs custom sv_sendsig method instead.

Sponsored by:	The FreeBSD Foundation
2015-11-28 08:49:07 +00:00
Konstantin Belousov
f186a80d8a Remove VI_AGE vnode iflag, it is unused.
Noted by:	bde
Sponsored by:	The FreeBSD Foundation
2015-11-27 01:45:40 +00:00
Konstantin Belousov
b3162b45e1 Move the comment about resident pages preventing vnode from leaving
active list, into the header comment for vdrop(), which is the
function that decides whether to leave the vnode on the list.  Note
that dirty page write-out in vinactive() is asynchronous.

Discussed with:	alc
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2015-11-27 01:16:35 +00:00
Andrey V. Elsukov
0991fe0117 Check that hhk_helper pointer isn't NULL before access.
It isn't forbidden to use NULL pointer for hook_helper in hookinfo
structure when hhook_add_hook() adds new helper hook.
2015-11-25 07:14:58 +00:00
Konstantin Belousov
547831b6fd Rework the vnode cache recycling to meet free and unused vnodes
targets.  See the comment above wantfreevnodes variable for the
description of the algorithm.

The vfs.vlru_alloc_cache_src sysctl is removed.  New code frees
namecache sources as the last chance to satisfy the highest watermark,
instead of selecting the source vnodes randomly. This provides good
enough behaviour to keep vn_fullpath() working in most situations.
The filesystem layout with deep trees, where the removed knob was
required, is thus handled automatically.

Submitted by:	bde
Discussed with:	mckusick
Tested by:	pho
MFC after:	1 month
2015-11-24 09:45:36 +00:00
Mark Johnston
1155462e3a The buffer passed to an sbuf drain callback is not necessarily
null-terminated, so don't assume that it is.

Reported by:	pho
X-MFC-With:	r291059
2015-11-23 18:45:35 +00:00
Konstantin Belousov
5e27d79314 Split kerne timekeep ABI structure vdso_sv_tk out of the struct
sysentvec.  This allows the timekeep data to be shared between similar
ABIs which cannot share sysentvec.

Make the timekeep_push_vdso() tick callback to the timekeep structures
instead of sysentvecs.  If several sysentvec share the vdso_sv_tk
structure, we would update the userspace data several times on each
tick, without the change.

Only allocate vdso_sv_tk in the exec_sysvec_init() sysinit when
sysentvec is marked with the new SV_TIMEKEEP flag.  This saves
allocation and update of unneeded vdso_sv_tk for ABIs which do not
provide userspace gettimeofday yet, which are PowerPCs arches right
now.

Make vdso_sv_tk allocator public, namely split out and export
alloc_sv_tk() and alloc_sv_tk_compat32().  ABIs which share timekeep
data now can allocate it manually and share as appropriate.

Requested by:	nwhitehorn
Tested by:	nwhitehorn, pho
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
2015-11-23 07:09:35 +00:00
Gleb Smirnoff
09c837b897 Remove remnants of the old NFS from vnode pager.
Reviewed by:	kib
Sponsored by:	Netflix
2015-11-20 23:52:27 +00:00
Edward Tomasz Napierala
a8723fb881 The freebsd4_getfsstat() was broken in r281551 to always return 0 on success.
All versions of getfsstat(3) are supposed to return the number of [o]statfs
structs in the array that was copied out.

Also fix missing bounds checking and signed comparison of unsigned types.

Submitted by:	bde@
MFC after:	1 month
Sponsored by:	The FreeBSD Foundation
2015-11-20 14:08:12 +00:00
Jonathan T. Looney
1067a2ba68 Consistently enforce the restriction against calling malloc/free when in a
critical section.

uma_zalloc_arg()/uma_zalloc_free() may acquire a sleepable lock on the
zone. The malloc() family of functions may call uma_zalloc_arg() or
uma_zalloc_free().

The malloc(9) man page currently claims that free() will never sleep.
It also implies that the malloc() family of functions will not sleep
when called with M_NOWAIT. However, it is more correct to say that
these functions will not sleep indefinitely. Indeed, they may acquire
a sleepable lock. However, a developer may overlook this restriction
because the WITNESS check that catches attempts to call the malloc()
family of functions within a critical section is inconsistenly
applied.

This change clarifies the language of the malloc(9) man page to clarify
the restriction against calling the malloc() family of functions
while in a critical section or holding a spin lock. It also adds
KASSERTs at appropriate points to make the enforcement of this
restriction more consistent.

PR:		204633
Differential Revision:	https://reviews.freebsd.org/D4197
Reviewed by:	markj
Approved by:	gnn (mentor)
Sponsored by:	Juniper Networks
2015-11-19 14:04:53 +00:00
Mark Johnston
150b709793 Remove a commented-out debug print.
MFC after:	1 week
2015-11-19 05:58:51 +00:00
Mark Johnston
1b9254f885 Add support for a configurable output channel to witness(4).
This is useful in environments where system configuration is performed by
automated interaction with the system console, since unexpected witness
output makes such automation difficult. With this change, the new
debug.witness.output_channel sysctl allows one to specify that witness
output is to be printed to the kernel log (using log(9)) rather than the
console.

Reviewed by:	cem, jhb
MFC after:	2 weeks
Relnotes:	yes
Sponsored by:	EMC / Isilon Storage Division
Differential Revision:	https://reviews.freebsd.org/D4183
2015-11-19 05:56:59 +00:00