Commit Graph

14776 Commits

Author SHA1 Message Date
Edward Tomasz Napierala
ac3c9819ab Refactor; no functional changes.
MFC after:	1 month
Sponsored by:	The FreeBSD Foundation
2016-03-31 17:32:28 +00:00
John Baldwin
4d805eacfa Tidy up the unmapped I/O code in qphysio.
- Move some blocks around to reduce the number of 'if (unmap)' checks.
- Use 'pbuf == NULL' instead of 'unmap'.
- Use nitems.
- Pull an assignment out of an if expression.

Reviewed by:	kib
Sponsored by:	Chelsio Communications
2016-03-31 17:27:30 +00:00
Edward Tomasz Napierala
b450d4479d Fix overflows, making it impossible to add negative amounts using rctl(8).
MFC after:	1 month
Sponsored by:	The FreeBSD Foundation
2016-03-31 17:00:47 +00:00
Jamie Gritton
320d842101 Add osd_reserve() and osd_set_reserved(), which allow M_WAITOK allocation
of an OSD array,
2016-03-30 16:57:28 +00:00
Gleb Smirnoff
9c64cfe56c The sendfile(2) allows to send extra data from userspace before the file
data (headers).  Historically the size of the headers was not checked
against the socket buffer space.  Application could easily overcommit the
socket buffer space.

With the new sendfile (r293439) the problem remained, but a KASSERT was
inserted that checked that amount of data written to the socket matches
its space.  In case when size of headers is bigger that socket space,
KASSERT fires.  Without INVARIANTS the new sendfile won't panic, but
would report incorrect amount of bytes sent.

o With this change, the headers copyin is moved down into the cycle, after
  the sbspace() check.  The uio size is trimmed by socket space there,
  which fixes the overcommit problem and its consequences.
o The compatibility handling for FreeBSD 4 sendfile headers API is pushed
  up the stack to syscall wrappers.  This required a copy and paste of the
  code, but in turn this allowed to remove extra stack carried parameter
  from fo_sendfile_t, and embrace entire compat code into #ifdef.  If in
  future we got more fo_sendfile_t function, the copy and paste level would
  even reduce.

Reviewed by:	emax, gallatin, Maxim Dounin <mdounin mdounin.ru>
Tested by:	Vitalij Satanivskij <satan ukr.net>
Sponsored by:	Netflix
2016-03-29 19:57:11 +00:00
Edward Tomasz Napierala
35030a5dd4 Remove some NULL checks for M_WAITOK allocations.
MFC after:	1 month
Sponsored by:	The FreeBSD Foundation
2016-03-29 13:56:59 +00:00
Jamie Gritton
859f4d0611 Move the various per-type arrays of OSD data into a single structure array. 2016-03-28 22:18:37 +00:00
Warner Losh
cb49b65481 Move pccard_safe_quote() up to subr_bus.c and rename to
devctl_safe_quote() so it can be used more generally.
2016-03-28 20:16:29 +00:00
Navdeep Parhar
fddd4f6273 Plug leak in m_unshare.
m_unshare passes on the source mbuf's flags as-is to m_getcl and this
results in a leak if the flags include M_NOFREE.  The fix is to clear
the bits not listed in M_COPYALL before calling m_getcl.  M_RDONLY
should probably be filtered out too but that's outside the scope of this
fix.

Add assertions in the zone_mbuf and zone_pack ctors to catch similar
bugs.

Update netmap_get_mbuf to not pass M_NOFREE to m_getcl.  It's not clear
what the original code was trying to do but it's likely incorrect.
Updated code is no different functionally but it avoids the newly added
assertions.

Reviewed by:	gnn@
Sponsored by:	Chelsio Communications
Differential Revision:	https://reviews.freebsd.org/D5698
2016-03-26 23:39:53 +00:00
Conrad Meyer
c975a5d359 Add td_swinvoltick to track last involuntary context switch
Expose in DDB via "show thread."

Reviewed by:	markj
Sponsored by:	EMC / Isilon Storage Division
2016-03-25 19:35:29 +00:00
Gleb Smirnoff
c8f59118be Space and style(9) corrections for recent mbuf changes. 2016-03-24 20:06:52 +00:00
Svatopluk Kraus
61c8fde5d6 Generalize IPI support for ARM intrng and use it for interrupt
controller IPI provider.

New struct intr_ipi is defined which keeps all info about an IPI:
its name, counter, send and dispatch methods. Generic intr_ipi_setup(),
intr_ipi_send() and intr_ipi_dispatch() functions are implemented.

An IPI provider must implement two functions:
(1) an intr_ipi_send_t function which is able to send an IPI,
(2) a setup function which initializes itself for an IPI and
    calls intr_ipi_setup() with appropriate arguments.

Differential Revision:	https://reviews.freebsd.org/D5700
2016-03-24 09:55:11 +00:00
George V. Neville-Neil
dcd070d80a Move mbuf provider under SDT to indicate that it is FreeBSD specific
and not a stable interface.

Reviewed by:	markj
MFC after:	2 weeks
Sponsored by:	Rubicon Communications (Netgate)
Differential Revision:	https://reviews.freebsd.org/D5716
2016-03-24 08:26:06 +00:00
Bryan Drewery
1c9b29f95f Pass the expected struct radix_node_head * to vfs_free_netcred.
No functional change.

struct radix_node_head's first element is rh so this was already
referring to the same address.  It was likely an unintended
s/rnh/&rnh->rh/ change from r294706 as all other rnh_walktree() callers
pass the expected struct radix_node_head * rather than obscurely passing
the address of their first element.

Sponsored by:	EMC / Isilon Storage Division
2016-03-24 04:40:07 +00:00
Bryan Drewery
57a8e34141 Fix M_RTABLE memory leak from r274118 (11/2014).
Replace free(M_RTABLE) with rn_detachhead() to match rn_inithead().

This would trigger when reloading NFS exports and was similar to
problems with pf reload [1].

PR:		194078 [1]
Sponsored by:	EMC / Isilon Storage Division
2016-03-24 03:08:39 +00:00
Edward Tomasz Napierala
68d35798b9 Wait for root mount tokens before showing the root mount prompt.
This restores the pre-r290196 behaviour, eliminating the need to manually
press '.' a couple of times to get USB to finish probing.

Note that there's still something wrong with the console (character
echoing doesn't quite work), and there's also a reported problem with
BHyVe, but those two don't seem related to the problem above.

MFC after:	1 month
Sponsored by:	The FreeBSD Foundation
2016-03-22 13:46:01 +00:00
George V. Neville-Neil
480f4e946d Add an mbuf provider to DTrace.
The mbuf provider is made up of a set of Statically Defined Tracepoints
which help us look into mbufs as they are allocated and freed.  This can be
used to inspect the buffers or for a simplified mbuf leak detector.

New tracepoints are:

mbuf:::m-init
mbuf:::m-gethdr
mbuf:::m-get
mbuf:::m-getcl
mbuf:::m-clget
mbuf:::m-cljget
mbuf:::m-cljset
mbuf:::m-free
mbuf:::m-freem

There is also a translator for mbufs which gives some visibility into the structure,
see mbuf.d for more details.

Reviewed by:	bz, markj
MFC after:	2 weeks
Sponsored by:	Rubicon Communications (Netgate)
Differential Revision:	https://reviews.freebsd.org/D5682
2016-03-22 13:16:52 +00:00
John Baldwin
70f52fd69f Regen. 2016-03-21 21:38:35 +00:00
John Baldwin
bb430bc740 Fully handle size_t lengths in AIO requests.
First, update the return types of aio_return() and aio_waitcomplete() to
ssize_t.

POSIX requires aio_return() to return a ssize_t so that it can represent
all return values from read() and write().  aio_waitcomplete() should use
ssize_t for the same reason.

aio_return() has used ssize_t in <aio.h> since r31620 but the manpage and
system call entry were not updated.  aio_waitcomplete() has always
returned int.

Note that this does not require new system call stubs as this is
effectively only an API change in how the compiler interprets the return
value.

Second, allow aio_nbytes values up to IOSIZE_MAX instead of just INT_MAX.

aio_read/write should now honor the same length limits as normal read/write.

Third, use longs instead of ints in the aio_return() and aio_waitcomplete()
system call functions so that the 64-bit size_t in the in-kernel aiocb
isn't truncated to 32-bits before being copied out to userland or
being returned.

Finally, a simple test has been added to verify the bounds checking on the
maximum read size from a file.
2016-03-21 21:37:33 +00:00
Maxim Konovalov
ab77175063 o "avaliable" -> "available".
PR:		208141
Submitted by:	Tyler Littlefield
2016-03-21 08:03:50 +00:00
Pedro F. Giffuni
5166fdde7f aio_qphysio(): Avoid uninitialized pointer read on error.
For the !unmap case it may happen that pbuf gets called unreferenced
when vm_fault_quick_hold_pages() fails.
Initialize it so it doesn't cause trouble.

CID:		1352776
Reviewed by:	jhb
MFC after:	1 week
2016-03-18 19:04:01 +00:00
Justin Hibbits
da1b038af9 Use uintmax_t (typedef'd to rman_res_t type) for rman ranges.
On some architectures, u_long isn't large enough for resource definitions.
Particularly, powerpc and arm allow 36-bit (or larger) physical addresses, but
type `long' is only 32-bit.  This extends rman's resources to uintmax_t.  With
this change, any resource can feasibly be placed anywhere in physical memory
(within the constraints of the driver).

Why uintmax_t and not something machine dependent, or uint64_t?  Though it's
possible for uintmax_t to grow, it's highly unlikely it will become 128-bit on
32-bit architectures.  64-bit architectures should have plenty of RAM to absorb
the increase on resource sizes if and when this occurs, and the number of
resources on memory-constrained systems should be sufficiently small as to not
pose a drastic overhead.  That being said, uintmax_t was chosen for source
clarity.  If it's specified as uint64_t, all printf()-like calls would either
need casts to uintmax_t, or be littered with PRI*64 macros.  Casts to uintmax_t
aren't horrible, but it would also bake into the API for
resource_list_print_type() either a hidden assumption that entries get cast to
uintmax_t for printing, or these calls would need the PRI*64 macros.  Since
source code is meant to be read more often than written, I chose the clearest
path of simply using uintmax_t.

Tested on a PowerPC p5020-based board, which places all device resources in
0xfxxxxxxxx, and has 8GB RAM.
Regression tested on qemu-system-i386
Regression tested on qemu-system-mips (malta profile)

Tested PAE and devinfo on virtualbox (live CD)

Special thanks to bz for his testing on ARM.

Reviewed By: bz, jhb (previous)
Relnotes:	Yes
Sponsored by:	Alex Perez/Inertial Computing
Differential Revision: https://reviews.freebsd.org/D4544
2016-03-18 01:28:41 +00:00
Conrad Meyer
82e17adcb4 fail(9): Only gather/print stacks if STACK is enabled
This is a follow-up fix to the earlier r296927.

Reported by:	bz
Sponsored by:	EMC / Isilon Storage Division
2016-03-17 01:05:53 +00:00
Conrad Meyer
70e20d4e1a fail(9): Upstreaming some fail point enhancements
This is several year's worth of fail point upgrades done at EMC Isilon. They
are interdependent enough that it makes sense to put a single diff up for them.
Primarily, we added:

- Changing all mainline execution paths to be lockless, which lets us use fail
  points in more sleep-sensitive areas, and allows more parallel execution
- A number of additional commands, including 'pause' that lets us do some
  interesting deterministic repros of race conditions
- The ability to dump the stacks of all threads sleeping on a fail point
- A number of other API changes to allow marking up the fail point's context in
  the code, and firing callbacks before and after execution
- A man page update

Submitted by:	Matthew Bryan <matthew.bryan@isilon.com>
Reviewed by:	cem (earlier version), jhb, kib, pho
With feedback from:	bdrewery
Sponsored by:	EMC / Isilon Storage Division
Differential Revision:	https://reviews.freebsd.org/D5427
2016-03-16 04:22:32 +00:00
Gleb Smirnoff
1d52250154 Free the temporary buffer in sysctl_handle_counter_u64_array().
Submitted by:	mjg
2016-03-15 00:21:32 +00:00
Gleb Smirnoff
b5b7b142a7 Provide sysctl(9) macro to deal with array of counter(9). 2016-03-15 00:05:00 +00:00
Justin T. Gibbs
5405e7e2ee Provide high precision conversion from ns,us,ms -> sbintime in kevent
In timer2sbintime(), calculate the second and fractional second portions of
the sbintime separately. When calculating the the fractional second portion,
use a 64bit multiply to prevent excess truncation. This avoids the ~7% error
in the original conversion for ns, and smaller errors of the same type for us
and ms.

PR: 198139
Reviewed by: jhb
MFC after: 1 week
Differential Revision:    https://reviews.freebsd.org/D5397
2016-03-12 23:02:53 +00:00
John Baldwin
f479b2ac89 Do not include system call wrappers in libc for old FreeBSD system calls.
The base system libc is only used to run binaries built on FreeBSD 7.0 and
later.  It does not need to include system call wrappers for system calls
only used by FreeBSD binaries built on versions older than 7.0.  This was
already true for "COMPAT" system calls, but now wrappers for system calls
used on FreeBSD 4 and 6 are excluded as well.

Reviewed by:	kib
Differential Revision:	https://reviews.freebsd.org/D5597
2016-03-12 22:53:46 +00:00
Edward Tomasz Napierala
2a5a08cb38 Refactor the way we restore cn_lkflags; no functional changes.
MFC after:	1 month
Sponsored by:	The FreeBSD Foundation
2016-03-12 09:05:43 +00:00
Edward Tomasz Napierala
f69db55151 Remove cn_consume from 'struct componentname'. It was never set to anything
other than 0.

Reviewed by:	kib@
MFC after:	1 month
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D5611
2016-03-12 08:50:38 +00:00
Edward Tomasz Napierala
213ed83855 Fix autofs triggering problem. Assume you have an NFS server,
192.168.1.1, with share "share". This commit fixes a problem
where "mkdir /net/192.168.1.1/share/meh" would return spurious
error instead of creating the directory if the target filesystem
wasn't mounted yet; subsequent attempts would work correctly.

The failure scenario is kind of complicated to explain, but it all
boils down to calling VOP_MKDIR() for the target filesystem (NFS)
with wrong dvp - the autofs vnode instead of the filesystem root
mounted over it.

Reviewed by:	kib@
MFC after:	1 month
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D5442
2016-03-12 07:54:42 +00:00
John Baldwin
47cedcbd72 Use SI_SUB_LAST instead of SI_SUB_SMP as the "catch-all" subsystem.
Reviewed by:	kib
Sponsored by:	Netflix
Differential Revision:	https://reviews.freebsd.org/D5515
2016-03-11 23:18:06 +00:00
John Baldwin
8d91aced32 Regen. 2016-03-09 19:06:46 +00:00
John Baldwin
399e8c1773 Simplify AIO initialization now that it is standard.
- Mark AIO system calls as STD and remove the helpers to dynamically
  register them.
- Use COMPAT6 for the old system calls with the older sigevent instead of
  an 'o' prefix.
- Simplify the POSIX configuration to note that AIO is always available.
- Handle AIO in the default VOP_PATHCONF instead of special casing it in
  the pathconf() system call.  fpathconf() is still hackish.
- Remove freebsd32_aio_cancel() as it just called the native one directly.

Reviewed by:	kib
Sponsored by:	Chelsio Communications
Differential Revision:	https://reviews.freebsd.org/D5589
2016-03-09 19:05:11 +00:00
Konstantin Belousov
3cfce8e4df Convert all panics from the link_elf_obj kernel linker for object
files format into printfs and errors to caller.  Some leaks of
resources are there, but the same leaks are present in other error
pathes.  With the change, the kernel at least boots even when module
with unexpected or corrupted ELF structure is preloaded.

Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
2016-03-07 18:44:06 +00:00
Konstantin Belousov
13f28d969a In the link_elf_obj.c, handle sections of type SHT_AMD64_UNWIND same
as SHT_PROGBITS.  This is needed after the clang 3.8 import, which
generates that type for .eh_frame section, which had SHT_PROGBITS type
before.

Reported by:	 Nikolai Lifanov <lifanov@mail.lifanov.com>
PR:	207729
Tested by:	dim (previous version)
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
2016-03-06 00:31:11 +00:00
Justin Hibbits
534ccd7bbf Replace all resource occurrences of '0UL/~0UL' with '0/~0'.
Summary:
The idea behind this is '~0ul' is well-defined, and casting to uintmax_t, on a
32-bit platform, will leave the upper 32 bits as 0.  The maximum range of a
resource is 0xFFF.... (all bits of the full type set).  By dropping the 'ul'
suffix, C type promotion rules apply, and the sign extension of ~0 on 32 bit
platforms gets it to a type-independent 'unsigned max'.

Reviewed By: cem
Sponsored by:	Alex Perez/Inertial Computing
Differential Revision: https://reviews.freebsd.org/D5255
2016-03-03 05:07:35 +00:00
Konstantin Belousov
5db9ed8062 If callout_stop_safe() noted that the callout is currently executing,
but next invocation is cancelled while migrating,
sleepq_check_timeout() needs to be informed that the callout is
stopped.  Otherwise the thread switches off CPU and never become
runnable, since running callout could have already raced with us,
while the migrating and cancelled callout could be one which is
expected to set TDP_TIMOFAIL flag for us.  This contradicts with the
expected behaviour of callout_stop() for other callers, which
e.g. decrement references from the callout callbacks.

Add a new flag CS_MIGRBLOCK requesting report of the situation as
'successfully stopped'.

Reviewed by:	jhb (previous version)
Tested by:	cognet, pho
PR:	200992
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
Differential revision:	https://reviews.freebsd.org/D5221
2016-03-02 18:46:17 +00:00
Gleb Smirnoff
0ea37a86ba Fix regression in r296242 affecting several drivers. For EXT_NET_DRV,
EXT_MOD_TYPE, EXT_DISPOSABLE types we should first execute the free
callback, then free the mbuf, otherwise we will derefernce memory that
was just freed.

Reported and tested:	jhibbits
2016-03-02 02:12:01 +00:00
Bryan Drewery
3b77ac5f9e Correct a comment. 2016-03-01 23:58:53 +00:00
John Baldwin
7b8cfe26a0 Use SCHEDULER_STOPPED() in cv_*wait*() instead of checking panicstr.
Reviewed by:	kib
MFC after:	1 month
Sponsored by:	Netflix
Differential Revision:	https://reviews.freebsd.org/D5516
2016-03-01 22:51:44 +00:00
John Baldwin
f3215338ef Refactor the AIO subsystem to permit file-type-specific handling and
improve cancellation robustness.

Introduce a new file operation, fo_aio_queue, which is responsible for
queueing and completing an asynchronous I/O request for a given file.
The AIO subystem now exports library of routines to manipulate AIO
requests as well as the ability to run a handler function in the
"default" pool of AIO daemons to service a request.

A default implementation for file types which do not include an
fo_aio_queue method queues requests to the "default" pool invoking the
fo_read or fo_write methods as before.

The AIO subsystem permits file types to install a private "cancel"
routine when a request is queued to permit safe dequeueing and cleanup
of cancelled requests.

Sockets now use their own pool of AIO daemons and service per-socket
requests in FIFO order.  Socket requests will not block indefinitely
permitting timely cancellation of all requests.

Due to the now-tight coupling of the AIO subsystem with file types,
the AIO subsystem is now a standard part of all kernels.  The VFS_AIO
kernel option and aio.ko module are gone.

Many file types may block indefinitely in their fo_read or fo_write
callbacks resulting in a hung AIO daemon.  This can result in hung
user processes (when processes attempt to cancel all outstanding
requests during exit) or a hung system.  To protect against this, AIO
requests are only permitted for known "safe" files by default.  AIO
requests for all file types can be enabled by setting the new
vfs.aio.enable_usafe sysctl to a non-zero value.  The AIO tests have
been updated to skip operations on unsafe file types if the sysctl is
zero.

Currently, AIO requests on sockets and raw disks are considered safe
and are enabled by default.  aio_mlock() is also enabled by default.

Reviewed by:	cem, jilles
Discussed with:	kib (earlier version)
Sponsored by:	Chelsio Communications
Differential Revision:	https://reviews.freebsd.org/D5289
2016-03-01 18:12:14 +00:00
John Baldwin
cbc4d2db75 Remove taskqueue_enqueue_fast().
taskqueue_enqueue() was changed to support both fast and non-fast
taskqueues 10 years ago in r154167.  It has been a compat shim ever
since.  It's time for the compat shim to go.

Submitted by:	Howard Su <howard0su@gmail.com>
Reviewed by:	sephe
Differential Revision:	https://reviews.freebsd.org/D5131
2016-03-01 17:47:32 +00:00
Svatopluk Kraus
32d001e479 Remove an alternative way for dealing with root interrupt controller
which is not complete. Likely, it was forgotten and not removed before
committing.
2016-03-01 11:27:58 +00:00
Svatopluk Kraus
169e6abd8f Mark other parts of interrupt framework as INTR_SOLO option specific.
Note that isrc_arg member of struct intr_irqsrc is used only for
INTR_SOLO and IPI filter. This should be remembered if IPI filters
and their arguments will be stored on another place.

This option could be unusable very soon, if interrupt controllers
implementations will not be implemented considering it.
2016-03-01 10:57:29 +00:00
Gleb Smirnoff
56a5f52e80 New way to manage reference counting of mbuf external storage.
The m_ext.ext_cnt pointer becomes a union. It can now hold the refcount
value itself. To tell that m_ext.ext_flags flag EXT_FLAG_EMBREF is used.
The first mbuf to attach a cluster stores the refcount. The further mbufs
to reference the cluster point at refcount in the first mbuf. The first
mbuf is freed only when the last reference is freed.

The benefit over refcounts stored in separate slabs is that now refcounts
of different, unrelated mbufs do not share a cache line.

For EXT_EXTREF mbufs the zone_ext_refcnt is no longer needed, and m_extadd()
becomes void, making widely used M_EXTADD macro safe.

For EXT_SFBUF mbufs the sf_ext_ref() is removed, which was an optimization
exactly against the cache aliasing problem with regular refcounting.

Discussed with:		rrs, rwatson, gnn, hiren, sbruno, np
Reviewed by:		rrs
Differential Revision:	https://reviews.freebsd.org/D5396
Sponsored by:		Netflix
2016-03-01 00:17:14 +00:00
Konstantin Belousov
1bdbd70599 Implement process-shared locks support for libthr.so.3, without
breaking the ABI.  Special value is stored in the lock pointer to
indicate shared lock, and offline page in the shared memory is
allocated to store the actual lock.

Reviewed by:	vangyzen (previous version)
Discussed with:	deischen, emaste, jhb, rwatson,
	Martin Simmons <martin@lispworks.com>
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
2016-02-28 17:52:33 +00:00
Svatopluk Kraus
5b70c08cdf Move IPI related parts back to (ARM) machine specific file now, when
the interrupt framework is also going to be used by another (MIPS)
architecture. IPI implementations may vary much across different
architectures.

An IPI implementation should still define INTR_IPI_COUNT and use
intr_ipi_setup_counters() to setup IPI counters which are inside of
intrcnt[] and intrnames[] arrays. Those are used for sysctl and ddb.
Then, intr_ipi_increment_count() should be used to increment obtained
counter.

Reviewed by:	imp
Differential Revision:	https://reviews.freebsd.org/D5459
2016-02-27 12:03:07 +00:00
Ed Schouten
afc055d90f Remove the errno argument from unp_drop().
While there, add a comment to clarify that ECONNRESET should always be
returned for POSIX conformance.

Suggested by:	Steven Hartland
2016-02-26 12:46:34 +00:00
Mark Johnston
0acf5d0bfd Improve error handling for posix_fallocate(2) and posix_fadvise(2).
- Set td_errno so that ktrace and dtrace can obtain the syscall error
  number in the usual way.
- Pass negative error numbers directly to the syscall layer, as they're
  not intended to be returned to userland.

Reviewed by:	kib
Sponsored by:	EMC / Isilon Storage Division
Differential Revision: https://reviews.freebsd.org/D5425
2016-02-25 19:58:23 +00:00
Ed Schouten
72c8072ee5 Make asynchronous connection failures on UNIX sockets fail with ECONNRESET.
While making CloudABI work well on Linux, I discovered that I had a
FreeBSD-ism in one of my unit tests. The test did the following:

- Create UNIX socket 1, bind it, make it listen.
- Create UNIX socket 2, connect it to UNIX socket 1.
- Close UNIX socket 1.
- Obtain SO_ERROR from socket 2.

On FreeBSD this returns ECONNABORTED, while on Linux it returns
ECONNRESET. I dug through some of the relevant specifications[1] and it
looks like Linux is all right here. ECONNABORTED should only be returned
when the local connection (socket 2) is aborted; not the peer (socket 1).

It is of course slightly misleading: the function in which we set this
error is called uipc_abort(), but keep in mind that we're aborting the
peer, thus resetting the local socket.

[1] http://pubs.opengroup.org/onlinepubs/9699919799/functions/connect.html

Reviewed by:	cem
Sponsored by:	Nuxi, the Netherlands
Differential Revision:	https://reviews.freebsd.org/D5419
2016-02-24 17:10:32 +00:00
Konstantin Belousov
0791e0c0e7 Provide more correct sizing of the KVA consumed by a vnode, used by
the virtvnodes calculation.  Include the size of fs-specific v_data as
the nfs nclnode inline, the NFS nclnode is bigger than either ZFS
znode or UFS inode.  Include the size of namecache_ts and short cache
path element, multiplied by the name cache population factor, again
inline.

Inline defines are used to avoid pollution of the vnode.h with the
subsystem-private objects.  Non-significant unsynchronized changes of
the definitions are fine, we do not care about that precision, and
e.g. ZFS consumes much malloced memory per vnode for reasons
unaccounted in the formula.

Lower the partition of kmem dedicated to vnodes, from 1/7 to 1/10.

The measures reduce vnode cache pressure on kmem and bring the vnode
cache memory use below some apparent thresholds that were exceeded by
r291244 due to more robust vnode reuse.

Reported and tested by:	marius (i386, previous version)
Reviewed by:	bde
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
2016-02-24 15:15:46 +00:00
Bryan Drewery
200241b504 Fix build after r295934. 2016-02-23 23:37:10 +00:00
Mariusz Zaborski
94e6fdd806 According to the sys/kern/capabilities.conf, gethostid(3) should be allowed.
Pointed out by:	Milosz Kaniewski <m.kaniewski@wheelsystems.com>
Approved by:	pjd (mentor)
MFC after:	3 days
Sponsored by:	Wheel Systems, http://wheelsystems.com
2016-02-23 22:02:25 +00:00
Ian Lepore
85143dd18d Allow a dynamic env to override a compiled-in static env by passing in the
override indication in the env data.

Submitted by:	bde
2016-02-21 18:35:01 +00:00
Justin Hibbits
7915adb560 Introduce a RMAN_IS_DEFAULT_RANGE() macro, and use it.
This simplifies checking for default resource range for bus_alloc_resource(),
and improves readability.

This is part of, and related to, the migration of rman_res_t from u_long to
uintmax_t.

Discussed with:	jhb
Suggested by:	marcel
2016-02-20 01:32:58 +00:00
Mark Johnston
88c2beac9c Ensure that we test the event condition when a disabled kevent is enabled.
r274560 modified kqueue_register() to only test the event condition if the
corresponding knote is not disabled. However, this check takes place before
the EV_ENABLE flag is used to clear the KN_DISABLED flag on the knote, so
enabling a previously-disabled kevent would not result in a notification for
a triggered event. This change fixes the problem by testing for EV_ENABLED
before possibly checking the event condition.

This change also updates a kqueue regression test to exercise this case.

PR:		206368
Reviewed by:	kib
Sponsored by:	EMC / Isilon Storage Division
Differential Revision:	https://reviews.freebsd.org/D5307
2016-02-19 01:49:33 +00:00
Mark Johnston
fe169828c3 Return an error if both EV_ENABLE and EV_DISABLE are specified for a kevent.
Currently, this combination results in EV_DISABLE being ignored.

Reviewed by:	kib
Sponsored by:	EMC / Isilon Storage Division
Differential Revision:	https://reviews.freebsd.org/D5307
2016-02-19 01:35:01 +00:00
Zbigniew Bodek
910905c74f Fix build for i386 and arm64 after r295755
- Take bus_space_tag_t type into consideration when returning
  default, zero value.
- Include missing rman.h required by ofw_pci.h
2016-02-18 15:44:45 +00:00
Zbigniew Bodek
b998c9656b Introduce bus_get_bus_tag() method
Provide bus_get_bus_tag() for sparc64, powerpc, arm, arm64 and mips
nexus and its children in order to return a platform specific default tag.

This is required to ensure generic correctness of the bus_space tag.
It is especially needed for arches where child bus tag does not match
the parent bus tag. This solves the problem with ppc architecture
where the PCI bus tag differs from parent bus tag which is big-endian.

This commit is a part of the following patch:
https://reviews.freebsd.org/D4879

Submitted by:  Marcin Mazurek <mma@semihalf.com>
Obtained from: Semihalf
Sponsored by:  Annapurna Labs
Reviewed by:   jhibbits, mmel
Differential Revision: https://reviews.freebsd.org/D4879
2016-02-18 13:00:04 +00:00
Konstantin Belousov
fa48f413ef In bnoreuselist(), check both ends of the specified logical block
numbers range.

This effectively skips indirect and extdata blocks on the buffer
queue.  Since their logical block numbers are negative, bnoreuselist()
could loop infinitely.

Reported and tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2016-02-17 19:39:57 +00:00
Warner Losh
c55f57071a Create an API to reset a struct bio (g_reset_bio). This is mandatory
for all struct bio you get back from g_{new,alloc}_bio. Temporary
bios that you create on the stack or elsewhere should use this before
first use of the bio, and between uses of the bio. At the moment, it
is nothing more than a wrapper around bzero, but that may change in
the future. The wrapper also removes one place where we encode the
size of struct bio in the KBI.
2016-02-17 17:16:02 +00:00
Andrew Turner
96922be6b5 Remove an unused FDT header, fdt_common.h should only be needed in a few
places, mostly in sys/dev/fdt and legacy code.

Sponsored by:	ABT Systems Ltd
2016-02-15 17:05:03 +00:00
Adrian Chadd
0cc5515a85 Allow MIPS INTRNG code to be built without FDT support.
This patch allows the newly imported INTRNG code to be built without necessarily
having FDT support in the kernel.  This may be useful for some MIPS platforms
that wish to move to INTRNG, but not to FDT at the same time.

Basically all the code is already within ifdef's where FDT is concerned,
it's just the headers that aren't.

Submitted by:	Stanislav Galabov <sgalabov@gmail.com>
Differential Revision:	https://reviews.freebsd.org/D5249
2016-02-15 14:34:35 +00:00
Gleb Smirnoff
5e4bc63b7c o Gather all mbuf(9) allocation functions into kern_mbuf.c, and all
mbuf(9) manipulation functions into uipc_mbuf.c.  This looks like
  the initial intent, but had diffused in the last decade.

o Gather all declarations in mbuf.h in one place and sort them.

o Uninline m_clget() and m_cljget().

There are no functional changes in this patch.

The patch comes from a larger version, where all mbuf(9) allocation was
uninlined, which allowed to make mbuf(9) UMA zones private to kern_mbuf.c.
The performance impact of the total uninlining is still unclear, so we
are holding on now with larger version.

Together with:	melifaro, olivier
2016-02-11 21:32:23 +00:00
Konstantin Belousov
0be1e0e879 Remove useless checks for NULL before calling free(9), in the kernel
elf linkers.

Found by:	Related PVS-Studio diagnostic
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2016-02-10 21:35:00 +00:00
Konstantin Belousov
86a448c3a4 Finish r173600. There is no need to test a condition if both cases
result in the same value.

Found by:	PVS-Studio
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2016-02-10 21:16:37 +00:00
Gleb Smirnoff
b4b12e52fb Garbage collect unused arguments of m_init(). 2016-02-10 18:54:18 +00:00
Gleb Smirnoff
b28cc462ad Include sys/_task.h into uma_int.h, so that taskqueue.h isn't a
requirement for uma_int.h.

Suggested by:	jhb
2016-02-09 20:22:35 +00:00
Konstantin Belousov
db57c70a5b Rename P_KTHREAD struct proc p_flag to P_KPROC.
I left as is an apparent bug in ntoskrnl_var.h:AT_PASSIVE_LEVEL()
definition.

Suggested by:	jhb
Sponsored by:	The FreeBSD Foundation
2016-02-09 16:30:16 +00:00
John Baldwin
fb1f4582ff Call kthread_exit() rather than kproc_exit() for a premature kthread exit.
Kernel threads (and processes) are supposed to call kthread_exit() (or
kproc_exit()) to terminate.  However, the kernel includes a fallback in
fork_exit() to force a kthread exit if a kernel thread's "main" routine
returns.  This fallback was added back when the kernel only had processes
and was not updated to call kthread_exit() instead of kproc_exit() when
threads were added to the kernel.

This mistake was particular exciting when the errant thread belonged to
proc0.  Due to the missing P_KTHREAD flag the fallback did not kick in
and instead tried to return to userland via whatever garbage was in the
trapframe.  With P_KTHREAD set it tried to terminate proc0 resulting in
other amusements.

PR:		204999
MFC after:	1 week
2016-02-08 23:11:23 +00:00
John Baldwin
6270fa5f72 Mark proc0 as a kernel process via the P_KTHREAD flag.
All other kernel processes have this flag set and all threads in proc0
(including thread0) have the similar TDP_KTHREAD flag set.

PR:		204999
Submitted by:	Oliver Pinter @ HardenedBSD
Reviewed by:	kib
MFC after:	1 week
2016-02-08 23:06:27 +00:00
Konstantin Belousov
2dab579bfc Remove the assert which outlived its usefulness, and, by default,
disable compilation of the code which made it possible to call
stop_all_proc() from usermode at all.

Move the comment to the preamble of stop_all_proc() and reword it to
give overview of the function intent.

proc0 has P_HADTHREADS flag set due to kthread_add(), but no
P_KTHREAD, which triggered the assert, which does not serve a purpose
now.

Reported by:	Oliver Pinter
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
2016-02-08 10:54:27 +00:00
Jilles Tjoelker
f00fb5457e semget(): Check for [EEXIST] error first.
Although POSIX literally permits failing with [EINVAL] if IPC_CREAT and
IPC_EXCL were both passed, the semaphore set already exists and has fewer
semaphores than nsems, this does not allow an application to retry safely:
if the [EINVAL] is actually because of the semmsl limit, an infinite loop
would result.

PR:		206927
2016-02-07 22:12:39 +00:00
Pedro F. Giffuni
7559912039 Minor grammar fix in comment. 2016-02-07 16:18:12 +00:00
Kirk McKusick
b00b459084 Clarify a comment in kern_openat() about the use of falloc_noinstall().
Suggested by: Steve Jacobson
2016-02-07 01:04:47 +00:00
Mateusz Guzik
0c829a301d fork: ansify sys_pdfork
No functional changes.
2016-02-06 09:01:03 +00:00
John Baldwin
5652770d8f Rename aiocblist to kaiocb and use consistent variable names.
Typically <foo>list is used for a structure that holds a list head in
FreeBSD, not for members of a list.  As such, rename 'struct aiocblist'
to 'struct kaiocb' (the kernel version of 'struct aiocb').

While here, use more consistent variable names for AIO control blocks:

- Use 'job' instead of 'aiocbe', 'cb', 'cbe', or 'iocb' for kernel job
  objects.
- Use 'jobn' instead of 'cbn' for use with TAILQ_FOREACH_SAFE().
- Use 'sjob' and 'sjobn' instead of 'scb' and 'scbn' for fsync jobs.
- Use 'ujob' instead of 'aiocbp', 'job', 'uaiocb', or 'uuaiocb' to hold
  a user pointer to a 'struct aiocb'.
- Use 'ujobp' instead of 'aiocbp' for a user pointer to a 'struct aiocb *'.

Reviewed by:	kib
Sponsored by:	Chelsio Communications
Differential Revision:	https://reviews.freebsd.org/D5125
2016-02-05 20:38:09 +00:00
Konstantin Belousov
af582aaed1 When matching brand to the ELF binary by notes, try to find a brand
with interpreter name exactly matching one wanted by the binary.  If
no such brand exists, return first brand which accepted the binary by
note.

The change fixes a regression after r292749, where e.g. our two ia32
compat brands, ia32_brand_info and ia32_brand_oinfo, only differ by
the interpeter path and binary matches to a brand by linkage order.
Then old binaries which require /usr/libexec/ld-elf.so.1 but matched
against ia32_brand_info with interp_path /libexec/ld-elf.so.1, were
considered requiring non-standard interpreter name, and magic to force
ld-elf32.so.1 did not happen.

Note that it might make sense to apply the same selection of brands
for other matching criteria, SCO EI_OSABI and 3.x string.

Reported and tested by:	dwmalone
Sponsored by:	The FreeBSD Foundation
MFC after:	3 days
2016-02-04 20:55:49 +00:00
Konstantin Belousov
76c404fce5 Do not copy by field when converting struct oexport_args to struct
export_args on mount update, bzero() is consistent with
vfs_oexport_conv().
Make the code structure more explicit by using switch.
Return EINVAL if export option layout (deduced from size) is unknown.

Based on the submission by:	bde
Sponsored by:	The FreeBSD Foundation
2016-02-04 16:32:21 +00:00
Konstantin Belousov
4732ae438a Guard against runnable td2 exiting and than being reused for unrelated
process when the parent sleeps waiting for the debugger attach on
fork.

Diagnosed and reviewed by:	mjg
Sponsored by:	The FreeBSD Foundation
2016-02-04 10:49:34 +00:00
Mateusz Guzik
813361c140 fork: plug a use after free of the returned process
fork1 required its callers to pass a pointer to struct proc * which would
be set to the new process (if any). procdesc and racct manipulation also
used said pointer.

However, the process could have exited prior to do_fork return and be
automatically reaped, thus making this a use-after-free.

Fix the problem by letting callers indicate whether they want the pid or
the struct proc, return the process in stopped state for the latter case.

Reviewed by:	kib
2016-02-04 04:25:30 +00:00
Mateusz Guzik
33fd9b9a2b fork: pass arguments to fork1 in a dedicated structure
Suggested by:	kib
2016-02-04 04:22:18 +00:00
Gleb Smirnoff
e60b2fcbeb Redo r292484. Embed task(9) into zone, so that uz_maxaction is called
in a context that can sleep, allowing consumers of the KPI to run their
drain routines without any extra measures.

Discussed with:	jtl
2016-02-03 23:30:17 +00:00
Gleb Smirnoff
9542ea7b80 Move uma_dbg_alloc() and uma_dbg_free() into uma_core.c, which allows
to make uma_dbg.h not depend on uma_int.h, which allows to uninclude
uma_int.h from the mbuf(9) allocator.
2016-02-03 22:02:36 +00:00
Alfred Perlstein
7325dfbb59 Increase max allowed backlog for listen sockets
from short to int.

PR: 203922
Submitted by: White Knight <white_knight@2ch.net>
MFC After: 4 weeks
2016-02-02 05:57:59 +00:00
Gleb Smirnoff
8ec07310fa These files were getting sys/malloc.h and vm/uma.h with header pollution
via sys/mbuf.h
2016-02-01 17:41:21 +00:00
Andrew Turner
f9c2ec64be Fix the logic in the ddb command 'show ktr /a'. Prior to r118269 it would
print until cncheckc returned a non -1, i.e. a character had been entered.
After this change it would print only if cncheckc returned a character.
As this was before each call to db_mach_vtrace the normal outcome was
nothing was printed.

With this change 'show ktr /a' will now keep printing until the user stops
the command with a key press, or there is no more entries to print.
2016-01-31 17:32:20 +00:00
Eric van Gyzen
0e3d6ed44e kqueue EVFILT_PROC: avoid collision between NOTE_CHILD and NOTE_EXIT
NOTE_CHILD and NOTE_EXIT return something in kevent.data: the parent
pid (ppid) for NOTE_CHILD and the exit status for NOTE_EXIT.
Do not let the two events be combined, since one would overwrite
the other's data.

PR:		180385
Submitted by:	David A. Bright <david_a_bright@dell.com>
Reviewed by:	jhb
MFC after:	1 month
Sponsored by:	Dell Inc.
Differential Revision:	https://reviews.freebsd.org/D4900
2016-01-28 20:24:15 +00:00
Kirk McKusick
d2a28cb080 The bread() function was inconsistent about whether it would return
a buffer pointer in the event of an error (for some errors it would
return a buffer pointer and for other errors it would not return a
buffer pointer). The cluster_read() function was similarly inconsistent.

Clients of these functions were inconsistent in handling errors.
Some would assume that no buffer was returned after an error and
would thus lose buffers under certain error conditions. Others would
assume that brelse() should always be called after an error and
would thus panic the system under certain error conditions.

To correct both of these problems with minimal code churn, bread()
and cluster_write() now always free the buffer when returning an
error thus ensuring that buffers will never be lost. The brelse()
routine checks for being passed a NULL buffer pointer and silently
returns to avoid panics. Thus both approaches to handling error
returns from bread() and cluster_read() will work correctly.

Future code should be written assuming that bread() and cluster_read()
will never return a buffer with an error, so should not attempt to
brelse() the buffer when an error is returned.

Reviewed by: kib
2016-01-27 21:23:01 +00:00
Mateusz Guzik
4dd3a21fb3 ktrace: tidy up ktrstruct
- minor style fixes
- avoid doing strlen twice [1]

PR:		206648
Submitted by:	C Turt <ecturt gmail.com> (original version) [1]
2016-01-27 19:55:02 +00:00
Justin Hibbits
2dd1bdf183 Convert rman to use rman_res_t instead of u_long
Summary:
Migrate to using the semi-opaque type rman_res_t to specify rman resources.  For
now, this is still compatible with u_long.

This is step one in migrating rman to use uintmax_t for resources instead of
u_long.

Going forward, this could feasibly be used to specify architecture-specific
definitions of resource ranges, rather than baking a specific integer type into
the API.

This change has been broken out to facilitate MFC'ing drivers back to 10 without
breaking ABI.

Reviewed By: jhb
Sponsored by:	Alex Perez/Inertial Computing
Differential Revision: https://reviews.freebsd.org/D5075
2016-01-27 02:23:54 +00:00
John Baldwin
0dd6c0352b Various style fixes.
- Wrap long lines.
- Fix indentation.
- Remove excessive parens.
- Whitespace fixes in struct definitions.

Reviewed by:	kib
Differential Revision:	https://reviews.freebsd.org/D5025
2016-01-26 21:24:49 +00:00
Konstantin Belousov
2cd5358a66 Don't clear the software flow control flag before draining for last
close or assert the bug that it is clear when leaving.

Remove an unrelated rotted comment that was attached to the buggy
clearing.

Since draining is not done in more cases, flushing is needed in more
cases, so start fixing flushing:
- do a full flush in ttydisc_close().  State what POSIX requires more
  clearly.  This was missing ttydevsw_pktnotify() calls to tell the
  devsw layer to flush.  Hardware tty drivers don't actually flush
  since they don't understand this API.
- fix 2 missing wakeups in tty_flush().  Most of the wakeups here are
  unnecessary for last close.  But ttydisc_close() did one of the
  missing ones.

This flow control bug ameliorated the design bug of requiring
potentially unbounded waits in draining.  Software flow control is the
easiest way to get an unbounded wait, and a long wait is sometimes
actually useful.  Users can type the xoff character on the receiver
and (if ixon is set on the sender) expect the output to be held until
the user is ready for more.

Hardware flow control can also give the unbounded wait, and this bug
didn't affect hardware flow control.  Unbounded waits from hardware
flow control take a more unusual configuration.  E.g., a terminal
program that controls the modem status lines, or unplugging the cable
in a configuration where this doesn't break the connection.

The design bug is still ameliorated by a newer bug in draining for
last close -- the 1 second timeout.  E.g., if the user types the
xoff character and the sender reaches last close, then output is
not resumed and the wait times out after just 1 second.  This is
broken, but preferable to an unbounded wait.  Before this change,
the output was resumed immediately and usually completed.

Submitted by:	bde
MFC after:	2 weeks
2016-01-26 14:46:39 +00:00
Edward Tomasz Napierala
ed81020097 Fix the way RCTL handles rules' rrl_exceeded on credenials change.
Because of what this variable does, it was probably harmless - but
still incorrect.

MFC after:	1 month
Sponsored by:	The FreeBSD Foundation
2016-01-26 11:28:55 +00:00
Konstantin Belousov
88d74d64d7 Restore flushing of output for revoke(2) again. Document revoke()'s
intended behaviour in its man page.  Simplify tty_drain() to match.
Don't call ttydevsw methods in tty_flush() if the device is gone
since we now sometimes call it then.

The flushing was supposed to be implemented by passing the FNONBLOCK
flag to VOP_CLOSE() for revoke().  The tty driver is one of the few
that can block in close and was one of the fewer that knew about this.

This almost worked in FreeBSD-1 and similarly in Net/2.  These
versions only almost worked because there was and is considerable
confusion between IO_NDELAY and FNONBLOCK (aka O_NONBLOCK).  IO_NDELAY
is only valid for VOP_READ() and VOP_WRITE().  For other VOPs it has
the same value as O_SHLOCK.  But since vfs_subr.c and tty.c
consistently used the wrong flag and the O_SHLOCK flag is rarely set,
this mostly worked.  It also gave the feature than applications could
get the non-blocking close by abusing O_SHLOCK.

This was first broken then fixed in 1995.  I changed only the tty
driver to use FNONBLOCK, as a hack to get non-blocking via the normal
flag FNONBLOCK for last closes.  I didn't know about revoke()'s use
of IO_NDELAY or change it to be consistent, so revoke() was broken.
Then I changed revoke() to match.

This was next broken in 1997 then fixed in 1998.  Importing Lite2 made
the flags inconsistent again by undoing the fix only in vfs_subr.c.

This was next broken in 2008 by replacing everything in tty.c and not
checking any flags in last close.  Other bugs in draining limited the
resulting unbounded waits to drain in some cases.

It is now possible to fix this better using the new FREVOKE flag.
Just restore flushing for revoke() for now.  Don't restore or undo any
hacks for ordinary last closes yet.  But remove dead code in the
1-second relative timeout (r272789).  This did extra work to extend
the buggy draining for revoke() for as long as possible.  The 1-second
timeout made this not very long by usually flushing after 1 second.

Submitted by:	bde
MFC after:	2 weeks
2016-01-26 07:57:44 +00:00
Mark Johnston
87b87bb853 Evaluate the sysctl_running fail point before taking the sysctl lock.
The fail point handler may sleep, but this is not permitted while holding a
rm read lock.

MFC after:	1 week
Sponsored by:	EMC / Isilon Storage Division
2016-01-26 01:15:18 +00:00
Marius Strobl
57169cea64 - Make the code consistent with itself style-wise and bring it closer
to style(9).
- Mark unused arguments as such.
- Make the ttystates table const.
2016-01-25 22:58:06 +00:00
Konstantin Belousov
2e77021e4c Don't allow opening the callout device when the callin device is already
open (in disguise as the console device).  The only allowed combination
was supposed to be the callin device with the console.

Fix the assertion in ttydev_close() that was meant to detect this (it
only detected all 3 devices being open).  Assert this in ttydev_open()
too.

Submitted by:	bde
MFC after:	2 weeks
2016-01-25 16:47:20 +00:00
Konstantin Belousov
3593a18a91 Fix the %b flags string for ddb. All bits above the 5th
(TF_OPENED_CONS) were broken in r188147 by adding TF_OPENED_CONS
without updating the string.  It was especially confusing to display
OPENED_CONS as GONE and BYPASS as ZOMBIE.  2 flags at the end were
not updated in r188487.

Don't print an extra 0x prefix for %p in a ddb command.  In the rest
of the kernel there are more than 6000 lines with %p and only about
40 with this bug.

Print a non-extra 0x prefix for %b in a ddb command.  In the rest
of the kernel, there are approx. 180 lines with %b and 2/3 of them
have this bug.

Submitted by:	bde
MFC after:	2 weeks
2016-01-25 15:37:01 +00:00