Commit Graph

14846 Commits

Author SHA1 Message Date
kib
03d3355a71 Arm and arm64 both have fueword() implemented for some time. Correct
the comment.

Sponsored by:	The FreeBSD Foundation
2016-04-20 17:28:21 +00:00
pfg
32dcf3933a Indentation issues.
Contract some lines leftover from r298310.

Mea culpa.
2016-04-20 16:19:44 +00:00
cem
81f4ce8db7 kern_rctl: Fix resource leak in error path
Ordinarily, rctl_write_outbuf frees 'sb'.  However, if we are in low memory
conditions we skip past the rctl_write_outbuf.  In that case, free 'sb'.

Reported by:	Coverity
CID:		1338539
Sponsored by:	EMC / Isilon Storage Division
2016-04-20 02:09:38 +00:00
pfg
a7d40a88c9 kernel: use our nitems() macro when it is available through param.h.
No functional change, only trivial cases are done in this sweep,

Discussed in:	freebsd-current
2016-04-19 23:48:27 +00:00
trasz
8298725669 Fix debugging printf.
MFC after:	1 month
Sponsored by:	The FreeBSD Foundation
2016-04-19 13:36:31 +00:00
kib
b69824cf4d Fix umtx lock/trylock for compat32.
Sponsored by:	The FreeBSD Foundation
MFC after:	3 days
2016-04-19 11:37:43 +00:00
markj
5af0054863 Use a loop instead of a goto in sysctl_kern_proc_kstack().
MFC after:	3 days
2016-04-17 23:22:32 +00:00
kib
f16910a47e The struct thread td_estcpu member is only used by the 4BSD scheduler.
Move it to the struct td_sched for 4BSD, removing always present
field, otherwise unused for ULE.

New scheduler method sched_estcpu() returns the estimation for
kinfo_proc consumption.  As before, it always returns 0 for ULE.

Remove sched_tick() scheduler method, unused both by 4BSD and ULE.

Update locking comment for the 4BSD struct td_sched, copying it from
the same comment for ULE.

Spell MAXPRI as PRI_MAX_TIMESHARE in the 4BSD comment.

Based on some notes from, and reviewed by:	bde
Sponsored by:	The FreeBSD Foundation
2016-04-17 11:04:27 +00:00
cem
98188ed5c2 Add 4Kn kernel dump support
(And 4Kn minidump support, but only for amd64.)

Make sure all I/O to the dump device is of the native sector size.  To
that end, we keep a native sector sized buffer associated with dump
devices (di->blockbuf) and use it to pad smaller objects as needed (e.g.
kerneldumpheader).

Add dump_write_pad() as a convenience API to dump smaller objects with
zero padding.  (Rather than pull in NPM leftpad, we wrote our own.)

Savecore(1) has been updated to deal with these dumps.  The format for
512-byte sector dumps should remain backwards compatible.

Minidumps for other architectures are left as an exercise for the
reader.

PR:		194279
Submitted by:	ambrisko@
Reviewed by:	cem (earlier version), rpokala
Tested by:	rpokala (4Kn/512 except 512 fulldump), cem (512 fulldump)
Relnotes:	yes
Sponsored by:	EMC / Isilon Storage Division
Differential Revision:	https://reviews.freebsd.org/D5848
2016-04-15 17:45:12 +00:00
pfg
5b3421712d kern: for pointers replace 0 with NULL.
These are mostly cosmetical, no functional change.

Found with devel/coccinelle.
2016-04-15 16:10:11 +00:00
trasz
d4ed08909e Allocate RACCT/RCTL zones without UMA_ZONE_NOFREE; no idea why it was there
in the first place.

MFC after:	1 month
Sponsored by:	The FreeBSD Foundation
2016-04-15 13:34:59 +00:00
trasz
d690bf0a4e Sort variable declarations.
MFC after:	1 month
Sponsored by:	The FreeBSD Foundation
2016-04-15 11:55:29 +00:00
imp
edc06f5852 Create wrappers for uint64_t and int64_t for the tunables. While not
strictly necessary, it is more convenient.
2016-04-15 03:09:55 +00:00
jamie
e3a9ee4ccf Clean up some style(9) violations. 2016-04-14 17:07:26 +00:00
jamie
3c6ae3fb05 Separate POSIX mqueue objects in jails; actually, separate them by the
jail's root, so jails that don't have their own filesystem directory
also won't have their own mqueue namespace.

PR:		208082
2016-04-13 20:15:49 +00:00
jamie
384be5b5ff Separate POSIX sem/shm objects in jails, by prepending the jail's path
name to the object's "path".  While the objects don't have real path
names, it's a filesystem-like namespace, which allows jails to be
kept to their own space, but still allows the system / jail parent to
access a jail's IPC.

PR:		208082
2016-04-13 20:14:13 +00:00
trasz
611644daf2 Fix overflow checking.
There are some other potential problems related to overflowing racct
counters; I'll revisit those later.

Submitted by:	Pieter de Goeje (earlier version)
Reviewed by:	emaste@
MFC after:	1 month
Sponsored by:	The FreeBSD Foundation
2016-04-12 18:13:24 +00:00
pfg
b63211eed5 Cleanup unnecessary semicolons from the kernel.
Found with devel/coccinelle.
2016-04-10 23:07:00 +00:00
jhb
454f6ff2fd Add a function to lookup a device_t object by name.
This just walks the global list of devices looking for one with the
requested name.  The one use case outside of devctl2's implementation
is for DDB commands that wish to lookup devices by name.
2016-04-10 05:05:02 +00:00
jhb
6beb82443a Add more fine-grained kernel options for NUMA support.
VM_NUMA_ALLOC is used to enable use of domain-aware memory allocation in
the virtual memory system.  DEVICE_NUMA is used to enable affinity
reporting for devices such as bus_get_domain().

MAXMEMDOM must still be set to a value greater than for any NUMA support
to be effective.  Note that 'cpuset -gd' always works if MAXMEMDOM is
enabled and the system supports NUMA.

Reviewed by:	kib
Differential Revision:	https://reviews.freebsd.org/D5782
2016-04-09 13:58:04 +00:00
bz
9c35ad4130 Make the KASSERT message in hash destroy more informative.
While the pointer might not be too helpful, the malloc type might at
least give a good hint about which hashtbl we are talking.

MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
Reviewed by:	gnn, emaste
Differential Revision:	https://reviews.freebsd.org/D5802
2016-04-09 09:24:05 +00:00
trasz
fd767bd07b Make it possible to tweak RCTL throttling sysctls at runtime.
MFC after:	1 month
Sponsored by:	The FreeBSD Foundation
2016-04-08 18:15:31 +00:00
avg
fd12934507 topo_set_pu_id: turn a check into an assertion
The new id must not be present in any cpu set in any topology element.

MFC after:	30 days
2016-04-08 11:59:11 +00:00
kib
7c6ea895ee Use the ABI-prescribed name for SHT_X86_64_UNWIND in the loader and
kernel linker, after the r297686.

Sponsored by:	The FreeBSD Foundation
2016-04-08 10:23:48 +00:00
skra
b67819482f Fix intr_irq_shuffle(). After r297539, ISRCs doing IPI may be also
registered into global interrupt table. Thus, they must be filtered out
like per-cpu interrupts. Fortunately, it does not influence anything
on interrupt controllers which already use INTRNG.
2016-04-07 15:16:33 +00:00
skra
1e6a6a2cd5 Implement intr_isrc_init_on_cpu() and use it to replace very same
code implemented in every interrupt controller driver running SMP.
This function returns true, if provided ISRC should be enabled on
given cpu.
2016-04-07 15:00:25 +00:00
trasz
825d80e01c Add four new RCTL resources - readbps, readiops, writebps and writeiops,
for limiting disk (actually filesystem) IO.

Note that in some cases these limits are not quite precise. It's ok,
as long as it's within some reasonable bounds.

Testing - and review of the code, in particular the VFS and VM parts - is
very welcome.

MFC after:	1 month
Relnotes:	yes
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D5080
2016-04-07 04:23:25 +00:00
skra
b96ba003d0 Fix PIC lookup by device and xref. There was not taken into account
the situation that someone has a pointer to device but not its xref.
This situation is regular now, after r297539.
2016-04-06 12:48:45 +00:00
trasz
cd0a56084a Use proper locking macros in RACCT in RCTL.
MFC after:	1 month
Sponsored by:	The FreeBSD Foundation
2016-04-05 11:30:52 +00:00
avg
b2f81dcbd7 x86 topo: add some comments, descriptions and references to documentation
Plus a minor cosmetic change.

MFC after:	1 month
2016-04-05 10:36:40 +00:00
avg
14341afcfc new x86 smp topology detection code
Previously, the code determined a topology of processing units
(hardware threads, cores, packages) and then deduced a cache topology
using certain assumptions.  The new code builds a topology that
includes both processing units and caches using the information
provided by the hardware.

At the moment, the discovered full topology is used only to creeate
a scheduling topology for SCHED_ULE.
There is no KPI for other kernel uses.

Summary:
- based on APIC ID derivation rules for Intel and AMD CPUs
- can handle non-uniform topologies
- requires homogeneous APIC ID assignment (same bit widths for ID
  components)
- topology for dual-node AMD CPUs may not be optimal
- topology for latest AMD CPU models may not be optimal as the code is
  several years old
- supports only thread/package/core/cache nodes

Todo:
  - AMD dual-node processors
  - latest AMD processors
  - NUMA nodes
  - checking for homogeneity of the APIC ID assignment across packages
  - more flexible cache placement within topology
  - expose topology to userland, e.g., via sysctl nodes

Long term todo:
  - KPI for CPU sharing and affinity with respect to various resources
    (e.g., two logical processors may share the same FPU, etc)

Reviewed by:	mav
Tested by:	mav
MFC after:	1 month
Differential Revision:	https://reviews.freebsd.org/D2728
2016-04-04 16:09:29 +00:00
andrew
813a9f775b Include sys/rman.h directly rather than relying on header pollution.
Obtained from:	ABT Systems Ltd
Sponsored by:	The FreeBSD Foundation
2016-04-04 10:52:43 +00:00
skra
11cdd44a03 Remove FDT specific parts from INTRNG. Change its interface to make it
universal.

(1) New struct intr_map_data is defined as a container for arbitrary
description of an interrupt used by a device. Typically, an interrupt
number and configuration relevant to an interrupt controller is encoded
in such description. However, any additional information may be encoded
too like a set of cpus on which an interrupt should be enabled or vendor
specific data needed for setup of an interrupt in controller. The struct
intr_map_data itself is meant to be opaque for INTRNG.

(2) An intr_map_irq() function is created which takes an interrupt
controller identification and struct intr_map_data as arguments and
returns global interrupt number which identifies an interrupt.

(3) A set of functions to be used by bus drivers is created as well as
a corresponding set of methods for interrupt controller drivers. These
sets take both struct resource and struct intr_map_data as one of the
arguments. There is a goal to keep struct intr_map_data in struct
resource, however, this way a final solution is not limited to that.

(4) Other small changes are done to reflect new situation.

This is only first step aiming to create stable interface for interrupt
controller drivers. Thus, some temporary solution is taken. Interrupt
descriptions for devices are stored in INTRNG and two specific mapping
function are created to be temporary used by bus drivers. That's why
the struct intr_map_data is not opaque for INTRNG now. This temporary
solution will be replaced by final one in next step.

Differential Revision:	https://reviews.freebsd.org/D5730
2016-04-04 09:15:25 +00:00
trasz
e0e88029f1 Add configurable rate limit for "log" and "devctl" actions.
MFC after:	1 month
Sponsored by:	The FreeBSD Foundation
2016-04-02 09:11:52 +00:00
trasz
530be67ef4 Fix mismerge.
MFC after:	1 month
Sponsored by:	The FreeBSD Foundation
2016-04-01 18:45:04 +00:00
trasz
a36f880ad3 Drop the 'resource' argument to racct_decay(); it wouldn't make sense
to iterate separately for each resource.

MFC after:	1 month
Sponsored by:	The FreeBSD Foundation
2016-04-01 18:36:10 +00:00
jhb
a0ef1d0d15 Cap IOSIZE_MAX to INT_MAX for 32-bit processes.
Previously, freebsd32 binaries could submit read/write requests with lengths
greater than INT_MAX that a native kernel would have rejected.

Reviewed by:	kib
Sponsored by:	Chelsio Communications
Differential Revision:	https://reviews.freebsd.org/D5788
2016-04-01 18:29:38 +00:00
trasz
02eee7b047 Call rctl_enforce() in all cases the resource usage goes up, even when called
from racct_*_force() functions.  It makes the "log" and "devctl" actions work
in those cases.

MFC after:	1 month
Sponsored by:	The FreeBSD Foundation
2016-04-01 17:28:55 +00:00
trasz
4f9a63f201 Reorder the functions; no functional changes.
MFC after:	1 month
Sponsored by:	The FreeBSD Foundation
2016-04-01 17:21:55 +00:00
trasz
4aa469be24 Reduce code duplication.
MFC after:	1 month
Sponsored by:	The FreeBSD Foundation
2016-04-01 17:17:32 +00:00
trasz
5f827ef346 Reduce code duplication. There should be no (intended) functional changes.
MFC after:	1 month
Sponsored by:	The FreeBSD Foundation
2016-04-01 17:05:46 +00:00
sbruno
e9ae7cf74d Repair a overflow condition where a user could submit a string that was
not getting a proper bounds check.

Thanks to CTurt for pointing at this with a big red blinking neon sign.

PR:		206761
Submitted by:	sson
Reviewed by:	cturt@hardenedbsd.org
MFC after:	3 days
2016-04-01 16:16:26 +00:00
jhb
b4f65d818d Rework handling of thread sleeps before timers are working.
Previously, calls to *sleep() and cv_*wait*() immediately returned during
early boot.  Instead, permit threads that request a sleep without a
timeout to sleep as wakeup() works during early boot.  Sleeps with
timeouts are harder to emulate without working timers, so just punt and
panic explicitly if any thread tries to use those before timers are
working.  Any threads that depend on timeouts should either wait until
SI_SUB_KICK_SCHEDULER to start or they should use DELAY() until timers
are available.

Until APs are started earlier this should be a no-op as other kthreads
shouldn't get a chance to start running until after timers are working
regardless of when they were created.

Reviewed by:	kib
Sponsored by:	Netflix
Differential Revision:	https://reviews.freebsd.org/D5724
2016-03-31 18:10:29 +00:00
trasz
fe839a29fe Refactor; no functional changes.
MFC after:	1 month
Sponsored by:	The FreeBSD Foundation
2016-03-31 17:32:28 +00:00
jhb
df609e0cf4 Tidy up the unmapped I/O code in qphysio.
- Move some blocks around to reduce the number of 'if (unmap)' checks.
- Use 'pbuf == NULL' instead of 'unmap'.
- Use nitems.
- Pull an assignment out of an if expression.

Reviewed by:	kib
Sponsored by:	Chelsio Communications
2016-03-31 17:27:30 +00:00
trasz
1191e72322 Fix overflows, making it impossible to add negative amounts using rctl(8).
MFC after:	1 month
Sponsored by:	The FreeBSD Foundation
2016-03-31 17:00:47 +00:00
jamie
3461615130 Add osd_reserve() and osd_set_reserved(), which allow M_WAITOK allocation
of an OSD array,
2016-03-30 16:57:28 +00:00
glebius
e05176a63d The sendfile(2) allows to send extra data from userspace before the file
data (headers).  Historically the size of the headers was not checked
against the socket buffer space.  Application could easily overcommit the
socket buffer space.

With the new sendfile (r293439) the problem remained, but a KASSERT was
inserted that checked that amount of data written to the socket matches
its space.  In case when size of headers is bigger that socket space,
KASSERT fires.  Without INVARIANTS the new sendfile won't panic, but
would report incorrect amount of bytes sent.

o With this change, the headers copyin is moved down into the cycle, after
  the sbspace() check.  The uio size is trimmed by socket space there,
  which fixes the overcommit problem and its consequences.
o The compatibility handling for FreeBSD 4 sendfile headers API is pushed
  up the stack to syscall wrappers.  This required a copy and paste of the
  code, but in turn this allowed to remove extra stack carried parameter
  from fo_sendfile_t, and embrace entire compat code into #ifdef.  If in
  future we got more fo_sendfile_t function, the copy and paste level would
  even reduce.

Reviewed by:	emax, gallatin, Maxim Dounin <mdounin mdounin.ru>
Tested by:	Vitalij Satanivskij <satan ukr.net>
Sponsored by:	Netflix
2016-03-29 19:57:11 +00:00
trasz
ca92bb3067 Remove some NULL checks for M_WAITOK allocations.
MFC after:	1 month
Sponsored by:	The FreeBSD Foundation
2016-03-29 13:56:59 +00:00
jamie
63095618a2 Move the various per-type arrays of OSD data into a single structure array. 2016-03-28 22:18:37 +00:00
imp
6441be832c Move pccard_safe_quote() up to subr_bus.c and rename to
devctl_safe_quote() so it can be used more generally.
2016-03-28 20:16:29 +00:00
np
0b3b29f07b Plug leak in m_unshare.
m_unshare passes on the source mbuf's flags as-is to m_getcl and this
results in a leak if the flags include M_NOFREE.  The fix is to clear
the bits not listed in M_COPYALL before calling m_getcl.  M_RDONLY
should probably be filtered out too but that's outside the scope of this
fix.

Add assertions in the zone_mbuf and zone_pack ctors to catch similar
bugs.

Update netmap_get_mbuf to not pass M_NOFREE to m_getcl.  It's not clear
what the original code was trying to do but it's likely incorrect.
Updated code is no different functionally but it avoids the newly added
assertions.

Reviewed by:	gnn@
Sponsored by:	Chelsio Communications
Differential Revision:	https://reviews.freebsd.org/D5698
2016-03-26 23:39:53 +00:00
cem
96eb07ae2b Add td_swinvoltick to track last involuntary context switch
Expose in DDB via "show thread."

Reviewed by:	markj
Sponsored by:	EMC / Isilon Storage Division
2016-03-25 19:35:29 +00:00
glebius
fa5cb70a2d Space and style(9) corrections for recent mbuf changes. 2016-03-24 20:06:52 +00:00
skra
2683d49bfb Generalize IPI support for ARM intrng and use it for interrupt
controller IPI provider.

New struct intr_ipi is defined which keeps all info about an IPI:
its name, counter, send and dispatch methods. Generic intr_ipi_setup(),
intr_ipi_send() and intr_ipi_dispatch() functions are implemented.

An IPI provider must implement two functions:
(1) an intr_ipi_send_t function which is able to send an IPI,
(2) a setup function which initializes itself for an IPI and
    calls intr_ipi_setup() with appropriate arguments.

Differential Revision:	https://reviews.freebsd.org/D5700
2016-03-24 09:55:11 +00:00
gnn
77a2ccb155 Move mbuf provider under SDT to indicate that it is FreeBSD specific
and not a stable interface.

Reviewed by:	markj
MFC after:	2 weeks
Sponsored by:	Rubicon Communications (Netgate)
Differential Revision:	https://reviews.freebsd.org/D5716
2016-03-24 08:26:06 +00:00
bdrewery
a9f48f4d56 Pass the expected struct radix_node_head * to vfs_free_netcred.
No functional change.

struct radix_node_head's first element is rh so this was already
referring to the same address.  It was likely an unintended
s/rnh/&rnh->rh/ change from r294706 as all other rnh_walktree() callers
pass the expected struct radix_node_head * rather than obscurely passing
the address of their first element.

Sponsored by:	EMC / Isilon Storage Division
2016-03-24 04:40:07 +00:00
bdrewery
91bc17507f Fix M_RTABLE memory leak from r274118 (11/2014).
Replace free(M_RTABLE) with rn_detachhead() to match rn_inithead().

This would trigger when reloading NFS exports and was similar to
problems with pf reload [1].

PR:		194078 [1]
Sponsored by:	EMC / Isilon Storage Division
2016-03-24 03:08:39 +00:00
trasz
1c2e36026b Wait for root mount tokens before showing the root mount prompt.
This restores the pre-r290196 behaviour, eliminating the need to manually
press '.' a couple of times to get USB to finish probing.

Note that there's still something wrong with the console (character
echoing doesn't quite work), and there's also a reported problem with
BHyVe, but those two don't seem related to the problem above.

MFC after:	1 month
Sponsored by:	The FreeBSD Foundation
2016-03-22 13:46:01 +00:00
gnn
e4786b4992 Add an mbuf provider to DTrace.
The mbuf provider is made up of a set of Statically Defined Tracepoints
which help us look into mbufs as they are allocated and freed.  This can be
used to inspect the buffers or for a simplified mbuf leak detector.

New tracepoints are:

mbuf:::m-init
mbuf:::m-gethdr
mbuf:::m-get
mbuf:::m-getcl
mbuf:::m-clget
mbuf:::m-cljget
mbuf:::m-cljset
mbuf:::m-free
mbuf:::m-freem

There is also a translator for mbufs which gives some visibility into the structure,
see mbuf.d for more details.

Reviewed by:	bz, markj
MFC after:	2 weeks
Sponsored by:	Rubicon Communications (Netgate)
Differential Revision:	https://reviews.freebsd.org/D5682
2016-03-22 13:16:52 +00:00
jhb
ccef35d926 Regen. 2016-03-21 21:38:35 +00:00
jhb
6f8f2fe586 Fully handle size_t lengths in AIO requests.
First, update the return types of aio_return() and aio_waitcomplete() to
ssize_t.

POSIX requires aio_return() to return a ssize_t so that it can represent
all return values from read() and write().  aio_waitcomplete() should use
ssize_t for the same reason.

aio_return() has used ssize_t in <aio.h> since r31620 but the manpage and
system call entry were not updated.  aio_waitcomplete() has always
returned int.

Note that this does not require new system call stubs as this is
effectively only an API change in how the compiler interprets the return
value.

Second, allow aio_nbytes values up to IOSIZE_MAX instead of just INT_MAX.

aio_read/write should now honor the same length limits as normal read/write.

Third, use longs instead of ints in the aio_return() and aio_waitcomplete()
system call functions so that the 64-bit size_t in the in-kernel aiocb
isn't truncated to 32-bits before being copied out to userland or
being returned.

Finally, a simple test has been added to verify the bounds checking on the
maximum read size from a file.
2016-03-21 21:37:33 +00:00
maxim
86d336784d o "avaliable" -> "available".
PR:		208141
Submitted by:	Tyler Littlefield
2016-03-21 08:03:50 +00:00
pfg
eea894a1e2 aio_qphysio(): Avoid uninitialized pointer read on error.
For the !unmap case it may happen that pbuf gets called unreferenced
when vm_fault_quick_hold_pages() fails.
Initialize it so it doesn't cause trouble.

CID:		1352776
Reviewed by:	jhb
MFC after:	1 week
2016-03-18 19:04:01 +00:00
jhibbits
720f47c9ed Use uintmax_t (typedef'd to rman_res_t type) for rman ranges.
On some architectures, u_long isn't large enough for resource definitions.
Particularly, powerpc and arm allow 36-bit (or larger) physical addresses, but
type `long' is only 32-bit.  This extends rman's resources to uintmax_t.  With
this change, any resource can feasibly be placed anywhere in physical memory
(within the constraints of the driver).

Why uintmax_t and not something machine dependent, or uint64_t?  Though it's
possible for uintmax_t to grow, it's highly unlikely it will become 128-bit on
32-bit architectures.  64-bit architectures should have plenty of RAM to absorb
the increase on resource sizes if and when this occurs, and the number of
resources on memory-constrained systems should be sufficiently small as to not
pose a drastic overhead.  That being said, uintmax_t was chosen for source
clarity.  If it's specified as uint64_t, all printf()-like calls would either
need casts to uintmax_t, or be littered with PRI*64 macros.  Casts to uintmax_t
aren't horrible, but it would also bake into the API for
resource_list_print_type() either a hidden assumption that entries get cast to
uintmax_t for printing, or these calls would need the PRI*64 macros.  Since
source code is meant to be read more often than written, I chose the clearest
path of simply using uintmax_t.

Tested on a PowerPC p5020-based board, which places all device resources in
0xfxxxxxxxx, and has 8GB RAM.
Regression tested on qemu-system-i386
Regression tested on qemu-system-mips (malta profile)

Tested PAE and devinfo on virtualbox (live CD)

Special thanks to bz for his testing on ARM.

Reviewed By: bz, jhb (previous)
Relnotes:	Yes
Sponsored by:	Alex Perez/Inertial Computing
Differential Revision: https://reviews.freebsd.org/D4544
2016-03-18 01:28:41 +00:00
cem
6b40e40026 fail(9): Only gather/print stacks if STACK is enabled
This is a follow-up fix to the earlier r296927.

Reported by:	bz
Sponsored by:	EMC / Isilon Storage Division
2016-03-17 01:05:53 +00:00
cem
1cab282ecb fail(9): Upstreaming some fail point enhancements
This is several year's worth of fail point upgrades done at EMC Isilon. They
are interdependent enough that it makes sense to put a single diff up for them.
Primarily, we added:

- Changing all mainline execution paths to be lockless, which lets us use fail
  points in more sleep-sensitive areas, and allows more parallel execution
- A number of additional commands, including 'pause' that lets us do some
  interesting deterministic repros of race conditions
- The ability to dump the stacks of all threads sleeping on a fail point
- A number of other API changes to allow marking up the fail point's context in
  the code, and firing callbacks before and after execution
- A man page update

Submitted by:	Matthew Bryan <matthew.bryan@isilon.com>
Reviewed by:	cem (earlier version), jhb, kib, pho
With feedback from:	bdrewery
Sponsored by:	EMC / Isilon Storage Division
Differential Revision:	https://reviews.freebsd.org/D5427
2016-03-16 04:22:32 +00:00
glebius
743cf42b4f Free the temporary buffer in sysctl_handle_counter_u64_array().
Submitted by:	mjg
2016-03-15 00:21:32 +00:00
glebius
0cafb74055 Provide sysctl(9) macro to deal with array of counter(9). 2016-03-15 00:05:00 +00:00
gibbs
453fdfd69a Provide high precision conversion from ns,us,ms -> sbintime in kevent
In timer2sbintime(), calculate the second and fractional second portions of
the sbintime separately. When calculating the the fractional second portion,
use a 64bit multiply to prevent excess truncation. This avoids the ~7% error
in the original conversion for ns, and smaller errors of the same type for us
and ms.

PR: 198139
Reviewed by: jhb
MFC after: 1 week
Differential Revision:    https://reviews.freebsd.org/D5397
2016-03-12 23:02:53 +00:00
jhb
419b17235a Do not include system call wrappers in libc for old FreeBSD system calls.
The base system libc is only used to run binaries built on FreeBSD 7.0 and
later.  It does not need to include system call wrappers for system calls
only used by FreeBSD binaries built on versions older than 7.0.  This was
already true for "COMPAT" system calls, but now wrappers for system calls
used on FreeBSD 4 and 6 are excluded as well.

Reviewed by:	kib
Differential Revision:	https://reviews.freebsd.org/D5597
2016-03-12 22:53:46 +00:00
trasz
8804916675 Refactor the way we restore cn_lkflags; no functional changes.
MFC after:	1 month
Sponsored by:	The FreeBSD Foundation
2016-03-12 09:05:43 +00:00
trasz
beb648d9cc Remove cn_consume from 'struct componentname'. It was never set to anything
other than 0.

Reviewed by:	kib@
MFC after:	1 month
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D5611
2016-03-12 08:50:38 +00:00
trasz
faec271eeb Fix autofs triggering problem. Assume you have an NFS server,
192.168.1.1, with share "share". This commit fixes a problem
where "mkdir /net/192.168.1.1/share/meh" would return spurious
error instead of creating the directory if the target filesystem
wasn't mounted yet; subsequent attempts would work correctly.

The failure scenario is kind of complicated to explain, but it all
boils down to calling VOP_MKDIR() for the target filesystem (NFS)
with wrong dvp - the autofs vnode instead of the filesystem root
mounted over it.

Reviewed by:	kib@
MFC after:	1 month
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D5442
2016-03-12 07:54:42 +00:00
jhb
374ffbde40 Use SI_SUB_LAST instead of SI_SUB_SMP as the "catch-all" subsystem.
Reviewed by:	kib
Sponsored by:	Netflix
Differential Revision:	https://reviews.freebsd.org/D5515
2016-03-11 23:18:06 +00:00
jhb
96e88fd872 Regen. 2016-03-09 19:06:46 +00:00
jhb
1b87e4306e Simplify AIO initialization now that it is standard.
- Mark AIO system calls as STD and remove the helpers to dynamically
  register them.
- Use COMPAT6 for the old system calls with the older sigevent instead of
  an 'o' prefix.
- Simplify the POSIX configuration to note that AIO is always available.
- Handle AIO in the default VOP_PATHCONF instead of special casing it in
  the pathconf() system call.  fpathconf() is still hackish.
- Remove freebsd32_aio_cancel() as it just called the native one directly.

Reviewed by:	kib
Sponsored by:	Chelsio Communications
Differential Revision:	https://reviews.freebsd.org/D5589
2016-03-09 19:05:11 +00:00
kib
1e048a7127 Convert all panics from the link_elf_obj kernel linker for object
files format into printfs and errors to caller.  Some leaks of
resources are there, but the same leaks are present in other error
pathes.  With the change, the kernel at least boots even when module
with unexpected or corrupted ELF structure is preloaded.

Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
2016-03-07 18:44:06 +00:00
kib
ffbdd975f0 In the link_elf_obj.c, handle sections of type SHT_AMD64_UNWIND same
as SHT_PROGBITS.  This is needed after the clang 3.8 import, which
generates that type for .eh_frame section, which had SHT_PROGBITS type
before.

Reported by:	 Nikolai Lifanov <lifanov@mail.lifanov.com>
PR:	207729
Tested by:	dim (previous version)
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
2016-03-06 00:31:11 +00:00
jhibbits
70aaabfeac Replace all resource occurrences of '0UL/~0UL' with '0/~0'.
Summary:
The idea behind this is '~0ul' is well-defined, and casting to uintmax_t, on a
32-bit platform, will leave the upper 32 bits as 0.  The maximum range of a
resource is 0xFFF.... (all bits of the full type set).  By dropping the 'ul'
suffix, C type promotion rules apply, and the sign extension of ~0 on 32 bit
platforms gets it to a type-independent 'unsigned max'.

Reviewed By: cem
Sponsored by:	Alex Perez/Inertial Computing
Differential Revision: https://reviews.freebsd.org/D5255
2016-03-03 05:07:35 +00:00
kib
486320aac4 If callout_stop_safe() noted that the callout is currently executing,
but next invocation is cancelled while migrating,
sleepq_check_timeout() needs to be informed that the callout is
stopped.  Otherwise the thread switches off CPU and never become
runnable, since running callout could have already raced with us,
while the migrating and cancelled callout could be one which is
expected to set TDP_TIMOFAIL flag for us.  This contradicts with the
expected behaviour of callout_stop() for other callers, which
e.g. decrement references from the callout callbacks.

Add a new flag CS_MIGRBLOCK requesting report of the situation as
'successfully stopped'.

Reviewed by:	jhb (previous version)
Tested by:	cognet, pho
PR:	200992
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
Differential revision:	https://reviews.freebsd.org/D5221
2016-03-02 18:46:17 +00:00
glebius
a6bfbefef5 Fix regression in r296242 affecting several drivers. For EXT_NET_DRV,
EXT_MOD_TYPE, EXT_DISPOSABLE types we should first execute the free
callback, then free the mbuf, otherwise we will derefernce memory that
was just freed.

Reported and tested:	jhibbits
2016-03-02 02:12:01 +00:00
bdrewery
199dda9a01 Correct a comment. 2016-03-01 23:58:53 +00:00
jhb
58823d0b49 Use SCHEDULER_STOPPED() in cv_*wait*() instead of checking panicstr.
Reviewed by:	kib
MFC after:	1 month
Sponsored by:	Netflix
Differential Revision:	https://reviews.freebsd.org/D5516
2016-03-01 22:51:44 +00:00
jhb
be47bc68fb Refactor the AIO subsystem to permit file-type-specific handling and
improve cancellation robustness.

Introduce a new file operation, fo_aio_queue, which is responsible for
queueing and completing an asynchronous I/O request for a given file.
The AIO subystem now exports library of routines to manipulate AIO
requests as well as the ability to run a handler function in the
"default" pool of AIO daemons to service a request.

A default implementation for file types which do not include an
fo_aio_queue method queues requests to the "default" pool invoking the
fo_read or fo_write methods as before.

The AIO subsystem permits file types to install a private "cancel"
routine when a request is queued to permit safe dequeueing and cleanup
of cancelled requests.

Sockets now use their own pool of AIO daemons and service per-socket
requests in FIFO order.  Socket requests will not block indefinitely
permitting timely cancellation of all requests.

Due to the now-tight coupling of the AIO subsystem with file types,
the AIO subsystem is now a standard part of all kernels.  The VFS_AIO
kernel option and aio.ko module are gone.

Many file types may block indefinitely in their fo_read or fo_write
callbacks resulting in a hung AIO daemon.  This can result in hung
user processes (when processes attempt to cancel all outstanding
requests during exit) or a hung system.  To protect against this, AIO
requests are only permitted for known "safe" files by default.  AIO
requests for all file types can be enabled by setting the new
vfs.aio.enable_usafe sysctl to a non-zero value.  The AIO tests have
been updated to skip operations on unsafe file types if the sysctl is
zero.

Currently, AIO requests on sockets and raw disks are considered safe
and are enabled by default.  aio_mlock() is also enabled by default.

Reviewed by:	cem, jilles
Discussed with:	kib (earlier version)
Sponsored by:	Chelsio Communications
Differential Revision:	https://reviews.freebsd.org/D5289
2016-03-01 18:12:14 +00:00
jhb
15b2caff0f Remove taskqueue_enqueue_fast().
taskqueue_enqueue() was changed to support both fast and non-fast
taskqueues 10 years ago in r154167.  It has been a compat shim ever
since.  It's time for the compat shim to go.

Submitted by:	Howard Su <howard0su@gmail.com>
Reviewed by:	sephe
Differential Revision:	https://reviews.freebsd.org/D5131
2016-03-01 17:47:32 +00:00
skra
1cb2a75f09 Remove an alternative way for dealing with root interrupt controller
which is not complete. Likely, it was forgotten and not removed before
committing.
2016-03-01 11:27:58 +00:00
skra
caf6d91084 Mark other parts of interrupt framework as INTR_SOLO option specific.
Note that isrc_arg member of struct intr_irqsrc is used only for
INTR_SOLO and IPI filter. This should be remembered if IPI filters
and their arguments will be stored on another place.

This option could be unusable very soon, if interrupt controllers
implementations will not be implemented considering it.
2016-03-01 10:57:29 +00:00
glebius
163857deb4 New way to manage reference counting of mbuf external storage.
The m_ext.ext_cnt pointer becomes a union. It can now hold the refcount
value itself. To tell that m_ext.ext_flags flag EXT_FLAG_EMBREF is used.
The first mbuf to attach a cluster stores the refcount. The further mbufs
to reference the cluster point at refcount in the first mbuf. The first
mbuf is freed only when the last reference is freed.

The benefit over refcounts stored in separate slabs is that now refcounts
of different, unrelated mbufs do not share a cache line.

For EXT_EXTREF mbufs the zone_ext_refcnt is no longer needed, and m_extadd()
becomes void, making widely used M_EXTADD macro safe.

For EXT_SFBUF mbufs the sf_ext_ref() is removed, which was an optimization
exactly against the cache aliasing problem with regular refcounting.

Discussed with:		rrs, rwatson, gnn, hiren, sbruno, np
Reviewed by:		rrs
Differential Revision:	https://reviews.freebsd.org/D5396
Sponsored by:		Netflix
2016-03-01 00:17:14 +00:00
kib
e76eb4255b Implement process-shared locks support for libthr.so.3, without
breaking the ABI.  Special value is stored in the lock pointer to
indicate shared lock, and offline page in the shared memory is
allocated to store the actual lock.

Reviewed by:	vangyzen (previous version)
Discussed with:	deischen, emaste, jhb, rwatson,
	Martin Simmons <martin@lispworks.com>
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
2016-02-28 17:52:33 +00:00
skra
27bb203f7c Move IPI related parts back to (ARM) machine specific file now, when
the interrupt framework is also going to be used by another (MIPS)
architecture. IPI implementations may vary much across different
architectures.

An IPI implementation should still define INTR_IPI_COUNT and use
intr_ipi_setup_counters() to setup IPI counters which are inside of
intrcnt[] and intrnames[] arrays. Those are used for sysctl and ddb.
Then, intr_ipi_increment_count() should be used to increment obtained
counter.

Reviewed by:	imp
Differential Revision:	https://reviews.freebsd.org/D5459
2016-02-27 12:03:07 +00:00
ed
4a923c8cd0 Remove the errno argument from unp_drop().
While there, add a comment to clarify that ECONNRESET should always be
returned for POSIX conformance.

Suggested by:	Steven Hartland
2016-02-26 12:46:34 +00:00
markj
9abb1836d9 Improve error handling for posix_fallocate(2) and posix_fadvise(2).
- Set td_errno so that ktrace and dtrace can obtain the syscall error
  number in the usual way.
- Pass negative error numbers directly to the syscall layer, as they're
  not intended to be returned to userland.

Reviewed by:	kib
Sponsored by:	EMC / Isilon Storage Division
Differential Revision: https://reviews.freebsd.org/D5425
2016-02-25 19:58:23 +00:00
ed
0151167359 Make asynchronous connection failures on UNIX sockets fail with ECONNRESET.
While making CloudABI work well on Linux, I discovered that I had a
FreeBSD-ism in one of my unit tests. The test did the following:

- Create UNIX socket 1, bind it, make it listen.
- Create UNIX socket 2, connect it to UNIX socket 1.
- Close UNIX socket 1.
- Obtain SO_ERROR from socket 2.

On FreeBSD this returns ECONNABORTED, while on Linux it returns
ECONNRESET. I dug through some of the relevant specifications[1] and it
looks like Linux is all right here. ECONNABORTED should only be returned
when the local connection (socket 2) is aborted; not the peer (socket 1).

It is of course slightly misleading: the function in which we set this
error is called uipc_abort(), but keep in mind that we're aborting the
peer, thus resetting the local socket.

[1] http://pubs.opengroup.org/onlinepubs/9699919799/functions/connect.html

Reviewed by:	cem
Sponsored by:	Nuxi, the Netherlands
Differential Revision:	https://reviews.freebsd.org/D5419
2016-02-24 17:10:32 +00:00
kib
392fea70cc Provide more correct sizing of the KVA consumed by a vnode, used by
the virtvnodes calculation.  Include the size of fs-specific v_data as
the nfs nclnode inline, the NFS nclnode is bigger than either ZFS
znode or UFS inode.  Include the size of namecache_ts and short cache
path element, multiplied by the name cache population factor, again
inline.

Inline defines are used to avoid pollution of the vnode.h with the
subsystem-private objects.  Non-significant unsynchronized changes of
the definitions are fine, we do not care about that precision, and
e.g. ZFS consumes much malloced memory per vnode for reasons
unaccounted in the formula.

Lower the partition of kmem dedicated to vnodes, from 1/7 to 1/10.

The measures reduce vnode cache pressure on kmem and bring the vnode
cache memory use below some apparent thresholds that were exceeded by
r291244 due to more robust vnode reuse.

Reported and tested by:	marius (i386, previous version)
Reviewed by:	bde
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
2016-02-24 15:15:46 +00:00
bdrewery
8a44a735c3 Fix build after r295934. 2016-02-23 23:37:10 +00:00
oshogbo
5256441af8 According to the sys/kern/capabilities.conf, gethostid(3) should be allowed.
Pointed out by:	Milosz Kaniewski <m.kaniewski@wheelsystems.com>
Approved by:	pjd (mentor)
MFC after:	3 days
Sponsored by:	Wheel Systems, http://wheelsystems.com
2016-02-23 22:02:25 +00:00
ian
c764d7edd2 Allow a dynamic env to override a compiled-in static env by passing in the
override indication in the env data.

Submitted by:	bde
2016-02-21 18:35:01 +00:00
jhibbits
f8385663ee Introduce a RMAN_IS_DEFAULT_RANGE() macro, and use it.
This simplifies checking for default resource range for bus_alloc_resource(),
and improves readability.

This is part of, and related to, the migration of rman_res_t from u_long to
uintmax_t.

Discussed with:	jhb
Suggested by:	marcel
2016-02-20 01:32:58 +00:00
markj
32d1c3375a Ensure that we test the event condition when a disabled kevent is enabled.
r274560 modified kqueue_register() to only test the event condition if the
corresponding knote is not disabled. However, this check takes place before
the EV_ENABLE flag is used to clear the KN_DISABLED flag on the knote, so
enabling a previously-disabled kevent would not result in a notification for
a triggered event. This change fixes the problem by testing for EV_ENABLED
before possibly checking the event condition.

This change also updates a kqueue regression test to exercise this case.

PR:		206368
Reviewed by:	kib
Sponsored by:	EMC / Isilon Storage Division
Differential Revision:	https://reviews.freebsd.org/D5307
2016-02-19 01:49:33 +00:00