Commit Graph

14204 Commits

Author SHA1 Message Date
Konstantin Belousov
fe8a824ca6 Handle incorrect ELF images specifying size for PT_GNU_STACK not being
multiple of page size.

Sponsored by:	The FreeBSD Foundation
MFC after:	3 days
2015-04-23 11:27:21 +00:00
Alexander Motin
f743d981f3 Make AIO to not allocate pbufs for unmapped I/O like r281825.
While there, make few more performance optimizations.

On 40-core system doing many 512-byte AIO reads from array of raw SSDs
this change removes lock congestions inside pbuf allocator and devfs,
and bottleneck on single AIO completion taskqueue thread.  It improves
peak AIO performance from ~600K to ~1.3M IOPS.

MFC after:	2 weeks
2015-04-22 18:11:34 +00:00
Craig Rodrigues
d9db52256e Move zlib.c from net to libkern.
It is not network-specific code and would
be better as part of libkern instead.
Move zlib.h and zutil.h from net/ to sys/
Update includes to use sys/zlib.h and sys/zutil.h instead of net/

Submitted by:		Steve Kiernan stevek@juniper.net
Obtained from:		Juniper Networks, Inc.
GitHub Pull Request:	https://github.com/freebsd/freebsd/pull/28
Relnotes:		yes
2015-04-22 14:38:58 +00:00
Craig Rodrigues
d5fec48956 Support file verification in MAC.
* Add VCREAT flag to indicate when a new file is being created
* Add VVERIFY to indicate verification is required
* Both VCREAT and VVERIFY are only passed on the MAC method vnode_check_open
  and are removed from the accmode after
* Add O_VERIFY flag to rtld open of objects
* Add 'v' flag to __sflags to set O_VERIFY flag.

Submitted by:		Steve Kiernan <stevek@juniper.net>
Obtained from:		Juniper Networks, Inc.
GitHub Pull Request:	https://github.com/freebsd/freebsd/pull/27
Relnotes:		yes
2015-04-22 01:54:25 +00:00
Edward Tomasz Napierala
6289b482ec Modify kern___getcwd() to take max pathlen limit as an additional
argument.  This will be used for the Linux emulation layer - for Linux,
PATH_MAX is 4096 and not 1024.

Differential Revision:	https://reviews.freebsd.org/D2335
Reviewed by:	kib@
MFC after:	1 month
Sponsored by:	The FreeBSD Foundation
2015-04-21 13:55:24 +00:00
Alexander Motin
869fd29a7b Rewrite physio() to not allocate pbufs for unmapped I/O.
pbufs is a limited resource, and their allocator is not SMP-scalable.
So instead of always allocating pbuf to immediately convert it to bio,
allocate bio just here.  If buffer needs kernel mapping, then pbuf is
still allocated, but used only as a source of KVA and storage for a list
of held pages.

On 40-core system doing many 512-byte reads from user level to array of
raw SSDs this change removes huge lock congestion inside pbuf allocator.
It improves peak performance from ~300K to ~1.2M IOPS.  On my previous
24-core system this problem also existed, but was less serious.

Reviewed by:	kib
MFC after:	2 weeks
2015-04-21 10:55:53 +00:00
Eric van Gyzen
c207ff9319 Always send log(9) messages to the message buffer.
It is truer to the semantics of logging for messages to *always*
go to the message buffer, where they can eventually be collected
and, in fact, be put into a log file.

This restores the behavior prior to r70239, which seems to have
changed it inadvertently.

Submitted by:	Eric Badger <eric@badgerio.us>
Reviewed by:	jhb
Approved by:	kib (mentor)
Obtained from:	Dell Inc.
MFC after:	1 week
2015-04-20 20:03:26 +00:00
Konstantin Belousov
8103a8f608 Regen. 2015-04-18 21:50:53 +00:00
Konstantin Belousov
0538aafc41 The lseek(2), mmap(2), truncate(2), ftruncate(2), pread(2), and
pwrite(2) syscalls are wrapped to provide compatibility with pre-7.x
kernels which required padding before the off_t parameter.  The
fcntl(2) contains compatibility code to handle kernels before the
struct flock was changed during the 8.x CURRENT development.  The
shims were reasonable to allow easier revert to the older kernel at
that time.

Now, two or three major releases later, shims do not serve any
purpose.  Such old kernels cannot handle current libc, so revert the
compatibility code.

Make padded syscalls support conditional under the COMPAT6 config
option.  For COMPAT32, the syscalls were under COMPAT6 already.

Remove WITHOUT_SYSCALL_COMPAT build option, which only purpose was to
(partially) disable the removed shims.

Reviewed by:	jhb, imp (previous versions)
Discussed with:	peter
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2015-04-18 21:50:13 +00:00
Mark Johnston
38563f7c92 Remove unimplemented sched provider probes.
They were added for compatibility with the sched provider in Solaris and
illumos, but our sched provider is already incompatible since it uses native
types, so there isn't much point in keeping them around.

Differential Revision:	https://reviews.freebsd.org/D2167
Reviewed by:	rpaulo
2015-04-18 20:36:58 +00:00
Konstantin Belousov
ad8b1d857d Initialize td_sel in the thread_init(). Struct thread is not zeroed
on the initial allocation, but seltdinit() assumes that td_sel is NULL
or a valid pointer.  Note that thread_fini()/seltdfini() also relies
on this, but correctly resets td_sel to NULL.

Submitted by:	luke.tw@gmail.com
PR:	199518
MFC after:	1 week
2015-04-18 17:21:12 +00:00
Kirk McKusick
f351915514 More accurately collect name-cache statistics in sysctl functions
sysctl_debug_hashstat_nchash() and sysctl_debug_hashstat_rawnchash().
These changes are in preparation for allowing changes in the size
of the vnode hash tables driven by increases and decreases in the
maximum number of vnodes in the system.

Reviewed by: kib@
Phabric:     D2265
2015-04-18 00:59:03 +00:00
Devin Teske
43d4f8c4c6 Add "GELI Passphrase:" prompt to boot loader.
A new loader.conf(5) option of geom_eli_passphrase_prompt="YES" will now
allow you to enter your geli(8) root-mount credentials prior to invoking
the kernel.

See check-password.4th(8) for details.

Differential Revision:	https://reviews.freebsd.org/D2105
Reviewed by:	imp, kmoore
Discussed on:	-current
MFC after:	3 days
X-MFC-to:	stable/10
Relnotes:	yes
2015-04-16 20:53:15 +00:00
Rick Macklem
dda11d4ab9 File systems that do not use the buffer cache (such as ZFS) must
use VOP_FSYNC() to perform the NFS server's Commit operation.
This patch adds a mnt_kern_flag called MNTK_USES_BCACHE which
is set by file systems that use the buffer cache. If this flag
is not set, the NFS server always does a VOP_FSYNC().
This should be ok for old file system modules that do not set
MNTK_USES_BCACHE, since calling VOP_FSYNC() is correct, although
it might not be optimal for file systems that use the buffer cache.

Reviewed by:	kib
MFC after:	2 weeks
2015-04-15 20:16:31 +00:00
Neel Natu
1a688aa53e Fix handling of BUS_PROBE_NOWILDCARD in 'device_probe_child()'.
Device probe value of BUS_PROBE_NOWILDCARD should be treated specially only
if the device has a fixed devclass. Otherwise it should be interpreted just
as if the driver doesn't want to claim the device.

Prior to this change a device that was not claimed explicitly by its driver
would remain "attached" to the driver that returned BUS_PROBE_NOWILDCARD.
This would bump up the reference on 'driver->refs' and its 'dev->ops' would
point to the 'driver->ops'. When the driver is subsequently unloaded the
'dev->ops->cls' is left pointing to freed memory.

This fixes an easily reproducible #GP fault caused by loading and unloading
vmm.ko multiple times.

Differential Revision:	https://reviews.freebsd.org/D2294
Reviewed by:	imp, jhb
Discussed with:	rstone
Reported by:	Leon Dang (ldang@nahannisys.com)
MFC after:	2 weeks
2015-04-15 16:22:05 +00:00
Edward Tomasz Napierala
1c73bcab8e Rewrite linprocfs_domtab() as a wrapper around kern_getfsstat(). This
adds missing jail and MAC checks.

Differential Revision:	https://reviews.freebsd.org/D2193
Reviewed by:	kib@
MFC after:	1 month
Sponsored by:	The FreeBSD Foundation
2015-04-15 09:13:11 +00:00
Konstantin Belousov
316b384343 Implement support for binary to requesting specific stack size for the
initial thread.  It is read by the ELF image activator as the virtual
size of the PT_GNU_STACK program header entry, and can be specified by
the linker option -z stack-size in newer binutils.

The soft RLIMIT_STACK is auto-increased if possible, to satisfy the
binary' request.

Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2015-04-15 08:13:53 +00:00
George V. Neville-Neil
63bf240482 When a kernel has DEVICE_POLLING turned on but no drivers have
the capability do not try to take the mutex at all.

Replaces misbegotten attempt from reverted commit 281276

Pointed out by: glebius
Sponsored by: Rubicon Communications (Netgate)
Differential Revision:	https://reviews.freebsd.org/D2262
2015-04-14 14:22:34 +00:00
Randall Stewart
b132edb56f Fix my stupid restoral of old code.. must be c_iflags now.
Thanks jhb for catching my stupidity...
MFC after:	3 days
2015-04-14 00:02:39 +00:00
Randall Stewart
07a2df5d83 Restore the two lines accidentally deleted that allow CALLOUT_DIRECT to be
specifed in the flags.

Thanks Mark Johnston for noticing this ;-o

MFC after:	3 days
2015-04-13 23:06:13 +00:00
Will Andrews
6311d7aaf0 uiomove_object_page(): Avoid instantiating pages in sparse regions on reads.
Check whether the page being requested is either resident or on swap.  If
not, read from the zero_region instead of instantiating an unnecessary page.

This avoids consuming memory for sparse files on tmpfs, when they are read
by applications that do not use SEEK_HOLE/SEEK_DATA (which is most of them).

Reviewed by:	kib
MFC after:	1 week
Sponsored by:	Spectra Logic
2015-04-11 18:51:41 +00:00
Mateusz Guzik
2574218578 Replace struct filedesc argument in getsock_cap with struct thread
This is is a step towards removal of spurious arguments.
2015-04-11 16:00:33 +00:00
Mateusz Guzik
90f54cbfeb fd: remove filedesc argument from fdclose
Just accept a thread instead. This makes it consistent with fdalloc.

No functional changes.
2015-04-11 15:40:28 +00:00
Alexander Motin
cdd09fea28 Add vmem locking to r281026.
While races there are not fatal, they cause result underestimation, that
cause unneeded ARC reclaims.

MFC after:	1 month
2015-04-05 14:17:26 +00:00
Konstantin Belousov
4cfc037c30 Restore proper error from oshmctl(2), used by COMPAT_43, when the
segment cannot be found.  Broken by r280323.

Sponsored by:	The FreeBSD Foundation
MFC after:	3 days
2015-04-04 23:56:38 +00:00
Jilles Tjoelker
78d75aba77 utimensat: Correct Capsicum required capability rights. 2015-04-04 21:47:54 +00:00
Konstantin Belousov
0122d251bf Remove useless initialization.
Sponsored by:	The FreeBSD Foundation
MFC after:	3 days
2015-04-04 08:44:20 +00:00
Alexander Motin
2e9ccb32a1 Make ZFS ARC track both KVA usage and fragmentation.
Even on Illumos, with its much larger KVA, ZFS ARC steps back if KVA usage
reaches certain threshold (3/4 on i386 or 16/17 otherwise).  FreeBSD has
even less KVA, but had no such limit on archs with direct map as amd64.
As result, on machines with a lot of RAM, during load with very small user-
space memory pressure, such as `zfs send`, it was possible to reach state,
when there is enough both physical RAM and KVA (I've seen up to 25-30%),
but no continuous KVA range to allocate even single 128KB I/O request.

Address this situation from two sides:
 - restore KVA usage limitations in a way the most close to Illumos;
 - introduce new requirement for KVA fragmentation, specifying that we
should have at least one sequential KVA range of zfs_max_recordsize bytes.

Experiments show that first limitation done alone is not sufficient.  On
machine with 64GB of RAM it is sometimes needed to drop up to half of ARC
size to get at leats one 1MB KVA chunk.  Statically limiting ARC to half
of KVA/RAM is too strict, so second limitation makes it to work in cycles:
accumulate trash up to certain critical mass, do massive spring-cleaning,
and then start littering again. :)

MFC after:	1 month
2015-04-03 14:45:48 +00:00
Konstantin Belousov
2832cd544f Speed up symbol lookup for the amd64 kernel modules.
Amd64 uses relocatable object files as the modules format.  It is good
WRT not having unneeded overhead for PIC code, in particular, due to
absence of useless GOT and PLT.  But the cost is that the module
linking process cannot use hash to speed up the symbol lookup, and
that each reference to the symbol requiring a relocation, instead of
single-place relocation in GOT.

Cache the successfull symbol lookup results in the module symbol
table, using the newly allocated SHN_FBSD_CACHED value from
SHN_LOOS-HIOS range as an indicator.  The SHN_FBSD_CACHED together
with the non-existent definition of the found symbol are reverted
after successfull relocations, which is done under kld_sx lock, so it
should not be visible to other consumers of the symbol table.

Submitted by:	Conrad Meyer
Differential Revision:  https://reviews.freebsd.org/D1718
MFC after:	3 weeks
2015-04-02 20:14:51 +00:00
Ryan Stone
f2c2231e0c Fix integer truncation bug in malloc(9)
A couple of internal functions used by malloc(9) and uma truncated
a size_t down to an int.  This could cause any number of issues
(e.g. indefinite sleeps, memory corruption) if any kernel
subsystem tried to allocate 2GB or more through malloc.  zfs would
attempt such an allocation when run on a system with 2TB or more
of RAM.

Note to self: When this is MFCed, sparc64 needs the same fix.

Differential revision:	https://reviews.freebsd.org/D2106
Reviewed by:	kib
Reported by:	Michael Fuckner <michael@fuckner.net>
Tested by:	Michael Fuckner <michael@fuckner.net>
MFC after:	2 weeks
2015-04-01 12:42:26 +00:00
Randall Stewart
403df7a672 Adopt jhb's suggested changes, updated comments and callout_migration() moving
to kern/kern_timeout.c

This does *not* address his -1 -> NOCPU comment.

Sponsored by:	Netflix Inc.
2015-03-31 00:18:00 +00:00
Gleb Smirnoff
f6d6b5e262 Catch up on r271387 and remove unused parameter from
VOP_GETPAGES_ASYNC().
2015-03-30 22:49:26 +00:00
Alexander Motin
43329ffcc8 Periodically wake up threads waiting for vmem(9) resources, so they could
ask for resource reclamation again.

This is kind of dirty hack, but as last resort this is better then stuck
indefinitely because of KVA fragmentation, waiting until some random event
free something sufficient.  OpenSolaris also has this hack in its vmem(9).

MFC after:	2 weeks
2015-03-30 13:30:53 +00:00
Alexander Motin
b308aaed27 Add four new DDB commands to display vmem(9) statistics.
In particular, such DDB commands were added:
        show vmem <addr>
        show all vmem
        show vmemdump <addr>
        show all vmemdump

As possible usage, that allows to see KVA usage and fragmentation.
2015-03-29 10:02:29 +00:00
Konstantin Belousov
eeb697c8e9 Make debug.vmem_check a tunable. It is useful to set it early.
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2015-03-28 23:30:51 +00:00
Eric van Gyzen
e858b027db Clean up some cosmetic nits in kern_umtx.c, found during recent work
in this area and by the Clang static analyzer.

Remove some dead assignments.

Fix a typo in a panic string.

Use umtx_pi_disown() instead of duplicate code.

Use an existing variable instead of curthread.

Approved by:	kib (mentor)
MFC after:	3 days
Sponsored by:	Dell Inc
2015-03-28 21:21:40 +00:00
Bjoern A. Zeeb
a04d412295 Try to unbreak !SMP kernels broken in r280785 by using the proper macros
to access cc_cpu.
2015-03-28 15:07:19 +00:00
Randall Stewart
15b1eb142c Change the callout to supply -1 to indicate we are not changing
CPU, also add protection against invalid CPU's as well as
split c_flags and c_iflags so that if a user plays with the active
flag (the one expected to be played with by callers in MPSAFE) without
a lock, it won't adversely affect the callout system by causing a corrupt
list. This also means that all callers need to use the macros and *not*
play with the falgs directly (like netgraph used to).

Differential Revision: htts://reviews.freebsd.org/D1894
Reviewed by: .. timed out but looked at by jhb, imp, adrian hselasky
             tested by hiren and netflix.
Sponsored by:	Netflix Inc.
2015-03-28 12:50:24 +00:00
Hans Petter Selasky
38668c6044 Implement a simple OID number garbage collector. Given the increasing
number of dynamically created and destroyed SYSCTLs during runtime it
is very likely that the current new OID number limit of 0x7fffffff can
be reached. Especially if dynamic OID creation and destruction results
from automatic tests. Additional changes:

- Optimize the typical use case by decrementing the next automatic OID
sequence number instead of incrementing it. This saves searching time
when inserting new OIDs into a fresh parent OID node.

- Add simple check for duplicate non-automatic OID numbers.

MFC after:  1 week
2015-03-25 08:55:34 +00:00
Hans Petter Selasky
502702c644 Make sure tunable sysctls are only fetched once. The existing code can
re-register sysctls when destroying sysctl contexts or when moving
sysctls from one tree to another.
2015-03-24 17:42:53 +00:00
Gleb Smirnoff
a2d4a7e456 Do not include if_var.h and in6_var.h into kern_jail.c. It is now possible
after r280444.

Sponsored by:	Nginx, Inc.
2015-03-24 16:46:40 +00:00
Hans Petter Selasky
ab91c9a743 Correct string pointer offset for error printout. 2015-03-24 16:37:19 +00:00
Rui Paulo
0da9e11b7e Disable coredump_devctl because it could lead to leaking paths to
jails.
2015-03-24 02:17:17 +00:00
Mateusz Guzik
ea926658ff filedesc: microoptimize fget_unlocked by getting rid of fd < 0 branch
Casting fd to an unsigned type simplifies fd range coparison to mere checking
if the result is bigger than the table.
2015-03-24 00:10:11 +00:00
Ian Lepore
296f235de0 The sysctls that return process argv and envv return binary data, so clear
the SBUF_INCLUDENUL flag.

Pointed out by:	    tijl@
2015-03-22 21:18:44 +00:00
Hans Petter Selasky
2793ea13aa Fix for out of order device destruction notifications when using the
delist_dev() function. In addition to this change:
- add a proper description of this function
- add a proper witness assert inside this function
- switch a nearby line to use the "cdp" pointer instead of cdev2priv()

MFC after:	3 days
2015-03-22 13:11:56 +00:00
Mateusz Guzik
f97af9706b proc: use MTX_NEW flag in proc_init
This allows us to get rid of bzero which was added specifically to make
mtx_init on p_mtx reliable.

This also fixes a potential problem where mtx_init on other mutexes
could trip over on unitialized memory and fire an assertion.

Reviewed by:	kib
2015-03-21 20:25:34 +00:00
Mateusz Guzik
ffb34484ee cred: add proc_set_cred_init helper
proc_set_cred_init can be used to set first credentials of a new
process.

Update proc_set_cred assertions so that it only expects already used
processes.

This fixes panics where p_ucred of a new process happens to be non-NULL.

Reviewed by:	kib
2015-03-21 20:24:54 +00:00
Mateusz Guzik
12cec311e6 fork: assign refed credentials earlier
Prior to this change the kernel would take p1's credentials and assign
them tempororarily to p2. But p1 could change credentials at that time
and in effect give us a use-after-free.

No objections from: kib
2015-03-21 20:24:03 +00:00
Alan Cox
3d653db063 Introduce vm_object_color() and use it in mmap(2) to set the color of
named objects to zero before the virtual address is selected.  Previously,
the color setting was delayed until after the virtual address was
selected.  In rtld, this delay effectively prevented the mapping of a
shared library's code section using superpages.  Now, for example, we see
the first 1 MB of libc's code on armv6 mapped by a superpage after we've
gotten through the initial cold misses that bring the first 1 MB of code
into memory.  (With the page clustering that we perform on read faults,
this happens quickly.)

Differential Revision:	https://reviews.freebsd.org/D2013
Reviewed by:	jhb, kib
Tested by:	Svatopluk Kraus (armv6)
MFC after:	6 weeks
2015-03-21 17:56:55 +00:00