Commit Graph

14251 Commits

Author SHA1 Message Date
Mateusz Guzik
a99500a912 Extend struct ucred with group table.
This saves one malloc + free with typical cases and better utilizes
memory.

Submitted by:	Tiwei Bie <btw mail.ustc.edu.cn> (slightly modified)
X-Additional:	JuniorJobs project
2014-11-05 02:08:37 +00:00
Alexander V. Chernikov
9f25cbe45e Remove old hack abusing domattach from NFS code.
According to IANA RPC uaddr registry, there are no AFs
except IPv4 and IPv6, so it's not worth being too abstract here.

Remove ne_rtable[AF_MAX+1] and use explicit per-AF radix tries.
Use own initialization without relying on domattach code.

While I admit that this was one of the rare places in kernel
networking code which really was capable of doing multi-AF
without any AF-depended code, it is not possible anymore to
rely on dom* code.

While here, change terrifying "Invalid radix node head, rn:" message,
to different non-understandable "netcred already exists for given addr/mask",
but less terrifying. Since we know that rn_addaddr() returns NULL if
the same record already exists, we should provide more friendly error.

MFC after:	1 month
2014-11-05 00:58:01 +00:00
Dag-Erling Smørgrav
bccb6d5aa1 [SA-14:25] Fix kernel stack disclosure in setlogin(2) / getlogin(2).
[SA-14:26] Fix remote command execution in ftp(1).

Approved by:	so (des)
2014-11-04 23:29:29 +00:00
John Baldwin
2cba8dd301 Add a new thread state "spinning" to schedgraph and add tracepoints at the
start and stop of spinning waits in lock primitives.
2014-11-04 16:35:56 +00:00
Hans Petter Selasky
0ecd606b24 Simplify logic a bit. Ensure data buffer is properly aligned,
especially for platforms where unaligned access is not allowed. Make
it possible to override the small buffer size.

A simple continuous read string test using libusb showed a reduction
in CPU usage from roughly 10% to less than 1% using a dual-core GHz
CPU, when the malloc() operation was skipped for small buffers.

MFC after:	2 weeks
2014-11-04 11:29:49 +00:00
Jean-Sébastien Pédron
2d6f6d6373 Enable vt(4) by default
vt(4) is a new console driver which brings features such as:
    o  Support for Unicode and double-width characters
    o  Integration with the KMS kernel video drivers
    o  Support for UEFI

You may need to update your console settings in /etc/rc.conf, most
probably the keymap. During boot, /etc/rc.d/syscons will indicate what
you need to do.

vt(4) still has issues and lacks some features compared to syscons(4).
See the wiki for up-to-date information:
    https://wiki.freebsd.org/Newcons

If you want to keep using syscons(4), you can do so by adding the
following line to /boot/loader.conf:
    kern.vty=sc

Differential Revision:	https://reviews.freebsd.org/D1005
Discussed with:	emaste@, nwhitehorn@, ray@
Relnotes:	yes
2014-11-04 10:18:03 +00:00
Konstantin Belousov
74d5b4af82 Clean up confusing comment. Move it to the place of code which is
talked about.  Explain where the mentioned trampoline located
(usermode), and the fact that attempt to exit last thread is denied in
kernel (by delegating the work to usermode).

Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2014-11-03 11:29:08 +00:00
Konstantin Belousov
ab57474c83 When other end of the pipe closed during the write, but some bytes
were written, return short write instead of EPIPE.

Update comment.

Discussed with:	bde (long time ago)
MFC after:	2 weeks
2014-11-03 10:01:56 +00:00
Mateusz Guzik
5cbf44bf89 Provide an on-stack temporary buffer for small ioctl requests. 2014-11-03 07:46:51 +00:00
Mateusz Guzik
324a7026f1 filedesc: plus sys/kdb.h include which crept in with r274007 2014-11-03 06:24:43 +00:00
Mateusz Guzik
1d29258ac2 filedesc: plug unnecessary fdp NULL checks in fdescfreee and fdcopy
Anything reaching these functions has fd table.
2014-11-03 05:12:17 +00:00
Mateusz Guzik
32417098f0 filedesc: create a dedicated zone for struct filedesc0
Currently sizeof(struct filedesc0) is 1096 bytes, which means allocations from
malloc use 2048 bytes.

There is no easy way to shrink the structure <= 1024 an it is likely to grow in
the future.
2014-11-03 04:16:04 +00:00
Konstantin Belousov
cc24666735 Followup to r273966. Fix the build with ADAPTIVE_LOCKMGRS kernel option.
Note that the option is currently not used in any in-tree kernel
configs, including LINTs.

Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
2014-11-02 19:51:33 +00:00
Mateusz Guzik
3dca54ab98 filedesc: move freeing old tables to fdescfree
They cannot be accessed by anyone and hold count only protects the structure
from being freed.
2014-11-02 14:12:03 +00:00
Mateusz Guzik
3dc85312b2 filedesc: factor out some code out of fdescfree
Previously it had a huge self-contained chunk dedicated to dealing with shared
tables.

No functional changes.
2014-11-02 13:43:04 +00:00
Konstantin Belousov
72ba3c0822 Fix two issues with lockmgr(9) LK_CAN_SHARE() test, which determines
whether the shared request for already shared-locked lock could be
granted.  Both problems result in the exclusive locker starvation.

The concurrent exclusive request is indicated by either
LK_EXCLUSIVE_WAITERS or LK_EXCLUSIVE_SPINNERS flags.  The reverse
condition, i.e. no exclusive waiters, must check that both flags are
cleared.

Add a flag LK_NODDLKTREAT for shared lock request to indicate that
current thread guarantees that it does not own the lock in shared
mode.  This turns back the exclusive lock starvation avoidance code;
see man page update for detailed description.

Use LK_NODDLKTREAT when doing lookup(9).

Reported and tested by:	pho
No objections from:	attilio
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
2014-11-02 13:10:31 +00:00
Mateusz Guzik
080fdefc28 filedesc: tidy up fdcheckstd
No functional changes.
2014-11-02 02:32:33 +00:00
Mateusz Guzik
d3f3e12a4f filedesc: lock filedesc lock in fdcloseexec only when needed 2014-11-02 01:13:11 +00:00
Mateusz Guzik
cdcf242896 Fix up module unload for syscall_module_handler consumers.
After r273707 it was registering syscalls as static.

This fixes hwpmc module unload.

Reported by: markj
2014-11-01 22:36:40 +00:00
Jean-Sébastien Pédron
da49f6bcc3 vt(4): Adjust the cursor position after changing the window size
A new terminal_set_cursor() is added: it wraps the existing
teken_set_cursor() function.

In vtbuf_grow(), the cursor position is adjusted at the end of the
function. In vt_change_font(), we call terminal_set_cursor() just after
terminal_set_winsize_blank(), while the terminal is mute.

This fixes a bug where, after loading a kernel video driver which
increases the terminal window size, the cursor remains at its old
position, in other words, in the middle of the display content.

PR:		194421
MFC after:	1 week
2014-11-01 17:05:15 +00:00
Konstantin Belousov
2361c6d135 Add type qualifier volatile to the base (userspace) address argument
of fuword(9) and suword(9).  This makes the functions type-compatible
with volatile objects and does not require devolatile force, e.g. in
kern_umtx.c.

Requested by:	bde
Reviewed by:	jhb
Sponsored by:	The FreeBSD Foundation
MFC after:	3 weeks
2014-10-31 17:43:21 +00:00
Mateusz Guzik
2534d8eeb6 filedesc: drop retval argument from do_dup
It was almost always td_retval anyway.

For the one case where it is not, preserve the old value across the call.
2014-10-31 10:35:01 +00:00
Mateusz Guzik
8a5177cca3 filedesc: fix missed comments about fdsetugidsafety
While here just note that both fdsetugidsafety and fdcheckstd take sleepable
locks.
2014-10-31 09:56:00 +00:00
Mateusz Guzik
f652d856ab filedesc: make fdinit return with source filedesc locked and new one sized
appropriately

Assert FILEDESC_XLOCK_ASSERT only for already used tables in fdgrowtable.
We don't have to call it with the lock held if we are just creating new
filedesc.

As a side note, strictly speaking processes can have fdtables with
fd_lastfile = -1, but then they cannot enter fdgrowtable. Very first file
descriptor they get will be 0 and the only syscall allowing to choose fd number
requires an active file descriptor. Should this ever change, we can add an 'init'
(or similar) parameter to fdgrowtable.
2014-10-31 09:25:28 +00:00
Mateusz Guzik
ffeb890592 filedesc: iterate over fd table only once in fdcopy
While here add 'fdused_init' which does not perform unnecessary work.

Drop FILEDESC_LOCK_ASSERT from fdisused and rely on callers to hold
it when appropriate. This function is only used with INVARIANTS.

No functional changes intended.
2014-10-31 09:19:46 +00:00
Mateusz Guzik
1a0c80a3df filedesc: tidy up fdfree
Implement fdefree_last variant and get rid of 'last' parameter.

No functional changes.
2014-10-31 09:15:59 +00:00
Mateusz Guzik
b97a758ffc filedesc: tidy up fdcopy a little bit
Test for file availability by fde_file != NULL instead of fdisused, this is
consistent with similar checks later.

Drop badfileops check. badfileops don't have DFLAG_PASSABLE set, so it was never
reached in practice.

fdiused is now only used in some KASSERTS, so ifdef it under INVARIANTS.

No functional changes.
2014-10-31 05:41:27 +00:00
Mark Murray
10cb24248a This is the much-discussed major upgrade to the random(4) device, known to you all as /dev/random.
This code has had an extensive rewrite and a good series of reviews, both by the author and other parties. This means a lot of code has been simplified. Pluggable structures for high-rate entropy generators are available, and it is most definitely not the case that /dev/random can be driven by only a hardware souce any more. This has been designed out of the device. Hardware sources are stirred into the CSPRNG (Yarrow, Fortuna) like any other entropy source. Pluggable modules may be written by third parties for additional sources.

The harvesting structures and consequently the locking have been simplified. Entropy harvesting is done in a more general way (the documentation for this will follow). There is some GREAT entropy to be had in the UMA allocator, but it is disabled for now as messing with that is likely to annoy many people.

The venerable (but effective) Yarrow algorithm, which is no longer supported by its authors now has an alternative, Fortuna. For now, Yarrow is retained as the default algorithm, but this may be changed using a kernel option. It is intended to make Fortuna the default algorithm for 11.0. Interested parties are encouraged to read ISBN 978-0-470-47424-2 "Cryptography Engineering" By Ferguson, Schneier and Kohno for Fortuna's gory details. Heck, read it anyway.

Many thanks to Arthur Mesh who did early grunt work, and who got caught in the crossfire rather more than he deserved to.

My thanks also to folks who helped me thresh this out on whiteboards and in the odd "Hallway track", or otherwise.

My Nomex pants are on. Let the feedback commence!

Reviewed by:	trasz,des(partial),imp(partial?),rwatson(partial?)
Approved by:	so(des)
2014-10-30 21:21:53 +00:00
Mateusz Guzik
f55cf4b0d1 filedesc: make sure to force table reload in fget_unlocked when count == 0
This is a fixup to r273843.
2014-10-30 07:21:38 +00:00
Mateusz Guzik
29c85772bb filedesc: microoptimize fget_unlocked by retrying obtaining reference count
without restarting whole lookup

Restart is only needed when fp was closed by current process, which is a much
rarer event than ref/deref by some other thread.
2014-10-30 05:21:12 +00:00
Mateusz Guzik
aa77d52800 filedesc: get rid of atomic_load_acq_int from fget_unlocked
A read barrier was necessary because fd table pointer and table size were
updated separately, opening a window where fget_unlocked could read new size
and old pointer.

This patch puts both these fields into one dedicated structure, pointer to which
is later atomically updated. As such, fget_unlocked only needs data a dependency
barrier which is a noop on all supported architectures.

Reviewed by:	kib (previous version)
MFC after:	2 weeks
2014-10-30 05:10:33 +00:00
John Baldwin
01e1933dcc Rework virtual machine hypervisor detection.
- Move the existing code to x86/x86/identcpu.c since it is x86-specific.
- If the CPUID2_HV flag is set, assume a hypervisor is present and query
  the 0x40000000 leaf to determine the hypervisor vendor ID.  Export the
  vendor ID and the highest supported hypervisor CPUID leaf via
  hv_vendor[] and hv_high variables, respectively.  The hv_vendor[]
  array is also exported via the hw.hv_vendor sysctl.
- Merge the VMWare detection code from tsc.c into the new probe in
  identcpu.c.  Add a VM_GUEST_VMWARE to identify vmware and use that in
  the TSC code to identify VMWare.

Differential Revision:	https://reviews.freebsd.org/D1010
Reviewed by:	delphij, jkim, neel
2014-10-28 19:17:44 +00:00
Konstantin Belousov
f7e91c288a Convert kern_umtx.c to use fueword() and casueword().
Also fix some mishandling of suword(9) errors as errno, which resulted
in spurious ERESTART.

Sponsored by:	The FreeBSD Foundation
Tested by:	pho
MFC after:	3 weeks
2014-10-28 15:30:33 +00:00
Konstantin Belousov
0a2c94b86e Replace some calls to fuword() by fueword() with proper error checking.
Sponsored by:	The FreeBSD Foundation
Tested by:	pho
MFC after:	3 weeks
2014-10-28 15:28:20 +00:00
Konstantin Belousov
4f3dc90023 Add fueword(9) and casueword(9) functions. They are like fuword(9)
and casuword(9), but do not mix value read and indication of fault.

I know (or remember) enough assembly to handle x86 and powerpc.  For
arm, mips and sparc64, implement fueword() and casueword() as wrappers
around fuword() and casuword(), which means that the functions cannot
distinguish between -1 and fault.

On architectures where fueword() and casueword() are native, implement
fuword() and casuword() using fueword() and casuword(), to reduce
assembly code duplication.

Sponsored by:	The FreeBSD Foundation
Tested by:	pho
MFC after:	2 weeks (ia64 needs treating)
2014-10-28 15:22:13 +00:00
Hans Petter Selasky
0e1152fcc2 The SYSCTL data pointers can come from userspace and must not be
directly accessed. Although this will work on some platforms, it can
throw an exception if the pointer is invalid and then panic the kernel.

Add a missing SYSCTL_IN() of "SCTP_BASE_STATS" structure.

MFC after:	3 days
Sponsored by:	Mellanox Technologies
2014-10-28 12:00:39 +00:00
Mateusz Guzik
a7963b9758 Simplify sys_getloginclass.
Just use current thread credentials as they have the same accuracy as the
ones obtained from proc..
2014-10-28 04:59:33 +00:00
Mateusz Guzik
b720dc9749 Change loginclass mutex to an rwlock.
While here reduce nesting in loginclass_free.

Submitted by:	Tiwei Bie <btw mail.ustc.edu.cn>
X-Additional:	JuniorJobs project
MFC after:	2 weeks
2014-10-28 04:33:57 +00:00
Mateusz Guzik
8d866b10fc Tidy up functions related to uidinfo management.
- reference found uidinfo in uilookup
- reduce nesting by handling shorter cases first
2014-10-27 20:20:05 +00:00
Mateusz Guzik
8101958c4f De-k&r-ify function definitions in kern/kern_resource.c
No functional changes.
2014-10-27 20:18:30 +00:00
Mateusz Guzik
e015b1ab0a Avoid dynamic syscall overhead for statically compiled modules.
The kernel tracks syscall users so that modules can safely unregister them.

But if the module is not unloadable or was compiled into the kernel, there is
no need to do this.

Achieve this by adding SY_THR_STATIC_KLD macro which expands to SY_THR_STATIC
during kernel build and 0 otherwise.

Reviewed by:	kib (previous version)
MFC after:	2 weeks
2014-10-26 19:42:44 +00:00
Mateusz Guzik
b90638866e Fix up an assertion in kern_setgroups, it should compare with ngroups_max + 1
Bug introdued in r273685.

Noted by: Tiwei Bie <btw mail.ustc.edu.cn>
2014-10-26 14:25:42 +00:00
Mateusz Guzik
7e9a456a53 Tidy up sys_setgroups and kern_setgroups.
- 'groups' initialization to NULL is always ovewrwriten before use, so plug it
- get rid of 'goto out'
- kern_setgroups's callers already validate ngrp, so only assert the condition
- ngrp  is an u_int, so 'ngrp < 1' is more readable as 'ngrp == 0'

No functional changes.
2014-10-26 06:04:09 +00:00
Mateusz Guzik
92b064f43d Use a temporary buffer in sys_setgroups for requests with <= XU_NGROUPS groups.
Submitted by:	Tiwei Bie <btw mail.ustc.edu.cn>
X-Additional: JuniorJobs project
MFC after:	2 weeks
2014-10-26 05:39:42 +00:00
Mateusz Guzik
f84f8f9468 Now that sysctl_root is only called with sysctl lock in shared mode, update
its assertion to require that.

Update comment missed in r273400: sysctl_xlock/unlock -> sysctl_xlock/xunlock

Noted by: jhb
2014-10-26 01:47:55 +00:00
John Baldwin
1bc9ea1caa Use correct type in __DEVOLATILE(). 2014-10-25 20:42:47 +00:00
Alexander Motin
ccf8a5688a Revert somewhat hackish geom_disk optimization, committed as part of r256880,
and the following r273143 commit, supposed to workaround introduced issue by
quite innocent-looking change.

While there is no clear understanding why, but r273143 is accused in data
corruption in some environments with high I/O load.  I personally don't see
any problem in that commit, and possibly it is just a trigger to some other
bug somewhere, but better safe then sorry for now.

Requested by:	scottl@
MFC after:	3 days
2014-10-25 15:16:19 +00:00
Mateusz Guzik
675c3507d4 rlimit: plug duplicate assertion
counter sanity is already checked by refcount_release.
2014-10-25 05:56:21 +00:00
Xin LI
6362e06b42 Fix build. 2014-10-25 00:16:36 +00:00
John Baldwin
53e1ffbbce The current POSIX semaphore implementation stores the _has_waiters flag
in a separate word from the _count.  This does not permit both items to
be updated atomically in a portable manner.  As a result, sem_post()
must always perform a system call to safely clear _has_waiters.

This change removes the _has_waiters field and instead uses the high bit
of _count as the _has_waiters flag.  A new umtx object type (_usem2) and
two new umtx operations are added (SEM_WAIT2 and SEM_WAKE2) to implement
these semantics.  The older operations are still supported under the
COMPAT_FREEBSD9/10 options.  The POSIX semaphore API in libc has
been updated to use the new implementation.  Note that the new
implementation is not compatible with the previous implementation.
However, this only affects static binaries (which cannot be helped by
symbol versioning).  Binaries using a dynamic libc will continue to work
fine.  SEM_MAGIC has been bumped so that mismatched binaries will error
rather than corrupting a shared semaphore.  In addition, a padding field
has been added to sem_t so that it remains the same size.

Differential Revision:	https://reviews.freebsd.org/D961
Reported by:	adrian
Reviewed by:	kib, jilles (earlier version)
Sponsored by:	Norse
2014-10-24 20:02:44 +00:00
Dag-Erling Smørgrav
b0d69dfad9 In all cases except CTLTYPE_STRING, penv is NULL here, so passing it
indiscriminately to printf() and freeenv() is incorrect.  Add a NULL
check before freeenv(); as for printf(), we could use req.newptr
instead, but we'd have to select the correct format string based on
the type, and that's too much work for an error message, so just
remove it.
2014-10-23 22:42:56 +00:00
Mateusz Guzik
ffc5ce7b75 In selfdfree re-evaulate sf_si after takin the lock.
Otherwise we can race with doselwakeup.

This is a fixup to r273549

Reviewed by:	jhb
Reported by:	everyone and their dog
2014-10-23 19:06:08 +00:00
Xin LI
2735a91d93 Test if 'env' is NULL before doing memset() and strlen(),
the caller may pass NULL to freeenv().
2014-10-23 18:23:50 +00:00
Mateusz Guzik
73f2e5f759 Avoid taking the lock in selfdfree when not needed. 2014-10-23 15:35:47 +00:00
Colin Percival
b9f6af45a5 Avoid leaking data from the kernel environment: When we convert the
initial static environment to a dynamic one, zero the static environment
buffer, and zero individual values when kern_unsetenv and freeenv are
called.

Tested by:	kmoore (VM memory dump + grep)
Tested by:	cperciva (kernel panic dump + grep)
2014-10-22 23:35:32 +00:00
Mateusz Guzik
58a3dcb229 filedesc assert that table size is at least 3 in fdsetugidsafety
Requested by: kib
2014-10-22 08:56:57 +00:00
Mateusz Guzik
4bc68ed7bc Plug unnecessary PRS_NEW check in kern_procctl.
pfind does not return processes in such state.
2014-10-22 04:16:09 +00:00
Mateusz Guzik
a39d200bb9 Reduce nesting in vn_access.
No functional changes.
2014-10-22 01:53:00 +00:00
Mateusz Guzik
eac9678110 Avoid crdup when possible in kern_accessat.
While here tidy up a little.
2014-10-22 01:09:07 +00:00
Mateusz Guzik
11888da8d9 filedesc: cleanup setugidsafety a little
Rename it to fdsetugidsafety for consistency with other functions.

There is no need to take filedesc lock if not closing any files.

The loop has to verify each file and we are guaranteed fdtable has space
for at least 20 fds. As such there is no need to check fd_lastfile.

While here tidy up is_unsafe.
2014-10-22 00:23:43 +00:00
Mateusz Guzik
07b384cbe2 Eliminate unnecessary memory allocation in sys_getgroups and its ibcs2 counterpart. 2014-10-21 23:08:46 +00:00
Mateusz Guzik
2afec8edfc Take the lock shared in linker_search_symbol_name.
This helps sysctl kern.proc.stack.
2014-10-21 21:29:20 +00:00
Mateusz Guzik
fca7732078 Mark some more sysctl stuff shared-locked and MPSAFE. 2014-10-21 21:08:45 +00:00
Mateusz Guzik
b564c5d6aa Make sysctl name2oid shared-locked as well.
This is a follow-up to r273401.
2014-10-21 19:45:08 +00:00
Mateusz Guzik
efe0abddf5 Implement shared locking for sysctl. 2014-10-21 19:05:44 +00:00
Mateusz Guzik
580a011762 Rename sysctl_lock and _unlock to sysctl_xlock and _xunlock. 2014-10-21 19:02:26 +00:00
Hans Petter Selasky
f0188618f2 Fix multiple incorrect SYSCTL arguments in the kernel:
- Wrong integer type was specified.

- Wrong or missing "access" specifier. The "access" specifier
sometimes included the SYSCTL type, which it should not, except for
procedural SYSCTL nodes.

- Logical OR where binary OR was expected.

- Properly assert the "access" argument passed to all SYSCTL macros,
using the CTASSERT macro. This applies to both static- and dynamically
created SYSCTLs.

- Properly assert the the data type for both static and dynamic
SYSCTLs. In the case of static SYSCTLs we only assert that the data
pointed to by the SYSCTL data pointer has the correct size, hence
there is no easy way to assert types in the C language outside a
C-function.

- Rewrote some code which doesn't pass a constant "access" specifier
when creating dynamic SYSCTL nodes, which is now a requirement.

- Updated "EXAMPLES" section in SYSCTL manual page.

MFC after:	3 days
Sponsored by:	Mellanox Technologies
2014-10-21 07:31:21 +00:00
Mateusz Guzik
5c37b305fd Plug unnecessary binvp NULL initialization and test.
Reported by: Coverity
CID: 1018889
2014-10-20 22:52:15 +00:00
Mateusz Guzik
966ee9f25f filedesc: plug 2 write-only variables
Reported by: Coverity
CID: 1245745, 1245746
2014-10-20 21:57:24 +00:00
Mark Johnston
4fd6ca7275 Fix a typo from r189544, which replaced unp_global_rwlock with unp_list_lock
and unp_link_rwlock.

MFC after:	3 days
2014-10-20 20:21:40 +00:00
Mateusz Guzik
4fce16e4c9 Provide vfs suspension support only for filesystems which need it, take
two.

nullfs and unionfs need to request suspension if underlying filesystem(s)
use it. Utilize mnt_kern_flag for this purpose.

This is a fixup for 273271.

No strong objections from: kib
Pointy hat to: mjg
MFC after:	2 weeks
2014-10-20 18:00:50 +00:00
Marcel Moolenaar
0067051fe7 Fully support constructors for the purpose of code coverage analysis.
This involves:
1.  Have the loader pass the start and size of the .ctors section to the
    kernel in 2 new metadata elements.
2.  Have the linker backends look for and record the start and size of
    the .ctors section in dynamically loaded modules.
3.  Have the linker backends call the constructors as part of the final
    work of initializing preloaded or dynamically loaded modules.

Note that LLVM appends the priority of the constructors to the name of
the .ctors section. Not so when compiling with GCC. The code currently
works for GCC and not for LLVM.

Submitted by:	Dmitry Mikulin <dmitrym@juniper.net>
Obtained from:	Juniper Networks, Inc.
2014-10-20 17:04:03 +00:00
Mateusz Guzik
020b8f17a0 Provide vfs suspension support only for filesystems which need it.
Need is expressed by providing vfs_susp_clean function in vfsops.

Differential Revision:	D952
Reviewed by:	kib (previous version)
MFC after:	2 weeks
2014-10-19 06:59:33 +00:00
Adrian Chadd
3fe93b946f Convert a missed u_char cpu -> int cpu.
This was caught by a gcc build.

Reported by:	luigi
Sponsored by:	Norse Corp, Inc.
2014-10-19 04:38:02 +00:00
Adrian Chadd
e77f9fed15 Update the ULE scheduler + thread and kinfo structs to use int for cpuid
rather than u_char.

To try and play nice with the ABI, the u_char CPU ID values are clamped
at 254.  The new fields now contain the full CPU ID, or -1 for no cpu.

Differential Revision:	D955
Reviewed by:	jhb, kib
Sponsored by:	Norse Corp, Inc.
2014-10-18 19:36:11 +00:00
Davide Italiano
2be111bf7d Follow up to r225617. In order to maximize the re-usability of kernel code
in userland rename in-kernel getenv()/setenv() to kern_setenv()/kern_getenv().
This fixes a namespace collision with libc symbols.

Submitted by:   kmacy
Tested by:      make universe
2014-10-16 18:04:43 +00:00
Alexander Motin
99b9076c21 Remove setting BIO_DONE flag for BIOs that have done() method.
This fixes use-after-free, caused by geom_disk, completing same BIO twice
to save extra allocation, and getting BIO_DONE set after the first.

MFC after:	1 week
2014-10-15 18:36:34 +00:00
Konstantin Belousov
f821fad417 Implement FIODTYPE for master ptys.
Requested and reviewed by:	bde
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2014-10-15 12:38:26 +00:00
Mateusz Guzik
32e7f8e4d5 Don't take devmtx unnecessarily in vn_isdisk.
MFC after:	1 week
2014-10-15 05:17:36 +00:00
Mateusz Guzik
55056be254 filedesc: plug 2 assignments to M_ZERO-ed pointers in falloc_noinstall
No functional changes.
2014-10-15 01:16:11 +00:00
Marcel Moolenaar
ddbe5b951f Fix nits in previous commit:
1.  Remove initializer for badstack_sbuf_size; it gets set unconditionally.
2.  Remove meaningless comment.
3.  Group witness_count and its sysctl together.
4.  Fix spacing in for statements (space after for and within condition).
5.  Change *all* M_NOWAIT usages in witness_initialize() to M_WAITOK; not
    just those that were newly introduced -- the allocation is assumed to
    succeed for all allocations.
6.  Avoid using uint8_t as the base type in sizeof() expressions; Use the
    variable name (w_rmatrix) as much as possible.

Pointed out by: jhb@ (thanks!)
2014-10-11 16:34:01 +00:00
Marcel Moolenaar
90a5222f14 Turn WITNESS_COUNT into a tunable and sysctl. This allows adjusting
the value without recompiling the kernel.  This is useful when
recompiling is not possible as an immediate solution. When we run out
of witness objects, witness is completely disabled. Not having an
immediate solution can therefore be problematic.

Submitted by:	Sreekanth Rupavatharam <rupavath@juniper.net>
Obtained from:	Juniper Networks, Inc.
2014-10-11 02:02:58 +00:00
Marcel Moolenaar
2e7634503e Regenerate after r272823:
Move the SCTP syscalls to netinet with the rest of the SCTP code.

Submitted by:	Steve Kiernan <stevek@juniper.net>
Reviewed by:	tuexen, rrs
Obtained from:	Juniper Networks, Inc.
2014-10-09 15:19:35 +00:00
Marcel Moolenaar
80b47aefa1 Move the SCTP syscalls to netinet with the rest of the SCTP code. The
syscalls themselves are tightly coupled with the network stack and
therefore should not be in the generic socket code.

The following four syscalls have been marked as NOSTD so they can be
dynamically registered in sctp_syscalls_init() function:
  sys_sctp_peeloff
  sys_sctp_generic_sendmsg
  sys_sctp_generic_sendmsg_iov
  sys_sctp_generic_recvmsg

The syscalls are also set up to be dynamically registered when COMPAT32
option is configured.

As a side effect of moving the SCTP syscalls, getsock_cap needs to be
made available outside of the uipc_syscalls.c source file.  A proper
prototype has been added to the sys/socketvar.h header file.

API tests from the SCTP reference implementation have been run to ensure
compatibility. (http://code.google.com/p/sctp-refimpl/source/checkout)

Submitted by:	Steve Kiernan <stevek@juniper.net>
Reviewed by:	tuexen, rrs
Obtained from:	Juniper Networks, Inc.
2014-10-09 15:16:52 +00:00
Adrian Chadd
ffcf962dab Add a bus method to fetch the VM domain for the given device/bus.
* Add a bus_if.m method - get_domain() - returning the VM domain or
  ENOENT if the device isn't in a VM domain;
* Add bus methods to print out the domain of the device if appropriate;
* Add code in srat.c to save the PXM -> VM domain mapping that's done and
  expose a function to translate VM domain -> PXM;
* Add ACPI and ACPI PCI methods to check if the bus has a _PXM attribute
  and if so map it to the VM domain;
* (.. yes, this works recursively.)
* Have the pci bus glue print out the device VM domain if present.

Note: this is just the plumbing to start enumerating information -
it doesn't at all modify behaviour.

Differential Revision:	D906
Reviewed by:	jhb
Sponsored by:	Norse Corp
2014-10-09 05:33:25 +00:00
Marcel Moolenaar
383f423be1 Fix draining in ttydev_leave():
1.  ERESTART is not only returned when the revoke count changed. It
    is also returned when a signal is received. While a change in
    the revoke count should be ignored, a signal should not.
2.  Waiting until the output queue is entirely drained can cause a
    hang when the underlying device is stuck or broken.

Have tty_drain() take care of this by telling it when we're leaving.
When leaving, tty_drain() will use a timed wait to address point 2
above and it will check the revoke count to handle point 1 above.
The timeout is set to 1 second, which is arbitrary and long enough
to expect a change in the output queue.

Discussed with: jilles@
Reported by: Yamagi Burmeister <lists@yamagi.org>
2014-10-09 02:30:38 +00:00
Marcel Moolenaar
75c2b79df8 Apply r269126 to tty_timedwait():
Don't return ERESTART when the device is gone.
2014-10-09 01:59:25 +00:00
John Baldwin
232e8b52b0 Add schedgraph traces for callout handlers. Specifically, a callwheel logs
a running event each time it executes a callout function.  The event
includes the function pointer, argument, and whether or not it was run from
hardware interrupt context.  The callwheel is marked idle when each handler
completes.  This effectively logs the duration of each callout routine in
the graph.
2014-10-08 16:22:59 +00:00
Jung-uk Kim
37417245bf Make kern.nswbuf tunable from loader.
MFC after:	1 week
2014-10-07 20:13:47 +00:00
Mateusz Guzik
dd2390be68 Convert racct stubs to inline functions.
This saves some symbols and function calls for kernel without RACCT.

MFC after:	1 week
2014-10-06 02:31:33 +00:00
Mateusz Guzik
2b4a2528d7 filedesc: fix up breakage introduced in 272505
Include sequence counter supports incoditionally [1]. This fixes reprted build
problems with e.g. nvidia driver due to missing opt_capsicum.h.

Replace fishy looking sizeof with offsetof. Make fde_seq the last member in
order to simplify calculations.

Suggested by:	kib [1]
X-MFC:		with 272505
2014-10-05 19:40:29 +00:00
Konstantin Belousov
57c2505e65 On error, sbuf_bcat() returns -1. Some callers returned this -1 to
the upper layers, which interpret it as errno value, which happens to
be ERESTART.  The result was spurious restarts of the sysctls in loop,
e.g. kern.proc.proc, instead of returning ENOMEM to caller.

Convert -1 from sbuf_bcat() to ENOMEM, when returning to the callers
expecting errno.

In collaboration with:	pho
Sponsored by:	The FreeBSD Foundation (kib)
MFC after:	1 week
2014-10-05 17:35:59 +00:00
Mateusz Guzik
bad2520a2b Avoid unnecessary ppeers_lock acquisition in exit1.
MFC after:	1 week
2014-10-05 07:21:41 +00:00
Mateusz Guzik
25108069ec Get rid of crshared. 2014-10-05 02:16:53 +00:00
Konstantin Belousov
4142462eeb Slightly reword comment. Move code, which is described by the
comment, after it.

Discussed with:	bde
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2014-10-04 18:51:55 +00:00
Konstantin Belousov
b76278407d Add kernel option KSTACK_USAGE_PROF to sample the stack depth on
interrupts and report the largest value seen as sysctl
debug.max_kstack_used.  Useful to estimate how close the kernel stack
size is to overflow.

In collaboration with:	Larry Baird <lab@gta.com>
Sponsored by:	The FreeBSD Foundation (kib)
MFC after:	1 week
2014-10-04 18:38:14 +00:00
Konstantin Belousov
539c9eef12 Fixes for i/o during coredumping:
- Do not dump into system files.
- Do not acquire write reference to the mount point where img.core is
  written, in the coredump().  The vn_rdwr() calls from ELF imgact
  request the write ref from vn_rdwr().  Recursive acqusition of the
  write ref deadlocks with the unmount.
- Instead, take the range lock for the whole core file.  This prevents
  parallel dumping from two processes executing the same image,
  converting the useless interleaved dump into sequential dumping,
  with second core overwriting the first.

Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
2014-10-04 18:35:00 +00:00
Konstantin Belousov
e3d6feceb1 Add IO_RANGELOCKED flag for vn_rdwr(9), which specifies that vnode is
not locked, but range is.

Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
2014-10-04 18:28:27 +00:00
Ian Lepore
41e8f7efbe Make kevent(2) periodic timer events more reliably periodic. The event
callout is now scheduled using the C_ABSOLUTE flag, and the absolute time
of each event is calculated as the time the previous event was scheduled
for plus the interval.  This ensures that latency in processing a given
event doesn't perturb the arrival time of any subsequent events.

Reviewed by:	jhb
2014-10-04 15:59:15 +00:00
Mateusz Guzik
ee3fd7bbb1 Plug capability races.
fp and appropriate capability lookups were not atomic, which could result in
improper capabilities being checked.

This could result either in protection bypass or in a spurious ENOTCAPABLE.

Make fp + capability check atomic with the help of sequence counters.

Reviewed by:	kib
MFC after:	3 weeks
2014-10-04 08:08:56 +00:00
John Baldwin
7aa1071e48 Require p_cansched() for changing a process' protection status via
procctl() rather than p_cansee().

Submitted by:	rwatson
MFC after:	3 days
2014-10-02 21:18:16 +00:00
Will Andrews
9832a24d27 In the syncer, drop the sync mutex while patting the watchdog.
Some watchdog drivers (like ipmi) need to sleep while patting the watchdog.
See sys/dev/ipmi/ipmi.c:ipmi_wd_event(), which calls malloc(M_WAITOK).

Submitted by:	asomers
MFC after:	1 month
Sponsored by:	Spectra Logic
MFSpectraBSD:	637548 on 2012/10/04
2014-10-01 15:32:28 +00:00
Navdeep Parhar
a9fa76f27e Test for absence of M_NOFREE before attempting to purge the mbuf's tags.
This will leave more state intact should the assertion go off.

MFC after:	1 month
2014-09-30 23:16:26 +00:00
Mateusz Guzik
8e572983d3 Use bzero instead of explicitly zeroing stuff in do_execve.
While strictly speaking this is not correct since some fields are pointers,
it makes no difference on all supported archs and we already rely on it doing
the right thing in other places.

No functional changes.
2014-09-29 23:59:19 +00:00
Neel Natu
fbe602fb61 tty_rel_free() can be called more than once for the same tty so make sure
that the tty is dequeued from 'tty_list' only the first time.

The panic below was seen when a revoke(2) was issued on an nmdm device.
In this case there was also a thread that was blocked on a read(2) on the
device. The revoke(2) woke up the blocked thread which would typically
return an error to userspace. In this case the reader also held the last
reference on the file descriptor so fdrop() ended up calling tty_rel_free()
via ttydev_close().

tty_rel_free() then tried to dequeue 'tp' again which led to the panic.

panic: Bad link elm 0xfffff80042602400 prev->next != elm
cpuid = 1
KDB: stack backtrace:
db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame 0xfffffe00f9c90460
kdb_backtrace() at kdb_backtrace+0x39/frame 0xfffffe00f9c90510
vpanic() at vpanic+0x189/frame 0xfffffe00f9c90590
panic() at panic+0x43/frame 0xfffffe00f9c905f0
tty_rel_free() at tty_rel_free+0x29b/frame 0xfffffe00f9c90640
ttydev_close() at ttydev_close+0x1f9/frame 0xfffffe00f9c90690
devfs_close() at devfs_close+0x298/frame 0xfffffe00f9c90720
VOP_CLOSE_APV() at VOP_CLOSE_APV+0x13c/frame 0xfffffe00f9c90770
vn_close() at vn_close+0x194/frame 0xfffffe00f9c90810
vn_closefile() at vn_closefile+0x48/frame 0xfffffe00f9c90890
devfs_close_f() at devfs_close_f+0x2c/frame 0xfffffe00f9c908c0
_fdrop() at _fdrop+0x29/frame 0xfffffe00f9c908e0
sys_read() at sys_read+0x63/frame 0xfffffe00f9c90980
amd64_syscall() at amd64_syscall+0x2b3/frame 0xfffffe00f9c90ab0
Xfast_syscall() at Xfast_syscall+0xfb/frame 0xfffffe00f9c90ab0
--- syscall (3, FreeBSD ELF64, sys_read), rip = 0x800b78d8a, rsp = 0x7fffffbfdaf8, rbp = 0x7fffffbfdb30 ---

CR:		https://reviews.freebsd.org/D851
Reviewed by:	glebius, ed
Reported by:	Leon Dang
Sponsored by:	Nahanni Systems
MFC after:	1 week
2014-09-28 21:12:23 +00:00
Gleb Smirnoff
bd071d4d19 - Remove empty wrappers ether_poll_[de]register_drv(). [1]
- Move polling(9) declarations out of ifq.h back to if_var.h
  they are absolutely unrelated to queues.

Submitted by:	Mikhail <mp lenta.ru> [1]
2014-09-28 14:05:18 +00:00
Mateusz Guzik
0c4a09a378 Make do_dup() static and move relevant macros to kern_descrip.c
No functional changes.
2014-09-26 19:48:47 +00:00
John Baldwin
c1d67516d9 Don't panic if a resource is allocated twice. Instead, print a warning and
fail the allocation request.  Allocations of "reserved" resources such as
PCI BARs already fail the request instead of panic'ing in this case.

MFC after:	1 week
2014-09-26 18:37:49 +00:00
Konstantin Belousov
f69261f2f9 Fix fcntl(2) compat32 after r270691. The copyin and copyout of the
struct flock are done in the sys_fcntl(), which mean that compat32 used
direct access to userland pointers.

Move code from sys_fcntl() to new wrapper, kern_fcntl_freebsd(), which
performs neccessary userland memory accesses, and use it from both
native and compat32 fcntl syscalls.

Reported by:	jhibbits
Sponsored by:	The FreeBSD Foundation
MFC after:	3 days
2014-09-25 21:07:19 +00:00
Konstantin Belousov
2c6fbcbed6 In kern_linkat() and kern_renameat(), do not call namei(9) while
holding a write reference on the filesystem.  Try to get write
reference in unblocked way after all vnodes are resolved; if failed,
drop all locks and retry after waiting for suspension end.

The VFS_UNMOUNT() methods for UFS and tmpfs try to establish
suspension on unmount, while covered vnode is locked by VFS, which
prevents namei() from stepping over the mount point.  The thread doing
namei() sleeps on the covered vnode lock, owning the write ref.

Reported by:	bdrewery
Tested by:	bdrewery (previous version), pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2014-09-25 20:42:25 +00:00
Justin Hibbits
a1c1634858 Stage one of multipass suspend/resume
Summary:
Add the beginnings of multipass suspend/resume, by introducing
BUS_SUSPEND_CHILD/BUS_RESUME_CHILD, and move the PCI driver to this.

Reviewers: jhb

Reviewed By: jhb

Differential Revision: https://reviews.freebsd.org/D590
2014-09-23 02:56:40 +00:00
John Baldwin
9696feebe2 Add a new fo_fill_kinfo fileops method to add type-specific information to
struct kinfo_file.
- Move the various fill_*_info() methods out of kern_descrip.c and into the
  various file type implementations.
- Rework the support for kinfo_ofile to generate a suitable kinfo_file object
  for each file and then convert that to a kinfo_ofile structure rather than
  keeping a second, different set of code that directly manipulates
  type-specific file information.
- Remove the shm_path() and ksem_info() layering violations.

Differential Revision:	https://reviews.freebsd.org/D775
Reviewed by:	kib, glebius (earlier version)
2014-09-22 16:20:47 +00:00
John Baldwin
fadf3fb98c Convert from timeout(9) to callout(9). 2014-09-22 14:27:26 +00:00
Hans Petter Selasky
9fd573c39d Improve transmit sending offload, TSO, algorithm in general.
The current TSO limitation feature only takes the total number of
bytes in an mbuf chain into account and does not limit by the number
of mbufs in a chain. Some kinds of hardware is limited by two
factors. One is the fragment length and the second is the fragment
count. Both of these limits need to be taken into account when doing
TSO. Else some kinds of hardware might have to drop completely valid
mbuf chains because they cannot loaded into the given hardware's DMA
engine. The new way of doing TSO limitation has been made backwards
compatible as input from other FreeBSD developers and will use
defaults for values not set.

Reviewed by:	adrian, rmacklem
Sponsored by:	Mellanox Technologies
MFC after:	1 week
2014-09-22 08:27:27 +00:00
Sean Bruno
7c51714e0a svn revisions r269964 and r269963 seemed to have impaired small memory
footprint systems(32M/64M) and didn't leave enough free memory to load modules
when it was setting up page tables that for sizes that are never used on
these smallish boards.

Set kmem_zmax to PAGE_SIZE on these smaller systems (< 128M) to keep this
from happening. Verified on mips32 h/w.

PR:             193465
Submitted by:   delphij
Reviewed by:    adrian
2014-09-22 05:07:22 +00:00
Alexander Motin
ae9e9b4fda Reprase r271616 comments.
Submitted by:	alc
MFC after:	1 month
2014-09-17 17:43:32 +00:00
Adrian Chadd
066da8050b Migrate ie->ie_assign_cpu and associated code to use an int for CPU rather
than u_char.

Migrate post_filter to use an int for a CPU rather than u_char.

Change intr_event_bind() to use an int for CPU rather than u_char.

It touches the ppc, sparc64, arm and mips machdep code but it should
(hah!) be a no-op.

Tested:

* i386, AMD64 laptops

Reviewed by:	jhb
2014-09-17 17:33:22 +00:00
Adrian Chadd
7f7528fc79 Modify cpuset_setithread() to take a CPU ID as an integer, not a char.
We're going to end up having > 254 CPUs at some point.
2014-09-16 01:21:47 +00:00
Enji Cooper
257597a434 Validate the mode argument in access, eaccess, and faccessat for optional
POSIX compliance and to improve compatibility with Linux and NetBSD

The issue was identified with lib/libc/sys/t_access:access_inval from
NetBSD

Update the manpage accordingly

PR: 181155
Reviewed by: jilles (code), jmmv (code), wblock (manpage), wollman (code)
MFC after: 4 weeks
Phabric: D678 (code), D786 (manpage)
Sponsored by: EMC / Isilon Storage Division
2014-09-16 00:56:47 +00:00
Alexander Motin
7965496958 Add comments describing r271604 change.
MFC after:	3 days
2014-09-15 11:17:36 +00:00
Alexander Motin
7e9b58eaaa Add couple memory barries to serialize tdq_cpu_idle and tdq_load accesses.
This change fixes transient performance drops in some of my benchmarks,
vanishing as soon as I am trying to collect any stats from the scheduler.
It looks like reordered access to those variables sometimes caused loss of
IPI_PREEMPT, that delayed thread execution until some later interrupt.

MFC after:	3 days
2014-09-14 22:13:19 +00:00
Alexander V. Chernikov
c1d9ecf2be Fix error handling in cpuset_setithread() introduced in r267716.
Noted by:	kib
MFC after:	1 week
2014-09-13 13:46:16 +00:00
John Baldwin
2d69d0dcc2 Fix various issues with invalid file operations:
- Add invfo_rdwr() (for read and write), invfo_ioctl(), invfo_poll(),
  and invfo_kqfilter() for use by file types that do not support the
  respective operations.  Home-grown versions of invfo_poll() were
  universally broken (they returned an errno value, invfo_poll()
  uses poll_no_poll() to return an appropriate event mask).  Home-grown
  ioctl routines also tended to return an incorrect errno (invfo_ioctl
  returns ENOTTY).
- Use the invfo_*() functions instead of local versions for
  unsupported file operations.
- Reorder fileops members to match the order in the structure definition
  to make it easier to spot missing members.
- Add several missing methods to linuxfileops used by the OFED shim
  layer: fo_write(), fo_truncate(), fo_kqfilter(), and fo_stat().  Most
  of these used invfo_*(), but a dummy fo_stat() implementation was
  added.
2014-09-12 21:29:10 +00:00
John Baldwin
cd550b9b52 Tweak pipe_truncate() to more closely match pipe_chown() and pipe_chmod()
by checking PIPE_NAMED and using invfo_truncate() for unnamed pipes.
2014-09-12 21:20:36 +00:00
John Baldwin
0ed667f6e5 Simplify vntype_to_kinfo() by returning when the desired value is found
instead of breaking out of the loop and then immediately checking the loop
index so that if it was broken out of the proper value can be returned.

While here, use nitems().
2014-09-12 20:56:09 +00:00
Gleb Smirnoff
27ad26d8c7 Remove unused arguments for VOP_GETPAGES(), VOP_PUTPAGES(). 2014-09-10 12:36:41 +00:00
Edward Tomasz Napierala
f514b97b7d Avoid unlocking unlocked mutex in RCTL jail code. Specific test case
is attached to PR.

PR:		193457
MFC after:	1 week
Sponsored by:	The FreeBSD Foundation
2014-09-09 16:05:33 +00:00
Hiroki Sato
714266373c - Make hhook_run_socket() vnet-aware instead of adding CURVNET_SET() around
the function calls.
- Fix a memory leak and stats in the case that hhook_run_socket() fails
  in soalloc().

PR:	193265
2014-09-08 09:04:22 +00:00
Jean-Sébastien Pédron
ffea80b445 pause_sbt(): Take the cold path (ie. use DELAY()) if KDB is active
This fixes a panic in the i915 driver when one uses debug.kdb.enter=1
under vt(4).

PR:		193269
Reported by:	emaste@
Submitted by:	avg@
MFC after:	3 days
2014-09-08 08:44:50 +00:00
Gleb Smirnoff
9e739a5a05 Fix for r271182.
Submitted by:	mjg
Pointy hat to:	me, submitter and everyone who urged me to commit
2014-09-07 05:44:14 +00:00
Mateusz Guzik
64196a9996 Plug unnecessary fp assignments in kern_fcntl.
No functional changes.
2014-09-05 23:56:25 +00:00
Gleb Smirnoff
d9257d8b57 Set vnet context before accessing V_socket_hhh[].
Submitted by:	"Hiroo Ono (小野寛生)" <hiroo.ono+freebsd gmail.com>
2014-09-05 19:50:18 +00:00
Sean Bruno
65f20a89f1 Allow multiple image activators to run on the same execution by changing
imgp->interpreted to a bitmask instead of, functionally, a bool. Each
imgactivator now requires its own flag in interpreted to indicate whether
or not it has already examined argv[0].

Change imgp->interpreted to an unsigned char to add one extra bit for
future use.

With this change, one can execute a shell script from a 64bit host native
make and still get the binmisc image activator to fire for the script
interpreter.  Prior to this, execution would fail.

Phabric:	https://reviews.freebsd.org/D696
Reviewed by:	jhb@
MFC after:	4 weeks
2014-09-04 21:31:25 +00:00
Gleb Smirnoff
7ee2d05890 Change a very strange code in m_demote() to simple assertion.
Sponsored by:	Nginx, Inc.
2014-09-04 19:27:30 +00:00
Gleb Smirnoff
1967edba02 Provide m_catpkt(), a wrapper around m_cat() that deals with M_PKTHDR mbufs.
Sponsored by:	Netflix
Sponsored by:	Nginx, Inc.
2014-09-04 09:07:14 +00:00
Mateusz Guzik
2570cdd605 Plug a hypothetical use after free in sysctl kern.proc.groups.
MFC after:	1 week
2014-09-04 01:21:33 +00:00
Benno Rice
c079e1c018 Add KASSERTs to catch the case where a developer may have forgotten to
set bo_bsize on a bufobj.

This is a slight modification of the patch provided.

PR:		193146
Submitted by:	Conrad Meyer <conrad.meyer@isilon.com>
Sponsored by:	EMC Isilon Storage Division
2014-09-04 00:10:06 +00:00
Konstantin Belousov
624bf9e134 Style.
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2014-09-03 08:40:16 +00:00
Konstantin Belousov
8626a0ddc6 Retire thread_unthread(), it has only one caller. Update comment in
the block of code before the previous call to thread_unthread().

Discussed with:	alc
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2014-09-03 08:35:42 +00:00
Konstantin Belousov
fd229b5b75 Right now, thread_single(SINGLE_EXIT) returns after the p_numthreads
reaches 1. The p_numthreads counter is decremented in thread_exit() by
a call to thread_unlink(). This means that the exiting threads may
still execute on other CPUs when thread_single(SINGLE_EXIT) returns.
As result, vmspace could be destroyed while paging structures are
still used on other CPUs by exiting threads.

Delay the return from thread_single(SINGLE_EXIT) until all threads are
really destroyed by thread_stash() after the last switch out. The
p_exitthreads counter already provides the required mechanism, move
the wait from the thread_wait() (which is called from wait(2) code)
into thread_single().

Reported by:	many (as "panic: pmap active <addr>")
Reviewed by:	alc, jhb
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2014-09-03 08:18:07 +00:00
Gleb Smirnoff
5b5477d762 Fix dereference after NULL check.
CID:		1234607
Sponsored by:	Nginx, Inc.
2014-09-03 08:14:07 +00:00
Mateusz Guzik
9152087ea7 Fix up proc_realparent to always return correct process.
Prior to the change it would always return initproc for non-traced processes.

This fixes ps apparently always returning 1 as ppid.

Pointy hat:	mjg
Reported by:	many
MFC after:	1 week
2014-09-03 06:25:34 +00:00
Alan Cox
01a8fb7db5 Automatically prefault a limited number of mappings to resident pages in
shmat(2), just like mmap(2).

MFC after:	5 days
Sponsored by:	EMC / Isilon Storage Division
2014-08-31 17:38:41 +00:00
Mateusz Guzik
6662ce5aab Add missing proctree locking to fill_kinfo_proc consumers.
This fixes r270444.

Pointy hat:	mjg
Reported by:	many
MFC after:	1 week
2014-08-30 03:10:55 +00:00
Andreas Tobler
5be725d7e8 Rename shm_dict_init to shm_init to fix a compiler warning.
Reviewed by:	jhb
2014-08-29 21:50:32 +00:00
John Baldwin
610a2b3c45 Use a unit number allocator to provide suitable st_dev and st_ino values
for POSIX shared memory descriptors.  The implementation is similar to
that used for pipes.

MFC after:	1 week
2014-08-29 18:18:29 +00:00
Konstantin Belousov
575e02d94f Add function and wrapper to switch lockmgr and vnode lock back to
auto-promotion of shared to exclusive.

Tested by:	hrs, pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2014-08-29 09:02:01 +00:00
Mateusz Guzik
8b04bbef31 Return real parent pid in kinfo (used by e.g. ps)
Add a separate field which exports tracer pid and add a new keyword
("tracer") for ps to display it.

This is a follow up to r270444.

Reviewed by:	kib
MFC after:	1 week
Relnotes:	yes
2014-08-28 08:41:11 +00:00
Jean-Sébastien Pédron
3e206539a1 vt(4): Add cngrab() and cnungrab() callbacks
They are used when a panic occurs or when entering a DDB session for
instance.

cngrab() forces a vt-switch to the console window, no matter if the
original window is another terminal or an X session. However, cnungrab()
doesn't vt-switch back to the original window currently.

MFC after:	1 week
2014-08-27 10:04:10 +00:00
Gleb Smirnoff
e86447ca44 - Remove socket file operations declaration from sys/file.h.
- Make them static in sys_socket.c.
- Provide generic invfo_truncate() instead of soo_truncate().

Sponsored by:	Netflix
Sponsored by:	Nginx, Inc.
2014-08-26 14:44:08 +00:00
Mateusz Guzik
037755fd15 Fix up races with f_seqcount handling.
It was possible that the kernel would overwrite user-supplied hint.

Abuse vnode lock for this purpose.

In collaboration with: kib
MFC after:	1 week
2014-08-26 08:17:22 +00:00
Konstantin Belousov
c83655f334 Revert the handling of all siginfo sa_flags except SA_SIGINFO to the
pre-r270321.  Namely, the flags are preserved for SIG_DFL and SIG_IGN
dispositions.

Requested and reviewed by:	jilles
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2014-08-24 16:37:50 +00:00
Mateusz Guzik
8cc11167fb Plug a memory leak in case of failed lookups in capability mode.
Put common cnp cleanup into one function and use it for this purpose.

MFC after:	1 week
2014-08-24 12:51:12 +00:00
Mateusz Guzik
ce8daaadbd Use refcount_init in sigacts_alloc.
This change is a no-op, but fixes up an inconsistency introduced with
r268634.

MFC after:	3 days
2014-08-24 09:24:37 +00:00
Mateusz Guzik
abd386bafe Fix getppid for traced processes.
Traced processes always have the tracer set as the parent.
Utilize proc_realparent to obtain the right process when needed.

Reviewed by:	kib
MFC after:	1 week
2014-08-24 09:04:09 +00:00
Mateusz Guzik
a661bebe26 Properly reparent traced processes when the tracer dies.
Previously they were uncoditionally reparented to init. In effect
it was possible that tracee was never returned to original parent.

Reviewed by:	kib
MFC after:	1 week
2014-08-24 09:02:16 +00:00
Alexander Motin
2e7d7bb294 Restore pre-r239157 handling of sched_yield(), when thread time slice was
aborted, allowing other threads to run.  Without this change thread is just
rescheduled again, that was illustrated by provided test tool.

PR:		192926
Submitted by:	eric@vangyzen.net
MFC after:	2 weeks
2014-08-23 17:31:56 +00:00
Konstantin Belousov
fbb6eca60f In do_lock_pi(), do not override error from umtxq_sleep_pi() when
doing suspend check.  This restores the pre-r251684 behaviour, to
retry once after the signal is detected.

PR:	kern/192918
Submitted by:	Elliott Rabe, Dell Inc., Eric van Gyzen <eric@vangyzen.net>
Obtained from:	Dell Inc.
MFC after:	1 week
2014-08-22 18:42:14 +00:00
Konstantin Belousov
350ae56373 Ensure that sigaction flags for signal, which disposition is reset to
ignored or default, are not leaking.  Apparently, there exists code
which relies on SA_SIGINFO not reported for SIG_DFL or SIG_IGN.

In kern_sigaction, ignore flags when resetting.  Encapsulate the flag
and disposition testing into helper sigact_flag_test().

On exec, and when delivering signal with SA_RESETHAND flag set,
signals are reset automatically.  Use new helper sigdflt(), which
removes duplicated code and corrects all flag bits for the signal.

For proc0, set sigintr bit for all ignored signals.  Ignored signals
are consumed in tdsendsignal() and not delivered to the victim thread
at all.

Reported and tested by:	royger
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2014-08-22 08:19:08 +00:00
Konstantin Belousov
2d86417410 Check the validity of struct sigaction sa_flags value, reject unknown
flags.

Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2014-08-22 07:52:47 +00:00
Hiroki Sato
ed063112f4 Fix a panic which occurs in a VIMAGE-enabled kernel after r270158, and
separate socket_hhook_register() part and put it into VNET_SYS{,UN}INIT()
handler.

Discussed with:	marcel
2014-08-22 05:03:30 +00:00
Marcel Moolenaar
4ec7371233 For vendors like Juniper, extensibility for sockets is important. A
good example is socket options that aren't necessarily generic.  To
this end, OSD is added to the socket structure and hooks are defined
for key operations on sockets.  These are:
o   soalloc() and sodealloc()
o   Get and set socket options
o   Socket related kevent filters.

One aspect about hhook that appears to be not fully baked is the return
semantics (the return value from the hook is ignored in hhook_run_hooks()
at the time of commit).  To support return values, the socket_hhook_data
structure contains a 'status' field to hold return values.

Submitted by:	Anuranjan Shukla <anshukla@juniper.net>
Obtained from:	Juniper Networks, Inc.
2014-08-18 23:45:40 +00:00
Warner Losh
817dc00433 Expand the elf brandelf infrastructure to give access to the whole ELF
header (Elf_Ehdr) to determine if a particular interpretor wants to
accept it or not. Use this mechanism to filter EABI arm on OABI arm
kernels, and vice versa. This method could also be used to implement
OABI on EABI arm kernels, if desired, or to allow a single mips kernel
to run o32, n32 and n64 binaries.

Differential Revision: https://reviews.freebsd.org/D609
2014-08-18 02:44:56 +00:00
Edward Tomasz Napierala
3914ddf8a7 Bring in the new automounter, similar to what's provided in most other
UNIX systems, eg. MacOS X and Solaris.  It uses Sun-compatible map format,
has proper kernel support, and LDAP integration.

There are still a few outstanding problems; they will be fixed shortly.

Reviewed by:	allanjude@, emaste@, kib@, wblock@ (earlier versions)
Phabric:	D523
MFC after:	2 weeks
Relnotes:	yes
Sponsored by:	The FreeBSD Foundation
2014-08-17 09:44:42 +00:00
Mark Johnston
ba78d6b7a1 Correct the order of arguments passed to LIST_INSERT_AFTER().
Reviewed by:	kib
X-MFC-With:	r269656
2014-08-15 15:42:58 +00:00
Xin LI
7001d850bb Add a new loader tunable, vm.kmem_zmax which allows a system administrator
to limit the maximum allocation size that malloc(9) would consider using
the UMA cache allocator as backend.

Suggested by:	alfred
MFC after:	2 weeks
2014-08-14 05:31:39 +00:00
Xin LI
bda06553fd Re-instate UMA cached backend for 4K - 64K allocations. New consumers
like geli(4) uses malloc(9) to allocate temporary buffers that gets
free'ed shortly, causing frequent TLB shootdown as observed in hwpmc
supported flame graph.

Discussed with:	jeff, alfred
MFC after:	1 week
2014-08-14 05:13:24 +00:00
Konstantin Belousov
70978c93b8 If vm_page_grab() allocates a new page, the page is not inserted into
page queue even when the allocation is not wired.  It is
responsibility of the vm_page_grab() caller to ensure that the page
does not end on the vm_object queue but not on the pagedaemon queue,
which would effectively create unpageable unwired page.

In exec_map_first_page() and vm_imgact_hold_page(), activate the page
immediately after unbusying it, to avoid leak.

In the uiomove_object_page(), deactivate page before the object is
unlocked.  There is no leak, since the page is deactivated after
uiomove_fromphys() finished.  But allowing non-queued non-wired page
in the unlocked object queue makes it impossible to assert that leak
does not happen in other places.

Reviewed by:	alc
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2014-08-13 05:44:08 +00:00
Gleb Smirnoff
cd1692fa5d Move KASSERT into locked region.
Submitted by:	kib
2014-08-11 15:06:07 +00:00
Gleb Smirnoff
eaf78ad3f7 Use M_WAITOK in sf_buf_init().
Sponsored by:	Netflix
Sponsored by:	Nginx, Inc.
2014-08-11 13:12:18 +00:00
Gleb Smirnoff
818d40d033 Provide sf_buf_ref() to optimize refcounting of already allocated
sendfile(2) buffers.

Sponsored by:	Netflix
Sponsored by:	Nginx, Inc.
2014-08-11 12:59:55 +00:00
Bjoern A. Zeeb
e346f8c452 Split up sys_ktimer_getoverrun() into a sys_ and a kern_ variant
and export the kern_ version needed by an upcoming linuxolator change.

MFC after:	3 days
Sponsored by:	DARPA,AFRL
2014-08-07 16:49:50 +00:00
Andrey V. Elsukov
3e40097976 Temporary revert r269661, it looks like the patch isn't complete. 2014-08-07 14:32:28 +00:00
Andrey V. Elsukov
c60e497af9 Use cpuset_setithread() to apply cpu mask to taskq threads.
Sponsored by:	Yandex LLC
2014-08-07 10:23:50 +00:00
Konstantin Belousov
d735998057 Correct the problems with the ptrace(2) making the debuggee an orphan.
One problem is inferior(9) looping due to the process tree becoming a
graph instead of tree if the parent is traced by child. Another issue
is due to the use of p_oppid to restore the original parent/child
relationship, because real parent could already exited and its pid
reused (noted by mjg).

Add the function proc_realparent(9), which calculates the parent for
given process. It uses the flag P_TREE_FIRST_ORPHAN to detect the head
element of the p_orphan list and than stepping back to its container
to find the parent process. If the parent has already exited, the
init(8) is returned.

Move the P_ORPHAN and the new helper flag from the p_flag* to new
p_treeflag field of struct proc, which is protected by proctree lock
instead of proc lock, since the orphans relationship is managed under
the proctree_lock already.

The remaining uses of p_oppid in ptrace(PT_DETACH) and process
reapping are replaced by proc_realparent(9).

Phabric:	D417
Reviewed by:	jhb
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
2014-08-07 05:47:53 +00:00
Gleb Smirnoff
c8d2ffd6a7 Merge all MD sf_buf allocators into one MI, residing in kern/subr_sfbuf.c
The MD allocators were very common, however there were some minor
differencies. These differencies were all consolidated in the MI allocator,
under ifdefs. The defines from machine/vmparam.h turn on features required
for a particular machine. For details look in the comment in sys/sf_buf.h.

As result no MD code left in sys/*/*/vm_machdep.c. Some arches still have
machine/sf_buf.h, which is usually quite small.

Tested by:	glebius (i386), tuexen (arm32), kevlo (arm32)
Reviewed by:	kib
Sponsored by:	Netflix
Sponsored by:	Nginx, Inc.
2014-08-05 09:44:10 +00:00
Kirk McKusick
5f9500c358 Add support for multi-threading of soft updates.
Replace a single soft updates thread with a thread per FFS-filesystem
mount point. The threads are associated with the bufdaemon process.

Reviewed by:  kib
Tested by:    Peter Holm and Scott Long
MFC after:    2 weeks
Sponsored by: Netflix
2014-08-04 22:03:58 +00:00
Davide Italiano
4295aa9240 Fix an overflow in getsockopt(). optval isn't big enough to hold
sbintime_t.
Re-introduce r255030 behaviour capping socket timeouts to INT_32
if they're too large.

CR:	https://phabric.freebsd.org/D433
Reported by:	demon
Reviewed by:	bde [1], jhb [2]
MFC after:	2 weeks
2014-08-04 05:40:51 +00:00
Peter Wemm
6dde7ecb5d Partial revert of r262867.
r262867 was described as fixing socket buffer checks for SOCK_SEQPACKET,
but also changed one of the SOCK_DGRAM code paths to use the new
sbappendaddr_nospacecheck_locked() function.  This lead to SOCK_DGRAM
bypassing socket buffer limits.
2014-08-03 22:37:21 +00:00
Sergey Kandaurov
bcdd3bceb6 vn_path_to_global_path: update comment. 2014-08-03 07:59:19 +00:00
Warner Losh
146cbf6fa2 Make the witness lock limit an option. 2014-08-03 05:00:43 +00:00
Konstantin Belousov
168f4ee0a8 Remove Giant acquisition from the mount and unmount pathes.
It could be claimed that two things were reasonable protected by
Giant.  One is vfsconf list links, which is converted to the new
dedicated sx vfsconf_sx.  Another is vfsconf.vfc_refcount, which is
now updated with atomics.

Note that vfc_refcount still has the same races now as it has under
the Giant, the unload of filesystem modules can happen while the
module is still in use.

Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
2014-08-03 03:27:54 +00:00
Rui Paulo
551a78956c In the shm_open() and shm_unlink() syscalls, export the path to KTR.
MFC after:	1 week
2014-08-01 23:29:04 +00:00
Konstantin Belousov
634012b917 Remove one-time use macros which check for the vnode lifecycle. More,
some parts of the checks are in fact redundand in the surrounding
code, and it is more clear what the conditions are by direct testing
of the flags.  Two of the three macros were only used in assertions.

In vnlru_free(), all relevant parts of vholdl() were already inlined,
except the increment of v_holdcnt itself.  Do not call vholdl() to do
the increment as well, this allows to make assertions in
vholdl()/vhold() more strict.

In v_incr_usecount(), call vholdl() before incrementing other ref
counters.  The change is no-op, but it makes less surprising to see
the vnode state in debugger if interrupted inside v_incr_usecount().

Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2014-07-29 16:42:34 +00:00
Konstantin Belousov
d3a3b8b038 Simplify the expression, by removing redundand calculation.
Noted by:	"O'Connor, Daniel" <Daniel.O'Connor@emc.com>
MFC after:	3 days
2014-07-29 01:46:31 +00:00
Konstantin Belousov
5d9b4508fd For md(4), posix shm(3) and tmpfs(5), free swap space used by paged in
dirty page, which is written by the process.

Reviewed by:	alc
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2014-07-28 14:27:05 +00:00
Pietro Cerutti
adecd05bf0 Unbreak the ABI by reverting r268494 until the compat shims are provided 2014-07-28 07:20:22 +00:00
Marcel Moolenaar
1e0a021e3d The accept filter code is not specific to the FreeBSD IPv4 network stack,
so it really should not be under "optional inet". The fact that uipc_accf.c
lives under kern/ lends some weight to making it a "standard" file.

Moving kern/uipc_accf.c from "optional inet" to "standard" eliminates the
need for #ifdef INET in kern/uipc_socket.c.

Also, this meant the net.inet.accf.unloadable sysctl needed to move, as
net.inet does not exist without networking compiled in (as it lives in
netinet/in_proto.c.) The new sysctl has been named net.accf.unloadable.

In order to support existing accept filter sysctls, the net.inet.accf node
has been added netinet/in_proto.c.

Submitted by:	Steve Kiernan <stevek@juniper.net>
Obtained from:	Juniper Networks, Inc.
2014-07-26 19:27:34 +00:00
Marcel Moolenaar
be836fab6c Don't return ERESTART when the device is gone. In ttydev_leave() ERESTART
is the indication that draining got interrupted due to a revoke(2) and
that tty_drain() is to be called again for draining to complete. If the
device is flagged as gone, then waiting/draining is not possible. Only
return ERESTART when waiting is still possible.

Obtained from:	Juniper Networks, Inc.
2014-07-26 15:46:41 +00:00
Gavin Atkinson
f6b4f5ca21 Add error return to dumpsys(), and use it in doadump().
This commit does not add error returns to minidumpsys() or
textdump_dumpsys(); those can also be added later.

Submitted by:	Conrad Meyer (EMC / Isilon storage division)
2014-07-25 23:52:53 +00:00
Daniel Eischen
66d8df9dfc Insert new threads at the end of the thread list in the process
instead of at the beginning.  This allows an intra process signal
to be sent to the oldest thread with the signal unmasked - which,
if it still exists, is the main thread.  This mimics behavior
found in Linux and Solaris.
2014-07-25 20:21:02 +00:00
Mateusz Guzik
a1bf811596 Prepare fget_unlocked for reading fd table only once.
Some capsicum functions accept fdp + fd and lookup fde based on that.
Add variants which accept fde.

Reviewed by:	pjd
MFC after:	1 week
2014-07-23 19:33:49 +00:00
Mateusz Guzik
6a1cf96b4a Cosmetic changes to unp_internalize
Don't throw away the result of fget_unlocked.
Move fdp increment to for loop to make it consistent with similar code
elsewhere.

MFC after:	1 week
2014-07-23 18:04:52 +00:00
Gleb Smirnoff
c71b4037ff Use assignment instead of bcopy.
Submitted by:	jmg
2014-07-18 14:59:35 +00:00
Baptiste Daroussin
42e62eca52 Extend kqueue's EVFILT_TIMER by adding precision unit flags support
Define the precision macros as bits sets to conform with XNU equivalent.
Test fflags passed for EVFILT_TIMER and return EINVAL in case an invalid flag
is passed.

Phabric:	https://phabric.freebsd.org/D421
Reviewed by:	kib
2014-07-18 14:27:04 +00:00
Kevin Lo
c29a33213b Deprecate m_act. Use m_nextpkt always. 2014-07-17 05:21:16 +00:00
Don Lewis
d3a6879421 Nuke the never-used RF_TIMESHARE feature, reducing the complexity of the
code.  The consensus on arch@ is that this feature might have been useful
in the distant past, but is now just unnecessary bloat.

The int_rman_activate_resource() and int_rman_deactivate_resource()
functions become trivial, so manually inline them.

The special deferred handling of RF_ACTIVE is no longer needed in
reserve_resource_bound(), so eliminate the associated code at the
end of the function.

These changes reduce the object file size by more than 500 bytes on i386.

Update the rman.9 man page to reflect the removal of the RF_TIMESHARE
feature.

MFC after:	2 weeks
2014-07-16 22:18:19 +00:00
Konstantin Belousov
65589a29f4 Check for the cross-device cross-link attempt in the VFS, instead of
forcing filesystem VOP_LINK() methods to repeat the code.  In
tmpfs_link(), remove redundand check for the type of the source,
already done by VFS.

Note that NFS server already performs this check before calling
VOP_LINK().

Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
2014-07-16 14:04:46 +00:00
Konstantin Belousov
a62eb1398a Followup to r268466.
- Move the code to calculate resident count into separate function.
  It reduces the indent level and makes the operation of
  vmmap_skip_res_cnt tunable more clear.
- Optimize the calculation of the resident page count for map entry.
  Skip directly to the next lowest available index and page among the
  whole shadow chain.
- Restore the use of pmap_incore(9), only to verify that current
  mapping is indeed superpage.
- Note the issue with the invalid pages.

Suggested and reviewed by:	alc
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2014-07-15 19:57:03 +00:00
Konstantin Belousov
3760e341ca Change the calculation of the kinfo_vmentry field kve_private_resident
to reflect its name.

Noted and reviewed by:	alc
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2014-07-15 19:49:00 +00:00
Mateusz Guzik
965d08605f Plug p_pptr null test in do_execve. It is always true. 2014-07-14 22:40:46 +00:00
Mateusz Guzik
c959c23740 Manage struct sigacts refcnt with atomics instead of a mutex.
MFC after:	1 week
2014-07-14 21:12:59 +00:00
Konstantin Belousov
895b3782c6 Extract the code to put a filesystem into the suspended state (at the
unmount time) in the helper vfs_write_suspend_umnt().  Use it instead
of two inline copies in FFS.

Fix the bug in the FFS unmount, when suspension failed, the ufs
extattrs were not reinitialized.

Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
2014-07-14 09:10:00 +00:00
Konstantin Belousov
57ef02ff0f In kern_linkat(), avoid passing doomed vnode to the VOP.
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
2014-07-14 08:41:13 +00:00
Konstantin Belousov
a69452162a Generalize vn_get_ino() to allow filesystems to use custom vnode
producer, instead of hard-coding VFS_VGET().  New function, which
takes callback, is called vn_get_ino_gen(), standard callback for
vn_get_ino() is provided.

Convert inline copies of vn_get_ino() in msdosfs and cd9660 into the
uses of vn_get_ino_gen().

Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
2014-07-14 08:34:54 +00:00
Kevin Lo
cb7df69b7e Make bind(2) and connect(2) return EAFNOSUPPORT for AF_UNIX on wrong
address family.

See https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=191586 for the
original discussion.

Reviewed by:	terry
2014-07-14 06:00:01 +00:00
Mateusz Guzik
8bedd5d782 Clear nonblock and async on devctl close instaed of open.
This is a purely cosmetic change.
2014-07-12 15:35:04 +00:00
Gleb Smirnoff
1fbe6a82f4 Improve reference counting of EXT_SFBUF pages attached to mbufs.
o Do not use UMA refcount zone. The problem with this zone is that
  several refcounting words (16 on amd64) share the same cache line,
  and issueing atomic(9) updates on them creates cache line contention.
  Also, allocating and freeing them is extra CPU cycles.
  Instead, refcount the page directly via vm_page_wire() and the sfbuf
  via sf_buf_alloc(sf_buf_page(sf)) [1].

o Call refcounting/freeing function for EXT_SFBUF via direct function
  call, instead of function pointer. This removes barrier for CPU
  branch predictor.

o Do not cleanup the mbuf to be freed in mb_free_ext(), merely to
  satisfy assertion in mb_dtor_mbuf(). Remove the assertion from
  mb_dtor_mbuf(). Use bcopy() instead of manual assignments to
  copy m_ext in mb_dupcl().

[1] This has some problems for now. Using sf_buf_alloc() merely to
    increase refcount is expensive, and is broken on sparc64. To be
    fixed.

Sponsored by:	Netflix
Sponsored by:	Nginx, Inc.
2014-07-11 19:40:50 +00:00
Gleb Smirnoff
fcc34a238c Fix style bug: rename the refcount field of m_ext to ext_cnt, to match
other members.

Sponsored by:	Nginx, Inc.
2014-07-11 14:34:29 +00:00
Gleb Smirnoff
15c28f87b8 All mbuf external free functions never fail, so let them be void.
Sponsored by:	Nginx, Inc.
2014-07-11 13:58:48 +00:00
Mateusz Guzik
88f98985aa Eliminate plim and vtmp local vars in exit1.
No functional changes.

MFC after:	1 week
2014-07-10 22:54:38 +00:00
Mateusz Guzik
30d58d6b39 Don't make a temporary copy of fixed sysctl strings. 2014-07-10 21:46:57 +00:00
Mateusz Guzik
b23c40d7b1 Don't zero fd_nfiles during fdp destruction.
Code trying to take a look has to check fd_refcnt and it is 0 by that time.

This is a follow up to r268505, without this the code would leak memory for
tables bigger than the default.

MFC after:	1 week
2014-07-10 21:05:45 +00:00
Mateusz Guzik
e518baf8f9 Avoid relocking filedesc lock when closing fds during fdp destruction.
Don't call bzero nor fdunused from fdfree for such cases. It would do
unnecessary work and complain that the lock is not taken.

MFC after:	1 week
2014-07-10 20:59:54 +00:00
Pietro Cerutti
7150b86bfe Implement Short/Small String Optimization in SBUF(9) and change lengths and
positions in the API from ssize_t and int to size_t.

CR:		D388
Approved by:	des, bapt
2014-07-10 13:08:51 +00:00
Konstantin Belousov
479fcb4e32 Unconditionally initialize addr to handle the case of changed map
timestamp while the map is unlocked.

Reported by:	bz
Sponsored by:	The FreeBSD Foundation
MFC after:	6 days
2014-07-10 11:20:24 +00:00
Konstantin Belousov
a91831a261 Current code in sysctl proc.vmmap, which intent is to calculate the
amount of resident pages, in fact calculates the amount of installed
pte entries in the region.  Resident pages which were not soft-faulted
yet are not counted.

Calculate the amount of resident pages by looking in the objects chain
backing the region.

Add a knob to disable the residency calculation at all.  For large
sparce regions, either previous or updated algorithm runs for too long
time, while several introspection tools do not need the (advisory) RSS
value at all.

PR:	kern/188911
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2014-07-09 19:11:57 +00:00
Xin LI
2827952eb4 Don't leave the padding between the msg header and the cmsg data,
and the padding after the cmsg data un-initialized.

Submitted by:	tuexen
Security:	CVE-2014-3952
Security:	FreeBSD-SA-14:17.kmem
2014-07-08 21:54:23 +00:00
Konstantin Belousov
3bcc218f46 Correct the problem reported by test16 from
tools/regression/file/flock/flock.c, which completes the fix in
r192685.  When the lock was stolen from us, retry the whole lock
sequence in kernel, instead of returning EINTR to usermode and hoping
that application would handle it correctly by restarting the lock
acquire.

Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
2014-07-08 08:10:15 +00:00
Don Lewis
626a79752f Declaration whitespace changes for style(9).
MFC after:	1 week
2014-07-07 22:02:39 +00:00
Mateusz Guzik
5e2554b7f8 Don't call crdup nor uifind under vnode lock.
A locked vnode can get into the way of satisyfing malloc with M_WATOK.

This is a fixup to r268087.

Suggested by:	kib
MFC after:	1 week
2014-07-07 14:03:30 +00:00
Marcel Moolenaar
e7d939bda2 Remove ia64.
This includes:
o   All directories named *ia64*
o   All files named *ia64*
o   All ia64-specific code guarded by __ia64__
o   All ia64-specific makefile logic
o   Mention of ia64 in comments and documentation

This excludes:
o   Everything under contrib/
o   Everything under crypto/
o   sys/xen/interface
o   sys/sys/elf_common.h

Discussed at: BSDcan
2014-07-07 00:27:09 +00:00
Hans Petter Selasky
604bf9d37e When getting the initial value of numeric tunables use the
getenv_xxx() functions instead of strtoq(), because the getenv_xxx()
functions include wrappers for various postfixes like G/M/K, which
strtoq() doesn't do.
2014-07-05 06:12:48 +00:00
Konstantin Belousov
2499a5ccef Micro-manage clang to get the expected inlining for cpu_search().
Mark cpu_search_lowest/cpu_search_highest/cpu_search_both as noinline,
while cpu_search() gets always_inline.  With the attributes set,
cpu_search() is inlined in wrappers, and if()s with constant
conditionals are optimized.

On some tests on many-core machine, the hwpmc reported samples for
cpu_search*() are reduced from 25% total to 9%.

Submitted by:	"Rang, Anton" <anton.rang@isilon.com>
MFC after:	1 week
2014-07-03 11:06:27 +00:00
Marcel Moolenaar
054b57a740 Drop KTR records when we're in the debugger so that the debugger isn't
changing or overwriting the trace buffer. When KTR is enabled for things
like traps or pmap functions, the amount of logging can be substantial.
2014-07-02 22:13:07 +00:00
Ed Maste
969d3cc28b Fix typos in VTY constant names from r268158 2014-07-02 14:47:48 +00:00
Ed Maste
018147eef9 Prefer vt(4) for UEFI boot
The UEFI framebuffer driver vt_efifb requires vt(4), so add a mechanism
for the startup routine to set the preferred console.  This change is
ugly because console init happens very early in the boot, making a
cleaner interface difficult.  This change is intended only to facilitate
the sc(4) / vt(4) transition, and can be reverted once vt(4) is the
default.
2014-07-02 13:24:21 +00:00
Mateusz Guzik
a6bad85e8e Plug gcc warning after r268074 about unitialized newsigacts
Reported by:	Gary Jennejohn <gljennjohn gmail.com>
2014-07-02 05:45:40 +00:00
Mateusz Guzik
350d51816e Don't call crcopysafe or uifind unnecessarily in execve.
MFC after:	1 week
2014-07-01 09:21:32 +00:00
Mateusz Guzik
d00c8ea429 Perform a lockless check in sigacts_shared.
It is used only during execve (i.e. singlethreaded), so there is no fear
of returning 'not shared' which soon becomes 'shared'.

While here reorganize the code a little to avoid proc lock/unlock in
shared case.

MFC after:	1 week
2014-07-01 06:29:15 +00:00
Adrian Chadd
c445c3c7f6 If we're doing RSS then ensure that the callwheel swi's are CPU pinned. 2014-06-30 04:25:51 +00:00
Hans Petter Selasky
4813ad54f8 Compile fixes:
Remove duplicate "debug_ktr.mask" sysctl definition.
Remove now unused variable from "kern_ktr.c".
This fixes build of "ktr" which was broken by r267961.

Let the default value for "vm_kmem_size_scale" be zero. It is setup
after that the sysctl has been initialized from "getenv()" in the
"kmeminit()" function to equal the "VM_KMEM_SIZE_MAX" value, if
zero. On Sparc64 the "VM_KMEM_SIZE_MAX" macro is not a constant. This
fixes build of Sparc64 which was broken by r267961.

Add a special macro to dynamically create SYSCTL root nodes, because
root nodes have a special parent. This fixes build of existing OFED
module and CANBUS module for pc98 which was broken by r267961.

Add missing "sysctl.h" includes to get the needed sysctl header file
declarations. This is needed after r267961.

MFC after:	2 weeks
2014-06-28 17:36:18 +00:00
Mateusz Guzik
b0bc0cadbe Call fdcloseexec right after fdunshare.
No functional changes.

MFC after:	1 week
2014-06-28 05:51:45 +00:00
Mateusz Guzik
b9d32c36fa Make fdunshare accept only td parameter.
Proc had to match the thread anyway and 2 parameters were inconsistent
with the rest.

MFC after:	1 week
2014-06-28 05:41:53 +00:00
Mateusz Guzik
35778d7aa9 Make sure to always clear p_fd for process getting rid of its filetable.
Filetable can be shared with other processes. Previous code failed to
clear the pointer for all but the last process getting rid of the table.
This is mostly cosmetics.

Get rid of 'This should happen earlier' comment. Clearing the pointer in
this place is fine as consumers can reliably check for files availability
by inspecting fd_refcnt and vnodes availabity by NULL-checking them.

MFC after:	1 week
2014-06-28 05:18:03 +00:00
Hans Petter Selasky
6a3287f889 Fix regression issue after r267961. Handle special string case for
SYSCTLs like previously.

MFC after:	2 weeks
Reported by:	several people
2014-06-28 03:59:04 +00:00
Hans Petter Selasky
af3b2549c4 Pull in r267961 and r267973 again. Fix for issues reported will follow. 2014-06-28 03:56:17 +00:00
Glen Barber
37a107a407 Revert r267961, r267973:
These changes prevent sysctl(8) from returning proper output,
such as:

 1) no output from sysctl(8)
 2) erroneously returning ENOMEM with tools like truss(1)
    or uname(1)
 truss: can not get etype: Cannot allocate memory
2014-06-27 22:05:21 +00:00
Marius Strobl
7344ee184b In order to get vt(4) a bit closer to the feature set provided by sc(4),
implement options TERMINAL_{KERN,NORM}_ATTR. These are aliased to
SC_{KERNEL_CONS,NORM}_ATTR and like these latter, allow to change the
default colors of normal and kernel text respectively.
Note on the naming: Although affecting the output of vt(4), technically
kern/subr_terminal.c is primarily concerned with changing default colors
so it would be inconsistent to term these options VT_{KERN,NORM}_ATTR.
Actually, if the architecture and abstraction of terminal+teken+vt would
be perfect, dev/vt/* wouldn't be touched by this commit at all.

Reviewed by:	emaste
MFC after:	3 days
Sponsored by:	Bally Wulff Games & Entertainment GmbH
2014-06-27 19:57:57 +00:00
Ed Maste
6ac6c9d5f4 Add CTLFLAG_NOFETCH flag; console vty code runs before tunable fetch
Also remove redundant "" assignment for string in BSS.

Submitted by:	hselasky@
2014-06-27 19:07:35 +00:00
Ed Maste
59644098f8 Use a common tunable to choose between vt(4)/sc(4)
With this change and previous work from ray@ it will be possible to put
both in GENERIC, and have one enabled by default, but allow the other to
be selected via the loader.

(The previous implementation had separate kern.vt.disable and
hw.syscons.disable tunables, and would panic if both drivers were
compiled in and neither was explicitly disabled.)

MFC after:	1 week
Sponsored by:	The FreeBSD Foundation
2014-06-27 17:50:33 +00:00
Hans Petter Selasky
3da1cf1e88 Extend the meaning of the CTLFLAG_TUN flag to automatically check if
there is an environment variable which shall initialize the SYSCTL
during early boot. This works for all SYSCTL types both statically and
dynamically created ones, except for the SYSCTL NODE type and SYSCTLs
which belong to VNETs. A new flag, CTLFLAG_NOFETCH, has been added to
be used in the case a tunable sysctl has a custom initialisation
function allowing the sysctl to still be marked as a tunable. The
kernel SYSCTL API is mostly the same, with a few exceptions for some
special operations like iterating childrens of a static/extern SYSCTL
node. This operation should probably be made into a factored out
common macro, hence some device drivers use this. The reason for
changing the SYSCTL API was the need for a SYSCTL parent OID pointer
and not only the SYSCTL parent OID list pointer in order to quickly
generate the sysctl path. The motivation behind this patch is to avoid
parameter loading cludges inside the OFED driver subsystem. Instead of
adding special code to the OFED driver subsystem to post-load tunables
into dynamically created sysctls, we generalize this in the kernel.

Other changes:
- Corrected a possibly incorrect sysctl name from "hw.cbb.intr_mask"
to "hw.pcic.intr_mask".
- Removed redundant TUNABLE statements throughout the kernel.
- Some minor code rewrites in connection to removing not needed
TUNABLE statements.
- Added a missing SYSCTL_DECL().
- Wrapped two very long lines.
- Avoid malloc()/free() inside sysctl string handling, in case it is
called to initialize a sysctl from a tunable, hence malloc()/free() is
not ready when sysctls from the sysctl dataset are registered.
- Bumped FreeBSD version to indicate SYSCTL API change.

MFC after:	2 weeks
Sponsored by:	Mellanox Technologies
2014-06-27 16:33:43 +00:00
Mateusz Guzik
de966666a2 Check lower bound of cmsg_len.
If passed cm->cmsg_len was below cmsghdr size the experssion:
datalen = (caddr_t)cm + cm->cmsg_len - (caddr_t)data;

would give negative result. However, in practice it would not
result in a crash because the kernel would try to obtain garbage fds
for given process and would error out with EBADF.

PR:		124908
Submitted by:	campbell mumble.net (modified a little)
MFC after:	1 week
2014-06-27 05:04:36 +00:00
Pawel Jakub Dawidek
e16406c7ba Remove duplicated includes.
Submitted by:	Mariusz Zaborski <oshogbo@FreeBSD.org>
2014-06-26 13:57:44 +00:00
Attilio Rao
e989086b1d sysctl subsystem uses sxlocks so avoid to setup dynamic sysctl nodes
before sleepinit() has been fully executed in the SLEEPQUEUE_PROFILING
case.

Sponsored by:	EMC / Isilon storage division
2014-06-24 15:16:55 +00:00
Mateusz Guzik
450570a55e Tidy up fd-related functions called by do_execve
o assert in each one that fdp is not shared
o remove unnecessary NULL checks - all userspace processes have fdtables
and kernel processes cannot execve
o remove comments about the danger of fd_ofiles getting reallocated - fdtable
is not shared and fd_ofiles could be only reallocated if new fd was about to be
added, but if that was possible the code would already be buggy as setugidsafety
work could be undone

MFC after:	1 week
2014-06-23 01:28:18 +00:00
Mateusz Guzik
158627616c Don't take filedesc lock in fdunshare().
We can read refcnt safely and only care if it is equal to 1.

If it could suddenly change from 1 to something bigger the code would be
buggy even in the previous form and transitions from > 1 to 1 are equally racy
and harmless (we copy even though there is no need).

MFC after:	1 week
2014-06-22 21:37:27 +00:00
Alexander V. Chernikov
811985398d Permit changing cpu mask for cpu set 1 in presence of drivers
binding their threads to particular CPU.

Changing ithread cpu mask is now performed by special cpuset_setithread().
It creates additional cpuset root group on first bind invocation.

No objection:	jhb
Tested by:	hiren
MFC after:	2 weeks
Sponsored by:	Yandex LLC
2014-06-22 11:32:23 +00:00
Mateusz Guzik
adf87ab01c fd: replace fd_nfiles with fd_lastfile where appropriate
fd_lastfile is guaranteed to be the biggest open fd, so when the intent
is to iterate over active fds or lookup one, there is no point in looking
beyond that limit.

Few places are left unpatched for now.

MFC after:	1 week
2014-06-22 01:31:55 +00:00
Mateusz Guzik
0f0b852c73 do_dup: plug redundant adjustment of fd_lastfile
By that time it was already set by fdalloc, or was there in the first place
if fd is replaced.

MFC after:	1 week
2014-06-22 00:53:33 +00:00
Konstantin Belousov
7b81a399a4 In msdosfs_setattr(), add a check for result of the utimes(2)
permissions test, forgotten in r164033.

Refactor the permission checks for utimes(2) into vnode helper
function vn_utimes_perm(9), and simplify its code comparing with the
UFS origin, by writing the call to VOP_ACCESSX only once.  Use the
helper for UFS(5), tmpfs(5), devfs(5) and msdosfs(5).

Reported by:	bde
Reviewed by:	bde, trasz
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2014-06-17 07:11:00 +00:00
Dmitry Chagin
2dedc1281a Revert r266925 as it can lead to instant panic at fexecve():
To allow to run the interpreter itself add a new ELF branding type.

Pointed out by:	kib, mjg
2014-06-17 05:29:18 +00:00
Attilio Rao
3ae10f7477 - Modify vm_page_unwire() and vm_page_enqueue() to directly accept
the queue where to enqueue pages that are going to be unwired.
- Add stronger checks to the enqueue/dequeue for the pagequeues when
  adding and removing pages to them.

Of course, for unmanaged pages the queue parameter of vm_page_unwire() will
be ignored, just as the active parameter today.
This makes adding new pagequeues quicker.

This change effectively modifies the KPI.  __FreeBSD_version will be,
however, bumped just when the full cache of free pages will be
evicted.

Sponsored by:	EMC / Isilon storage division
Reviewed by:	alc
Tested by:	pho
2014-06-16 18:15:27 +00:00
Konstantin Belousov
2e501b0a9e Use vn_io_fault for the writes from core dumping code. Recursing into
VM due to copyin(9) faulting while VFS locks are held is
deadlock-prone there in the same way as for the write(2) syscall.

Reported and tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
2014-06-15 04:51:53 +00:00
Alexander Motin
781c93d405 Implement simple direct-mapped cache for popular filesystem identifiers to
avoid congestion on global mountlist_mtx mutex in vfs_busyfs(), while
traversing through the list of mount points.

This change significantly improves NFS server scalability, since it had
to do this translation for every request, and the global lock becomes quite
congested.

This code is more optimized for relatively small number of mount points.
On systems with hundreds of active mount points this simple cache may have
many collisions.  But the original traversal code in that case should also
behave much worse, so we are not loosing much.

Reviewed by:	attilio
MFC after:	2 weeks
Sponsored by:	iXsystems, Inc.
2014-06-12 12:43:48 +00:00
Alexander Motin
4f655310bf Remove unneeded mountlist_mtx acquisition from sync_fsync().
All struct mount fields accessed by sync_fsync() are protected by MNT_MTX.
2014-06-11 12:56:49 +00:00
Alexander Motin
eb6d6216c4 Move root_mount_hold() functionality to separate mutex.
It has nothing to share with mutex protecting list of mounted file systems.
2014-06-11 08:14:08 +00:00
Konstantin Belousov
a19c5d3716 Devolatile as needed.
Sponsored by:	The FreeBSD Foundation
MFC after:	13 days
2014-06-09 09:10:31 +00:00
Konstantin Belousov
7f82c6c17f Change the nblock mutex, protecting the needsbuffer buffer deficit
flags, to rwlock.  Lock it in read mode when used from subroutines
called from buffer release code paths.

The needsbuffer is now updated using atomics, while read lock of
nblock prevents loosing the wakeups from bufspacewakeup() and
bufcountadd() in getnewbuf_bufd_help().

In several interesting loads, needsbuffer flags are never set, while
buffers are reused quickly.  This causes brelse() and bqrelse() from
different threads to content on the nblock.  Now they take nblock in
read mode, together with needsbuffer not needing an update, allowing
higher parallelism.

Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
2014-06-09 03:38:03 +00:00
Alan Cox
78960940fe Refresh a comment. The VM_STACK option was eliminated in r43209.
Sponsored by:	EMC / Isilon Storage Division
2014-06-09 00:15:16 +00:00
Alexander Motin
3345d73ca8 Remove extra branching from r267232.
MFC after:	2 weeks
2014-06-08 19:01:37 +00:00
Alexander Motin
590d636321 Use atomics to modify numvnodes variable.
This allows to mostly avoid lock usage in getnewvnode_[drop_]reserve(),
that reduces number of global vnode_free_list_mtx mutex acquisitions
from 4 to 2 per NFS request on ZFS, improving SMP scalability.

Reviewed by:	kib
MFC after:	2 weeks
Sponsored by:	iXsystems, Inc.
2014-06-08 15:38:40 +00:00
Konstantin Belousov
a288c757d4 Remove write-only local variable.
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2014-06-08 10:56:25 +00:00
Konstantin Belousov
23f6698fbd Initialize the pbuf counter for directio using SYSINIT, instead of
using a direct hook called from kern_vfs_bio_buffer_alloc().
Mark ffs_rawread.c as requiring both ffs and directio options to be
compiled into the kernel.  Add ffs_rawread.c to the list of ufs.ko
module' sources.

In addition to stopping breaking the layering violation, it also
allows to link kernel when FFS is configured as module and DIRECTIO is
enabled.

One consequence of the change is that ffs_rawread.o is always linked
into the module regardless of the DIRECTIO option.  This is similar to
the option QUOTA and ufs_quota.c.

Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2014-06-08 10:55:06 +00:00
Jilles Tjoelker
093e059c7d ktrace: Use designated initializers for the data_lengths array.
In the .o file, this only changes some line numbers (head amd64) because
element 0 is no longer explicitly initialized.

This should make bugs like FreeBSD-SA-14:12.ktrace less likely.

Discussed with:	des
MFC after:	1 week
2014-06-06 14:49:00 +00:00
Davide Italiano
e392e44c27 Convert functions to the new-style format.
Submitted by:	Vijay Singh <vijju.singh@gmail.com> via -hackers
2014-06-05 03:46:46 +00:00
Marcel Moolenaar
62d76917b8 Introduce a procedural interface to the ifnet structure. The new
interface allows the ifnet structure to be defined as an opaque
type in NIC drivers.  This then allows the ifnet structure to be
changed without a need to change or recompile NIC drivers.

Put differently, NIC drivers can be written and compiled once and
be used with different network stack implementations, provided of
course that those network stack implementations have an API and
ABI compatible interface.

This commit introduces the 'if_t' type to replace 'struct ifnet *'
as the type of a network interface. The 'if_t' type is defined as
'void *' to enable the compiler to perform type conversion to
'struct ifnet *' and vice versa where needed and without warnings.
The functions that implement the API are the only functions that
need to have an explicit cast.

The MII code has been converted to use the driver API to avoid
unnecessary code churn. Code churn comes from having to work with
both converted and unconverted drivers in correlation with having
callback functions that take an interface. By converting the MII
code first, the callback functions can be defined so that the
compiler will perform the typecasts automatically.

As soon as all drivers have been converted, the if_t type can be
redefined as needed and the API functions can be fix to not need
an explicit cast.

The immediate benefactors of this change are:
1.  Juniper Networks - The network stack implementation in Junos
    is entirely different from FreeBSD's one and this change
    allows Juniper to build "stock" NIC drivers that can be used
    in combination with both the FreeBSD and Junos stacks.
2.  FreeBSD - This change opens the door towards changing ifnet
    and implementing new features and optimizations in the network
    stack without it requiring a change in the many NIC drivers
    FreeBSD has.

Submitted by:	Anuranjan Shukla <anshukla@juniper.net>
Reviewed by:	glebius@
Obtained from:	Juniper Networks, Inc.
2014-06-02 17:54:39 +00:00
Adrian Chadd
924aaf69ff Pin the right thread.
This _was_ right, a last minute suggestion and not enough testing makes
Adrian a bad boy.

Tested:

* igb(4) with RSS patches, by hand verifying each igb(4) taskqueue
  tid from procstat -ka using cpuset -g -t <tid>.
2014-06-01 04:11:05 +00:00
Dmitry Chagin
5f56da1891 To allow to run the interpreter itself add a new ELF branding type.
Allow Linux ABI to run ELF interpreter.

MFC after:	3 days
2014-05-31 15:01:51 +00:00
Gleb Smirnoff
c46713e636 Whitespace only. 2014-05-30 08:22:58 +00:00
Mark Johnston
f2789bd5c7 Commit the rest of the changes that were intended to be part of r266826.
X-MFC-with:	r266826
2014-05-29 01:42:22 +00:00
Don Lewis
5b892e7363 Initialize r_flags the same way in all cases using a sanitized copy of
flags that has several bits cleared. The RF_WANTED and RF_FIRSTSHARE
bits are invalid in this context, and we want to defer setting RF_ACTIVE
in r_flags until later.  This should make rman_get_flags() return
the correct answer in all cases.

Add a KASSERT() to catch callers which incorrectly pass the RF_WANTED
or RF_FIRSTSHARE flags.

Do a strict equality check on the share type bits of flags.  In
particular, do an equality check on RF_PREFETCHABLE.  The previous
code would allow one type of mismatch of RF_PREFETCHABLE but disallow
the other type of mismatch.  Also, ignore the the RF_ALIGNMENT_MASK
bits since alignment validity should be handled by the amask check.
This field contains an integer value, but previous code did a strange
bitwise comparison on it.

Leave the original value of flags unmolested as a minor debug aid.

Change the start+amask overflow check to a KASSERT() since it is just
meant to catch a highly unlikely programming error in the caller.

Reviewed by:	jhb
MFC after:	1 month
2014-05-28 16:57:17 +00:00
Adrian Chadd
5a6f0eee47 Add a new taskqueue setup method that takes a cpuid to pin the
taskqueue worker thread(s) to.

For now it isn't a taskqueue/taskthread error to fail to pin
to the given cpuid.

Thanks to rpaulo@, kib@ and jhb@ for feedback.

Tested:

* igb(4), with local RSS patches to pin taskqueues.

TODO:

* ask the doc team for help in documenting the new API call.
* add a taskqueue_start_threads_cpuset() method which takes
  a cpuset_t - but this may require a bunch of surgery to
  bring cpuset_t into scope.
2014-05-24 20:37:15 +00:00
Benjamin Kaduk
bf09eca2cb Check for mismatched vref()/vdrop()
Assert that the hold count has not fallen below the use count, a situation
that would only happen when a vref() (or similar) is erroneously paired
with a vdrop().  This situation has not been observed in the wild, but
could be helpful for someone implementing a new filesystem.

Reviewed by:	kib
Approved by:	hrs (mentor)
2014-05-21 03:11:27 +00:00
Konstantin Belousov
7032434e98 When exec_new_vmspace() decides that current vmspace cannot be reused
on execve(2), it calls vmspace_exec(), which frees the current
vmspace.  The thread executing an exec syscall gets new vmspace
assigned, and old vmspace is freed if only referenced by the current
process.  The free operation includes pmap_release(), which
de-constructs the paging structures used by hardware.

If the calling process is multithreaded, other threads are suspended
in the thread_suspend_check(), and need to be unsuspended and run to
be able to exit on successfull exec.  Now, since the old vmspace is
destroyed, paging structures are invalid, threads are resumed on the
non-existent pmaps (page tables), which leads to triple fault on x86.

To fix, postpone the free of old vmspace until the threads are resumed
and exited.  To avoid modifications to all image activators all of
which use exec_new_vmspace(), memoize the current (old) vmspace in
kern_execve(), and notify it about the need to call vmspace_free()
with a thread-private flag TDP_EXECVMSPC.

http://bugs.debian.org/743141

Reported by:	Ivo De Decker <ivo.dedecker@ugent.be> through secteam
Sponsored by:	The FreeBSD Foundation
MFC after:	3 days
2014-05-20 09:19:35 +00:00
Don Lewis
c201b03fc3 Slightly restructure the final loop in rman_reserve_resource_bound().
Replace with the existing loop termination test with a similar
condition from the nested "if" that may terminate the loop a bit
sooner, but still not too early.   This condition can then be removed
from the nested "if".  Relocate an operator to be style(9) compliant.

MFC after:	3 days
2014-05-19 04:44:27 +00:00
Edward Tomasz Napierala
fbaadda60b Initialize loginclass mutex using MTX_SYSINIT instead of using SI_SUB_CPU.
Suggested by:	rwatson@
MFC after:	1 month
Sponsored by:	The FreeBSD Foundation
2014-05-14 09:03:02 +00:00
Don Lewis
11e104c50f Be even more paranoid about overflow.
Requested by:	ache
2014-05-12 20:22:42 +00:00
Don Lewis
11ada7013a Nuke a couple of unnecessary assigments. Nothing uses the values of rstart
and rend after this point.

MFC after:	1 week
2014-05-12 17:56:52 +00:00
Jilles Tjoelker
857ce8a246 accept(),accept4(): Don't set *addrlen = 0 on [ECONNABORTED].
If the underlying protocol reported an error (e.g. because a connection was
closed while waiting in the queue), this error was also indicated by
returning a zero-length address. For all other kinds of errors (e.g.
[EAGAIN], [ENFILE], [EMFILE]), *addrlen is unmodified and there are
successful cases where a zero-length address is returned (e.g. a connection
from an unbound Unix-domain socket), so this error indication is not
reliable.

As reported in Austin Group bug #836, modifying *addrlen on error may cause
subtle bugs if applications retry the call without resetting *addrlen.
2014-05-11 21:21:14 +00:00
Colin Percival
760f4dec67 In cf_get_method, when we don't already know what clock speed the CPU is
running at, guess the nearest value instead of looking for a value within
25 MHz of the observed frequency.

Prior to this change, if a system booted with Intel Turbo Boost enabled,
the dev.cpu.0.freq sysctl is nonfunctional, since the ACPI-reported
frequency for Turbo Boost states does not match the actual clock frequency
(and thus no levels are within 25 MHz of the observed frequency) and the
current performance level is read before a new level is set.

MFC after:	3 days
Relnotes:	Bug fix in power management on CPUs with Intel Turbo Boost
2014-05-11 10:32:58 +00:00
Adrian Chadd
ac75ee9fa3 Add in support to optionally pin the swi threads.
Under enough load, the swi's can actually be preempted and migrated
to other currently free cores.  When doing RSS experiments, this lead
to the per-CPU TCP timers not lining up any more with the RX CPU said
flows were ending up on, leading to increased lock contention.

Since there was a little pushback on flipping them on by default,
I've left the default at "don't pin."

The other less obvious problem here is that the default swi
is also the same as the destination swi for CPU #0.  So if one
pins the swi on CPU #0, there's no default floating swi.

A nice future project would be to create a separate swi for
the "default" floating swi, as well as per-CPU swis that are
(optionally) pinned.

Tested:

* parallel TCP tests (2 x 1g unfortunately for now);
  CPU: Intel(R) Xeon(R) CPU E5-2650

Note:

This is based on some initial investigation into RSS/TCP stack lock
contention on FreeBSD-HEAD whilst at Netflix in January 2014.
2014-05-10 00:53:36 +00:00
Don Lewis
1237b6d9ed Avoid unsigned integer overflow which can cause
rman_reserve_resource_bound() to return incorrect results.

Continue the initial search until the first viable region is found.

Add a comment to explain the search termination test.

PR:		kern/188534
Reviewed by:	jhb (previous version)
MFC after:	1 week
2014-05-05 15:59:31 +00:00
Mateusz Guzik
f2b1eaec33 Request a non-exiting process in sysctl_kern_proc_{o,}filedesc
This fixes a race with exit1 freeing p_textvp.

Suggested by:	kib
MFC after:	1 week
2014-05-02 21:55:09 +00:00
Christian Brueffer
ed472910ba Free resources in an error case.
CID:		1018947
Found with:	Coverity Prevent(tm)
MFC after:	1 week
2014-05-02 21:34:17 +00:00
Robert Watson
a2496f6e01 Garbage collect mtxpool_lockbuilder, the mutex pool historically used
for lockmgr and sx interlocks, but unused since optimised versions of
those sleep locks were introduced.  This will save a (quite) small
amount of memory in all kernel configurations.  The sleep mutex pool is
retained as it is used for 'struct bio' and several other consumers.

Discussed with:	jhb
MFC after:	3 days
2014-05-02 07:57:40 +00:00
Mateusz Guzik
183870cf75 Ignore the error from pipespace_new when creating a pipe.
It can fail if pipe map is exhausted (as a result of too many pipes created),
but it is not fatal and could be provoked by unprivileged users. The only
consequence is worse performance with given pipe.

Reported by:	ivoras
Suggested by:	kib
MFC after:	1 week
2014-05-02 00:52:13 +00:00
Brooks Davis
ee9bc59982 Fix a 2038 bug.
If time_t is 64-bit (i.e. isn't 32-bit) allow any value of year, not
just years less than 2038.

Don't bother fixing the underflow in the case of years before 1903.

MFC after:	1 week
Sponsored by:	DARPA, AFRL
2014-05-01 22:28:14 +00:00
Marius Strobl
0d13d5fce2 Given that as of r258002 the last external user is gone, make sched_lock
static.
2014-04-29 20:51:57 +00:00
Peter Grehan
d6cd193e5e Bump WITNESS_PENDLIST by MAXCPU to account for the
pmap pvlist locks which are scaled by MAXCPU.

This allows an amd64 system to boot with MAXCPU set
to 256, which is currently FreeBSD's hard limit without
x2apic support.

Compile-tested for other arch's.

PR:	185831
Discussed with:		jhb
MFC after:	3 weeks
2014-04-29 17:22:29 +00:00
Brooks Davis
a3fe2bc59e Revert r263754, re-adding support for hw.bus.devctl_disable. Breaking
old devd's and thus hosts that get IP addresses from DHCP was too much
of a POLA violation.

The sysctl may be removed again after r263758 has been merged to at
least stable/9 and stable/10, and releases have been cut from those
branches.

Discussed with:	mjg
Reported by:	theraven, rwatson
2014-04-28 20:38:08 +00:00
Scott Long
60ad8150c7 Retire smp_active. It was racey and caused demonstrated problems with
the cpufreq code.  Replace its use with smp_started.  There's at least
one userland tool that still looks at the kern.smp.active sysctl, so
preserve it but point it to smp_started as well.

Discussed with: peter, jhb
MFC after: 3 days
Obtained from: Netflix
2014-04-26 20:27:54 +00:00
Bryan Drewery
2809a6dfa4 Fix grammar error and trailing newline.
Submitted by:	danfe
MFC after:	3 days
2014-04-23 02:21:17 +00:00
Ian Lepore
6afc723819 Fix a comment typo; conversion tables are for leap years, not leap seconds. 2014-04-20 13:37:22 +00:00
Konstantin Belousov
beb4f781a5 Fix typo.
MFC after:	3 days
2014-04-17 18:13:23 +00:00
Navdeep Parhar
c7a3775adf Do not set M_BESTFIT if a strategy has already been provided. This
fixes problems when using M_FIRSTFIT.

Reviewed by:	jeff@
MFC after:	1 week
2014-04-16 21:39:43 +00:00
Alexander Motin
d10a1df8d7 Fix VIRTUAL and PROF interval timers for short intervals, broken at r247903.
Due to the way those timers are implemented, we can't handle very short
intervals.  In addition to that mentioned patch caused math overflows
for short intervals.  To avoid that round those intervals to 1 tick.

PR:		kern/187668
MFC after:	1 week
2014-04-16 18:37:46 +00:00
Christian Brueffer
83a396ce95 Refine r264422: set buf to NULL only when we don't allocate memory,
and free buf unconditionally.

Requested by:	kib
MFC after:	1 week
2014-04-14 21:02:20 +00:00
Christian Brueffer
a1761d7335 Free buf after usage.
CID:		1199377
Found with:	Coverity Prevent(tm)
MFC after:	1 week
2014-04-13 21:23:15 +00:00
Davide Italiano
4bc38a5ab0 Hide internal details of sbintime_t implementation wrapping INT64_MAX into
SBT_MAX, to make it more robust in case internal type representation will
change in the future. All the consumers were migrated to SBT_MAX and
every new consumer (if any) should from now use this interface.

Requested by:	bapt, jmg, Ryan Lortie (implictly)
Reviewed by:	mav, bde
2014-04-12 23:29:29 +00:00
Bryan Drewery
97c0df733f Use proper MFSNAMELEN for fs type.
MFC after:	2 weeks
Reviewed by:	rodrigc
Also spotted by:ambrisko
2014-04-12 21:39:17 +00:00
David Xu
7d62aec6fe Add kqueue support for devctl.
Reviewed by:	kib,mjg
2014-04-10 02:30:51 +00:00
Sean Bruno
b888dae4c8 sys/kern/imgact_binmisc.c -- free the right pointer mask vs magic
sys/sys/imagact_binmisc.h -- cleanup white space tabs vs spaces
                          -- remove stray " in comment

Submitted by:	jmallett@
2014-04-08 22:12:01 +00:00
Sean Bruno
6d75644981 Add Stacey Son's binary activation patches that allow remapping of
execution to a emumation program via parsing of ELF header information.

With this kernel module and userland tool, poudriere is able to build
ports packages via the QEMU userland tools (or another emulator program)
in a different architecture chroot, e.g. TARGET=mips TARGET_ARCH=mips

I'm not connecting this to GENERIC for obvious reasons, but this should
allow the kernel module to be built by default and enable the building
of the userland tool (which automatically loads the kernel module).

Submitted by:	sson@
Reviewed by:	jhb@
2014-04-08 20:10:22 +00:00
Aleksandr Rybalko
19fbe1ea90 Do not fill screen, while muted.
Sponsored by:	The FreeBSD Foundation
2014-04-07 22:37:13 +00:00
Ed Schouten
8f5b107b84 Thinko: don't forget to apply 'howto' in case init(8) isn't running. 2014-04-07 21:18:12 +00:00
Ed Schouten
912d59378b Clean up shutdown_nice(). Just send the right signal to init(8).
Right now, init(8) cannot distinguish between an ACPI power button press
or a Ctrl+Alt+Del sequence on the keyboard. This is because
shutdown_nice() sends SIGINT to init(8) unconditionally, but later
modifies the arguments to reboot(2) to force a certain behaviour.

Instead of doing this, patch up the code to just forward the appropriate
signal to userspace. SIGUSR1 and SIGUSR2 can already be used to halt the
system.

While there, move waittime to the function where it's used; kern_reboot().
2014-04-07 21:11:29 +00:00
Ed Schouten
38219d6acd Implement kqueue(2) for procdesc(4).
kqueue(2) already supports EVFILT_PROC. Add an EVFILT_PROCDESC that
behaves the same, but operates on a procdesc(4) instead. Only implement
NOTE_EXIT for now. The nice thing about NOTE_EXIT is that it also
returns the exit status of the process, meaning that we can now obtain
this value, even if pdwait4(2) is still unimplemented.

Notes:

- Simply reuse EVFILT_NETDEV for EVFILT_PROCDESC. As both of these will
  be used on totally different descriptor types, this should not clash.

- Let procdesc_kqops_event() reuse the same structure as filt_proc().
  The only difference is that procdesc_kqops_event() should also be able
  to deal with the case where the process was already terminated after
  registration. Simply test this when hint == 0.

- Fix some style(9) issues in filt_proc() to keep it consistent with the
  newly added procdesc_kqops_event().

- Save the exit status of the process in pd->pd_xstat, as we cannot pick
  up the proctree_lock from within procdesc_kqops_event().

Discussed on:	arch@
Reviewed by:	kib@
2014-04-07 18:10:49 +00:00
Ed Schouten
d7a39436e5 Fix a typo. The function name is pdfork; not pfork. 2014-04-06 20:20:07 +00:00
Ed Schouten
a90feb39a2 Nit: fix locking of p->p_state in procdesc_close().
According to <sys/proc.h>, this field needs to be locked with either the
p_mtx or the p_slock. In this case the damage was quite small. Instead
of being reaped, the process would just be reparented to init, so it
could be reaped from there.
2014-04-06 20:00:42 +00:00
Konstantin Belousov
14fcb4b4f8 Use realloc(9) instead of doing the reallocation inline.
Submitted by:	bde
MFC after:	1 week
2014-04-05 20:44:52 +00:00
Dmitry Chagin
6b57eff4c0 Prevent alq from panic when the invalid alq_file path specified.
MFC after:	1 week
2014-04-05 16:54:47 +00:00
Konstantin Belousov
1a5edcf8ea When KN_INFLUX is set on the knote due to kqueue_register() or
kqueue_scan() unlocking the kqueue to call f_event, knote() or
knote_fork() should not skip the knote.  The knote is not going to
disappear during the influx time, and the mutual exclusion between
scan and knote() is ensured by both code pathes taking knlist lock.
The race appears since knlist lock is before kq lock, so KN_INFLUX
must be set, kq lock must be dropped and only then knlist lock can be
taken.  The window between kq unlock and knlist lock causes lost
events.

Add a flag KN_SCAN to indicate that KN_INFLUX is set in a manner safe
for the knote(), and check for it to ignore KN_INFLUX in the knote*()
as needed.  Also, in knote(), remove the lockless check for the
KN_INFLUX flag, which could also result in the lost notification.

Reported and tested by:	Kohji Okuno <okuno.kohji@jp.panasonic.com>
Discussed with:	jmg
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2014-04-05 14:09:16 +00:00
Ed Maste
b7bd677fe1 Initialise m_pkthdr via bzero instead of explicitly zeroing each member
Sponsored by:	The FreeBSD Foundation
2014-04-04 21:09:06 +00:00
David Xu
5055c92801 Fix SIGIO delivery. Use fsetown() to handle file descriptor owner
ioctl and use pgsigio() to send SIGIO.

Submitted by:	truckman
Reviewed by:	mjg
2014-04-04 12:31:13 +00:00
Mateusz Guzik
210a5d1689 Garbage collect fdavail.
It rarely returns an error and fdallocn handles the failure of fdalloc
just fine.
2014-04-04 05:07:36 +00:00
Ian Lepore
9e24f23880 Fix build breakage. Apparently all ARM configs build kern_et.c, but only a
few of them also build kern_clocksource.c.  That strikes me as insane, but
maybe there's a good reason for it.  Until I figure that out, un-break
the build by not referencing functions in kern_clocksource if NO_EVENTTIMERS
is defined.
2014-04-02 17:34:17 +00:00
Ian Lepore
cfc4b56b57 Add support for event timers whose clock frequency can change while running. 2014-04-02 15:56:11 +00:00
Mateusz Guzik
0ab7a1f396 Document a known problem with handling the process intended to receive
SIGIO in /dev/devctl.

Suggested by:	adrian
MFC after:	6 days
2014-03-25 23:30:35 +00:00
Mateusz Guzik
88b7c833d2 Remove long obsolete sysctl hw.bus.devctl_disable.
Suggested by:	imp
Relnotes:	yes
2014-03-25 23:19:45 +00:00
Mateusz Guzik
6abaea7d58 Remove lockless check in devopen, while correct it does not make much sense.
Suggested by:	imp
MFC after:	6 days
2014-03-25 23:13:46 +00:00
Mateusz Guzik
37dbba2a44 Make /dev/devctl mpsafe.
MFC after:	1 week
2014-03-25 03:28:58 +00:00
Maksim Yevmenkin
b646225a13 change defaule permissions on /dev/devstat. while i'm here remove
D_NEEDGIANT flag

Submitted by:	jhb
Reviewed by:	jhb, scottl, rwatson, delphij, phk
MFC after:	1 week
2014-03-24 18:13:41 +00:00
Neel Natu
d6543c678c Don't lose track of the KTR entries copied from 'ktr_buf_init[]' to the
dynamically allocated 'ktr_buf[]'.

The memcpy arranges 'ktr_buf[]' such that the latest KTR entry is at
'KTR_BOOT_ENTRIES - 1'.
2014-03-22 22:35:57 +00:00
Bryan Drewery
44f1c91610 Rename global cnt to vm_cnt to avoid shadowing.
To reduce the diff struct pcu.cnt field was not renamed, so
PCPU_OP(cnt.field) is still used. pc_cnt and pcpu are also used in
kvm(3) and vmstat(8). The goal was to not affect externally used KPI.

Bump __FreeBSD_version_ in case some out-of-tree module/code relies on the
the global cnt variable.

Exp-run revealed no ports using it directly.

No objection from:	arch@
Sponsored by:	EMC / Isilon Storage Division
2014-03-22 10:26:09 +00:00
Mateusz Guzik
f804336026 Mark the following sysctls as MPSAFE:
kern.file
kern.proc.filedesc
kern.proc.ofiledesc

MFC after:	7 days
2014-03-21 19:12:05 +00:00
Konstantin Belousov
52f3c44efe Fix two issues with /dev/mem access on amd64, both causing kernel page
faults.

First, for accesses to direct map region should check for the limit by
which direct map is instantiated.

Second, for accesses to the kernel map, success returned from the
kernacc(9) does not guarantee that consequent attempt to read or write
to the checked address succeed, since other thread might invalidate
the address meantime.  Add a new thread private flag TDP_DEVMEMIO,
which instructs vm_fault() to return error when fault happens on the
MAP_ENTRY_NOFAULT entry, instead of panicing.  The trap handler would
then see a page fault from access, and recover in normal way, making
/dev/mem access safer.

Remove GIANT_REQUIRED from the amd64 memrw(), since it is not needed
and having Giant locked does not solve issues for amd64.

Note that at least the second issue exists on other architectures, and
requires similar patching for md code.

Reported and tested by:	clusteradm (gjb, sbruno)
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2014-03-21 14:25:09 +00:00
Mateusz Guzik
4c73e705a5 Take filedesc lock only for reading when allocating new fdtable.
Code populating the table does this already.

MFC after:	1 week
2014-03-21 01:34:19 +00:00
Attilio Rao
3198603edd Fix comments.
Sponsored by:	EMC / Isilon Storage Division
2014-03-19 12:45:40 +00:00
Konstantin Belousov
88b124cede Make the array pointed to by AT_PAGESIZES auxv properly aligned.
Also, remove the expression which calculated the location of the
strings for a new image and grown over the time to be
non-comprehensible.  Instead, calculate the offsets by steps, which
also makes fixing the alignments much cleaner.

Reported and reviewed by:	alc
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2014-03-19 12:35:04 +00:00
Attilio Rao
c149e542a5 Fix GENERIC build. 2014-03-19 00:38:27 +00:00
Attilio Rao
4f11a684ff Regen per r263318.
Sponsored by:	EMC / Isilon storage division
2014-03-18 21:34:11 +00:00
Attilio Rao
ce42e79310 Remove dead code from umtx support:
- Retire long time unused (basically always unused) sys__umtx_lock()
  and sys__umtx_unlock() syscalls
- struct umtx and their supporting definitions
- UMUTEX_ERROR_CHECK flag
- Retire UMTX_OP_LOCK/UMTX_OP_UNLOCK from _umtx_op() syscall

__FreeBSD_version is not bumped yet because it is expected that further
breakages to the umtx interface will follow up in the next days.
However there will be a final bump when necessary.

Sponsored by:	EMC / Isilon storage division
Reviewed by:	jhb
2014-03-18 21:32:03 +00:00
Robert Watson
4a14441044 Update kernel inclusions of capability.h to use capsicum.h instead; some
further refinement is required as some device drivers intended to be
portable over FreeBSD versions rely on __FreeBSD_version to decide whether
to include capability.h.

MFC after:	3 weeks
2014-03-16 10:55:57 +00:00
John-Mark Gurney
6f2b769cac change td_retval into a union w/ off_t, with defines to mask the
change...  This eliminates a cast, and also forces td_retval
(often 2 32-bit registers) to be aligned so that off_t's can be
stored there on arches with strict alignment requirements like
armeb (AVILA)...  On i386, this doesn't change alignment, and on
amd64 it doesn't either, as register_t is already 64bits...

This will also prevent future breakage due to people adding additional
fields to the struct...

This gets AVILA booting a bit farther...

Reviewed by:	bde
2014-03-16 00:53:40 +00:00
Gleb Smirnoff
45c203fce2 Remove AppleTalk support.
AppleTalk was a network transport protocol for Apple Macintosh devices
in 80s and then 90s. Starting with Mac OS X in 2000 the AppleTalk was
a legacy protocol and primary networking protocol is TCP/IP. The last
Mac OS X release to support AppleTalk happened in 2009. The same year
routing equipment vendors (namely Cisco) end their support.

Thus, AppleTalk won't be supported in FreeBSD 11.0-RELEASE.
2014-03-14 06:29:43 +00:00
Gleb Smirnoff
2c284d9395 Remove IPX support.
IPX was a network transport protocol in Novell's NetWare network operating
system from late 80s and then 90s. The NetWare itself switched to TCP/IP
as default transport in 1998. Later, in this century the Novell Open
Enterprise Server became successor of Novell NetWare. The last release
that claimed to still support IPX was OES 2 in 2007. Routing equipment
vendors (e.g. Cisco) discontinued support for IPX in 2011.

Thus, IPX won't be supported in FreeBSD 11.0-RELEASE.
2014-03-14 02:58:48 +00:00
Bryan Drewery
ae8959dd57 Combine similar code from vprintf(9) and log(9).
MFC after:	2 weeks
2014-03-14 01:17:11 +00:00
Alan Somers
c2090e73d7 Replace 4.4BSD Lite's unix domain socket backpressure hack with a cleaner
mechanism, based on the new SB_STOP sockbuf flag.  The old hack dynamically
changed the sending sockbuf's high water mark whenever adding or removing
data from the receiving sockbuf.  It worked for stream sockets, but it never
worked for SOCK_SEQPACKET sockets because of their atomic nature.  If the
sockbuf was partially full, it might return EMSGSIZE instead of blocking.

The new solution is based on DragonFlyBSD's fix from commit
3a6117bbe0ed6a87605c1e43e12a1438d8844380 on 2008-05-27.  It adds an SB_STOP
flag to sockbufs.  Whenever uipc_send surpasses the socket's size limit, it
sets SB_STOP on the sending sockbuf.  sbspace() will then return 0 for that
sockbuf, causing sosend_generic and friends to block.  uipc_rcvd will
likewise clear SB_STOP.  There are two fringe benefits: uipc_{send,rcvd} no
longer need to call chgsbsize() on every send and receive because they don't
change the sockbuf's high water mark.  Also, uipc_sense no longer needs to
acquire the UIPC linkage lock, because it's simpler to compute the
st_blksizes.

There is one drawback: since sbspace() will only ever return 0 or the
maximum, sosend_generic will allow the sockbuf to exceed its nominal maximum
size by at most one packet of size less than the max.  I don't think that's
a serious problem.  In fact, I'm not even positive that FreeBSD guarantees a
socket will always stay within its nominal size limit.

sys/sys/sockbuf.h
	Add the SB_STOP flag and adjust sbspace()

sys/sys/unpcb.h
	Delete the obsolete unp_cc and unp_mbcnt fields from struct unpcb.

sys/kern/uipc_usrreq.c
	Adjust uipc_rcvd, uipc_send, and uipc_sense to use the SB_STOP
	backpressure mechanism.  Removing obsolete unpcb fields from
	db_show_unpcb.

tests/sys/kern/unix_seqpacket_test.c
	Clear expected failures from ATF.

Obtained from:	DragonFly BSD
PR:		kern/185812
Reviewed by:	silence from freebsd-net@ and rwatson@
MFC after:	3 weeks
Sponsored by:	Spectra Logic Corporation
2014-03-13 18:42:12 +00:00
Konstantin Belousov
cee9542d51 Use correct types for sizeof() in the calculations for the malloc(9) sizes [1].
While there, remove unneeded checks for failed allocations with M_WAITOK flag.

Submitted by:	Conrad Meyer <cemeyer@uw.edu> [1]
MFC after:	1 week
2014-03-12 10:25:26 +00:00
Konstantin Belousov
9d2437a6f5 The auio structure is only initialized when the vnode is symlink,
avoid reading from it otherwise.

Submitted by:	Conrad Meyer <cemeyer@uw.edu>
MFC after:	1 week
2014-03-12 10:23:51 +00:00
Jeff Roberson
8bc713f6c5 - Make runq_steal_from more aggressive. Previously it would examine only
a single priority queue.  If that queue had a thread or threads which
   could not be migrated we would fail to steal load.  This could cause
   starvation in situations where cores are idle.

Submitted by:	Doug Kilpatrick <dkilpatrick@isilon.com>
Tested by:	pho
Reviewed by:	mav
Sponsored by:	EMC / Isilon Storage Division
2014-03-08 00:35:06 +00:00
Alan Somers
74107e870a Partial revert of change 262914. I screwed up subversion syntax with
perforce syntax and committed some unrelated files.  Only devd files
should've been committed.

Reported by: 	imp
Pointy hat to:	asomers
MFC after:	3 weeks
X-MFC-With:	r262914
2014-03-07 23:40:36 +00:00
Alan Somers
6a2ae0eb16 sbin/devd/devd.8
sbin/devd/devd.cc
	Add a -q flag to devd that will suppress syslog logging at
	LOG_NOTICE or below.

Requested by:	ian@ and imp@
MFC after:	3 weeks
Sponsored by:	Spectra Logic Corporation
2014-03-07 23:30:48 +00:00
Alan Somers
8de34a88de Fix PR kern/185813 "SOCK_SEQPACKET AF_UNIX sockets with asymmetrical
buffers drop packets".  It was caused by a check for the space available
in a sockbuf, but it was checking the wrong sockbuf.

sys/sys/sockbuf.h
sys/kern/uipc_sockbuf.c
    Add sbappendaddr_nospacecheck_locked(), which is just like
    sbappendaddr_locked but doesn't validate the receiving socket's
    space.  Factor out common code into sbappendaddr_locked_internal().
    We shouldn't simply make sbappendaddr_locked check the space and
    then call sbappendaddr_nospacecheck_locked, because that would cause
    the O(n) function m_length to be called twice.

sys/kern/uipc_usrreq.c
    Use sbappendaddr_nospacecheck_locked for SOCK_SEQPACKET sockets,
    because the receiving sockbuf's size limit is irrelevant.

tests/sys/kern/unix_seqpacket_test.c
    Now that 185813 is fixed, pipe_128k_8k fails intermittently due to
    185812.  Make it fail every time by adding a usleep after starting
    the writer thread and before starting the reader thread in
    test_pipe.  That gives the writer time to fill up its send buffer.
    Also, clear the expected failure message due to 185813.  It actually
    said "185812", but that was a typo.

PR:		kern/185813
Reviewed by:	silence from freebsd-net@ and rwatson@
MFC after:	3 weeks
Sponsored by:	Spectra Logic Corporation
2014-03-06 20:24:15 +00:00
Dimitry Andric
892620150f Merge from head up to r262415. 2014-02-23 23:33:11 +00:00
Dimitry Andric
f9d498ad60 On sparc64, VM_KMEM_SIZE_SCALE is not a constant expression, so it
cannot be tested in a CTASSERT().
2014-02-23 17:37:24 +00:00
Bryan Drewery
63d8fe5531 Fix style of comment blocks.
Reported by:	peter
Approved by:	bapt (mentor, implicit)
X-MFC with:	r262006
2014-02-22 04:28:49 +00:00
Mark Johnston
9e9ea73715 Print a backtrace if the SDT(9) stub gets called so that there's at least
some hope of figuring out how it happened.

Suggested by:	rstone
MFC after:	1 week
2014-02-22 01:41:45 +00:00
Mateusz Guzik
1f9e8f8ad9 Fix a race between kern_proc_{o,}filedesc_out and fdescfree leading
to use-after-free.

fdescfree proceeds to free file pointers once fd_refcnt reaches 0, but
kern_proc_{o,}filedesc_out only checked for hold count.

MFC after:	3 days
2014-02-21 22:29:09 +00:00