Commit Graph

13737 Commits

Author SHA1 Message Date
hselasky
bd1ed65f0f Extend the meaning of the CTLFLAG_TUN flag to automatically check if
there is an environment variable which shall initialize the SYSCTL
during early boot. This works for all SYSCTL types both statically and
dynamically created ones, except for the SYSCTL NODE type and SYSCTLs
which belong to VNETs. A new flag, CTLFLAG_NOFETCH, has been added to
be used in the case a tunable sysctl has a custom initialisation
function allowing the sysctl to still be marked as a tunable. The
kernel SYSCTL API is mostly the same, with a few exceptions for some
special operations like iterating childrens of a static/extern SYSCTL
node. This operation should probably be made into a factored out
common macro, hence some device drivers use this. The reason for
changing the SYSCTL API was the need for a SYSCTL parent OID pointer
and not only the SYSCTL parent OID list pointer in order to quickly
generate the sysctl path. The motivation behind this patch is to avoid
parameter loading cludges inside the OFED driver subsystem. Instead of
adding special code to the OFED driver subsystem to post-load tunables
into dynamically created sysctls, we generalize this in the kernel.

Other changes:
- Corrected a possibly incorrect sysctl name from "hw.cbb.intr_mask"
to "hw.pcic.intr_mask".
- Removed redundant TUNABLE statements throughout the kernel.
- Some minor code rewrites in connection to removing not needed
TUNABLE statements.
- Added a missing SYSCTL_DECL().
- Wrapped two very long lines.
- Avoid malloc()/free() inside sysctl string handling, in case it is
called to initialize a sysctl from a tunable, hence malloc()/free() is
not ready when sysctls from the sysctl dataset are registered.
- Bumped FreeBSD version to indicate SYSCTL API change.

MFC after:	2 weeks
Sponsored by:	Mellanox Technologies
2014-06-27 16:33:43 +00:00
mjg
21b16efdd1 Check lower bound of cmsg_len.
If passed cm->cmsg_len was below cmsghdr size the experssion:
datalen = (caddr_t)cm + cm->cmsg_len - (caddr_t)data;

would give negative result. However, in practice it would not
result in a crash because the kernel would try to obtain garbage fds
for given process and would error out with EBADF.

PR:		124908
Submitted by:	campbell mumble.net (modified a little)
MFC after:	1 week
2014-06-27 05:04:36 +00:00
pjd
64b5b5018d Remove duplicated includes.
Submitted by:	Mariusz Zaborski <oshogbo@FreeBSD.org>
2014-06-26 13:57:44 +00:00
attilio
195f74af68 sysctl subsystem uses sxlocks so avoid to setup dynamic sysctl nodes
before sleepinit() has been fully executed in the SLEEPQUEUE_PROFILING
case.

Sponsored by:	EMC / Isilon storage division
2014-06-24 15:16:55 +00:00
mjg
dc769e3b99 Tidy up fd-related functions called by do_execve
o assert in each one that fdp is not shared
o remove unnecessary NULL checks - all userspace processes have fdtables
and kernel processes cannot execve
o remove comments about the danger of fd_ofiles getting reallocated - fdtable
is not shared and fd_ofiles could be only reallocated if new fd was about to be
added, but if that was possible the code would already be buggy as setugidsafety
work could be undone

MFC after:	1 week
2014-06-23 01:28:18 +00:00
mjg
202339afcf Don't take filedesc lock in fdunshare().
We can read refcnt safely and only care if it is equal to 1.

If it could suddenly change from 1 to something bigger the code would be
buggy even in the previous form and transitions from > 1 to 1 are equally racy
and harmless (we copy even though there is no need).

MFC after:	1 week
2014-06-22 21:37:27 +00:00
melifaro
3f08add0c8 Permit changing cpu mask for cpu set 1 in presence of drivers
binding their threads to particular CPU.

Changing ithread cpu mask is now performed by special cpuset_setithread().
It creates additional cpuset root group on first bind invocation.

No objection:	jhb
Tested by:	hiren
MFC after:	2 weeks
Sponsored by:	Yandex LLC
2014-06-22 11:32:23 +00:00
mjg
d74326bc91 fd: replace fd_nfiles with fd_lastfile where appropriate
fd_lastfile is guaranteed to be the biggest open fd, so when the intent
is to iterate over active fds or lookup one, there is no point in looking
beyond that limit.

Few places are left unpatched for now.

MFC after:	1 week
2014-06-22 01:31:55 +00:00
mjg
38cd838637 do_dup: plug redundant adjustment of fd_lastfile
By that time it was already set by fdalloc, or was there in the first place
if fd is replaced.

MFC after:	1 week
2014-06-22 00:53:33 +00:00
kib
f5cc3af6c8 In msdosfs_setattr(), add a check for result of the utimes(2)
permissions test, forgotten in r164033.

Refactor the permission checks for utimes(2) into vnode helper
function vn_utimes_perm(9), and simplify its code comparing with the
UFS origin, by writing the call to VOP_ACCESSX only once.  Use the
helper for UFS(5), tmpfs(5), devfs(5) and msdosfs(5).

Reported by:	bde
Reviewed by:	bde, trasz
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2014-06-17 07:11:00 +00:00
dchagin
dd6bed9dd2 Revert r266925 as it can lead to instant panic at fexecve():
To allow to run the interpreter itself add a new ELF branding type.

Pointed out by:	kib, mjg
2014-06-17 05:29:18 +00:00
attilio
2802c525ad - Modify vm_page_unwire() and vm_page_enqueue() to directly accept
the queue where to enqueue pages that are going to be unwired.
- Add stronger checks to the enqueue/dequeue for the pagequeues when
  adding and removing pages to them.

Of course, for unmanaged pages the queue parameter of vm_page_unwire() will
be ignored, just as the active parameter today.
This makes adding new pagequeues quicker.

This change effectively modifies the KPI.  __FreeBSD_version will be,
however, bumped just when the full cache of free pages will be
evicted.

Sponsored by:	EMC / Isilon storage division
Reviewed by:	alc
Tested by:	pho
2014-06-16 18:15:27 +00:00
kib
cde29dbd9f Use vn_io_fault for the writes from core dumping code. Recursing into
VM due to copyin(9) faulting while VFS locks are held is
deadlock-prone there in the same way as for the write(2) syscall.

Reported and tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
2014-06-15 04:51:53 +00:00
mav
9e5655843e Implement simple direct-mapped cache for popular filesystem identifiers to
avoid congestion on global mountlist_mtx mutex in vfs_busyfs(), while
traversing through the list of mount points.

This change significantly improves NFS server scalability, since it had
to do this translation for every request, and the global lock becomes quite
congested.

This code is more optimized for relatively small number of mount points.
On systems with hundreds of active mount points this simple cache may have
many collisions.  But the original traversal code in that case should also
behave much worse, so we are not loosing much.

Reviewed by:	attilio
MFC after:	2 weeks
Sponsored by:	iXsystems, Inc.
2014-06-12 12:43:48 +00:00
mav
60cbac5944 Remove unneeded mountlist_mtx acquisition from sync_fsync().
All struct mount fields accessed by sync_fsync() are protected by MNT_MTX.
2014-06-11 12:56:49 +00:00
mav
1be587d505 Move root_mount_hold() functionality to separate mutex.
It has nothing to share with mutex protecting list of mounted file systems.
2014-06-11 08:14:08 +00:00
kib
b5eae21087 Devolatile as needed.
Sponsored by:	The FreeBSD Foundation
MFC after:	13 days
2014-06-09 09:10:31 +00:00
kib
e15884d6df Change the nblock mutex, protecting the needsbuffer buffer deficit
flags, to rwlock.  Lock it in read mode when used from subroutines
called from buffer release code paths.

The needsbuffer is now updated using atomics, while read lock of
nblock prevents loosing the wakeups from bufspacewakeup() and
bufcountadd() in getnewbuf_bufd_help().

In several interesting loads, needsbuffer flags are never set, while
buffers are reused quickly.  This causes brelse() and bqrelse() from
different threads to content on the nblock.  Now they take nblock in
read mode, together with needsbuffer not needing an update, allowing
higher parallelism.

Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
2014-06-09 03:38:03 +00:00
alc
2f6e275d72 Refresh a comment. The VM_STACK option was eliminated in r43209.
Sponsored by:	EMC / Isilon Storage Division
2014-06-09 00:15:16 +00:00
mav
6106e186e6 Remove extra branching from r267232.
MFC after:	2 weeks
2014-06-08 19:01:37 +00:00
mav
14c8389ad0 Use atomics to modify numvnodes variable.
This allows to mostly avoid lock usage in getnewvnode_[drop_]reserve(),
that reduces number of global vnode_free_list_mtx mutex acquisitions
from 4 to 2 per NFS request on ZFS, improving SMP scalability.

Reviewed by:	kib
MFC after:	2 weeks
Sponsored by:	iXsystems, Inc.
2014-06-08 15:38:40 +00:00
kib
00a919ea8f Remove write-only local variable.
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2014-06-08 10:56:25 +00:00
kib
321a65896d Initialize the pbuf counter for directio using SYSINIT, instead of
using a direct hook called from kern_vfs_bio_buffer_alloc().
Mark ffs_rawread.c as requiring both ffs and directio options to be
compiled into the kernel.  Add ffs_rawread.c to the list of ufs.ko
module' sources.

In addition to stopping breaking the layering violation, it also
allows to link kernel when FFS is configured as module and DIRECTIO is
enabled.

One consequence of the change is that ffs_rawread.o is always linked
into the module regardless of the DIRECTIO option.  This is similar to
the option QUOTA and ufs_quota.c.

Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2014-06-08 10:55:06 +00:00
jilles
5e8a5eb68c ktrace: Use designated initializers for the data_lengths array.
In the .o file, this only changes some line numbers (head amd64) because
element 0 is no longer explicitly initialized.

This should make bugs like FreeBSD-SA-14:12.ktrace less likely.

Discussed with:	des
MFC after:	1 week
2014-06-06 14:49:00 +00:00
davide
e9bd39eb77 Convert functions to the new-style format.
Submitted by:	Vijay Singh <vijju.singh@gmail.com> via -hackers
2014-06-05 03:46:46 +00:00
marcel
916c7006f5 Introduce a procedural interface to the ifnet structure. The new
interface allows the ifnet structure to be defined as an opaque
type in NIC drivers.  This then allows the ifnet structure to be
changed without a need to change or recompile NIC drivers.

Put differently, NIC drivers can be written and compiled once and
be used with different network stack implementations, provided of
course that those network stack implementations have an API and
ABI compatible interface.

This commit introduces the 'if_t' type to replace 'struct ifnet *'
as the type of a network interface. The 'if_t' type is defined as
'void *' to enable the compiler to perform type conversion to
'struct ifnet *' and vice versa where needed and without warnings.
The functions that implement the API are the only functions that
need to have an explicit cast.

The MII code has been converted to use the driver API to avoid
unnecessary code churn. Code churn comes from having to work with
both converted and unconverted drivers in correlation with having
callback functions that take an interface. By converting the MII
code first, the callback functions can be defined so that the
compiler will perform the typecasts automatically.

As soon as all drivers have been converted, the if_t type can be
redefined as needed and the API functions can be fix to not need
an explicit cast.

The immediate benefactors of this change are:
1.  Juniper Networks - The network stack implementation in Junos
    is entirely different from FreeBSD's one and this change
    allows Juniper to build "stock" NIC drivers that can be used
    in combination with both the FreeBSD and Junos stacks.
2.  FreeBSD - This change opens the door towards changing ifnet
    and implementing new features and optimizations in the network
    stack without it requiring a change in the many NIC drivers
    FreeBSD has.

Submitted by:	Anuranjan Shukla <anshukla@juniper.net>
Reviewed by:	glebius@
Obtained from:	Juniper Networks, Inc.
2014-06-02 17:54:39 +00:00
adrian
e3480f19d0 Pin the right thread.
This _was_ right, a last minute suggestion and not enough testing makes
Adrian a bad boy.

Tested:

* igb(4) with RSS patches, by hand verifying each igb(4) taskqueue
  tid from procstat -ka using cpuset -g -t <tid>.
2014-06-01 04:11:05 +00:00
dchagin
538f396887 To allow to run the interpreter itself add a new ELF branding type.
Allow Linux ABI to run ELF interpreter.

MFC after:	3 days
2014-05-31 15:01:51 +00:00
glebius
999343aa9c Whitespace only. 2014-05-30 08:22:58 +00:00
markj
4c818572b7 Commit the rest of the changes that were intended to be part of r266826.
X-MFC-with:	r266826
2014-05-29 01:42:22 +00:00
truckman
6c648960e7 Initialize r_flags the same way in all cases using a sanitized copy of
flags that has several bits cleared. The RF_WANTED and RF_FIRSTSHARE
bits are invalid in this context, and we want to defer setting RF_ACTIVE
in r_flags until later.  This should make rman_get_flags() return
the correct answer in all cases.

Add a KASSERT() to catch callers which incorrectly pass the RF_WANTED
or RF_FIRSTSHARE flags.

Do a strict equality check on the share type bits of flags.  In
particular, do an equality check on RF_PREFETCHABLE.  The previous
code would allow one type of mismatch of RF_PREFETCHABLE but disallow
the other type of mismatch.  Also, ignore the the RF_ALIGNMENT_MASK
bits since alignment validity should be handled by the amask check.
This field contains an integer value, but previous code did a strange
bitwise comparison on it.

Leave the original value of flags unmolested as a minor debug aid.

Change the start+amask overflow check to a KASSERT() since it is just
meant to catch a highly unlikely programming error in the caller.

Reviewed by:	jhb
MFC after:	1 month
2014-05-28 16:57:17 +00:00
adrian
578ca42efc Add a new taskqueue setup method that takes a cpuid to pin the
taskqueue worker thread(s) to.

For now it isn't a taskqueue/taskthread error to fail to pin
to the given cpuid.

Thanks to rpaulo@, kib@ and jhb@ for feedback.

Tested:

* igb(4), with local RSS patches to pin taskqueues.

TODO:

* ask the doc team for help in documenting the new API call.
* add a taskqueue_start_threads_cpuset() method which takes
  a cpuset_t - but this may require a bunch of surgery to
  bring cpuset_t into scope.
2014-05-24 20:37:15 +00:00
bjk
7280c1da39 Check for mismatched vref()/vdrop()
Assert that the hold count has not fallen below the use count, a situation
that would only happen when a vref() (or similar) is erroneously paired
with a vdrop().  This situation has not been observed in the wild, but
could be helpful for someone implementing a new filesystem.

Reviewed by:	kib
Approved by:	hrs (mentor)
2014-05-21 03:11:27 +00:00
kib
235f4b1083 When exec_new_vmspace() decides that current vmspace cannot be reused
on execve(2), it calls vmspace_exec(), which frees the current
vmspace.  The thread executing an exec syscall gets new vmspace
assigned, and old vmspace is freed if only referenced by the current
process.  The free operation includes pmap_release(), which
de-constructs the paging structures used by hardware.

If the calling process is multithreaded, other threads are suspended
in the thread_suspend_check(), and need to be unsuspended and run to
be able to exit on successfull exec.  Now, since the old vmspace is
destroyed, paging structures are invalid, threads are resumed on the
non-existent pmaps (page tables), which leads to triple fault on x86.

To fix, postpone the free of old vmspace until the threads are resumed
and exited.  To avoid modifications to all image activators all of
which use exec_new_vmspace(), memoize the current (old) vmspace in
kern_execve(), and notify it about the need to call vmspace_free()
with a thread-private flag TDP_EXECVMSPC.

http://bugs.debian.org/743141

Reported by:	Ivo De Decker <ivo.dedecker@ugent.be> through secteam
Sponsored by:	The FreeBSD Foundation
MFC after:	3 days
2014-05-20 09:19:35 +00:00
truckman
56f5be6dbc Slightly restructure the final loop in rman_reserve_resource_bound().
Replace with the existing loop termination test with a similar
condition from the nested "if" that may terminate the loop a bit
sooner, but still not too early.   This condition can then be removed
from the nested "if".  Relocate an operator to be style(9) compliant.

MFC after:	3 days
2014-05-19 04:44:27 +00:00
trasz
c857f75919 Initialize loginclass mutex using MTX_SYSINIT instead of using SI_SUB_CPU.
Suggested by:	rwatson@
MFC after:	1 month
Sponsored by:	The FreeBSD Foundation
2014-05-14 09:03:02 +00:00
truckman
c965763705 Be even more paranoid about overflow.
Requested by:	ache
2014-05-12 20:22:42 +00:00
truckman
a5f5868e49 Nuke a couple of unnecessary assigments. Nothing uses the values of rstart
and rend after this point.

MFC after:	1 week
2014-05-12 17:56:52 +00:00
jilles
3009646c96 accept(),accept4(): Don't set *addrlen = 0 on [ECONNABORTED].
If the underlying protocol reported an error (e.g. because a connection was
closed while waiting in the queue), this error was also indicated by
returning a zero-length address. For all other kinds of errors (e.g.
[EAGAIN], [ENFILE], [EMFILE]), *addrlen is unmodified and there are
successful cases where a zero-length address is returned (e.g. a connection
from an unbound Unix-domain socket), so this error indication is not
reliable.

As reported in Austin Group bug #836, modifying *addrlen on error may cause
subtle bugs if applications retry the call without resetting *addrlen.
2014-05-11 21:21:14 +00:00
cperciva
e9e7ccdea5 In cf_get_method, when we don't already know what clock speed the CPU is
running at, guess the nearest value instead of looking for a value within
25 MHz of the observed frequency.

Prior to this change, if a system booted with Intel Turbo Boost enabled,
the dev.cpu.0.freq sysctl is nonfunctional, since the ACPI-reported
frequency for Turbo Boost states does not match the actual clock frequency
(and thus no levels are within 25 MHz of the observed frequency) and the
current performance level is read before a new level is set.

MFC after:	3 days
Relnotes:	Bug fix in power management on CPUs with Intel Turbo Boost
2014-05-11 10:32:58 +00:00
adrian
65fc060c58 Add in support to optionally pin the swi threads.
Under enough load, the swi's can actually be preempted and migrated
to other currently free cores.  When doing RSS experiments, this lead
to the per-CPU TCP timers not lining up any more with the RX CPU said
flows were ending up on, leading to increased lock contention.

Since there was a little pushback on flipping them on by default,
I've left the default at "don't pin."

The other less obvious problem here is that the default swi
is also the same as the destination swi for CPU #0.  So if one
pins the swi on CPU #0, there's no default floating swi.

A nice future project would be to create a separate swi for
the "default" floating swi, as well as per-CPU swis that are
(optionally) pinned.

Tested:

* parallel TCP tests (2 x 1g unfortunately for now);
  CPU: Intel(R) Xeon(R) CPU E5-2650

Note:

This is based on some initial investigation into RSS/TCP stack lock
contention on FreeBSD-HEAD whilst at Netflix in January 2014.
2014-05-10 00:53:36 +00:00
truckman
c163eb4b72 Avoid unsigned integer overflow which can cause
rman_reserve_resource_bound() to return incorrect results.

Continue the initial search until the first viable region is found.

Add a comment to explain the search termination test.

PR:		kern/188534
Reviewed by:	jhb (previous version)
MFC after:	1 week
2014-05-05 15:59:31 +00:00
mjg
1b83ce1524 Request a non-exiting process in sysctl_kern_proc_{o,}filedesc
This fixes a race with exit1 freeing p_textvp.

Suggested by:	kib
MFC after:	1 week
2014-05-02 21:55:09 +00:00
brueffer
09b97bcfaf Free resources in an error case.
CID:		1018947
Found with:	Coverity Prevent(tm)
MFC after:	1 week
2014-05-02 21:34:17 +00:00
rwatson
7ca3cc4389 Garbage collect mtxpool_lockbuilder, the mutex pool historically used
for lockmgr and sx interlocks, but unused since optimised versions of
those sleep locks were introduced.  This will save a (quite) small
amount of memory in all kernel configurations.  The sleep mutex pool is
retained as it is used for 'struct bio' and several other consumers.

Discussed with:	jhb
MFC after:	3 days
2014-05-02 07:57:40 +00:00
mjg
7ec4134d19 Ignore the error from pipespace_new when creating a pipe.
It can fail if pipe map is exhausted (as a result of too many pipes created),
but it is not fatal and could be provoked by unprivileged users. The only
consequence is worse performance with given pipe.

Reported by:	ivoras
Suggested by:	kib
MFC after:	1 week
2014-05-02 00:52:13 +00:00
brooks
f9a10360bc Fix a 2038 bug.
If time_t is 64-bit (i.e. isn't 32-bit) allow any value of year, not
just years less than 2038.

Don't bother fixing the underflow in the case of years before 1903.

MFC after:	1 week
Sponsored by:	DARPA, AFRL
2014-05-01 22:28:14 +00:00
marius
6678ece656 Given that as of r258002 the last external user is gone, make sched_lock
static.
2014-04-29 20:51:57 +00:00
grehan
455d465e40 Bump WITNESS_PENDLIST by MAXCPU to account for the
pmap pvlist locks which are scaled by MAXCPU.

This allows an amd64 system to boot with MAXCPU set
to 256, which is currently FreeBSD's hard limit without
x2apic support.

Compile-tested for other arch's.

PR:	185831
Discussed with:		jhb
MFC after:	3 weeks
2014-04-29 17:22:29 +00:00
brooks
2e62cc3f35 Revert r263754, re-adding support for hw.bus.devctl_disable. Breaking
old devd's and thus hosts that get IP addresses from DHCP was too much
of a POLA violation.

The sysctl may be removed again after r263758 has been merged to at
least stable/9 and stable/10, and releases have been cut from those
branches.

Discussed with:	mjg
Reported by:	theraven, rwatson
2014-04-28 20:38:08 +00:00