appropriately
Assert FILEDESC_XLOCK_ASSERT only for already used tables in fdgrowtable.
We don't have to call it with the lock held if we are just creating new
filedesc.
As a side note, strictly speaking processes can have fdtables with
fd_lastfile = -1, but then they cannot enter fdgrowtable. Very first file
descriptor they get will be 0 and the only syscall allowing to choose fd number
requires an active file descriptor. Should this ever change, we can add an 'init'
(or similar) parameter to fdgrowtable.
While here add 'fdused_init' which does not perform unnecessary work.
Drop FILEDESC_LOCK_ASSERT from fdisused and rely on callers to hold
it when appropriate. This function is only used with INVARIANTS.
No functional changes intended.
Test for file availability by fde_file != NULL instead of fdisused, this is
consistent with similar checks later.
Drop badfileops check. badfileops don't have DFLAG_PASSABLE set, so it was never
reached in practice.
fdiused is now only used in some KASSERTS, so ifdef it under INVARIANTS.
No functional changes.
This code has had an extensive rewrite and a good series of reviews, both by the author and other parties. This means a lot of code has been simplified. Pluggable structures for high-rate entropy generators are available, and it is most definitely not the case that /dev/random can be driven by only a hardware souce any more. This has been designed out of the device. Hardware sources are stirred into the CSPRNG (Yarrow, Fortuna) like any other entropy source. Pluggable modules may be written by third parties for additional sources.
The harvesting structures and consequently the locking have been simplified. Entropy harvesting is done in a more general way (the documentation for this will follow). There is some GREAT entropy to be had in the UMA allocator, but it is disabled for now as messing with that is likely to annoy many people.
The venerable (but effective) Yarrow algorithm, which is no longer supported by its authors now has an alternative, Fortuna. For now, Yarrow is retained as the default algorithm, but this may be changed using a kernel option. It is intended to make Fortuna the default algorithm for 11.0. Interested parties are encouraged to read ISBN 978-0-470-47424-2 "Cryptography Engineering" By Ferguson, Schneier and Kohno for Fortuna's gory details. Heck, read it anyway.
Many thanks to Arthur Mesh who did early grunt work, and who got caught in the crossfire rather more than he deserved to.
My thanks also to folks who helped me thresh this out on whiteboards and in the odd "Hallway track", or otherwise.
My Nomex pants are on. Let the feedback commence!
Reviewed by: trasz,des(partial),imp(partial?),rwatson(partial?)
Approved by: so(des)
without restarting whole lookup
Restart is only needed when fp was closed by current process, which is a much
rarer event than ref/deref by some other thread.
A read barrier was necessary because fd table pointer and table size were
updated separately, opening a window where fget_unlocked could read new size
and old pointer.
This patch puts both these fields into one dedicated structure, pointer to which
is later atomically updated. As such, fget_unlocked only needs data a dependency
barrier which is a noop on all supported architectures.
Reviewed by: kib (previous version)
MFC after: 2 weeks
- Move the existing code to x86/x86/identcpu.c since it is x86-specific.
- If the CPUID2_HV flag is set, assume a hypervisor is present and query
the 0x40000000 leaf to determine the hypervisor vendor ID. Export the
vendor ID and the highest supported hypervisor CPUID leaf via
hv_vendor[] and hv_high variables, respectively. The hv_vendor[]
array is also exported via the hw.hv_vendor sysctl.
- Merge the VMWare detection code from tsc.c into the new probe in
identcpu.c. Add a VM_GUEST_VMWARE to identify vmware and use that in
the TSC code to identify VMWare.
Differential Revision: https://reviews.freebsd.org/D1010
Reviewed by: delphij, jkim, neel
Also fix some mishandling of suword(9) errors as errno, which resulted
in spurious ERESTART.
Sponsored by: The FreeBSD Foundation
Tested by: pho
MFC after: 3 weeks
and casuword(9), but do not mix value read and indication of fault.
I know (or remember) enough assembly to handle x86 and powerpc. For
arm, mips and sparc64, implement fueword() and casueword() as wrappers
around fuword() and casuword(), which means that the functions cannot
distinguish between -1 and fault.
On architectures where fueword() and casueword() are native, implement
fuword() and casuword() using fueword() and casuword(), to reduce
assembly code duplication.
Sponsored by: The FreeBSD Foundation
Tested by: pho
MFC after: 2 weeks (ia64 needs treating)
directly accessed. Although this will work on some platforms, it can
throw an exception if the pointer is invalid and then panic the kernel.
Add a missing SYSCTL_IN() of "SCTP_BASE_STATS" structure.
MFC after: 3 days
Sponsored by: Mellanox Technologies
The kernel tracks syscall users so that modules can safely unregister them.
But if the module is not unloadable or was compiled into the kernel, there is
no need to do this.
Achieve this by adding SY_THR_STATIC_KLD macro which expands to SY_THR_STATIC
during kernel build and 0 otherwise.
Reviewed by: kib (previous version)
MFC after: 2 weeks
- 'groups' initialization to NULL is always ovewrwriten before use, so plug it
- get rid of 'goto out'
- kern_setgroups's callers already validate ngrp, so only assert the condition
- ngrp is an u_int, so 'ngrp < 1' is more readable as 'ngrp == 0'
No functional changes.
and the following r273143 commit, supposed to workaround introduced issue by
quite innocent-looking change.
While there is no clear understanding why, but r273143 is accused in data
corruption in some environments with high I/O load. I personally don't see
any problem in that commit, and possibly it is just a trigger to some other
bug somewhere, but better safe then sorry for now.
Requested by: scottl@
MFC after: 3 days
in a separate word from the _count. This does not permit both items to
be updated atomically in a portable manner. As a result, sem_post()
must always perform a system call to safely clear _has_waiters.
This change removes the _has_waiters field and instead uses the high bit
of _count as the _has_waiters flag. A new umtx object type (_usem2) and
two new umtx operations are added (SEM_WAIT2 and SEM_WAKE2) to implement
these semantics. The older operations are still supported under the
COMPAT_FREEBSD9/10 options. The POSIX semaphore API in libc has
been updated to use the new implementation. Note that the new
implementation is not compatible with the previous implementation.
However, this only affects static binaries (which cannot be helped by
symbol versioning). Binaries using a dynamic libc will continue to work
fine. SEM_MAGIC has been bumped so that mismatched binaries will error
rather than corrupting a shared semaphore. In addition, a padding field
has been added to sem_t so that it remains the same size.
Differential Revision: https://reviews.freebsd.org/D961
Reported by: adrian
Reviewed by: kib, jilles (earlier version)
Sponsored by: Norse
indiscriminately to printf() and freeenv() is incorrect. Add a NULL
check before freeenv(); as for printf(), we could use req.newptr
instead, but we'd have to select the correct format string based on
the type, and that's too much work for an error message, so just
remove it.
initial static environment to a dynamic one, zero the static environment
buffer, and zero individual values when kern_unsetenv and freeenv are
called.
Tested by: kmoore (VM memory dump + grep)
Tested by: cperciva (kernel panic dump + grep)
Rename it to fdsetugidsafety for consistency with other functions.
There is no need to take filedesc lock if not closing any files.
The loop has to verify each file and we are guaranteed fdtable has space
for at least 20 fds. As such there is no need to check fd_lastfile.
While here tidy up is_unsafe.
- Wrong integer type was specified.
- Wrong or missing "access" specifier. The "access" specifier
sometimes included the SYSCTL type, which it should not, except for
procedural SYSCTL nodes.
- Logical OR where binary OR was expected.
- Properly assert the "access" argument passed to all SYSCTL macros,
using the CTASSERT macro. This applies to both static- and dynamically
created SYSCTLs.
- Properly assert the the data type for both static and dynamic
SYSCTLs. In the case of static SYSCTLs we only assert that the data
pointed to by the SYSCTL data pointer has the correct size, hence
there is no easy way to assert types in the C language outside a
C-function.
- Rewrote some code which doesn't pass a constant "access" specifier
when creating dynamic SYSCTL nodes, which is now a requirement.
- Updated "EXAMPLES" section in SYSCTL manual page.
MFC after: 3 days
Sponsored by: Mellanox Technologies
two.
nullfs and unionfs need to request suspension if underlying filesystem(s)
use it. Utilize mnt_kern_flag for this purpose.
This is a fixup for 273271.
No strong objections from: kib
Pointy hat to: mjg
MFC after: 2 weeks
This involves:
1. Have the loader pass the start and size of the .ctors section to the
kernel in 2 new metadata elements.
2. Have the linker backends look for and record the start and size of
the .ctors section in dynamically loaded modules.
3. Have the linker backends call the constructors as part of the final
work of initializing preloaded or dynamically loaded modules.
Note that LLVM appends the priority of the constructors to the name of
the .ctors section. Not so when compiling with GCC. The code currently
works for GCC and not for LLVM.
Submitted by: Dmitry Mikulin <dmitrym@juniper.net>
Obtained from: Juniper Networks, Inc.
rather than u_char.
To try and play nice with the ABI, the u_char CPU ID values are clamped
at 254. The new fields now contain the full CPU ID, or -1 for no cpu.
Differential Revision: D955
Reviewed by: jhb, kib
Sponsored by: Norse Corp, Inc.
in userland rename in-kernel getenv()/setenv() to kern_setenv()/kern_getenv().
This fixes a namespace collision with libc symbols.
Submitted by: kmacy
Tested by: make universe
This fixes use-after-free, caused by geom_disk, completing same BIO twice
to save extra allocation, and getting BIO_DONE set after the first.
MFC after: 1 week
1. Remove initializer for badstack_sbuf_size; it gets set unconditionally.
2. Remove meaningless comment.
3. Group witness_count and its sysctl together.
4. Fix spacing in for statements (space after for and within condition).
5. Change *all* M_NOWAIT usages in witness_initialize() to M_WAITOK; not
just those that were newly introduced -- the allocation is assumed to
succeed for all allocations.
6. Avoid using uint8_t as the base type in sizeof() expressions; Use the
variable name (w_rmatrix) as much as possible.
Pointed out by: jhb@ (thanks!)
the value without recompiling the kernel. This is useful when
recompiling is not possible as an immediate solution. When we run out
of witness objects, witness is completely disabled. Not having an
immediate solution can therefore be problematic.
Submitted by: Sreekanth Rupavatharam <rupavath@juniper.net>
Obtained from: Juniper Networks, Inc.
Move the SCTP syscalls to netinet with the rest of the SCTP code.
Submitted by: Steve Kiernan <stevek@juniper.net>
Reviewed by: tuexen, rrs
Obtained from: Juniper Networks, Inc.
syscalls themselves are tightly coupled with the network stack and
therefore should not be in the generic socket code.
The following four syscalls have been marked as NOSTD so they can be
dynamically registered in sctp_syscalls_init() function:
sys_sctp_peeloff
sys_sctp_generic_sendmsg
sys_sctp_generic_sendmsg_iov
sys_sctp_generic_recvmsg
The syscalls are also set up to be dynamically registered when COMPAT32
option is configured.
As a side effect of moving the SCTP syscalls, getsock_cap needs to be
made available outside of the uipc_syscalls.c source file. A proper
prototype has been added to the sys/socketvar.h header file.
API tests from the SCTP reference implementation have been run to ensure
compatibility. (http://code.google.com/p/sctp-refimpl/source/checkout)
Submitted by: Steve Kiernan <stevek@juniper.net>
Reviewed by: tuexen, rrs
Obtained from: Juniper Networks, Inc.
* Add a bus_if.m method - get_domain() - returning the VM domain or
ENOENT if the device isn't in a VM domain;
* Add bus methods to print out the domain of the device if appropriate;
* Add code in srat.c to save the PXM -> VM domain mapping that's done and
expose a function to translate VM domain -> PXM;
* Add ACPI and ACPI PCI methods to check if the bus has a _PXM attribute
and if so map it to the VM domain;
* (.. yes, this works recursively.)
* Have the pci bus glue print out the device VM domain if present.
Note: this is just the plumbing to start enumerating information -
it doesn't at all modify behaviour.
Differential Revision: D906
Reviewed by: jhb
Sponsored by: Norse Corp
1. ERESTART is not only returned when the revoke count changed. It
is also returned when a signal is received. While a change in
the revoke count should be ignored, a signal should not.
2. Waiting until the output queue is entirely drained can cause a
hang when the underlying device is stuck or broken.
Have tty_drain() take care of this by telling it when we're leaving.
When leaving, tty_drain() will use a timed wait to address point 2
above and it will check the revoke count to handle point 1 above.
The timeout is set to 1 second, which is arbitrary and long enough
to expect a change in the output queue.
Discussed with: jilles@
Reported by: Yamagi Burmeister <lists@yamagi.org>
a running event each time it executes a callout function. The event
includes the function pointer, argument, and whether or not it was run from
hardware interrupt context. The callwheel is marked idle when each handler
completes. This effectively logs the duration of each callout routine in
the graph.
Include sequence counter supports incoditionally [1]. This fixes reprted build
problems with e.g. nvidia driver due to missing opt_capsicum.h.
Replace fishy looking sizeof with offsetof. Make fde_seq the last member in
order to simplify calculations.
Suggested by: kib [1]
X-MFC: with 272505
the upper layers, which interpret it as errno value, which happens to
be ERESTART. The result was spurious restarts of the sysctls in loop,
e.g. kern.proc.proc, instead of returning ENOMEM to caller.
Convert -1 from sbuf_bcat() to ENOMEM, when returning to the callers
expecting errno.
In collaboration with: pho
Sponsored by: The FreeBSD Foundation (kib)
MFC after: 1 week
interrupts and report the largest value seen as sysctl
debug.max_kstack_used. Useful to estimate how close the kernel stack
size is to overflow.
In collaboration with: Larry Baird <lab@gta.com>
Sponsored by: The FreeBSD Foundation (kib)
MFC after: 1 week
- Do not dump into system files.
- Do not acquire write reference to the mount point where img.core is
written, in the coredump(). The vn_rdwr() calls from ELF imgact
request the write ref from vn_rdwr(). Recursive acqusition of the
write ref deadlocks with the unmount.
- Instead, take the range lock for the whole core file. This prevents
parallel dumping from two processes executing the same image,
converting the useless interleaved dump into sequential dumping,
with second core overwriting the first.
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
callout is now scheduled using the C_ABSOLUTE flag, and the absolute time
of each event is calculated as the time the previous event was scheduled
for plus the interval. This ensures that latency in processing a given
event doesn't perturb the arrival time of any subsequent events.
Reviewed by: jhb
fp and appropriate capability lookups were not atomic, which could result in
improper capabilities being checked.
This could result either in protection bypass or in a spurious ENOTCAPABLE.
Make fp + capability check atomic with the help of sequence counters.
Reviewed by: kib
MFC after: 3 weeks
Some watchdog drivers (like ipmi) need to sleep while patting the watchdog.
See sys/dev/ipmi/ipmi.c:ipmi_wd_event(), which calls malloc(M_WAITOK).
Submitted by: asomers
MFC after: 1 month
Sponsored by: Spectra Logic
MFSpectraBSD: 637548 on 2012/10/04
While strictly speaking this is not correct since some fields are pointers,
it makes no difference on all supported archs and we already rely on it doing
the right thing in other places.
No functional changes.
that the tty is dequeued from 'tty_list' only the first time.
The panic below was seen when a revoke(2) was issued on an nmdm device.
In this case there was also a thread that was blocked on a read(2) on the
device. The revoke(2) woke up the blocked thread which would typically
return an error to userspace. In this case the reader also held the last
reference on the file descriptor so fdrop() ended up calling tty_rel_free()
via ttydev_close().
tty_rel_free() then tried to dequeue 'tp' again which led to the panic.
panic: Bad link elm 0xfffff80042602400 prev->next != elm
cpuid = 1
KDB: stack backtrace:
db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame 0xfffffe00f9c90460
kdb_backtrace() at kdb_backtrace+0x39/frame 0xfffffe00f9c90510
vpanic() at vpanic+0x189/frame 0xfffffe00f9c90590
panic() at panic+0x43/frame 0xfffffe00f9c905f0
tty_rel_free() at tty_rel_free+0x29b/frame 0xfffffe00f9c90640
ttydev_close() at ttydev_close+0x1f9/frame 0xfffffe00f9c90690
devfs_close() at devfs_close+0x298/frame 0xfffffe00f9c90720
VOP_CLOSE_APV() at VOP_CLOSE_APV+0x13c/frame 0xfffffe00f9c90770
vn_close() at vn_close+0x194/frame 0xfffffe00f9c90810
vn_closefile() at vn_closefile+0x48/frame 0xfffffe00f9c90890
devfs_close_f() at devfs_close_f+0x2c/frame 0xfffffe00f9c908c0
_fdrop() at _fdrop+0x29/frame 0xfffffe00f9c908e0
sys_read() at sys_read+0x63/frame 0xfffffe00f9c90980
amd64_syscall() at amd64_syscall+0x2b3/frame 0xfffffe00f9c90ab0
Xfast_syscall() at Xfast_syscall+0xfb/frame 0xfffffe00f9c90ab0
--- syscall (3, FreeBSD ELF64, sys_read), rip = 0x800b78d8a, rsp = 0x7fffffbfdaf8, rbp = 0x7fffffbfdb30 ---
CR: https://reviews.freebsd.org/D851
Reviewed by: glebius, ed
Reported by: Leon Dang
Sponsored by: Nahanni Systems
MFC after: 1 week
fail the allocation request. Allocations of "reserved" resources such as
PCI BARs already fail the request instead of panic'ing in this case.
MFC after: 1 week
struct flock are done in the sys_fcntl(), which mean that compat32 used
direct access to userland pointers.
Move code from sys_fcntl() to new wrapper, kern_fcntl_freebsd(), which
performs neccessary userland memory accesses, and use it from both
native and compat32 fcntl syscalls.
Reported by: jhibbits
Sponsored by: The FreeBSD Foundation
MFC after: 3 days
holding a write reference on the filesystem. Try to get write
reference in unblocked way after all vnodes are resolved; if failed,
drop all locks and retry after waiting for suspension end.
The VFS_UNMOUNT() methods for UFS and tmpfs try to establish
suspension on unmount, while covered vnode is locked by VFS, which
prevents namei() from stepping over the mount point. The thread doing
namei() sleeps on the covered vnode lock, owning the write ref.
Reported by: bdrewery
Tested by: bdrewery (previous version), pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Summary:
Add the beginnings of multipass suspend/resume, by introducing
BUS_SUSPEND_CHILD/BUS_RESUME_CHILD, and move the PCI driver to this.
Reviewers: jhb
Reviewed By: jhb
Differential Revision: https://reviews.freebsd.org/D590
struct kinfo_file.
- Move the various fill_*_info() methods out of kern_descrip.c and into the
various file type implementations.
- Rework the support for kinfo_ofile to generate a suitable kinfo_file object
for each file and then convert that to a kinfo_ofile structure rather than
keeping a second, different set of code that directly manipulates
type-specific file information.
- Remove the shm_path() and ksem_info() layering violations.
Differential Revision: https://reviews.freebsd.org/D775
Reviewed by: kib, glebius (earlier version)
The current TSO limitation feature only takes the total number of
bytes in an mbuf chain into account and does not limit by the number
of mbufs in a chain. Some kinds of hardware is limited by two
factors. One is the fragment length and the second is the fragment
count. Both of these limits need to be taken into account when doing
TSO. Else some kinds of hardware might have to drop completely valid
mbuf chains because they cannot loaded into the given hardware's DMA
engine. The new way of doing TSO limitation has been made backwards
compatible as input from other FreeBSD developers and will use
defaults for values not set.
Reviewed by: adrian, rmacklem
Sponsored by: Mellanox Technologies
MFC after: 1 week
footprint systems(32M/64M) and didn't leave enough free memory to load modules
when it was setting up page tables that for sizes that are never used on
these smallish boards.
Set kmem_zmax to PAGE_SIZE on these smaller systems (< 128M) to keep this
from happening. Verified on mips32 h/w.
PR: 193465
Submitted by: delphij
Reviewed by: adrian
than u_char.
Migrate post_filter to use an int for a CPU rather than u_char.
Change intr_event_bind() to use an int for CPU rather than u_char.
It touches the ppc, sparc64, arm and mips machdep code but it should
(hah!) be a no-op.
Tested:
* i386, AMD64 laptops
Reviewed by: jhb
POSIX compliance and to improve compatibility with Linux and NetBSD
The issue was identified with lib/libc/sys/t_access:access_inval from
NetBSD
Update the manpage accordingly
PR: 181155
Reviewed by: jilles (code), jmmv (code), wblock (manpage), wollman (code)
MFC after: 4 weeks
Phabric: D678 (code), D786 (manpage)
Sponsored by: EMC / Isilon Storage Division
This change fixes transient performance drops in some of my benchmarks,
vanishing as soon as I am trying to collect any stats from the scheduler.
It looks like reordered access to those variables sometimes caused loss of
IPI_PREEMPT, that delayed thread execution until some later interrupt.
MFC after: 3 days
- Add invfo_rdwr() (for read and write), invfo_ioctl(), invfo_poll(),
and invfo_kqfilter() for use by file types that do not support the
respective operations. Home-grown versions of invfo_poll() were
universally broken (they returned an errno value, invfo_poll()
uses poll_no_poll() to return an appropriate event mask). Home-grown
ioctl routines also tended to return an incorrect errno (invfo_ioctl
returns ENOTTY).
- Use the invfo_*() functions instead of local versions for
unsupported file operations.
- Reorder fileops members to match the order in the structure definition
to make it easier to spot missing members.
- Add several missing methods to linuxfileops used by the OFED shim
layer: fo_write(), fo_truncate(), fo_kqfilter(), and fo_stat(). Most
of these used invfo_*(), but a dummy fo_stat() implementation was
added.
instead of breaking out of the loop and then immediately checking the loop
index so that if it was broken out of the proper value can be returned.
While here, use nitems().
This fixes a panic in the i915 driver when one uses debug.kdb.enter=1
under vt(4).
PR: 193269
Reported by: emaste@
Submitted by: avg@
MFC after: 3 days
imgp->interpreted to a bitmask instead of, functionally, a bool. Each
imgactivator now requires its own flag in interpreted to indicate whether
or not it has already examined argv[0].
Change imgp->interpreted to an unsigned char to add one extra bit for
future use.
With this change, one can execute a shell script from a 64bit host native
make and still get the binmisc image activator to fire for the script
interpreter. Prior to this, execution would fail.
Phabric: https://reviews.freebsd.org/D696
Reviewed by: jhb@
MFC after: 4 weeks
set bo_bsize on a bufobj.
This is a slight modification of the patch provided.
PR: 193146
Submitted by: Conrad Meyer <conrad.meyer@isilon.com>
Sponsored by: EMC Isilon Storage Division
reaches 1. The p_numthreads counter is decremented in thread_exit() by
a call to thread_unlink(). This means that the exiting threads may
still execute on other CPUs when thread_single(SINGLE_EXIT) returns.
As result, vmspace could be destroyed while paging structures are
still used on other CPUs by exiting threads.
Delay the return from thread_single(SINGLE_EXIT) until all threads are
really destroyed by thread_stash() after the last switch out. The
p_exitthreads counter already provides the required mechanism, move
the wait from the thread_wait() (which is called from wait(2) code)
into thread_single().
Reported by: many (as "panic: pmap active <addr>")
Reviewed by: alc, jhb
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Prior to the change it would always return initproc for non-traced processes.
This fixes ps apparently always returning 1 as ppid.
Pointy hat: mjg
Reported by: many
MFC after: 1 week
Add a separate field which exports tracer pid and add a new keyword
("tracer") for ps to display it.
This is a follow up to r270444.
Reviewed by: kib
MFC after: 1 week
Relnotes: yes
They are used when a panic occurs or when entering a DDB session for
instance.
cngrab() forces a vt-switch to the console window, no matter if the
original window is another terminal or an X session. However, cnungrab()
doesn't vt-switch back to the original window currently.
MFC after: 1 week
pre-r270321. Namely, the flags are preserved for SIG_DFL and SIG_IGN
dispositions.
Requested and reviewed by: jilles
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Traced processes always have the tracer set as the parent.
Utilize proc_realparent to obtain the right process when needed.
Reviewed by: kib
MFC after: 1 week
Previously they were uncoditionally reparented to init. In effect
it was possible that tracee was never returned to original parent.
Reviewed by: kib
MFC after: 1 week
aborted, allowing other threads to run. Without this change thread is just
rescheduled again, that was illustrated by provided test tool.
PR: 192926
Submitted by: eric@vangyzen.net
MFC after: 2 weeks
doing suspend check. This restores the pre-r251684 behaviour, to
retry once after the signal is detected.
PR: kern/192918
Submitted by: Elliott Rabe, Dell Inc., Eric van Gyzen <eric@vangyzen.net>
Obtained from: Dell Inc.
MFC after: 1 week
ignored or default, are not leaking. Apparently, there exists code
which relies on SA_SIGINFO not reported for SIG_DFL or SIG_IGN.
In kern_sigaction, ignore flags when resetting. Encapsulate the flag
and disposition testing into helper sigact_flag_test().
On exec, and when delivering signal with SA_RESETHAND flag set,
signals are reset automatically. Use new helper sigdflt(), which
removes duplicated code and corrects all flag bits for the signal.
For proc0, set sigintr bit for all ignored signals. Ignored signals
are consumed in tdsendsignal() and not delivered to the victim thread
at all.
Reported and tested by: royger
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
good example is socket options that aren't necessarily generic. To
this end, OSD is added to the socket structure and hooks are defined
for key operations on sockets. These are:
o soalloc() and sodealloc()
o Get and set socket options
o Socket related kevent filters.
One aspect about hhook that appears to be not fully baked is the return
semantics (the return value from the hook is ignored in hhook_run_hooks()
at the time of commit). To support return values, the socket_hhook_data
structure contains a 'status' field to hold return values.
Submitted by: Anuranjan Shukla <anshukla@juniper.net>
Obtained from: Juniper Networks, Inc.
header (Elf_Ehdr) to determine if a particular interpretor wants to
accept it or not. Use this mechanism to filter EABI arm on OABI arm
kernels, and vice versa. This method could also be used to implement
OABI on EABI arm kernels, if desired, or to allow a single mips kernel
to run o32, n32 and n64 binaries.
Differential Revision: https://reviews.freebsd.org/D609
UNIX systems, eg. MacOS X and Solaris. It uses Sun-compatible map format,
has proper kernel support, and LDAP integration.
There are still a few outstanding problems; they will be fixed shortly.
Reviewed by: allanjude@, emaste@, kib@, wblock@ (earlier versions)
Phabric: D523
MFC after: 2 weeks
Relnotes: yes
Sponsored by: The FreeBSD Foundation
page queue even when the allocation is not wired. It is
responsibility of the vm_page_grab() caller to ensure that the page
does not end on the vm_object queue but not on the pagedaemon queue,
which would effectively create unpageable unwired page.
In exec_map_first_page() and vm_imgact_hold_page(), activate the page
immediately after unbusying it, to avoid leak.
In the uiomove_object_page(), deactivate page before the object is
unlocked. There is no leak, since the page is deactivated after
uiomove_fromphys() finished. But allowing non-queued non-wired page
in the unlocked object queue makes it impossible to assert that leak
does not happen in other places.
Reviewed by: alc
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
One problem is inferior(9) looping due to the process tree becoming a
graph instead of tree if the parent is traced by child. Another issue
is due to the use of p_oppid to restore the original parent/child
relationship, because real parent could already exited and its pid
reused (noted by mjg).
Add the function proc_realparent(9), which calculates the parent for
given process. It uses the flag P_TREE_FIRST_ORPHAN to detect the head
element of the p_orphan list and than stepping back to its container
to find the parent process. If the parent has already exited, the
init(8) is returned.
Move the P_ORPHAN and the new helper flag from the p_flag* to new
p_treeflag field of struct proc, which is protected by proctree lock
instead of proc lock, since the orphans relationship is managed under
the proctree_lock already.
The remaining uses of p_oppid in ptrace(PT_DETACH) and process
reapping are replaced by proc_realparent(9).
Phabric: D417
Reviewed by: jhb
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
The MD allocators were very common, however there were some minor
differencies. These differencies were all consolidated in the MI allocator,
under ifdefs. The defines from machine/vmparam.h turn on features required
for a particular machine. For details look in the comment in sys/sf_buf.h.
As result no MD code left in sys/*/*/vm_machdep.c. Some arches still have
machine/sf_buf.h, which is usually quite small.
Tested by: glebius (i386), tuexen (arm32), kevlo (arm32)
Reviewed by: kib
Sponsored by: Netflix
Sponsored by: Nginx, Inc.
Replace a single soft updates thread with a thread per FFS-filesystem
mount point. The threads are associated with the bufdaemon process.
Reviewed by: kib
Tested by: Peter Holm and Scott Long
MFC after: 2 weeks
Sponsored by: Netflix
r262867 was described as fixing socket buffer checks for SOCK_SEQPACKET,
but also changed one of the SOCK_DGRAM code paths to use the new
sbappendaddr_nospacecheck_locked() function. This lead to SOCK_DGRAM
bypassing socket buffer limits.
It could be claimed that two things were reasonable protected by
Giant. One is vfsconf list links, which is converted to the new
dedicated sx vfsconf_sx. Another is vfsconf.vfc_refcount, which is
now updated with atomics.
Note that vfc_refcount still has the same races now as it has under
the Giant, the unload of filesystem modules can happen while the
module is still in use.
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
some parts of the checks are in fact redundand in the surrounding
code, and it is more clear what the conditions are by direct testing
of the flags. Two of the three macros were only used in assertions.
In vnlru_free(), all relevant parts of vholdl() were already inlined,
except the increment of v_holdcnt itself. Do not call vholdl() to do
the increment as well, this allows to make assertions in
vholdl()/vhold() more strict.
In v_incr_usecount(), call vholdl() before incrementing other ref
counters. The change is no-op, but it makes less surprising to see
the vnode state in debugger if interrupted inside v_incr_usecount().
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
so it really should not be under "optional inet". The fact that uipc_accf.c
lives under kern/ lends some weight to making it a "standard" file.
Moving kern/uipc_accf.c from "optional inet" to "standard" eliminates the
need for #ifdef INET in kern/uipc_socket.c.
Also, this meant the net.inet.accf.unloadable sysctl needed to move, as
net.inet does not exist without networking compiled in (as it lives in
netinet/in_proto.c.) The new sysctl has been named net.accf.unloadable.
In order to support existing accept filter sysctls, the net.inet.accf node
has been added netinet/in_proto.c.
Submitted by: Steve Kiernan <stevek@juniper.net>
Obtained from: Juniper Networks, Inc.
is the indication that draining got interrupted due to a revoke(2) and
that tty_drain() is to be called again for draining to complete. If the
device is flagged as gone, then waiting/draining is not possible. Only
return ERESTART when waiting is still possible.
Obtained from: Juniper Networks, Inc.
This commit does not add error returns to minidumpsys() or
textdump_dumpsys(); those can also be added later.
Submitted by: Conrad Meyer (EMC / Isilon storage division)
instead of at the beginning. This allows an intra process signal
to be sent to the oldest thread with the signal unmasked - which,
if it still exists, is the main thread. This mimics behavior
found in Linux and Solaris.
Define the precision macros as bits sets to conform with XNU equivalent.
Test fflags passed for EVFILT_TIMER and return EINVAL in case an invalid flag
is passed.
Phabric: https://phabric.freebsd.org/D421
Reviewed by: kib
code. The consensus on arch@ is that this feature might have been useful
in the distant past, but is now just unnecessary bloat.
The int_rman_activate_resource() and int_rman_deactivate_resource()
functions become trivial, so manually inline them.
The special deferred handling of RF_ACTIVE is no longer needed in
reserve_resource_bound(), so eliminate the associated code at the
end of the function.
These changes reduce the object file size by more than 500 bytes on i386.
Update the rman.9 man page to reflect the removal of the RF_TIMESHARE
feature.
MFC after: 2 weeks
forcing filesystem VOP_LINK() methods to repeat the code. In
tmpfs_link(), remove redundand check for the type of the source,
already done by VFS.
Note that NFS server already performs this check before calling
VOP_LINK().
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
- Move the code to calculate resident count into separate function.
It reduces the indent level and makes the operation of
vmmap_skip_res_cnt tunable more clear.
- Optimize the calculation of the resident page count for map entry.
Skip directly to the next lowest available index and page among the
whole shadow chain.
- Restore the use of pmap_incore(9), only to verify that current
mapping is indeed superpage.
- Note the issue with the invalid pages.
Suggested and reviewed by: alc
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
unmount time) in the helper vfs_write_suspend_umnt(). Use it instead
of two inline copies in FFS.
Fix the bug in the FFS unmount, when suspension failed, the ufs
extattrs were not reinitialized.
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
producer, instead of hard-coding VFS_VGET(). New function, which
takes callback, is called vn_get_ino_gen(), standard callback for
vn_get_ino() is provided.
Convert inline copies of vn_get_ino() in msdosfs and cd9660 into the
uses of vn_get_ino_gen().
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
o Do not use UMA refcount zone. The problem with this zone is that
several refcounting words (16 on amd64) share the same cache line,
and issueing atomic(9) updates on them creates cache line contention.
Also, allocating and freeing them is extra CPU cycles.
Instead, refcount the page directly via vm_page_wire() and the sfbuf
via sf_buf_alloc(sf_buf_page(sf)) [1].
o Call refcounting/freeing function for EXT_SFBUF via direct function
call, instead of function pointer. This removes barrier for CPU
branch predictor.
o Do not cleanup the mbuf to be freed in mb_free_ext(), merely to
satisfy assertion in mb_dtor_mbuf(). Remove the assertion from
mb_dtor_mbuf(). Use bcopy() instead of manual assignments to
copy m_ext in mb_dupcl().
[1] This has some problems for now. Using sf_buf_alloc() merely to
increase refcount is expensive, and is broken on sparc64. To be
fixed.
Sponsored by: Netflix
Sponsored by: Nginx, Inc.
Code trying to take a look has to check fd_refcnt and it is 0 by that time.
This is a follow up to r268505, without this the code would leak memory for
tables bigger than the default.
MFC after: 1 week
amount of resident pages, in fact calculates the amount of installed
pte entries in the region. Resident pages which were not soft-faulted
yet are not counted.
Calculate the amount of resident pages by looking in the objects chain
backing the region.
Add a knob to disable the residency calculation at all. For large
sparce regions, either previous or updated algorithm runs for too long
time, while several introspection tools do not need the (advisory) RSS
value at all.
PR: kern/188911
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
tools/regression/file/flock/flock.c, which completes the fix in
r192685. When the lock was stolen from us, retry the whole lock
sequence in kernel, instead of returning EINTR to usermode and hoping
that application would handle it correctly by restarting the lock
acquire.
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
This includes:
o All directories named *ia64*
o All files named *ia64*
o All ia64-specific code guarded by __ia64__
o All ia64-specific makefile logic
o Mention of ia64 in comments and documentation
This excludes:
o Everything under contrib/
o Everything under crypto/
o sys/xen/interface
o sys/sys/elf_common.h
Discussed at: BSDcan
getenv_xxx() functions instead of strtoq(), because the getenv_xxx()
functions include wrappers for various postfixes like G/M/K, which
strtoq() doesn't do.
Mark cpu_search_lowest/cpu_search_highest/cpu_search_both as noinline,
while cpu_search() gets always_inline. With the attributes set,
cpu_search() is inlined in wrappers, and if()s with constant
conditionals are optimized.
On some tests on many-core machine, the hwpmc reported samples for
cpu_search*() are reduced from 25% total to 9%.
Submitted by: "Rang, Anton" <anton.rang@isilon.com>
MFC after: 1 week
The UEFI framebuffer driver vt_efifb requires vt(4), so add a mechanism
for the startup routine to set the preferred console. This change is
ugly because console init happens very early in the boot, making a
cleaner interface difficult. This change is intended only to facilitate
the sc(4) / vt(4) transition, and can be reverted once vt(4) is the
default.
It is used only during execve (i.e. singlethreaded), so there is no fear
of returning 'not shared' which soon becomes 'shared'.
While here reorganize the code a little to avoid proc lock/unlock in
shared case.
MFC after: 1 week
Remove duplicate "debug_ktr.mask" sysctl definition.
Remove now unused variable from "kern_ktr.c".
This fixes build of "ktr" which was broken by r267961.
Let the default value for "vm_kmem_size_scale" be zero. It is setup
after that the sysctl has been initialized from "getenv()" in the
"kmeminit()" function to equal the "VM_KMEM_SIZE_MAX" value, if
zero. On Sparc64 the "VM_KMEM_SIZE_MAX" macro is not a constant. This
fixes build of Sparc64 which was broken by r267961.
Add a special macro to dynamically create SYSCTL root nodes, because
root nodes have a special parent. This fixes build of existing OFED
module and CANBUS module for pc98 which was broken by r267961.
Add missing "sysctl.h" includes to get the needed sysctl header file
declarations. This is needed after r267961.
MFC after: 2 weeks
Filetable can be shared with other processes. Previous code failed to
clear the pointer for all but the last process getting rid of the table.
This is mostly cosmetics.
Get rid of 'This should happen earlier' comment. Clearing the pointer in
this place is fine as consumers can reliably check for files availability
by inspecting fd_refcnt and vnodes availabity by NULL-checking them.
MFC after: 1 week
These changes prevent sysctl(8) from returning proper output,
such as:
1) no output from sysctl(8)
2) erroneously returning ENOMEM with tools like truss(1)
or uname(1)
truss: can not get etype: Cannot allocate memory
implement options TERMINAL_{KERN,NORM}_ATTR. These are aliased to
SC_{KERNEL_CONS,NORM}_ATTR and like these latter, allow to change the
default colors of normal and kernel text respectively.
Note on the naming: Although affecting the output of vt(4), technically
kern/subr_terminal.c is primarily concerned with changing default colors
so it would be inconsistent to term these options VT_{KERN,NORM}_ATTR.
Actually, if the architecture and abstraction of terminal+teken+vt would
be perfect, dev/vt/* wouldn't be touched by this commit at all.
Reviewed by: emaste
MFC after: 3 days
Sponsored by: Bally Wulff Games & Entertainment GmbH
With this change and previous work from ray@ it will be possible to put
both in GENERIC, and have one enabled by default, but allow the other to
be selected via the loader.
(The previous implementation had separate kern.vt.disable and
hw.syscons.disable tunables, and would panic if both drivers were
compiled in and neither was explicitly disabled.)
MFC after: 1 week
Sponsored by: The FreeBSD Foundation
there is an environment variable which shall initialize the SYSCTL
during early boot. This works for all SYSCTL types both statically and
dynamically created ones, except for the SYSCTL NODE type and SYSCTLs
which belong to VNETs. A new flag, CTLFLAG_NOFETCH, has been added to
be used in the case a tunable sysctl has a custom initialisation
function allowing the sysctl to still be marked as a tunable. The
kernel SYSCTL API is mostly the same, with a few exceptions for some
special operations like iterating childrens of a static/extern SYSCTL
node. This operation should probably be made into a factored out
common macro, hence some device drivers use this. The reason for
changing the SYSCTL API was the need for a SYSCTL parent OID pointer
and not only the SYSCTL parent OID list pointer in order to quickly
generate the sysctl path. The motivation behind this patch is to avoid
parameter loading cludges inside the OFED driver subsystem. Instead of
adding special code to the OFED driver subsystem to post-load tunables
into dynamically created sysctls, we generalize this in the kernel.
Other changes:
- Corrected a possibly incorrect sysctl name from "hw.cbb.intr_mask"
to "hw.pcic.intr_mask".
- Removed redundant TUNABLE statements throughout the kernel.
- Some minor code rewrites in connection to removing not needed
TUNABLE statements.
- Added a missing SYSCTL_DECL().
- Wrapped two very long lines.
- Avoid malloc()/free() inside sysctl string handling, in case it is
called to initialize a sysctl from a tunable, hence malloc()/free() is
not ready when sysctls from the sysctl dataset are registered.
- Bumped FreeBSD version to indicate SYSCTL API change.
MFC after: 2 weeks
Sponsored by: Mellanox Technologies
If passed cm->cmsg_len was below cmsghdr size the experssion:
datalen = (caddr_t)cm + cm->cmsg_len - (caddr_t)data;
would give negative result. However, in practice it would not
result in a crash because the kernel would try to obtain garbage fds
for given process and would error out with EBADF.
PR: 124908
Submitted by: campbell mumble.net (modified a little)
MFC after: 1 week
o assert in each one that fdp is not shared
o remove unnecessary NULL checks - all userspace processes have fdtables
and kernel processes cannot execve
o remove comments about the danger of fd_ofiles getting reallocated - fdtable
is not shared and fd_ofiles could be only reallocated if new fd was about to be
added, but if that was possible the code would already be buggy as setugidsafety
work could be undone
MFC after: 1 week
We can read refcnt safely and only care if it is equal to 1.
If it could suddenly change from 1 to something bigger the code would be
buggy even in the previous form and transitions from > 1 to 1 are equally racy
and harmless (we copy even though there is no need).
MFC after: 1 week
binding their threads to particular CPU.
Changing ithread cpu mask is now performed by special cpuset_setithread().
It creates additional cpuset root group on first bind invocation.
No objection: jhb
Tested by: hiren
MFC after: 2 weeks
Sponsored by: Yandex LLC
fd_lastfile is guaranteed to be the biggest open fd, so when the intent
is to iterate over active fds or lookup one, there is no point in looking
beyond that limit.
Few places are left unpatched for now.
MFC after: 1 week
permissions test, forgotten in r164033.
Refactor the permission checks for utimes(2) into vnode helper
function vn_utimes_perm(9), and simplify its code comparing with the
UFS origin, by writing the call to VOP_ACCESSX only once. Use the
helper for UFS(5), tmpfs(5), devfs(5) and msdosfs(5).
Reported by: bde
Reviewed by: bde, trasz
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
the queue where to enqueue pages that are going to be unwired.
- Add stronger checks to the enqueue/dequeue for the pagequeues when
adding and removing pages to them.
Of course, for unmanaged pages the queue parameter of vm_page_unwire() will
be ignored, just as the active parameter today.
This makes adding new pagequeues quicker.
This change effectively modifies the KPI. __FreeBSD_version will be,
however, bumped just when the full cache of free pages will be
evicted.
Sponsored by: EMC / Isilon storage division
Reviewed by: alc
Tested by: pho
VM due to copyin(9) faulting while VFS locks are held is
deadlock-prone there in the same way as for the write(2) syscall.
Reported and tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
avoid congestion on global mountlist_mtx mutex in vfs_busyfs(), while
traversing through the list of mount points.
This change significantly improves NFS server scalability, since it had
to do this translation for every request, and the global lock becomes quite
congested.
This code is more optimized for relatively small number of mount points.
On systems with hundreds of active mount points this simple cache may have
many collisions. But the original traversal code in that case should also
behave much worse, so we are not loosing much.
Reviewed by: attilio
MFC after: 2 weeks
Sponsored by: iXsystems, Inc.
flags, to rwlock. Lock it in read mode when used from subroutines
called from buffer release code paths.
The needsbuffer is now updated using atomics, while read lock of
nblock prevents loosing the wakeups from bufspacewakeup() and
bufcountadd() in getnewbuf_bufd_help().
In several interesting loads, needsbuffer flags are never set, while
buffers are reused quickly. This causes brelse() and bqrelse() from
different threads to content on the nblock. Now they take nblock in
read mode, together with needsbuffer not needing an update, allowing
higher parallelism.
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
This allows to mostly avoid lock usage in getnewvnode_[drop_]reserve(),
that reduces number of global vnode_free_list_mtx mutex acquisitions
from 4 to 2 per NFS request on ZFS, improving SMP scalability.
Reviewed by: kib
MFC after: 2 weeks
Sponsored by: iXsystems, Inc.
using a direct hook called from kern_vfs_bio_buffer_alloc().
Mark ffs_rawread.c as requiring both ffs and directio options to be
compiled into the kernel. Add ffs_rawread.c to the list of ufs.ko
module' sources.
In addition to stopping breaking the layering violation, it also
allows to link kernel when FFS is configured as module and DIRECTIO is
enabled.
One consequence of the change is that ffs_rawread.o is always linked
into the module regardless of the DIRECTIO option. This is similar to
the option QUOTA and ufs_quota.c.
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
In the .o file, this only changes some line numbers (head amd64) because
element 0 is no longer explicitly initialized.
This should make bugs like FreeBSD-SA-14:12.ktrace less likely.
Discussed with: des
MFC after: 1 week
interface allows the ifnet structure to be defined as an opaque
type in NIC drivers. This then allows the ifnet structure to be
changed without a need to change or recompile NIC drivers.
Put differently, NIC drivers can be written and compiled once and
be used with different network stack implementations, provided of
course that those network stack implementations have an API and
ABI compatible interface.
This commit introduces the 'if_t' type to replace 'struct ifnet *'
as the type of a network interface. The 'if_t' type is defined as
'void *' to enable the compiler to perform type conversion to
'struct ifnet *' and vice versa where needed and without warnings.
The functions that implement the API are the only functions that
need to have an explicit cast.
The MII code has been converted to use the driver API to avoid
unnecessary code churn. Code churn comes from having to work with
both converted and unconverted drivers in correlation with having
callback functions that take an interface. By converting the MII
code first, the callback functions can be defined so that the
compiler will perform the typecasts automatically.
As soon as all drivers have been converted, the if_t type can be
redefined as needed and the API functions can be fix to not need
an explicit cast.
The immediate benefactors of this change are:
1. Juniper Networks - The network stack implementation in Junos
is entirely different from FreeBSD's one and this change
allows Juniper to build "stock" NIC drivers that can be used
in combination with both the FreeBSD and Junos stacks.
2. FreeBSD - This change opens the door towards changing ifnet
and implementing new features and optimizations in the network
stack without it requiring a change in the many NIC drivers
FreeBSD has.
Submitted by: Anuranjan Shukla <anshukla@juniper.net>
Reviewed by: glebius@
Obtained from: Juniper Networks, Inc.
This _was_ right, a last minute suggestion and not enough testing makes
Adrian a bad boy.
Tested:
* igb(4) with RSS patches, by hand verifying each igb(4) taskqueue
tid from procstat -ka using cpuset -g -t <tid>.
flags that has several bits cleared. The RF_WANTED and RF_FIRSTSHARE
bits are invalid in this context, and we want to defer setting RF_ACTIVE
in r_flags until later. This should make rman_get_flags() return
the correct answer in all cases.
Add a KASSERT() to catch callers which incorrectly pass the RF_WANTED
or RF_FIRSTSHARE flags.
Do a strict equality check on the share type bits of flags. In
particular, do an equality check on RF_PREFETCHABLE. The previous
code would allow one type of mismatch of RF_PREFETCHABLE but disallow
the other type of mismatch. Also, ignore the the RF_ALIGNMENT_MASK
bits since alignment validity should be handled by the amask check.
This field contains an integer value, but previous code did a strange
bitwise comparison on it.
Leave the original value of flags unmolested as a minor debug aid.
Change the start+amask overflow check to a KASSERT() since it is just
meant to catch a highly unlikely programming error in the caller.
Reviewed by: jhb
MFC after: 1 month
taskqueue worker thread(s) to.
For now it isn't a taskqueue/taskthread error to fail to pin
to the given cpuid.
Thanks to rpaulo@, kib@ and jhb@ for feedback.
Tested:
* igb(4), with local RSS patches to pin taskqueues.
TODO:
* ask the doc team for help in documenting the new API call.
* add a taskqueue_start_threads_cpuset() method which takes
a cpuset_t - but this may require a bunch of surgery to
bring cpuset_t into scope.