- Read interrupt properties at bus enumeration time and store
it into global mapping table.
- At bus_activate_resource() time, given mapping entry is resolved and
connected to real interrupt source. A copy of mapping entry is attached
to given resource.
- At bus_setup_intr() time, mapping entry stored in resource is used
for delivery of requested interrupt configuration.
- For MSI/MSIX interrupts, mapping entry is created within
pci_alloc_msi()/pci_alloc_msix() call.
- For legacy PCI interrupts, mapping entry must be created within
pcib_route_interrupt() by pcib driver itself.
Reviewed by: nwhitehorn, andrew
Differential Revision: https://reviews.freebsd.org/D7493
aio_aqueue() calls aio_init_aioinfo() as the first action. There is no
need to duplicate the code in kern_aio_fsync().
Also fix indent for aio_aqueue() definition.
Reviewed by: jhb
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D7523
Right now, userspace (fast) gettimeofday(2) on x86 only works for
RDTSC. For older machines, like Core2, where RDTSC is not C2/C3
invariant, and which fall to HPET hardware, this means that the call
has both the penalty of the syscall and of the uncached hw behind the
QPI or PCIe connection to the sought bridge. Nothing can me done
against the access latency, but the syscall overhead can be removed.
System already provides mappable /dev/hpetX devices, which gives
straight access to the HPET registers page.
Add yet another algorithm to the x86 'vdso' timehands. Libc is updated
to handle both RDTSC and HPET. For HPET, the index of the hpet device
to mmap is passed from kernel to userspace, index might be changed and
libc invalidates its mapping as needed.
Remove cpu_fill_vdso_timehands() KPI, instead require that
timecounters which can be used from userspace, to provide
tc_fill_vdso_timehands{,32}() methods. Merge i386 and amd64
libc/<arch>/sys/__vdso_gettc.c into one source file in the new
libc/x86/sys location. __vdso_gettc() internal interface is changed
to move timecounter algorithm detection into the MD code.
Measurements show that RDTSC even with the syscall overhead is faster
than userspace HPET access. But still, userspace HPET is three-four
times faster than syscall HPET on several Core2 and SandyBridge
machines.
Tested by: Howard Su <howard0su@gmail.com>
Sponsored by: The FreeBSD Foundation
MFC after: 1 month
Differential revision: https://reviews.freebsd.org/D7473
it (either async or sync drain).
At this moment the only user of drain is TCP, but TCP wouldn't reschedule a
callout after it has drained it, since it drains only when a tcpcb is closed.
This for now the problem isn't observed.
Submitted by: rrs
Make the kern_fsync() function public, so that it can be used by other
parts of the kernel. Fix up existing consumers to make use of it.
Requested by: kib
If the caller of sem_post() wakes up a thread sleeping via sem_wait()
before it clears the has_waiters flag, the caller of sem_wait() has no way of
knowing when it is safe to destroy the semaphore and reuse the memory. This is
because the caller of sem_post() may be interrupted between the wake step and
the clearing of has_waiters. It will then write into the has_waiters flag in
userspace after being preempted for some unknown amount of time.
Reviewed by: jhb, kib, vangyzen
Approved by: kib (mentor), vangyzen (mentor)
MFC after: 2 weeks
Sponsored by: Dell Inc.
Differential Revision: https://reviews.freebsd.org/D7505
Standard VOP_FSYNC() implementation just syncs data buffers, and due
to this, is the correct and efficient implementation for msdosfs or
any other filesystem which uses bufer cache trivially. Provide
globally visible wrapper vop_stdfdatasync_buf() for future consumption
by other filesystems.
Reviewed by: mckusick
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
Differential revision: https://reviews.freebsd.org/D7471
The syscall is a trivial wrapper around new VOP_FDATASYNC(), sharing
code with fsync(2). For all filesystems, this commit provides the
implementation which delegates the work of VOP_FDATASYNC() to
VOP_FSYNC(). This is functionally correct but not efficient.
This is not yet POSIX-compliant implementation, because it does not
ensure that queued AIO requests are completed before returning.
Reviewed by: mckusick
Discussed with: avg (ZFS), jhb (AIO part)
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
Differential revision: https://reviews.freebsd.org/D7471
It was remarkably hard to trace all current threads. "show pcpu" only
showed the pid, and there was nothing (?) better than searching ps output
to find the tids on CPUs. This change simplifies the search, but you
still have to trace the tid for each CPU manually.
incorrect from the error cases in exec_map_first_page(). They are
unnecessary because we automatically unbusy the page in vm_page_free()
when we remove it from the object. The calls are incorrect because they
happen after the page is freed, so we might actually unbusy the page
after it has been reallocated to a different object. (This error was
introduced in r292373.)
Reviewed by: kib
MFC after: 1 week
- Move group task queue into kern/subr_gtaskqueue.c
- Change intr_enable to return an int so it can be detected if it's not
implemented
- Allow different TX/RX queues per set to be different sizes
- Don't split up TX mbufs before transmit
- Allow a completion queue for TX as well as RX
- Pass the RX budget to isc_rxd_available() to allow an earlier return
and avoid multiple calls
Submitted by: shurd
Reviewed by: gallatin
Approved by: scottl
Differential Revision: https://reviews.freebsd.org/D7393
It was added in r153192 for XFS and doesn't appear to have been used for
anything else. XFS was disconnected in r241607 and removed entirely in
r247631.
Reported by: mlaier
Reviewed by: imp, kib
Differential Revision: https://reviews.freebsd.org/D7468
processes which combine kernel and non-kernel threads, e.g. nfsd. For
such processes, termination of a kthread must recheck signal delivery
among other threads according to masks.
Reported and tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Several files use the internal name of `struct device` instead of
`device_t` which is part of the public API. This patch changes all
`struct device *` to `device_t`.
The remaining occurrences of `struct device` are those referring to the
Linux or OpenBSD version of the structure, or the code is not built on
FreeBSD and it's unclear what to do.
Submitted by: Matthew Macy <mmacy@nextbsd.org> (previous version)
Approved by: emaste, jhibbits, sbruno
MFC after: 3 days
Differential Revision: https://reviews.freebsd.org/D7447
_prison_check_ip4 renamed to prison_check_ip4_locked
Move IPv6-specific jail functions to new file netinet6/in6_jail.c
_prison_check_ip6 renamed to prison_check_ip6_locked
Add appropriate prototypes to sys/sys/jail.h
Adjust kern_jail.c to call prison_check_ip4_locked and
prison_check_ip6_locked accordingly.
Add netinet/in_jail.c and netinet6/in6_jail.c to the list of files that
need to be built when INET and INET6, respectively, are configured in the
kernel configuration file.
Reviewed by: jtl
Approved by: sjg (mentor)
Sponsored by: Juniper Networks, Inc.
Differential Revision: https://reviews.freebsd.org/D6799
If the listening socket is closed while sonewconn() is executing, the
nascent child socket is aborted, which results in recursion on the
unp_link lock when the child's pru_detach method is invoked. Fix this
by using a flag to mark such sockets, and skip a part of the socket's
teardown during detach.
Reported by: Raviprakash Darbha <rdarbha@juniper.net>
Tested by: pho
MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D7398
r296773 was done to only remove libc symbols for <7. We want to provide
the syscall symbols going forward for 7+.
Discussed with: jhb
MFC after: 3 days
Sponsored by: EMC / Isilon Storage Division
Our mprotect() function seems to take a "const void *" address to the
pages whose permissions need to be adjusted. POSIX uses "void *". Simply
stick to the POSIX one to prevent us from writing unportable code.
PR: 211423 (exp-run)
Tested by: antoine@ (Thanks!)
All current spinning loops retry an atomic op the first chance they get,
which leads to performance degradation under load.
One classic solution to the problem consists of delaying the test to an
extent. This implementation has a trivial linear increment and a random
factor for each attempt.
For simplicity, this first thouch implementation only modifies spinning
loops where the lock owner is running. spin mutexes and thread lock were
not modified.
Current parameters are autotuned on boot based on mp_cpus.
Autotune factors are very conservative and are subject to change later.
Reviewed by: kib, jhb
Tested by: pho
MFC after: 1 week
Both variables are uint64_t, but they only count spins or sleeps.
All reasonable values which we can get here comfortably hit in 32-bit range.
Suggested by: kib
MFC after: 1 week
If a thread is created bound to a cpuset it might already be bound before
it's very first timeslice, and td_lastcpu will be NOCPU in that case.
MFC after: 1 week
- Use correct lock in aio_cancel_sync when dequeueing job.
- Add _locked variants of aio_set/clear_cancel_function and use those
to avoid lock recursion when adding and removing fsync jobs to the
per-process sync queue.
- While here, add a basic test for aio_fsync().
PR: 211390
Reported by: Randy Westlund <rwestlun@gmail.com>
MFC after: 1 week
Sponsored by: Chelsio Communications
Differential Revision: https://reviews.freebsd.org/D7339
Any sensible workflow will include a revision control system from which
to restore the old files if required. In normal usage, developers just
have to clean up the mess.
Reviewed by: jhb
Sponsored by: DARPA, AFRL
Differential Revision: https://reviews.freebsd.org/D7353
It looks like the msgrcv() system call is already written in such a way
that the size is internally computed as a size_t and written into all of
td_retval[0]. This means that it is effectively already returning
ssize_t. It's just that the userspace prototype doesn't match up.
specifics of callout KPI. Esp., do not depend on the exact interface
of callout_stop(9) return values.
The main change is that instead of requiring precise callouts, code
maintains absolute time to wake up. Callouts now should ensure that a
wake occurs at the requested moment, but we can tolerate both run-away
callout, and callout_stop(9) lying about running callout either way.
As consequence, it removes the constant source of the bugs where
sleepq_check_timeout() causes uninterruptible thread state where the
thread is detached from CPU, see e.g. r234952 and r296320.
Patch also removes dual meaning of the TDF_TIMEOUT flag, making code
(IMO much) simpler to reason about.
Tested by: pho
Reviewed by: jhb
Sponsored by: The FreeBSD Foundation
MFC after: 1 month
Differential revision: https://reviews.freebsd.org/D7137
callout_when(9). See the man page update for the description of the
intended use.
Tested by: pho
Reviewed by: jhb, bjk (man page updates)
Sponsored by: The FreeBSD Foundation
MFC after: 1 month
X-Differential revision: https://reviews.freebsd.org/D7137
target. Due to a way issignal() selects the next signal to deliver
and report, if the simultaneous or already pending another signal
exists, that signal might be reported by the next waitpid(2) call.
This causes minor annoyance for debuggers, which must be prepared to
take any signal as the first event, then filter SIGSTOP later.
More importantly, for tools like gcore(1), which attach and then
detach without processing events, SIGSTOP might leak to be delivered
after PT_DETACH. This results in the process being unintentionally
stopped after detach, which is fatal for automatic tools.
The solution is to force SIGSTOP to be the first signal reported after
the attach. Attach code is modified to set P2_PTRACE_FSTP to indicate
that the attaching ritual was not yet finished, and issignal() prefers
SIGSTOP in that condition. Also, the thread which handles
P2_PTRACE_FSTP is made to guarantee to own p_xthread during the first
waitpid(2). All that ensures that SIGSTOP is consumed first.
Additionally, if P2_PTRACE_FSTP is still set on detach, which means
that waitpid(2) was not called at all, SIGSTOP is removed from the
queue, ensuring that the process is resumed on detach.
In issignal(), when acting on STOPing signals, remove the signal from
queue before suspending. Otherwise parallel attach could result in
ptracestop() acting on that STOP as if it was the STOP signal from the
attach. Then SIGSTOP from attach leaks again.
As a minor refactoring, some bits of the common attach code is moved
to new helper proc_set_traced().
Reported by: markj
Reviewed by: jhb, markj
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
Differential revision: https://reviews.freebsd.org/D7256
- Move cr_canseeinpcb to sys/netinet/in_prot.c in order to separate the
INET and INET6-specific code from the rest of the prot code (It is only
used by the network stack, so it makes sense for it to live with the
other network stack code.)
- Move cr_canseeinpcb prototype from sys/systm.h to netinet/in_systm.h
- Rename cr_seeotheruids to cr_canseeotheruids and cr_seeothergids to
cr_canseeothergids, make them non-static, and add prototypes (so they
can be seen/called by in_prot.c functions.)
- Remove sw_csum variable from ip6_forward in ip6_forward.c, as it is an
unused variable.
Reviewed by: gnn, jtl
Approved by: sjg (mentor)
Sponsored by: Juniper Networks, Inc.
Differential Revision: https://reviews.freebsd.org/D2901
and there is no other issues with parallel settime(). Remove spl()
vestiges there as well.
Tested by: pho (as part of the whole patch)
Reviewed by: jhb (same)
Discussed wit: bde
Sponsored by: The FreeBSD Foundation
MFC after: 1 month
Differential revision: https://reviews.freebsd.org/D7302
setclock() and from simultaneous top-level and interrupt. For this,
tc_windup() is protected with a tc_setclock_mtx spinlock, in the try
mode when called from hardclock interrupt. If spinlock cannot be
obtained without spinning from the interrupt context, this means that
top-level executes tc_windup() on other core and our try may be
avoided.
The boottimebin and boottime variables should be adjusted from
tc_windup(). To be correct, they must be part of the timehands and
read using lockless protocol. Remove the globals and reimplement the
getboottime(9)/getboottimebin(9) KPI using the timehands read
protocol.
Tested by: pho (as part of the whole patch)
Reviewed by: jhb (same)
Discussed wit: bde
Sponsored by: The FreeBSD Foundation
MFC after: 1 month
X-Differential revision: https://reviews.freebsd.org/D7302
Change ntpadj_lock to spinlock always, and rename stuff removing
ADJ/adj from the names. ntp_update_second() requires ntp_lock and is
called from the tc_windup(), so ntp_lock must be a spinlock. Add
missed lock to ntp_update_second().
Tested by: pho (as part of the whole patch)
Reviewed by: jhb (same)
Noted by: bde
Sponsored by: The FreeBSD Foundation
MFC after: 1 month
X-Differential revision: https://reviews.freebsd.org/D7302
consumers can now be only one tc_windup() call late.
Use C99 initialization.
Tested by: pho (as part of the whole patch)
Reviewed by: jhb (same)
Discussed with: bde
Sponsored by: The FreeBSD Foundation
MFC after: 1 month
X-Differential revision: https://reviews.freebsd.org/D7302
and getboottimebin(9) KPI. Change consumers of boottime to use the
KPI. The variables were renamed to avoid shadowing issues with local
variables of the same name.
Issue is that boottime* should be adjusted from tc_windup(), which
requires them to be members of the timehands structure. As a
preparation, this commit only introduces the interface.
Some uses of boottime were found doubtful, e.g. NLM uses boottime to
identify the system boot instance. Arguably the identity should not
change on the leap second adjustment, but the commit is about the
timekeeping code and the consumers were kept bug-to-bug compatible.
Tested by: pho (as part of the bigger patch)
Reviewed by: jhb (same)
Discussed with: bde
Sponsored by: The FreeBSD Foundation
MFC after: 1 month
X-Differential revision: https://reviews.freebsd.org/D7302
number of core files allowed by a particular process when using the %I core
file name pattern.
Sanity check at compile time to ensure the value is within the valid range of
0-10.
Reviewed by: jtl, sjg
Approved by: sjg (mentor)
Sponsored by: Juniper Networks, Inc.
Differential Revision: https://reviews.freebsd.org/D6812
It looks like our "struct shmid_ds::shm_nattch" deviates from the
standard in the sense that it is a signed integer, whereas POSIX
requires that it is unsigned, having a special type shmatt_t.
Patch up our native and 32-bit copies to use a new shmatt_t that is an
unsigned integer. As it's unsigned, we can relax the comparisons that
are performed on it. Leave the Linux, iBCS2, etc. copies of the
structure alone.
Reviewed by: ngie
Differential Revision: https://reviews.freebsd.org/D6655
Devfs' file layer ioctl is now just a thin shim around the vnode layer.
Reviewed by: kib
Sponsored by: EMC / Isilon Storage Division
Differential Revision: https://reviews.freebsd.org/D7286
The asynchronous I/O changes made previously result in different
behavior out of the box. Previously all AIO requests failed with
ENOSYS / SIGSYS unless aio.ko was explicitly loaded. Now, some AIO
requests complete and others ("unsafe" requests) fail with EOPNOTSUPP.
Reword the introductory paragraph in aio(4) to add a general
description of AIO before describing the vfs.aio.enable_unsafe sysctl.
Remove the ENOSYS error description from aio_fsync(2), aio_read(2),
and aio_write(2) and replace it with a description of EOPNOTSUPP.
Remove the ENOSYS error description from aio_mlock(2).
Log a message to the system log the first time a process requests an
"unsafe" AIO request that fails with EOPNOTSUPP. This is modeled on
the log message used for processes using the legacy pty devices.
Reviewed by: kib (earlier version)
MFC after: 1 week
Sponsored by: Chelsio Communications
Differential Revision: https://reviews.freebsd.org/D7151
Two notes:
- I allow AIO on reclaimed vnodes, since it is deterministically terminated
fast.
- devfs mounts are marked as MNT_LOCAL, but device vnodes have type
VCHR, so the slow device io is not allowed.
Reviewed by: jhb
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D7273
warnings for some kernel events, mostly intended for the use of
obsoleted or otherwise undersired interfaces.
This is an abstracted and race-expelled code from compat pty driver.
Requested and reviewed by: jhb
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D7270
The each_writable_segment routine evaluates segments on a slightly little more
nuanced metric than simply "writable" or not. Rename the function to more
closely match its behavior (each_dumpable_segment).
Suggested by: jhb
Sponsored by: EMC / Isilon Storage Division
The ELF e_phnum field is only 16 bits wide. To support more than 65535 segments
(program headers), Sun's "Linker and Libraries Guide" table 7-7 (or 12-7,
depending on document version) prescribes a special first section header where
sh_info represents the real number of program headers.
Test code to follow, when it is ready.
Reference: http://docs.oracle.com/cd/E18752_01/pdf/817-1984.pdf
Reviewed by: emaste, markj
Sponsored by: EMC / Isilon Storage Division
Differential Revision: https://reviews.freebsd.org/D7255
fixes that I think closes up the races Gleb was
looking for. This is running quite nicely in Netflix and
now no longer causes TCP-tcb leaks.
Differential Revision: 7135
When threads were added to the kernel, the pr_pid member of the
NT_PRSTATUS note was repurposed to store LWP IDs instead of process
IDs. However, the process ID was no longer recorded in core dumps.
This change adds a pr_pid field to prpsinfo (NT_PRSINFO). Rather than
bumping the prpsinfo version number, note parsers can use the note's
payload size to determine if pr_pid is present.
Reviewed by: kib, emaste (older version)
MFC after: 2 months
Differential Revision: https://reviews.freebsd.org/D7117
First, PL_FLAG_FORKED events now also set a PL_FLAG_VFORKED flag when
the new child was created via vfork() rather than fork(). Second, a
new PL_FLAG_VFORK_DONE event can now be enabled via the PTRACE_VFORK
event mask. This new stop is reported after the vfork parent resumes
due to the child calling exit or exec. Debuggers can use this stop to
reinsert breakpoints in the vfork parent process before it resumes.
Reviewed by: kib
MFC after: 1 month
Differential Revision: https://reviews.freebsd.org/D7045
is delivered to vforked child. Issue is that we avoid stopping such
children in issignal() to not block parents. But executed AST, which
ignored stops, leaves the child with the signal pending but no AST
pending.
On first exec after vfork(), call signotify() to handle pending
reenabled signals. Adjust the assert to not check vfork children
until exec.
Reported and tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
that struct kevent member ident has uintptr_t type, which is silently
truncated to int in the call to fget(). Explicitely check for the
valid range.
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
object lock.
The vmspace_free() operations might need to lock map, object etc on
last dereference. Postpone the free until object's inspection is
done.
Reported and tested by: will
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
ptrace() now stores a mask of optional events in p_ptevents. Currently
this mask is a single integer, but it can be expanded into an array of
integers in the future.
Two new ptrace requests can be used to manipulate the event mask:
PT_GET_EVENT_MASK fetches the current event mask and PT_SET_EVENT_MASK
sets the current event mask.
The current set of events include:
- PTRACE_EXEC: trace calls to execve().
- PTRACE_SCE: trace system call entries.
- PTRACE_SCX: trace syscam call exits.
- PTRACE_FORK: trace forks and auto-attach to new child processes.
- PTRACE_LWP: trace LWP events.
The S_PT_SCX and S_PT_SCE events in the procfs p_stops flags have
been replaced by PTRACE_SCE and PTRACE_SCX. PTRACE_FORK replaces
P_FOLLOW_FORK and PTRACE_LWP replaces P2_LWP_EVENTS.
The PT_FOLLOW_FORK and PT_LWP_EVENTS ptrace requests remain for
compatibility but now simply toggle corresponding flags in the
event mask.
While here, document that PT_SYSCALL, PT_TO_SCE, and PT_TO_SCX both
modify the event mask and continue the traced process.
Reviewed by: kib
MFC after: 1 month
Differential Revision: https://reviews.freebsd.org/D7044
As was reported on http://seclists.org/oss-sec/2016/q3/68, tmpfs code
contains assertion that rdev != VNOVAL. On FreeBSD, there is no other
consequences except triggering the assert. To be compatible with
systems where device nodes have some significance, reject mknod(2)
call with dev == VNOVAL at the syscall level.
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Fill in pr_psargs in the NT_PRSINFO ELF core dump note with command
line arguments.
Reviewed by: kib
Differential Revision: https://reviews.freebsd.org/D7116
A buf's b_pages and b_npages fields may be inconsistent after a panic.
For instance, vfs_vmio_invalidate() sets b_npages to zero only after all
pages are unwired and their page array entries are cleared.
MFC after: 1 week
Sponsored by: EMC / Isilon Storage Division
Set vm_guest to a new enum value (VM_GUEST_KVM) when kvm is detected and use
vm_guest in conditionals testing for KVM.
Also, fix a conditional checking if we're running in a VM which caught only
the generic VM case, but not more specific VMs (KVM, VMWare, etc.). (Spotted
by: vangyzen).
Differential revision: https://reviews.freebsd.org/D7172
Sponsored by: Dell Inc.
Approved by: kib (mentor), vangyzen (mentor)
Reviewed by: alc
MFC after: 4 weeks
Despite the implication (process has pending signals -> the current
thread marked for AST and has TDF_NEEDSIGCHK set) is not true due to
other thread might manipulate its signal blocking mask, it should still
hold for the single-threaded processes. Enable check for the condition
for single-threaded case, and replicate it from userret() to ast() as
well, where we check that ast indeed has no signal to deliver.
Note that the check is under DIAGNOSTIC, it is not enabled for INVARIANTS
but !DIAGNOSTIC since it imposes too heavy-weight locking for day-to-day
used debugging kernel.
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
AST must not execute with TDF_SBDRY or TDF_SEINTR/TDF_SERESTART thread
flags set, which is asserted in userret(). As the consequence, -1 return
from cursig() must not be possible.
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
if vnode is VMIO. For VMIO vnodes, set BO_DEAD in vm_object_terminate().
The vnode_destroy_object(), when calling into vm_object_terminate(),
must be able to flush buffers. BO_DEAD purpose is to quickly destroy
buffers on write when the underlying vnode is not operable any more
(one example is the devfs node after geom is gone). Setting BO_DEAD
for reclaiming vnode before object is terminated is premature, and
results in unability to flush buffers with live SU dependencies from
vinvalbuf() in vm_object_terminate().
Reported by: David Cross <dcrosstech@gmail.com>
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks