8 or more cores to improve utilization. None of my tests on 2xXeon (2x6x2)
system shown any slowdown from mentioned "excess thrashing". Same time in
pbzip2 test with number of threads more then number of CPUs I see up to 10%
speedup with SMT disabled and up 5% with SMT enabled. Thinking about
trashing I was trying to limit that stealing within same last level cache,
but got only worse results. Present code any way prefers to steal threads
from topologically closer cores.
Sponsored by: iXsystems, Inc.
- remove extra dynamic variable initializations;
- restore (4BSD) and implement (ULE) hogticks variable setting;
- make sched_rr_interval() more tolerant to options;
- restore (4BSD) and implement (ULE) kern.sched.quantum sysctl, a more
user-friendly wrapper for sched_slice;
- tune some sysctl descriptions;
- make some style fixes.
allocated softc structure which is returned by
device_get_softc(). This method can be used to
easily implement softc refcounting. This can be
desirable when the softc has memory references
which are controlled by userspace handles for
example.
This solves the problem of blocking the caller
of device_detach() for a non-deterministic time.
Discussed with: kib, ed
MFC after: 2 weeks
the wrong direction. Before it, if preemption and end of time slice happen
same time, thread was put to the head of the queue as for only preemption.
It could cause single thread to run for indefinitely long time. r220198
handles it by not clearing TDF_NEEDRESCHED in case of preemption. But that
causes delayed context switch every time preemption happens, even when not
needed.
Solve problem by introducing scheduler-specifoc thread flag TDF_SLICEEND,
set when thread's time slice is over and it should be put to the tail of
queue. Using SW_PREEMPT flag for that purpose as it was before just not
enough informative to work correctly.
On my tests this by 2-3 times reduces run time deviation (improves fairness)
in cases when several threads share one CPU.
Reviewed by: fabient
MFC after: 2 months
Sponsored by: iXsystems, Inc.
With switchticks variable being reset each time thread preempted (that is
done regularly by interrupt threads) scheduling quantum may never expire.
It was not noticed in time because several other factors still regularly
trigger context switches.
Handle the problem by replacing that mechanism with its equivalent from
SCHED_ULE called time slice. It is effectively the same, just measured in
context of stathz instead of hz. Some unification is probably not bad.
In rare event when fast and ithread interrupts share the same vector
and the fast handler was registered first, we can end up trying to
schedule the ithread that is not created yet. The kernel built with
INVARIANTS then triggers an assertion.
Change the order to create the ithread first and only then add the
handler that needs it to the interrupt event handlers list.
Reviewed by: jhb
to pull vm_param.h was removed. Other big dependency of vm_page.h on
vm_param.h are PA_LOCK* definitions, which are only needed for
in-kernel code, because modules use KBI-safe functions to lock the
pages.
Stop including vm_param.h into vm_page.h. Include vm_param.h
explicitely for the kernel code which needs it.
Suggested and reviewed by: alc
MFC after: 2 weeks
Fix an issue related to old periodic timers. The code in kern_clocksource.c
uses interrupt to keep track of time, and this time may not match with
binuptime(). In order to address such incoherency, switch periodic timers
to binuptime().
Except further calloutng it is needed for already present cyclic subsystem.
lock is obtained before the write count is increased during open() and the
lock is released after the write count is decreased during close().
The first change closes a race where an open() that will block with O_SHLOCK
or O_EXLOCK can increase the write count while it waits. If the process
holding the current lock on the file then tries to call exec() on the file
it has locked, it can fail with ETXTBUSY even though the advisory lock is
preventing other threads from succesfully completeing a writable open().
The second change closes a race where a read-only open() with O_SHLOCK or
O_EXLOCK may return successfully while the write count is non-zero due to
another descriptor that had the advisory lock and was blocking the open()
still being in the process of closing. If the process that completed the
open() then attempts to call exec() on the file it locked, it can fail with
ETXTBUSY even though the other process that held a write lock has closed
the file and released the lock.
Reviewed by: kib
MFC after: 1 month
I found 8.3 is a history BSD version using socket to implement FIFO
pipe, it uses per-file seqcount to compare with writer generation
stored in per-pipe object. The concept is after all writers are gone,
the pipe enters next generation, all old readers have not closed the
pipe should get the indication that the pipe is disconnected, result
is they should get EPIPE, SIGPIPE or get POLLHUP in poll().
But newcomer should not know that previous writters were gone, it
should treat it as a fresh session.
I am trying to bring back FIFO pipe to history behavior. It is still
unclear that if single EOF flag can represent SBS_CANTSENDMORE and
SBS_CANTRCVMORE which socket-based version is using, but I have run
the poll regression test in tool directory, output is same as the one
on 8.3-STABLE now.
I think the output "not ok 18 FIFO state 6b: poll result 0 expected 1.
expected POLLHUP; got 0" might be bogus, because newcomer should not
know that old writers were gone. I got the same behavior on Linux.
Our implementation always return POLLIN for disconnected pipe even it
should return POLLHUP, but I think it is not wise to remove POLLIN for
compatible reason, this is our history behavior.
Regression test: /usr/src/tools/regression/poll
flag but not PIPE_WANTW, but FIFO pipe code does not understand this internal
state, when a FIFO peer reader closes the pipe, it wants to notify the writer,
it checks PIPE_WANTW, if not set, it skips calling wakeup(), so blocked writer
never noticed the case, but in general, the writer should return from the
syscall with EPIPE error code and may get SIGPIPE signal. Setting the
PIPE_WANTW fixed problem, or you can turn off direct write, it should fix the
problem too. This bug is found by PR/170203.
Another bug in FIFO pipe code is when peer closes the pipe, another end which
is being blocked in select() or poll() is not notified, it missed to call
pipeselwakeup().
Third problem is found in poll regression test, the existing code can not
pass 6b,6c,6d tests, but FreeBSD-4 works. This commit does not fix the
problem, I still need to study more to find the cause.
PR: 170203
Tested by: Garrett Copper < yanegomi at gmail dot com >
ktr(4), was constrained to be a power of two. Remove this constraint and
update sys/conf/NOTES accordingly.
Reviewed by: jhb
Approved by: gnn (mentor)
Sponsored by: Google Summer of Code 2012
Append '__' prefix to the tag of struct oflock, and put it under BSD
namespace. Structure is needed both by libc and kernel, thus cannot be
hidden under #ifdef _KERNEL.
Move a set of non-standard F_* and O_* constants into BSD namespace.
SUSv4 explicitely allows implemenation to pollute F_* and O_* names
after fcntl.h is included, but it costs us nothing to adhere
to the specification if exact POSIX compliance level is requested by
user code.
Change some spaces after #define to tabs.
Noted by and discussed with: bde
MFC after: 1 week
ELF parser. Specifically, do not allow note reader and interpreter
path comparision in the brandelf code to read past end of the page.
This may happen if specially crafter ELF image is activated.
Submitted by: Lukasz Wojcik <lukasz.wojcik zoho com>
MFC after: 3 days
VM_KMEM_MAX_SIZE.
The code was not taking into account the size of the kernel_map, which
the kmem_map is allocated from, so it could produce a sub-map size too
large to fit. The simplest solution is to ignore VM_KMEM_MAX entirely
and base the memguard map's size off the kernel_map's size, since this
is always relevant and always smaller.
Found by: Justin Hibbits
adds an extra tick to account for the current partial clock tick. However,
that is not appropriate for a repeating timer when the exact tvtohz() value
should be used for subsequent intervals. Fix repeating callouts for
EVFILT_TIMER by subtracting 1 tick from the tvtohz() result similar to the
fix used in realitexpire() for interval timers.
While here, update a few comments to note that if the EVFILT_TIMER code
were to move out of kern_event.c, it should move to kern_time.c (where the
interval timer code it mimics lives) rather than kern_timeout.c.
MFC after: 1 month
These probes are most useful when looking into the structures
they provide, which are listed in io.d. For example:
dtrace -n 'io:genunix::start { printf("%d\n", args[0]->bio_bcount); }'
Note that the I/O systems in FreeBSD and Solaris/Illumos are sufficiently
different that there is not a 1:1 mapping from scripts that work
with one to the other.
MFC after: 1 month
debugger exited without calling ptrace(PT_DETACH), there is a time window
that the p_xthread may be pointing to non-existing thread, in practical,
this is not a problem because child process soon will be killed by parent
process.
to attach to the process, it is surprising that the process is resumed
without inputting any gdb commands, however ptrace manual said:
The tracing process will see the newly-traced process stop and may
then control it as if it had been traced all along.
But the current code does not work in this way, unless traced process
received a signal later, it will continue to run as a background task.
To fix this problem, just send signal SIGSTOP to the traced process after
we resumed it, this works like that you are attaching to a running process,
it is not perfect but better than nothing.
Pass only FEXEC (instead of FREAD|FEXEC) in fgetvp_exec. _fget has to check for
!FWRITE anyway and may as well know about FREAD.
Make _fget code a bit more readable by converting permission checking from if()
to switch(). Assert that correct permission flags are passed.
In collaboration with: kib
Approved by: trasz (mentor)
MFC after: 6 days
X-MFC: with r238220
While here return EBADF for descriptors opened for writing (previously it was ETXTBSY).
Add fgetvp_exec function which performs appropriate checks.
PR: kern/169651
In collaboration with: kib
Approved by: trasz (mentor)
MFC after: 1 week
now fully encapsulates all accesses to f_offset, and extends f_offset
locking to other consumers that need it, in particular, to lseek() and
variants of getdirentries().
Ensure that on 32bit architectures f_offset, which is 64bit quantity,
always read and written under the mtxpool protection. This fixes
apparently easy to trigger race when parallel lseek()s or lseek() and
read/write could destroy file offset.
The already broken ABI emulations, including iBCS and SysV, are not
converted (yet).
Tested by: pho
No objections from: jhb
MFC after: 3 weeks
the scheduled task from tc_windup(). Do it directly from tc_windup in
interrupt context [1].
Establish the permanent mapping of the shared page into the kernel
address space, avoiding the potential need to sleep waiting for
allocation of sf buffer during vdso_timehands update. As a
consequence, shared_page_write_start() and shared_page_write_end()
functions are not needed anymore.
Guess and memorize the pointers to native host and compat32 sysentvec
during initialization, to avoid the need to get shared_page_alloc_sx
lock during the update.
In tc_fill_vdso_timehands(), do not loop waiting for timehands
generation to stabilize, since vdso_timehands is written in the same
interrupt context which wrote timehands.
Requested by: mav [1]
MFC after: 29 days
usermode, using shared page. The structures and functions have vdso
prefix, to indicate the intended location of the code in some future.
The versioned per-algorithm data is exported in the format of struct
vdso_timehands, which mostly repeats the content of in-kernel struct
timehands. Usermode reading of the structure can be lockless.
Compatibility export for 32bit processes on 64bit host is also
provided. Kernel also provides usermode with indication about
currently used timecounter, so that libc can fall back to syscall if
configured timecounter is unknown to usermode code.
The shared data updates are initiated both from the tc_windup(), where
a fast task is queued to do the update, and from sysctl handlers which
change timecounter. A manual override switch
kern.timecounter.fast_gettime allows to turn off the mechanism.
Only x86 architectures export the real algorithm data, and there, only
for tsc timecounter. HPET counters page could be exported as well, but
I prefer to not further glue the kernel and libc ABI there until
proper vdso-based solution is developed.
Minimal stubs neccessary for non-x86 architectures to still compile
are provided.
Discussed with: bde
Reviewed by: jhb
Tested by: flo
MFC after: 1 month
Do not rely on the busy state of the page from which we allocate the
chunk, to protect allocator state. Use statically allocated sx lock
instead.
Provide more flexible KPI. In particular, allow to allocate chunk
without providing initial data, and allow writes into existing
allocation. Allow to get an sf buf which temporary maps the chunk, to
allow sequential updates to shared page content without unmapping in
between.
Reviewed by: jhb
Tested by: flo
MFC after: 1 month
It seems that intended locking protocol for struct file f_offset field
was as follows: f_offset should always be changed under the vnode lock
(except fcntl(2) and lseek(2) did not followed the rules). Since
read(2) uses shared vnode lock, FOFFSET_LOCKED block is additionally
taken to serialize shared vnode lock owners.
This was broken first by enabling shared lock on writes, then by
fadvise changes, which moved f_offset assigned from under vnode lock,
and last by vn_io_fault() doing chunked i/o. More, due to uio_offset
not yet valid in vn_io_fault(), the range lock for reads was taken on
the wrong region.
Change the locking for f_offset to always use FOFFSET_LOCKED block,
which is placed before rangelocks in the lock order.
Extract foffset_lock() and foffset_unlock() functions which implements
FOFFSET_LOCKED lock, and consistently lock f_offset with it in the
vn_io_fault() both for reads and writes, even if MNTK_NO_IOPF flag is
not set for the vnode mount. Indicate that f_offset is already valid
for vn_read() and vn_write() calls from vn_io_fault() with FOF_OFFSET
flag, and assert that all callers of vn_read() and vn_write() follow
this protocol.
Extract get_advice() function to calculate the POSIX_FADV_XXX value
for the i/o region, and use it were appropriate.
Reviewed by: jhb
Tested by: pho
MFC after: 2 weeks
should be killed or not.
This fixes killing pdfork(2)ed process on last close of the corresponding
process descriptor.
Reviewed by: rwatson
MFC after: 1 month
On success we have to drop one after procdesc_finit() and on failure
we have to close allocated slot with fdclose(), which also drops one
reference for us and drop the remaining reference with fdrop().
Without this change closing process descriptor didn't result in killing
pdfork(2)ed child.
Reviewed by: rwatson
MFC after: 1 month
First, extend the changes in r230782 to better handle the common case
of using NOREUSE with sequential reads. A NOREUSE file descriptor
will now track the last implicit DONTNEED request it made as a result
of a NOREUSE read. If a subsequent NOREUSE read is adjacent to the
previous range, it will apply the DONTNEED request to the entire range
of both the previous read and the current read. The effect is that
each read of a file accessed sequentially will apply the DONTNEED
request to the entire range that has been read. This allows NOREUSE
to properly handle misaligned reads by flushing each buffer to cache
once it has been completely read.
Second, apply the same changes made to read(2) by r230782 and this
change to writes. This provides much better performance in the
sequential write case as it allows writes to still be clustered. It
also provides much better performance for misaligned writes. It does
mean that NOREUSE will be generally ineffective for non-sequential
writes as the current implementation relies on a future NOREUSE
write's implicit DONTNEED request to flush the dirty buffer from the
current write.
MFC after: 2 weeks