Append '__' prefix to the tag of struct oflock, and put it under BSD
namespace. Structure is needed both by libc and kernel, thus cannot be
hidden under #ifdef _KERNEL.
Move a set of non-standard F_* and O_* constants into BSD namespace.
SUSv4 explicitely allows implemenation to pollute F_* and O_* names
after fcntl.h is included, but it costs us nothing to adhere
to the specification if exact POSIX compliance level is requested by
user code.
Change some spaces after #define to tabs.
Noted by and discussed with: bde
MFC after: 1 week
ELF parser. Specifically, do not allow note reader and interpreter
path comparision in the brandelf code to read past end of the page.
This may happen if specially crafter ELF image is activated.
Submitted by: Lukasz Wojcik <lukasz.wojcik zoho com>
MFC after: 3 days
VM_KMEM_MAX_SIZE.
The code was not taking into account the size of the kernel_map, which
the kmem_map is allocated from, so it could produce a sub-map size too
large to fit. The simplest solution is to ignore VM_KMEM_MAX entirely
and base the memguard map's size off the kernel_map's size, since this
is always relevant and always smaller.
Found by: Justin Hibbits
adds an extra tick to account for the current partial clock tick. However,
that is not appropriate for a repeating timer when the exact tvtohz() value
should be used for subsequent intervals. Fix repeating callouts for
EVFILT_TIMER by subtracting 1 tick from the tvtohz() result similar to the
fix used in realitexpire() for interval timers.
While here, update a few comments to note that if the EVFILT_TIMER code
were to move out of kern_event.c, it should move to kern_time.c (where the
interval timer code it mimics lives) rather than kern_timeout.c.
MFC after: 1 month
These probes are most useful when looking into the structures
they provide, which are listed in io.d. For example:
dtrace -n 'io:genunix::start { printf("%d\n", args[0]->bio_bcount); }'
Note that the I/O systems in FreeBSD and Solaris/Illumos are sufficiently
different that there is not a 1:1 mapping from scripts that work
with one to the other.
MFC after: 1 month
debugger exited without calling ptrace(PT_DETACH), there is a time window
that the p_xthread may be pointing to non-existing thread, in practical,
this is not a problem because child process soon will be killed by parent
process.
to attach to the process, it is surprising that the process is resumed
without inputting any gdb commands, however ptrace manual said:
The tracing process will see the newly-traced process stop and may
then control it as if it had been traced all along.
But the current code does not work in this way, unless traced process
received a signal later, it will continue to run as a background task.
To fix this problem, just send signal SIGSTOP to the traced process after
we resumed it, this works like that you are attaching to a running process,
it is not perfect but better than nothing.
Pass only FEXEC (instead of FREAD|FEXEC) in fgetvp_exec. _fget has to check for
!FWRITE anyway and may as well know about FREAD.
Make _fget code a bit more readable by converting permission checking from if()
to switch(). Assert that correct permission flags are passed.
In collaboration with: kib
Approved by: trasz (mentor)
MFC after: 6 days
X-MFC: with r238220
While here return EBADF for descriptors opened for writing (previously it was ETXTBSY).
Add fgetvp_exec function which performs appropriate checks.
PR: kern/169651
In collaboration with: kib
Approved by: trasz (mentor)
MFC after: 1 week
now fully encapsulates all accesses to f_offset, and extends f_offset
locking to other consumers that need it, in particular, to lseek() and
variants of getdirentries().
Ensure that on 32bit architectures f_offset, which is 64bit quantity,
always read and written under the mtxpool protection. This fixes
apparently easy to trigger race when parallel lseek()s or lseek() and
read/write could destroy file offset.
The already broken ABI emulations, including iBCS and SysV, are not
converted (yet).
Tested by: pho
No objections from: jhb
MFC after: 3 weeks
the scheduled task from tc_windup(). Do it directly from tc_windup in
interrupt context [1].
Establish the permanent mapping of the shared page into the kernel
address space, avoiding the potential need to sleep waiting for
allocation of sf buffer during vdso_timehands update. As a
consequence, shared_page_write_start() and shared_page_write_end()
functions are not needed anymore.
Guess and memorize the pointers to native host and compat32 sysentvec
during initialization, to avoid the need to get shared_page_alloc_sx
lock during the update.
In tc_fill_vdso_timehands(), do not loop waiting for timehands
generation to stabilize, since vdso_timehands is written in the same
interrupt context which wrote timehands.
Requested by: mav [1]
MFC after: 29 days
usermode, using shared page. The structures and functions have vdso
prefix, to indicate the intended location of the code in some future.
The versioned per-algorithm data is exported in the format of struct
vdso_timehands, which mostly repeats the content of in-kernel struct
timehands. Usermode reading of the structure can be lockless.
Compatibility export for 32bit processes on 64bit host is also
provided. Kernel also provides usermode with indication about
currently used timecounter, so that libc can fall back to syscall if
configured timecounter is unknown to usermode code.
The shared data updates are initiated both from the tc_windup(), where
a fast task is queued to do the update, and from sysctl handlers which
change timecounter. A manual override switch
kern.timecounter.fast_gettime allows to turn off the mechanism.
Only x86 architectures export the real algorithm data, and there, only
for tsc timecounter. HPET counters page could be exported as well, but
I prefer to not further glue the kernel and libc ABI there until
proper vdso-based solution is developed.
Minimal stubs neccessary for non-x86 architectures to still compile
are provided.
Discussed with: bde
Reviewed by: jhb
Tested by: flo
MFC after: 1 month
Do not rely on the busy state of the page from which we allocate the
chunk, to protect allocator state. Use statically allocated sx lock
instead.
Provide more flexible KPI. In particular, allow to allocate chunk
without providing initial data, and allow writes into existing
allocation. Allow to get an sf buf which temporary maps the chunk, to
allow sequential updates to shared page content without unmapping in
between.
Reviewed by: jhb
Tested by: flo
MFC after: 1 month
It seems that intended locking protocol for struct file f_offset field
was as follows: f_offset should always be changed under the vnode lock
(except fcntl(2) and lseek(2) did not followed the rules). Since
read(2) uses shared vnode lock, FOFFSET_LOCKED block is additionally
taken to serialize shared vnode lock owners.
This was broken first by enabling shared lock on writes, then by
fadvise changes, which moved f_offset assigned from under vnode lock,
and last by vn_io_fault() doing chunked i/o. More, due to uio_offset
not yet valid in vn_io_fault(), the range lock for reads was taken on
the wrong region.
Change the locking for f_offset to always use FOFFSET_LOCKED block,
which is placed before rangelocks in the lock order.
Extract foffset_lock() and foffset_unlock() functions which implements
FOFFSET_LOCKED lock, and consistently lock f_offset with it in the
vn_io_fault() both for reads and writes, even if MNTK_NO_IOPF flag is
not set for the vnode mount. Indicate that f_offset is already valid
for vn_read() and vn_write() calls from vn_io_fault() with FOF_OFFSET
flag, and assert that all callers of vn_read() and vn_write() follow
this protocol.
Extract get_advice() function to calculate the POSIX_FADV_XXX value
for the i/o region, and use it were appropriate.
Reviewed by: jhb
Tested by: pho
MFC after: 2 weeks
should be killed or not.
This fixes killing pdfork(2)ed process on last close of the corresponding
process descriptor.
Reviewed by: rwatson
MFC after: 1 month
On success we have to drop one after procdesc_finit() and on failure
we have to close allocated slot with fdclose(), which also drops one
reference for us and drop the remaining reference with fdrop().
Without this change closing process descriptor didn't result in killing
pdfork(2)ed child.
Reviewed by: rwatson
MFC after: 1 month
First, extend the changes in r230782 to better handle the common case
of using NOREUSE with sequential reads. A NOREUSE file descriptor
will now track the last implicit DONTNEED request it made as a result
of a NOREUSE read. If a subsequent NOREUSE read is adjacent to the
previous range, it will apply the DONTNEED request to the entire range
of both the previous read and the current read. The effect is that
each read of a file accessed sequentially will apply the DONTNEED
request to the entire range that has been read. This allows NOREUSE
to properly handle misaligned reads by flushing each buffer to cache
once it has been completely read.
Second, apply the same changes made to read(2) by r230782 and this
change to writes. This provides much better performance in the
sequential write case as it allows writes to still be clustered. It
also provides much better performance for misaligned writes. It does
mean that NOREUSE will be generally ineffective for non-sequential
writes as the current implementation relies on a future NOREUSE
write's implicit DONTNEED request to flush the dirty buffer from the
current write.
MFC after: 2 weeks
dev = make_dev_cred();
dev->si_drv1 = tp;
leaves a small window where the newly created device may be opened
and si_drv1 is NULL.
As this is a vary rare situation, using a lock to close the window
seems overkill. Instead just wait for the assignment of si_drv1.
Suggested by: kib
MFC after: 1 week
zero but in any case is overwritten by successive copyin(), making the
previous initialization useless. Remove this.
As an added bonus this fixes a style(9) bug.
Discussed with: kib
Approved by: gnn (mentor)
MFC after: 3 days
indx will never be -1 on error, as none of dupfdopen(), finstall() and
kern_capwrap() modifies it on error, but what is more important none of
those functions install and leave file at indx descriptor on error.
Leave an assert to prove my words.
MFC after: 1 month
the caller using finstall().
This saves us the filedesc lock/unlock cycle, fhold()/fdrop() cycle and closes
a race between finstall() and dupfdopen().
MFC after: 1 month