LK_EXCLOTHER. LK_EXCLOTHER is only used to acquire a
usecount on a vnode during NFSv4 recovery from an
expired lease.
Reported and tested by: pho
MFC after: 2 weeks
is starting. This is in line with practice in OpenSolaris.
Note that this change is only in ULE and not in the 4BSD scheduler.
Once this change settles in (MFC timeout has expired) we'll try it out
on 4BSD as well.
PR: 177706
Submitted by: Tiwei Bie
MFC after: 1 month
and kern_proc_vmmap_out() functions to output process kinfo structures
to sbuf, to make the code reusable.
The functions are going to be used in the coredump routine to store
procstat info in the core program header notes.
Reviewed by: kib
MFC after: 3 weeks
function is provided, which is used either to calculate the note size
or output it to sbuf. On the first pass the notes are registered in a
list and the resulting size is found, on the second pass the list is
traversed outputing notes to sbuf. For the sbuf a drain routine is
provided that writes data to a core file.
The main goal of the change is to make coredump to write notes
directly to the core file, without preliminary preparing them all in a
memory buffer. Storing notes in memory is not a problem for the
current, rather small, set of notes we write to the core, but it may
becomes an issue when we start to store procstat notes.
Reviewed by: jhb (initial version), kib
Discussed with: jhb, kib
MFC after: 3 weeks
In some cases, kern_envp is set by the architecture code and env_pos does
not contain the length of the static kernel environment. In these cases
r249408 causes the kernel to discard the environment.
Fix this by updating the check for empty static env to *kern_envp != '\0'
Reported by: np@
In case where there are no static kernel environment entries, the
function init_dynamic_kenv() adds an incorrect entry at position 0 of
the dynamic kernel environment. This in turn causes kenv(1) to print
and empty list even though there are dynamic entries added later.
Fix this by checking env_pos in init_dynamic_kenv() and adding dynamic
entries only if there are static entries.
reside on their own cache line to prevent false sharing with other
nearby structures, especially for those in the .bss segment.
NB: Those mutexes and rwlocks with variables next to them that get
changed on every invocation do not benefit from their own cache line.
Actually it may be net negative because two cache misses would be
incurred in those cases.
Introduce counter(9) API, that implements fast and raceless counters,
provided (but not limited to) for gathering of statistical data.
See http://lists.freebsd.org/pipermail/freebsd-arch/2013-April/014204.html
for more details.
In collaboration with: kib
Reviewed by: luigi
Tested by: ae, ray
Sponsored by: Nginx, Inc.
Although native word padding (i.e. 8-byte on 64bit arch) looks to be
in agreement with standards, other parts of our code and other OSes
use 4-byte alignment.
This is not expected to change alignment for currently generated core
dump notes, as the notes look to consist of structures with sizes
multiple of 8 on 64-bit archs. But there are plans to add additional
notes, where 4-byte vs 8-byte alignment makes difference.
Discussed with: kib
Reviewed by: kib
MFC after: 2 weeks
POSIX mqueue, compatibility ksem and POSIX shm create a file descriptor that
has close-on-exec set. However, they do this incorrectly, leaving a window
where a thread may fork and exec while the flag has not been set yet. The
race is easily reproduced on a multicore system with one thread doing
shm_open and close and another thread doing posix_spawnp and waitpid.
Set UF_EXCLOSE via falloc()'s flags argument instead. This also simplifies
the code.
MFC after: 1 week
- Don't insert BKGRDMARKER bufs into the splay or dirty/clean buf lists.
No consumers need to find them there and it complicates the tree.
These flags are all FFS specific and could be moved out of the buf
cache.
- Use pbgetvp() and pbrelvp() to associate the background and journal
bufs with the vp. Not only is this much cheaper it makes more sense
for these transient bufs.
- Fix the assertions in pbget* and pbrel*. It's not safe to check list
pointers which were never initialized. Use the BX flags instead. We
also check B_PAGING in reassignbuf() so this should cover all cases.
Discussed with: kib, mckusick, attilio
Sponsored by: EMC / Isilon Storage Division
the handler address. Add a mark to distinguish between filter and
handler.
Note that the arguments for both filter and handler are same.
Sponsored by: The FreeBSD Foundation
Reviewed by: jhb
MFC after: 1 week
Allow boothowto and bootverbose to be set via kernel options, which
is useful on architectures that are unable to rely on a boot loader
to pass configuration variables to the kernel.
Submitted by: rwatson
extattr_set_{fd,file,link} is logically a write(2)-like operation and
should return ssize_t, just like extattr_get_*. Also, the user-space
utility was using an int for the return value of extattr_get_* and
extattr_list_*, both of which return an ssize_t.
MFC after: 1 week
vnode could be reclaimed while lock upgrade was performed.
Sponsored by: The FreeBSD Foundation
Reported and tested by: pho
Diagnosed and reviewed by: rmacklem
MFC after: 1 week
clear its pointer to next record, since next record
belongs to the buffer, and shouldn't be leaked.
The ng_ksocket(4) used to clear this pointer itself,
but the correct place is here.
Sponsored by: Nginx, Inc
1. If we wanted to send exactly as many bytes as the socket buffer is
sized for, the inner loop of kern_sendfile() would see that the
socket is full before seeing that it had no more bytes left to send.
This would cause it to return EAGAIN to the caller instead of
success. Fix by changing the order that these conditions are tested.
2. Simplify the calculation for the bytes to send in each iteration of
the inner loop of kern_sendfile()
3. Fix some calls with bogus arguments to sf_buf_ext(). These would
only trigger on mbuf allocation failure, but would be hilariously
bad if they did trigger.
Submitted by: gibbs(3), andre(2)
Reviewed by: emax, andre
Obtained from: Netflix
MFC after: 1 week
the thread reference on the vp->v_rdev and use the returned struct
cdev *dev instead of using vp->v_rdev. Call dev_strategy_csw()
instead of dev_strategy(), since we now own the reference.
Since the csw was already calculated, test d_flags to avoid mapping
the buffer if the driver supports unmapped requests [*].
Suggested by: kan [*]
Reviewed by: kan (previous version)
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
but assumes that a thread reference was already obtained on the passed
device. Use the function from physio(), to avoid two extra dev_mtx
lock and unlock. Note that physio() is always used as the cdevsw
method, or is called from a cdevsw method, and the caller already owns
the reference.
dev_strategy() is left to keep KPI intact, but now it is implemented
as a wrapper around dev_strategy_csw().
Do some style cleanup in physio().
Requested and reviewed by: kan (previous version)
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
maxbcache size fixed, the auto-tuned transient map is too small for
real-world load on i386.
Tested by: David Wolfskill
Sponsored by: The FreeBSD Foundation
In physio, check if device can handle unmapped IO and pass an
appropriately mapped buffer to the driver strategy routine. The
only driver in the tree that can handle unmapped buffers is one
exposed by GEOM, so mark it as such with the new flag in the
driver cdevsw structure.
This fixes insta-panics on hosts, running dconschat, as /dev/fwmem
is an example of the driver that makes use of physio routine, but
bypasses the g_down thread, where the buffer gets mapped normally.
Discussed with: kib (earlier version)
for migrating callouts to new CPU. This value is passed to
callout_cc_add() in order to update properly precision field in case of
rescheduling/migration.
Reviewed by: mav
The scope of these callbacks is primarily to support actions that affect the
taskqueue's thread environments. They are entirely optional, and
consequently are introduced as a new API: taskqueue_set_callback().
This interface allows the caller to specify that a taskqueue requires a
callback and optional context pointer for a given callback type.
The callback types included in this commit can be used to register a
constructor and destructor for thread-local storage using osd(9). This
allows a particular taskqueue to define that its threads require a specific
type of TLS, without the need for a specially-orchestrated task-based
mechanism for startup and shutdown in order to accomplish it.
Two callback types are supported at this point:
- TASKQUEUE_CALLBACK_TYPE_INIT, called by every thread when it starts, prior
to processing any tasks.
- TASKQUEUE_CALLBACK_TYPE_SHUTDOWN, called by every thread when it exits,
after it has processed its last task but before the taskqueue is
reclaimed.
While I'm here:
- Add two new macros, TQ_ASSERT_LOCKED and TQ_ASSERT_UNLOCKED, and use them
in appropriate locations.
- Fix taskqueue.9 to mention taskqueue_start_threads(), which is a required
interface for all consumers of taskqueue(9).
Reviewed by: kib (all), eadler (taskqueue.9), brd (taskqueue.9)
Approved by: ken (mentor)
Sponsored by: Spectra Logic
MFC after: 1 month
not every time an intermediate root (including the first devfs) is
mounted.
This is also consistent with waking up via root_mount_complete.
Reviewed by: jhb
MFC after: 13 days
u_long. Before this change it was of type int for syscalls, but prototypes
in sys/stat.h and documentation for chflags(2) and fchflags(2) (but not
for lchflags(2)) stated that it was u_long. Now some related functions
use u_long type for flags (strtofflags(3), fflagstostr(3)).
- Make path argument of type 'const char *' for consistency.
Discussed on: arch
Sponsored by: The FreeBSD Foundation
UMTX_PROFILING should really analyze the distribution of locks as they
index entries in the umtxq_chains hash-table.
However, the current implementation does add/dec the length counters
for *every* thread insert/removal, measuring at all really userland
contention and not the hash distribution.
Fix this by correctly add/dec the length counters in the points where
it is really needed.
Please note that this bug brought us questioning in the past the quality
of the umtx hash table distribution.
To date with all the benchmarks I could try I was not able to reproduce
any issue about the hash distribution on umtx.
Sponsored by: EMC / Isilon storage division
Reviewed by: jeff, davide
MFC after: 2 weeks
SIGSTOP if stop signals are currently deferred. This can occur if a
process is stopped via SIGSTOP while a thread is running or runnable
but before it has set TDF_SBDRY.
Tested by: pho
Reviewed by: kib
MFC after: 1 week
bufobj counter of the writes in progress is incremented. Other thread
inspecting the bufobj would consider it clean.
For the regular vnodes, the vnode lock is typically held both by the
thread performing the bufwrite() and an other thread doing syncing,
which prevents the situation. On the other hand, writes to the VCHR
vnodes are done without holding vnode lock.
Increment the write ref counter for the buffer object before calling
bundirty().
Sponsored by: The FreeBSD Foundation
Tested by: pho
MFC after: 2 weeks
becoming too low, the softdep flush thread processes the workitems,
which frees the space in journal, and then unsuspends the fs. The
softdep_flush() and other workitem processing functions busy the
filesystem before iterating over the worklist, to prevent the parallel
unmount from freeing the mount data. The vfs_busy() is called with
MBF_NOWAIT flag.
Now, if the unmount is already started and the filesystem is suspended
due to low journal space, the journal is never flushed and filesystem
is never unsuspended, because vfs_busy(MBF_NOWAIT) call cannot succeed
for the unmounting fs, and softdep_flush() does not process the
workitems. Unmount needs to write metadata, where it hangs in the
"suspfs" state.
Move the vn_start_write() call in the dounmount() before setting the
MNTK_UNMOUNT flag. This practically ensures that softdep_flush()
processed the pending journal writes by making dounmount() wait for
the lift of the suspension.
Sponsored by: The FreeBSD Foundation
Reported and tested by: pho
MFC after: 2 weeks
we need to call ufs_checkpath() to walk from our new location to
the root of the filesystem to ensure that we do not encounter
ourselves along the way. Until now, we accomplished this by reading
the ".." entries of each directory in our path until we reached
the root (or encountered an error). This change tries to avoid the
I/O of reading the ".." entries by first looking them up in the
name cache and only doing the I/O when the name cache lookup fails.
Reviewed by: kib
Tested by: Peter Holm
MFC after: 4 weeks
This change allows creating file descriptors with close-on-exec set in some
situations. SOCK_CLOEXEC and SOCK_NONBLOCK can be OR'ed in socket() and
socketpair()'s type parameter, and MSG_CMSG_CLOEXEC to recvmsg() makes file
descriptors (SCM_RIGHTS) atomically close-on-exec.
The numerical values for SOCK_CLOEXEC and SOCK_NONBLOCK are as in NetBSD.
MSG_CMSG_CLOEXEC is the first free bit for MSG_*.
The SOCK_* flags are not passed to MAC because this may cause incorrect
failures and can be done later via fcntl() anyway. On the other hand, audit
is expected to cope with the new flags.
For MSG_CMSG_CLOEXEC, unp_externalize() is extended to take a flags
argument.
Reviewed by: kib
buffer, transparently handling mapped or unmapped buffers. Its intent
is to replace the use of bzero(bp->b_data) in cases where the buffer
might be unmapped, to avoid unneeded upgrades.
Sponsored by: The FreeBSD Foundation
Tested by: pho
do not map the b_pages pages into buffer_map KVA. The use of the
unmapped buffers eliminate the need to perform TLB shootdown for
mapping on the buffer creation and reuse, greatly reducing the amount
of IPIs for shootdown on big-SMP machines and eliminating up to 25-30%
of the system time on i/o intensive workloads.
The unmapped buffer should be explicitely requested by the GB_UNMAPPED
flag by the consumer. For unmapped buffer, no KVA reservation is
performed at all. The consumer might request unmapped buffer which
does have a KVA reserve, to manually map it without recursing into
buffer cache and blocking, with the GB_KVAALLOC flag.
When the mapped buffer is requested and unmapped buffer already
exists, the cache performs an upgrade, possibly reusing the KVA
reservation.
Unmapped buffer is translated into unmapped bio in g_vfs_strategy().
Unmapped bio carry a pointer to the vm_page_t array, offset and length
instead of the data pointer. The provider which processes the bio
should explicitely specify a readiness to accept unmapped bio,
otherwise g_down geom thread performs the transient upgrade of the bio
request by mapping the pages into the new bio_transient_map KVA
submap.
The bio_transient_map submap claims up to 10% of the buffer map, and
the total buffer_map + bio_transient_map KVA usage stays the
same. Still, it could be manually tuned by kern.bio_transient_maxcnt
tunable, in the units of the transient mappings. Eventually, the
bio_transient_map could be removed after all geom classes and drivers
can accept unmapped i/o requests.
Unmapped support can be turned off by the vfs.unmapped_buf_allowed
tunable, disabling which makes the buffer (or cluster) creation
requests to ignore GB_UNMAPPED and GB_KVAALLOC flags. Unmapped
buffers are only enabled by default on the architectures where
pmap_copy_page() was implemented and tested.
In the rework, filesystem metadata is not the subject to maxbufspace
limit anymore. Since the metadata buffers are always mapped, the
buffers still have to fit into the buffer map, which provides a
reasonable (but practically unreachable) upper bound on it. The
non-metadata buffer allocations, both mapped and unmapped, is
accounted against maxbufspace, as before. Effectively, this means that
the maxbufspace is forced on mapped and unmapped buffers separately.
The pre-patch bufspace limiting code did not worked, because
buffer_map fragmentation does not allow the limit to be reached.
By Jeff Roberson request, the getnewbuf() function was split into
smaller single-purpose functions.
Sponsored by: The FreeBSD Foundation
Discussed with: jeff (previous version)
Tested by: pho, scottl (previous version), jhb, bf
MFC after: 2 weeks
calls to toggle TDF_SBDRY rather than passing PBDRY to individual sleep
calls.
- Remove the stop_allowed parameters from cursig() and issignal().
issignal() checks TDF_SBDRY directly.
- Remove the PBDRY and SLEEPQ_STOP_ON_BDRY flags.
In other words we don't require CAP_SEEK if either O_APPEND or O_TRUNC flag is
given, because O_APPEND doesn't allow to overwrite existing data and O_TRUNC
requires CAP_FTRUNCATE already.
Sponsored by: The FreeBSD Foundation
will not smash the M_EXT and data pointer, so it is safe to
pass an mbuf with external storage procuded by m_getcl() to
m_move_pkthdr().
Reviewed by: andre
Sponsored by: Nginx, Inc.
the filesystem VOP_READ() and VOP_WRITE() implementations in the same
way as vn_io_fault_uiomove() over the unmapped buffers. Helper
provides the convenient wrapper over the pmap_copy_pages() for struct
uio consumers, taking care of the TDP_UIOHELD situations.
Sponsored by: The FreeBSD Foundation
Tested by: pho
MFC after: 2 weeks
PBDRY (which simply doesn't make any sense) nor PCATCH (which could
be used by a malicious process to work around the PCPU limit).
Submitted by: Rudo Tomori
Reviewed by: kib
cluster_write() and cluster_wbuild() functions. The flags to be
allowed are a subset of the GB_* flags for getblk().
Sponsored by: The FreeBSD Foundation
Tested by: pho
buffers directly, use pmap_zero_page_area(9) for each zeroing page
region instead.
Sponsored by: The FreeBSD Foundation
Tested by: pho
MFC after: 2 weeks
the first page:
1. Cast uint16_t operands in a multiplication to unsigned int because
otherwise the implicit promotion to int results in a signed
multiplication that can overflow and the behaviour on integer
overflow is undefined.
2. Replace (offset + size > PAGE_SIZE) with (size > PAGE_SIZE - offset)
because the sum may overflow.
- Use the same tests to see if the path to the interpreter is on the first
page. There's no overflow here because size is already limited by
MAXPATHLEN, but the compiler optimises the new tests better. Also fix an
off-by-one error.
- Simplify tests to see if an ELF note program header is on the first page.
This also fixes an off-by-one error.
Reviewed by: kib
MFC after: 1 week
and that can drive someone crazy. While m_get2() is young and not
documented yet, change its order of arguments to match m_getm2().
Sorry for churn, but better now than later.
appropriate wait argument and checking return value. Before this change
m_extadd() could fail, and kern_sendfile() ignored that.
Sponsored by: Nginx, Inc.
- Make it return int, not void.
- Add wait parameter.
- Update MEXTADD() macro appropriately, defaults to M_NOWAIT, as
before this change.
Sponsored by: Nginx, Inc.
Change size requested to malloc(9) now that callwheel buckets are
callout_list and not callout_tailq anymore. This change was already
there but it seems it got lost after code churn in r248032.
Reported by: alc, kib
- Use u_int values for length and max_length values
- Add a way to reset the max_length heuristic in order to have the
possibility to reuse the mechanism consecutively without rebooting
the machine
- Add a way to quick display top5 contented buckets in the system for
the max_length value.
This should give a quick overview on the quality of the hash table
distribution.
Sponsored by: EMC / Isilon storage division
Reviewed by: jeff, davide
future further optimizations where the vm_object lock will be held
in read mode most of the time the page cache resident pool of pages
are accessed for reading purposes.
The change is mostly mechanical but few notes are reported:
* The KPI changes as follow:
- VM_OBJECT_LOCK() -> VM_OBJECT_WLOCK()
- VM_OBJECT_TRYLOCK() -> VM_OBJECT_TRYWLOCK()
- VM_OBJECT_UNLOCK() -> VM_OBJECT_WUNLOCK()
- VM_OBJECT_LOCK_ASSERT(MA_OWNED) -> VM_OBJECT_ASSERT_WLOCKED()
(in order to avoid visibility of implementation details)
- The read-mode operations are added:
VM_OBJECT_RLOCK(), VM_OBJECT_TRYRLOCK(), VM_OBJECT_RUNLOCK(),
VM_OBJECT_ASSERT_RLOCKED(), VM_OBJECT_ASSERT_LOCKED()
* The vm/vm_pager.h namespace pollution avoidance (forcing requiring
sys/mutex.h in consumers directly to cater its inlining functions
using VM_OBJECT_LOCK()) imposes that all the vm/vm_pager.h
consumers now must include also sys/rwlock.h.
* zfs requires a quite convoluted fix to include FreeBSD rwlocks into
the compat layer because the name clash between FreeBSD and solaris
versions must be avoided.
At this purpose zfs redefines the vm_object locking functions
directly, isolating the FreeBSD components in specific compat stubs.
The KPI results heavilly broken by this commit. Thirdy part ports must
be updated accordingly (I can think off-hand of VirtualBox, for example).
Sponsored by: EMC / Isilon storage division
Reviewed by: jeff
Reviewed by: pjd (ZFS specific review)
Discussed with: alc
Tested by: pho
from being indirectly called via cpu_startup()+vm_ksubmap_init().
The boot order position remains the same at SI_SUB_CPU.
Allocation of the callout array is changed to stardard kernel malloc
from a slightly obscure direct kernel_map allocation.
kern_timeout_callwheel_alloc() is renamed to callout_callwheel_init()
to better describe its purpose.
kern_timeout_callwheel_init() is removed simplifying the per-cpu
initialization.
Reviewed by: davide
kern_timeout_callwheel_alloc() where it is actually used.
This is a mechanical move and no tuning parameters are changed.
The pre-allocated callout array is only used for legacy timeout(9)
calls and is only allocated and active on cpu0. Eventually all
remaining users of timeout(9) should switch to the callout_* API.
Reviewed by: davide
response to an rtprio_thread() call, when the priority is different
than the old priority, and either the old or the new priority class is
not RTP_PRIO_NORMAL (timeshare).
The reasoning for the second half of the test is that if it's a change in
timeshare priority, then the scheduler is going to adjust that priority
in a way that completely wipes out the requested change anyway, so
what's the point? (If that's not true, then allowing a thread to change
its own timeshare priority would subvert the scheduler's adjustments and
let a cpu-bound thread monopolize the cpu; if allowed at all, that
should require priveleges.)
On the other hand, if either the old or new priority class is not
timeshare, then the scheduler doesn't make automatic adjustments, so we
should honor the request and make the priority change right away. The
reason the old class gets caught up in this is the very reason for this
change: when thread A changes the priority of its child thread B from
idle back to timeshare, thread B never actually gets moved to a
timeshare-range run queue unless there are some idle cycles available
to allow it to first get scheduled again as an idle thread.
Reviewed by: jhb@
using callout_reset_sbt() instead of callout_reset(). We can't remove
lower limit completely in this case because of significant processing
overhead, caused by unability to use direct callout execution due to using
process mutex in callout handler for sending SEGALRM signal. With support
of periodic events that would allow unprivileged user to abuse the system.
Reviewed by: davide