instead of kernel_map size to prevent kernel memory exhaustion
by mbufs and a subsequent panic on physical page allocation
failure.
On architectures without a direct map all mbuf memory (except
for jumbo mbufs larger than PAGE_SIZE) comes from kmem_map.
It is the limiting factor hence.
For architectures with a direct map using the size of kmem_map
is a good proxy of available kernel memory as well. If it is
much smaller the mbuf limit may be sub-optimal but remains
reasonable, while avoiding panics under exhaustion.
The overall mbuf memory limit calculation may be reconsidered
again later, however due to the many different mbuf sizes and
different backing KVM maps it is a tricky subject.
Found by: pho's new network stress test
Pointed out by: alc (kmem_map instead of kernel_map)
Tested by: pho
Programs often do not expect an [EINTR] return from sem_wait() and POSIX
only allows it if the signal was installed without SA_RESTART. The timeout
in sem_timedwait() is absolute so it can be restarted normally.
The umtx call can be invoked with a relative timeout and in that case
[ERESTART] must be changed to [EINTR]. However, libc does not do this.
The old POSIX semaphore implementation did this correctly (before r249566),
unlike the new umtx one.
It may be desirable to avoid [EINTR] completely, which matches the pthread
functions and is explicitly permitted by POSIX. However, the kernel must
return [EINTR] at least for signals with SA_RESTART clear, otherwise pthread
cancellation will not abort a semaphore wait. In this commit, only restore
the 8.x behaviour which is also permitted by POSIX.
Discussed with: jhb
MFC after: 1 week
prevents us from creating UMA_ZONE_PCPU zones earlier.
As bandaid shift initialization of counter(9) zone later.
Reviewed by: kib
Reported & tested by: Lytochkin Boris <lytboris gmail.com>
signal.
- Fix the old ksem implementation for POSIX semaphores to not restart
sem_wait() or sem_timedwait() if interrupted by a signal.
MFC after: 1 week
The notes format is a header of sizeof(int), which stores the size of
the corresponding data structure to provide some versioning, and data
in the format as it is returned by a related sysctl call.
The userland tools (procstat(1)) will be taught to extract this data,
providing additional info for postmortem analysis.
PR: kern/173723
Suggested by: jhb
Discussed with: jhb, kib
Reviewed by: jhb (initial version), kib
MFC after: 1 month
LK_EXCLOTHER. LK_EXCLOTHER is only used to acquire a
usecount on a vnode during NFSv4 recovery from an
expired lease.
Reported and tested by: pho
MFC after: 2 weeks
is starting. This is in line with practice in OpenSolaris.
Note that this change is only in ULE and not in the 4BSD scheduler.
Once this change settles in (MFC timeout has expired) we'll try it out
on 4BSD as well.
PR: 177706
Submitted by: Tiwei Bie
MFC after: 1 month
and kern_proc_vmmap_out() functions to output process kinfo structures
to sbuf, to make the code reusable.
The functions are going to be used in the coredump routine to store
procstat info in the core program header notes.
Reviewed by: kib
MFC after: 3 weeks
function is provided, which is used either to calculate the note size
or output it to sbuf. On the first pass the notes are registered in a
list and the resulting size is found, on the second pass the list is
traversed outputing notes to sbuf. For the sbuf a drain routine is
provided that writes data to a core file.
The main goal of the change is to make coredump to write notes
directly to the core file, without preliminary preparing them all in a
memory buffer. Storing notes in memory is not a problem for the
current, rather small, set of notes we write to the core, but it may
becomes an issue when we start to store procstat notes.
Reviewed by: jhb (initial version), kib
Discussed with: jhb, kib
MFC after: 3 weeks
In some cases, kern_envp is set by the architecture code and env_pos does
not contain the length of the static kernel environment. In these cases
r249408 causes the kernel to discard the environment.
Fix this by updating the check for empty static env to *kern_envp != '\0'
Reported by: np@
In case where there are no static kernel environment entries, the
function init_dynamic_kenv() adds an incorrect entry at position 0 of
the dynamic kernel environment. This in turn causes kenv(1) to print
and empty list even though there are dynamic entries added later.
Fix this by checking env_pos in init_dynamic_kenv() and adding dynamic
entries only if there are static entries.
reside on their own cache line to prevent false sharing with other
nearby structures, especially for those in the .bss segment.
NB: Those mutexes and rwlocks with variables next to them that get
changed on every invocation do not benefit from their own cache line.
Actually it may be net negative because two cache misses would be
incurred in those cases.
Introduce counter(9) API, that implements fast and raceless counters,
provided (but not limited to) for gathering of statistical data.
See http://lists.freebsd.org/pipermail/freebsd-arch/2013-April/014204.html
for more details.
In collaboration with: kib
Reviewed by: luigi
Tested by: ae, ray
Sponsored by: Nginx, Inc.
Although native word padding (i.e. 8-byte on 64bit arch) looks to be
in agreement with standards, other parts of our code and other OSes
use 4-byte alignment.
This is not expected to change alignment for currently generated core
dump notes, as the notes look to consist of structures with sizes
multiple of 8 on 64-bit archs. But there are plans to add additional
notes, where 4-byte vs 8-byte alignment makes difference.
Discussed with: kib
Reviewed by: kib
MFC after: 2 weeks
POSIX mqueue, compatibility ksem and POSIX shm create a file descriptor that
has close-on-exec set. However, they do this incorrectly, leaving a window
where a thread may fork and exec while the flag has not been set yet. The
race is easily reproduced on a multicore system with one thread doing
shm_open and close and another thread doing posix_spawnp and waitpid.
Set UF_EXCLOSE via falloc()'s flags argument instead. This also simplifies
the code.
MFC after: 1 week
- Don't insert BKGRDMARKER bufs into the splay or dirty/clean buf lists.
No consumers need to find them there and it complicates the tree.
These flags are all FFS specific and could be moved out of the buf
cache.
- Use pbgetvp() and pbrelvp() to associate the background and journal
bufs with the vp. Not only is this much cheaper it makes more sense
for these transient bufs.
- Fix the assertions in pbget* and pbrel*. It's not safe to check list
pointers which were never initialized. Use the BX flags instead. We
also check B_PAGING in reassignbuf() so this should cover all cases.
Discussed with: kib, mckusick, attilio
Sponsored by: EMC / Isilon Storage Division
the handler address. Add a mark to distinguish between filter and
handler.
Note that the arguments for both filter and handler are same.
Sponsored by: The FreeBSD Foundation
Reviewed by: jhb
MFC after: 1 week
Allow boothowto and bootverbose to be set via kernel options, which
is useful on architectures that are unable to rely on a boot loader
to pass configuration variables to the kernel.
Submitted by: rwatson
extattr_set_{fd,file,link} is logically a write(2)-like operation and
should return ssize_t, just like extattr_get_*. Also, the user-space
utility was using an int for the return value of extattr_get_* and
extattr_list_*, both of which return an ssize_t.
MFC after: 1 week
vnode could be reclaimed while lock upgrade was performed.
Sponsored by: The FreeBSD Foundation
Reported and tested by: pho
Diagnosed and reviewed by: rmacklem
MFC after: 1 week
clear its pointer to next record, since next record
belongs to the buffer, and shouldn't be leaked.
The ng_ksocket(4) used to clear this pointer itself,
but the correct place is here.
Sponsored by: Nginx, Inc
1. If we wanted to send exactly as many bytes as the socket buffer is
sized for, the inner loop of kern_sendfile() would see that the
socket is full before seeing that it had no more bytes left to send.
This would cause it to return EAGAIN to the caller instead of
success. Fix by changing the order that these conditions are tested.
2. Simplify the calculation for the bytes to send in each iteration of
the inner loop of kern_sendfile()
3. Fix some calls with bogus arguments to sf_buf_ext(). These would
only trigger on mbuf allocation failure, but would be hilariously
bad if they did trigger.
Submitted by: gibbs(3), andre(2)
Reviewed by: emax, andre
Obtained from: Netflix
MFC after: 1 week
the thread reference on the vp->v_rdev and use the returned struct
cdev *dev instead of using vp->v_rdev. Call dev_strategy_csw()
instead of dev_strategy(), since we now own the reference.
Since the csw was already calculated, test d_flags to avoid mapping
the buffer if the driver supports unmapped requests [*].
Suggested by: kan [*]
Reviewed by: kan (previous version)
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
but assumes that a thread reference was already obtained on the passed
device. Use the function from physio(), to avoid two extra dev_mtx
lock and unlock. Note that physio() is always used as the cdevsw
method, or is called from a cdevsw method, and the caller already owns
the reference.
dev_strategy() is left to keep KPI intact, but now it is implemented
as a wrapper around dev_strategy_csw().
Do some style cleanup in physio().
Requested and reviewed by: kan (previous version)
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
maxbcache size fixed, the auto-tuned transient map is too small for
real-world load on i386.
Tested by: David Wolfskill
Sponsored by: The FreeBSD Foundation