When maxusers was unrestricted and maxfiles was allowed to autotune
much higher the result was that ncallout which was based on maxfiles
and maxproc grew much higher than was needed.
To fix this clip autotuning to the same number we would get with
the old maxusers algorithm which would stop scaling at 384
maxusers.
Growing ncalout higher is not likely to be needed since most consumers
of timeout(9) are gone and any higher value for ncallout causes the
callwheel hashes to be much larger than will even be needed for
most applications.
MFC after: 1 month
Reviewed by: mav
Set the v_hash for a new vnode in the getnewvnode() to the value
calculated based on the vnode structure address. Filesystems using
vfs_hash_insert() override the v_hash using the standard formula of
(inode_number + mnt_hashseed). For other filesystems, the
initialization allows the vfs_hash_index() to provide useful hash too.
Suggested, reviewed and tested by: peter
Sponsored by: The FreeBSD Foundation
MFC after: 5 days
pre-masked hash for the given vnode. The function assumes that
vp->v_hash is initialized by the filesystem vnode instantiation
function. At the moment, it is only done if filesystem uses
vfs_hash_insert().
Reviewed by: peter
Tested by: peter, pho (previous version)
Sponsored by: The FreeBSD Foundation
MFC after: 5 days
mount, which means that is must not be called while the snaplock is
owned. The vfs_write_resume(9) does call the function as the
VFS_SUSP_CLEAN() method, which is too early and falls into the region
still protected by snaplock.
Add yet another flag for the vfs_write_resume_flags() to avoid calling
suspension cleanup handler after the suspend is lifted, and use it in
the ffs_snapshot() call to vfs_write_resume.
Reported and tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
the write start, by adding a variation of the vfs_write_resume(9)
which accepts flags.
Use the new function to prevent a deadlock between parallel suspension
and snapshotting a UFS mount. The ffs_snapshot() code performed
vfs_write_resume() followed by vn_start_write() while owning the
snaplock. If the suspension intervene between resume and
vn_start_write(), the deadlock occured after the suspending thread
tried to lock the snaplock, most typically during the write in the
ffs_copyonwrite().
Reported and tested by: Andreas Longwitz <longwitz@incore.de>
Reviewed by: mckusick
MFC after: 2 weeks
X-MFC-note: make the vfs_write_resume(9) function a macro after the MFC,
in HEAD
case. There is no point in optimizing further the code and use a TRUE
litteral for a path that does heavyweight stuff anyway (like lock acq),
at the price of obfuscated code.
Use the appropriate check where necessary and remove a macro.
Sponsored by: EMC / Isilon storage division
MFC after: 3 days
interrupt context can still be idlethread. At that point, without the
panic condition, it can still happen that idlethread then will try to
acquire some locks to carry on some operations.
Skip the idlethread check on block/sleep lock operations when KDB is
active.
Reported by: jh
Tested by: jh
MFC after: 1 week
When kern_yield() was introduced with the possibility to specify
a new priority, the behaviour changed by not lowering priority at all
in the consumers, making the yielding mechanism highly ineffective for
high priority kthreads like bufdaemon, syncer, vlrudaemon, etc.
There are no evidences that consumers could bear with such change in
semantic and this situation could finally lead to bugs similar to the
ones fixed in r244240.
Re-specify userland pri for kthreads involved.
Tested by: pho
Reviewed by: kib, mdf
MFC after: 1 week
manage a set of power-of-2 sized buffers for bus_dmamem_alloc().
This allows the caller to provide the back-end allocator uma allocator,
allowing full control of the memory pages backing the pool. For
convenience, it provides an optional builtin allocator that provides pages
allocated with the VM_MEMATTR_UNCACHEABLE attribute, for managing pools of
DMA buffers for BUS_DMA_COHERENT or BUS_DMA_NOCACHE.
This also allows the caller to specify a minimum alignment, and it ensures
that all buffers start on a boundary and have a length that's a multiple of
that value, to avoid using buffers that trigger partial cache line flushes.
Submitted by: Ian Lepore <freebsd@damnhippie.dyndns.org>
only returns name, but also vnode of corefile to use.
This simplifies the code and closes few races, especially in %I handling.
Reviewed by: kib
Obtained from: WHEEL Systems
- Use this new format to automatically handle syscalls and VOPs. This
changes the earlier format but is still human readable.
Sponsored by: EMC / Isilon Storage Division
calls and turn it on.
- Do not allow to call them inside jail. [1]
Pointed out by: trasz [1]
Reviewed by: avg
Approved by: kib (mentor)
MFC after: 1 week
This fixed panic where we hold mutex (process lock) and try to obtain sleepable
lock (vnode lock in expand_name()). The panic could occur when %I was used
in kern.corefile.
Additionally we avoid expand_name() overhead when coredumps are disabled.
Obtained from: WHEEL Systems
yields, specify the user priority for the yield. Otherwise, a
higher-priority (kernel) thread could fall into the priority-inversion
with the thread owning the mutex lock.
On single-processor machines or UP kernels, do not loop adaptively
when the next vnode cannot be locked, instead yield unconditionally.
Restructure the iteration initializer and the iterator to remove code
duplication. Put the code to fetch and lock a vnode next to the
current marker, into the mnt_vnode_next_active() function, and use it
instead of repeating the loop.
Reported by: hrs, rmacklem
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 3 days
was being copied from the wrong place. This patch fixes that.
This could cause access failures for mapped users, when the group
permissions were needed.
PR: 147998
Submitted by: Christopher Key (cjk32 at cam.ac.uk)
MFC after: 2 weeks
This is an ongoing effort to provide runtime debug information
useful in the field that does not panic existing installations.
This gives us the flexibility needed when shipping images to a
potentially large audience with WITNESS enabled without worrying
about formerly non-fatal LORs hurting a release.
Sponsored by: iXsystems
kern_yield() is problematic than.
The owned mutex is the mount interlock, and it is in fact not needed
to guarantee the stability of the mount list of active vnodes, so fix
the the issue by only taking the mount interlock for MNT_REF and
MNT_REL operations.
While there, augment the unconditional yield by some amount of
spinning [1].
Reported and tested by: pho
Reviewed by: attilio
Submitted by: attilio [1]
MFC after: 3 days
call. The function indicates a failure by the TRUE return value. To
be extra safe, assert that the return value from the following
vm_map_insert() indicates success.
Fix style issues in the nearby lines, reformulate the comment.
Reviewed by: alc (previous version)
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
- unp_zone: kern.ipc.maxsockets limit reached
- socket_zone: kern.ipc.maxsockets limit reached
- zone_mbuf: kern.ipc.nmbufs limit reached
- zone_clust: kern.ipc.nmbclusters limit reached
- zone_jumbop: kern.ipc.nmbjumbop limit reached
- zone_jumbo9: kern.ipc.nmbjumbo9 limit reached
- zone_jumbo16: kern.ipc.nmbjumbo16 limit reached
Note that those warnings are printed not often than every five minutes and can
be globally turned off by setting sysctl/tunable vm.zone_warnings to 0.
Discussed on: arch
Obtained from: WHEEL Systems
MFC after: 2 weeks
This is to allow debug images to be used without taking down the
system when non-fatal asserts are hit.
The following sysctls are added:
debug.kassert.warn_only: 1 = log, 0 = panic
debug.kassert.do_ktr: set to a ktr mask for logging via KTR
debug.kassert.do_log: 1 = log, 0 = quiet
debug.kassert.warnings: stats, number of kasserts hit
debug.kassert.log_panic_at:
number of kasserts before we actually panic, 0 = never
debug.kassert.log_pps_limit: pps limit for log messages
debug.kassert.log_mute_at: stop warning after N kasserts, 0 = never stop
debug.kassert.kassert: set this sysctl to trigger a kassert
Discussed with: scottl, gnn, marcel
Sponsored by: iXsystems
EPROTONOSUPPORT if the address family is not supported.
- introduce pffinddomain() to find a domain by family and use it as
appropriate.
Reviewed by: glebius
- As the comment report, CALLOUT_LOCAL_ALLOC cannot be checked
directly from the callout flags but might be checked by a cached
value. Hence, do so before to actually remove the callout, when
needed, in softclock_call_cc().
- In softclock_call_cc() also add a comment in the waiting and deferred
migration case explaining that the dereference should be safe
because of the migration dereference invariants.
Additively:
- In softclock_call_cc(), for the deferred migration case, move all the
accesses to callout structure after the comment stating the callout
must not be destroyed.
- For consistency with this last tweak, use cached c_flags for the
KASSERT() in the deferred migration case. It is not strictly necessary
but this way all the callout accesses happen after the above mentioned
comment, improving consistency.
Pointy hat to: me
Sponsored by: Isilon Systems / EMC Corporation
Reviewed by: kib
MFC after: 2 weeks
X-MFC: 243901
from the callwheel. Calculate the cc->cc_next before removing the
callout, otherwise the code followed the invalid tailq links. After
this, make softclock_call_cc() return void, since it always return
cc->cc_next, which is immediately available to the softclock()
anyway. This also allows to eliminate a label under #ifdef SMP.
Remove the assignment of cc->cc_next from callout_cc_del(), since the
function is called with the callout already removed from callwheel.
If cancelling the migration, also clear the CALLOUT_DFRMIGRATION flag.
Postpone the free of the timeout(9) allocated callouts after the
migration checks are done.
Add some more strict asserts about the state of the callout in
callout_call_cc().
Reviewed by: attilio
Reported and tested by: pho (previous version)
MFC after: 2 weeks
callout is started before kern_setitimer() acquires process mutex, but
looses a race and kern_setitimer() gets the process mutex before the
callout. Then, assuming that new specified struct itimerval has
it_interval zero, but it_value non-zero, the callout, after it starts
executing again, clears p->p_realtimer.it_value, but kern_setitimer()
already rescheduled the callout.
As the result of the race, both p_realtimer is zero, and the callout
is rescheduled. Then, in the exit1(), the exit code sees that it_value
is zero and does not even try to stop the callout. This allows the
struct proc to be reused and eventually the armed callout is
re-initialized. The consequence is the corrupted callwheel tailq.
Use process mutex to interlock the callout start, which fixes the race.
Reported and tested by: pho
Reviewed by: jhb
MFC after: 2 weeks
over the active list. The mount interlock is not enough to guarantee
the validity of the tailq link pointers. The __mnt_vnode_next_active()
and __mnt_vnode_first_active() active lists iterators helper functions
did not provided the neccessary stability for the list, allowing the
iterators to pick garbage.
This was uncovered after the r243599 made the active list iterators
non-nop.
Since a vnode interlock is before the vnode_free_list_mtx, obtain the
vnode ilock in the non-blocking manner when under vnode_free_list_mtx,
and restart iteration after the yield if the lock attempt failed.
Assert that a vnode found on the list is active, and assert that the
helpers return the vnode with interlock owned.
Reported and tested by: pho
MFC after: 1 week
Fix path handling for *at() syscalls.
Before the change directory descriptor was totally ignored,
so the relative path argument was appended to current working
directory path and not to the path provided by descriptor, thus
wrong paths were stored in audit logs.
Now that we use directory descriptor in vfs_lookup, move
AUDIT_ARG_UPATH1() and AUDIT_ARG_UPATH2() calls to the place where
we hold file descriptors table lock, so we are sure paths will
be resolved according to the same directory in audit record and
in actual operation.
Sponsored by: FreeBSD Foundation (auditdistd)
Reviewed by: rwatson
MFC after: 2 weeks
Remove redundant call to AUDIT_ARG_UPATH1().
Path will be remembered by the following NDINIT(AUDITVNODE1) call.
Sponsored by: FreeBSD Foundation (auditdistd)
MFC after: 2 weeks
variable as they may overflow on i386/PAE and i386 with > 2GB RAM.
Use 64bit quad_t instead. It has broader kernel infrastructure support
with TUNABLE_QUAD_FETCH() and qmin/qmax() than other available types.
Pointed out by: alc, bde
kernel memory, whichever is lower. The overall mbuf related memory
limit must be set so that mbufs (and clusters of various sizes)
can't exhaust physical RAM or KVM.
The limit is set to half of the physical RAM or KVM (whichever is
lower) as the baseline. In any normal scenario we want to leave
at least half of the physmem/kvm for other kernel functions and
userspace to prevent it from swapping too easily. Via a tunable
kern.maxmbufmem the limit can be upped to at most 3/4 of physmem/kvm.
At the same time divorce maxfiles from maxusers and set maxfiles to
physpages / 8 with a floor based on maxusers. This way busy servers
can make use of the significantly increased mbuf limits with a much
larger number of open sockets.
Tidy up ordering in init_param2() and check up on some users of
those values calculated here.
Out of the overall mbuf memory limit 2K clusters and 4K (page size)
clusters to get 1/4 each because these are the most heavily used mbuf
sizes. 2K clusters are used for MTU 1500 ethernet inbound packets.
4K clusters are used whenever possible for sends on sockets and thus
outbound packets. The larger cluster sizes of 9K and 16K are limited
to 1/6 of the overall mbuf memory limit. When jumbo MTU's are used
these large clusters will end up only on the inbound path. They are
not used on outbound, there it's still 4K. Yes, that will stay that
way because otherwise we run into lots of complications in the
stack. And it really isn't a problem, so don't make a scene.
Normal mbufs (256B) weren't limited at all previously. This was
problematic as there are certain places in the kernel that on
allocation failure of clusters try to piece together their packet
from smaller mbufs.
The mbuf limit is the number of all other mbuf sizes together plus
some more to allow for standalone mbufs (ACK for example) and to
send off a copy of a cluster. Unfortunately there isn't a way to
set an overall limit for all mbuf memory together as UMA doesn't
support such a limiting.
NB: Every cluster also has an mbuf associated with it.
Two examples on the revised mbuf sizing limits:
1GB KVM:
512MB limit for mbufs
419,430 mbufs
65,536 2K mbuf clusters
32,768 4K mbuf clusters
9,709 9K mbuf clusters
5,461 16K mbuf clusters
16GB RAM:
8GB limit for mbufs
33,554,432 mbufs
1,048,576 2K mbuf clusters
524,288 4K mbuf clusters
155,344 9K mbuf clusters
87,381 16K mbuf clusters
These defaults should be sufficient for even the most demanding
network loads.
MFC after: 1 month
accept queues a new socket/connection may be added to the queue
due to a race on the ACCEPT_LOCK.
The submitted patch is slightly changed in comments, teardown
and locking order and extended with KASSERT's.
Submitted by: Vijay Singh <vijju.singh-at-gmail-dot-com>
Found by: His team.
MFC after: 1 week
is in capability mode.
- Add VN_OPEN_NOCAPCHECK flag for vn_open_cred() to will ne converted into
NOCAPCHECK namei flag.
This functionality will be used to enable core dumps for sandboxed processes.
Reviewed by: rwatson
Obtained from: WHEEL Systems
MFC after: 2 weeks
to himself. For example abort(3) at first tries to do kill(getpid(), SIGABRT)
which was failing in capability mode, so the code was failing back to exit(1).
Reviewed by: rwatson
Obtained from: WHEEL Systems
MFC after: 2 weeks
There has not been any complaints about the default behavior, so there
is no need to keep a knob that enables the worse alternative.
Now that the hard-stopping of other CPUs is the only behavior, the panic_cpu
spinlock-like logic can be dropped, because only a single CPU is
supposed to win stop_cpus_hard(other_cpus) race and proceed past that
call.
MFC after: 1 month
the unix domain sockets to the next tick, coalescing the serial calls
until the collection fires. The thought is that more work for the
collector could arise in the near time, allowing to clean more and not
spend too much CPU on repeated collection when there is no garbage.
Currently the collection task is fired immediately upon unix domain
socket close if there are any rights in flight, which caused excessive
CPU usage and too long blocking of the threads waiting for
unp_list_lock and unp_link_rwlock in write mode.
Robert noted that it would be nice if we could find some heuristic by
which we decide whether to run GC a bit more quickly. E.g., if the
number of UNIX domain sockets is close to its resource limit, but not
quite.
Reported and tested by: Markus Gebert <markus.gebert@hostpoint.ch>
Reviewed by: rwatson
MFC after: 2 weeks
taskqueue_enqueue_timeout(). Do not rearm the callout if it is
already armed and the ticks is negative. Otherwise rearm it to fire
in abs(ticks) ticks in the future.
The intended use is to call taskqueue_enqueue_timeout() for the given
timeout_task with the same negative ticks argument. As result, the
task is scheduled to execute not further than abs(ticks) ticks in
future, and the consequent enqueues are coalesced until the already
scheduled task is finished.
Reviewed by: rwatson
Tested by: Markus Gebert <markus.gebert@hostpoint.ch>
MFC after: 2 weeks
then:
- assume the lock is held in exclusive mode and remove a moot check
about the lock acquisition.
- in the destructor remove !MPSAFE specific chunk.
Reviewed by: kib
MFC after: 2 weeks
Also remove the checks from vop_lookup_pre and vop_lookup_post, which
are now completely redundant (before this change they were partially
redundant).
Discussed with: kib
MFC after: 10 days