kernel.
When access restrictions are added to a page table entry, we flush the
corresponding virtual address mapping from the TLB. In contrast, when
access restrictions are removed from a page table entry, we do not
flush the virtual address mapping from the TLB. This is exactly as
recommended in AMD's documentation. In effect, when access
restrictions are removed from a page table entry, AMD's MMUs will
transparently refresh a stale TLB entry. In short, this saves us from
having to perform potentially costly TLB flushes. In contrast,
Intel's MMUs are allowed to generate a spurious page fault based upon
the stale TLB entry. Usually, such spurious page faults are handled
by vm_fault() without incident. However, when we are executing
no-fault sections of the kernel, we are not allowed to execute
vm_fault(). This change introduces special-case handling for spurious
page faults that occur in no-fault sections of the kernel.
In collaboration with: kib
Tested by: gibbs (an earlier version)
I would also like to acknowledge Hiroki Sato's assistance in
diagnosing this problem.
MFC after: 1 week
inserted after the priority token thus cleaning up the output.
- Remove the needless double internal do_add_char function.
- Resolve a possible deadlock if interrupts are
disabled and getnanotime is called
Reviewed by: bde kmacy, avg, sbruno (various versions)
Approved by: cperciva
MFC after: 2 weeks
hardclock() tick should be run on every active CPU, or on only one.
On my tests, avoiding extra interrupts because of this on 8-CPU Core i7
system with HZ=10000 saves about 2% of performance. At this moment option
implemented only for global timers, as reprogramming per-CPU timers is
too expensive now to be compensated by this benefit, especially since we
still have to regularly run hardclock() on at least one active CPU to
update system uptime. For global timer it is quite trivial: timer runs
always, but we just skip IPIs to other CPUs when possible.
Option is enabled by default now, keeping previous behavior, as periodic
hardclock() calls are still used at least to implement setitimer(2) with
ITIMER_VIRTUAL and ITIMER_PROF arguments. But since default schedulers don't
depend on it since r232917, we are much more free to experiment with it.
MFC after: 1 month
with HZ rate through the sched_tick() calls from hardclock().
Potentially it can be used to improve precision, but now it is just minus
one more reason to call hardclock() for every HZ tick on every active CPU.
SCHED_4BSD never used sched_tick(), but keep it in place for now, as at
least SCHED_FBFS existing in patches out of the tree depends on it.
MFC after: 1 month
Instead of using 25MHz equality threshold, look for the nearest value when
handling dev.cpu.0.freq sysctl and for exact match when it is expected.
ACPI may report extra level with frequency 1MHz above the nominal to
control Intel Turbo Boost operation. It is not a bug, but feature:
dev.cpu.0.freq_levels: 2934/106000 2933/95000 2800/82000 ...
In this case value 2933 means 2.93GHz, but 2934 means 3.2-3.6GHz.
I've found that my Core i7-870 based system has Intel Turbo Boost disabled
by default and without this change it was absolutely invisible and hard
to control.
MFC after: 2 weeks
- Pass number of events to the statclock() and profclock() functions
same as to hardclock() before to not call them many times in a loop.
- Rename them into statclock_cnt() and profclock_cnt().
- Turn statclock() and profclock() into compatibility wrappers,
still needed for arm.
- Rename hardclock_anycpu() into hardclock_cnt() for unification.
MFC after: 1 week
the cached name used for KTR_SCHED traces when a thread's name changes.
This way KTR_SCHED traces (and thus schedgraph) will notice when a thread's
name changes, most commonly via execve().
MFC after: 2 weeks
fifo_iseof() condition, allowing the v_fifoinfo to be reset and freed
by fifo_cleanup().
Precalculate EOF at the places were fo_wgen is changed, and cache the
state in a new pipe state flag PIPE_SAMEWGEN.
Reported and tested by: bf
Submitted by: gianni
MFC after: 1 week (a backport)
does not fit into registers, declare that we do not support this case
using CTASSERT(), and remove endianess-unsafe code to split return value
into td_retval.
While there, change the style of the sysctl debug.iosize_max_clamp
definition.
Requested by: bde
MFC after: 3 weeks
using the o32 ABI. This mostly follows nwhitehorn's lead in implementing
COMPAT_FREEBSD32 on powerpc64.
o) Add a new type to the freebsd32 compat layer, time32_t, which is time_t in the
32-bit ABI being used. Since the MIPS port is relatively-new, even the 32-bit
ABIs use a 64-bit time_t.
o) Because time{spec,val}32 has the same size and layout as time{spec,val} on MIPS
with 32-bit compatibility, then, disable some code which assumes otherwise
wrongly when built for MIPS. A more general macro to check in this case would
seem like a good idea eventually. If someone adds support for using n32
userland with n64 kernels on MIPS, then they will have to add a variety of
flags related to each piece of the ABI that can vary. That's probably the
right time to generalize further.
o) Add MIPS to the list of architectures which use PAD64_REQUIRED in the
freebsd32 compat code. Probably this should be generalized at some point.
Reviewed by: gonzo
significantly. Upon investigation this was caused by name cache
misses for lookups of "..". For name cache entries for non-".."
directories, the cache entry serves double duty. It maps both the
named directory plus ".." for the parent of the directory. As such,
two ctime values (one for each of the directory and its parent) need
to be saved in the name cache entry.
This patch adds an entry for ctime of the parent directory to the
name cache. It also adds an additional uma zone for large entries
with this time value, in order to minimize memory wastage.
As well, it fixes a couple of cases where the mtime of the parent
directory was being saved instead of ctime for positive name cache
entries. With this patch, Lookup RPC counts return to values similar
to pre-r230394 kernels.
Reported by: bde
Discussed with: kib
Reviewed by: jhb
MFC after: 2 weeks
long for specifying a boundary constraint.
- Change bus_dma tags to use bus_addr_t instead of bus_size_t for boundary
constraints.
These allow boundary constraints to be fully expressed for cases where
sizeof(bus_addr_t) != sizeof(bus_size_t). Specifically, it allows a
driver to properly specify a 4GB boundary in a PAE kernel.
Note that this cannot be safely MFC'd without a lot of compat shims due
to KBI changes, so I do not intend to merge it.
Reviewed by: scottl
snapshots on UFS filesystems running with journaled soft updates.
This is the first of several bugs that need to be fixed before
removing the restriction added in -r230250 to prevent the use
of snapshots on filesystems running with journaled soft updates.
The deadlock occurs when holding the snapshot lock (snaplk)
and then trying to flush an inode via ffs_update(). We become
blocked by another process trying to flush a different inode
contained in the same inode block that we need. It holds the
inode block for which we are waiting locked. When it tries to
write the inode block, it gets blocked waiting for the our
snaplk when it calls ffs_copyonwrite() to see if the inode
block needs to be copied in our snapshot.
The most obvious place that this deadlock arises is in the
ffs_copyonwrite() routine when it updates critical metadata
in a snapshot and tries to write it out before proceeding.
The fix here is to write the data and indirect block pointer
for the snapshot, but to skip the call to ffs_update() to
write the snapshot inode. To ensure that we will never have
to update a pointer in the inode itself, the ffs_snapshot()
routine that creates the snapshot has to ensure that all the
direct blocks are allocated as part of the creation of the
snapshot.
A less obvious place that this deadlock occurs is when we hold
the snaplk because we are deleting a snapshot. In the course of
doing the deletion, we need to allocate various soft update
dependency structures and allocate some journal space. If we
hit a resource limit while doing this we decrease the resources
in use by flushing out an existing dirty file to get it to give
up the soft dependency resources that it holds. The flush can
cause an ffs_update() to be done on the inode for the file that
we have selected to flush resulting in the same deadlock as
described above when the inode that we have chosen to flush
resides in the same inode block as the snapshot inode that we hold.
The fix is to defer cleaning up any time that the inode on which
we are operating is a snapshot.
Help and review by: Jeff Roberson
Tested by: Peter Holm
MFC (to 9 only) after: 2 weeks
operations for setting and accessing vnode's v_socket field.
The operations are necessary to implement proper unix socket handling
on layered file systems like nullfs(5).
This change fixes the long standing issue with nullfs(5) being in that
unix sockets did not work between lower and upper layers: if we bound
to a socket on the lower layer we could connect only to the lower
path; if we bound to the upper layer we could connect only to the
upper path. The new behavior is one can connect to both the lower and
the upper paths regardless what layer path one binds to.
PR: kern/51583, kern/159663
Suggested by: kib
Reviewed by: arch
MFC after: 2 weeks
following clang warning:
sys/kern/sys_pipe.c:1556:10: error: promoted type 'int' of K&R function parameter is not compatible with the parameter type 'mode_t'
(aka 'unsigned short') declared in a previous prototype [-Werror]
mode_t mode;
^
sys/kern/sys_pipe.c:155:19: note: previous declaration is here
static fo_chmod_t pipe_chmod;
^
not get syscall exit notification until the child performed exec of
exit. Swap the order of doing ptracestop() and waiting for P_PPWAIT
clearing, by postponing the wait into syscallret after ptracestop()
notification is done.
Reported, tested and reviewed by: Dmitry Mikulin <dmitrym juniper net>
MFC after: 2 weeks
Descriptions are specific to drivers and we don't change drivers on attached
devices. This fixes a few places where we were not clearing the description
when detaching a driver (e.g. with device_attach() failed). While here, fix
a few other nits:
- Remove spurious call to remove a device's driver from
devclass_driver_deleted(). device_detach() removes it already.
- Fix a typo.
- In sched_pickcpu() be more careful taking previous CPU on SMT systems.
Do it only if all other logical CPUs of that physical one are idle to avoid
extra resource sharing.
- In sched_pickcpu() change general logic of CPU selection. First
look for idle CPU, sharing last level cache with previously used one,
skipping SMT CPU groups. If none found, search all CPUs for the least loaded
one, where the thread with its priority can run now. If none found, search
just for the least loaded CPU.
- Make cpu_search() compare lowest/highest CPU load when comparing CPU
groups with equal load. That allows to differentiate 1+1 and 2+0 loads.
- Make cpu_search() to prefer specified (previous) CPU or group if load
is equal. This improves cache affinity for more complicated topologies.
- Randomize CPU selection if above factors are equal. Previous code tend
to prefer CPUs with lower IDs, causing unneeded collisions.
- Rework periodic balancer in sched_balance_group(). With cpu_search()
more intelligent now, make balansing process flat, removing recursion
over the topology tree. That fixes double swap problem and makes load
distribution more even and predictable.
All together this gives 10-15% performance improvement in many tests on
CPUs with SMT, such as Core i7, for number of threads is less then number
of logical CPUs. In some tests it also gives positive effect to systems
without SMT.
Reviewed by: jeff
Tested by: flo, hackers@
MFC after: 1 month
Sponsored by: iXsystems, Inc.
Uftdi(4) examines (c_iflag & (IXON|IXOFF)) to control hw XON-XOFF support.
This is obviously no good, if changes to those bits are not communicated
down the stack.