for limiting disk (actually filesystem) IO.
Note that in some cases these limits are not quite precise. It's ok,
as long as it's within some reasonable bounds.
Testing - and review of the code, in particular the VFS and VM parts - is
very welcome.
MFC after: 1 month
Relnotes: yes
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D5080
Previously, the code determined a topology of processing units
(hardware threads, cores, packages) and then deduced a cache topology
using certain assumptions. The new code builds a topology that
includes both processing units and caches using the information
provided by the hardware.
At the moment, the discovered full topology is used only to creeate
a scheduling topology for SCHED_ULE.
There is no KPI for other kernel uses.
Summary:
- based on APIC ID derivation rules for Intel and AMD CPUs
- can handle non-uniform topologies
- requires homogeneous APIC ID assignment (same bit widths for ID
components)
- topology for dual-node AMD CPUs may not be optimal
- topology for latest AMD CPU models may not be optimal as the code is
several years old
- supports only thread/package/core/cache nodes
Todo:
- AMD dual-node processors
- latest AMD processors
- NUMA nodes
- checking for homogeneity of the APIC ID assignment across packages
- more flexible cache placement within topology
- expose topology to userland, e.g., via sysctl nodes
Long term todo:
- KPI for CPU sharing and affinity with respect to various resources
(e.g., two logical processors may share the same FPU, etc)
Reviewed by: mav
Tested by: mav
MFC after: 1 month
Differential Revision: https://reviews.freebsd.org/D2728
universal.
(1) New struct intr_map_data is defined as a container for arbitrary
description of an interrupt used by a device. Typically, an interrupt
number and configuration relevant to an interrupt controller is encoded
in such description. However, any additional information may be encoded
too like a set of cpus on which an interrupt should be enabled or vendor
specific data needed for setup of an interrupt in controller. The struct
intr_map_data itself is meant to be opaque for INTRNG.
(2) An intr_map_irq() function is created which takes an interrupt
controller identification and struct intr_map_data as arguments and
returns global interrupt number which identifies an interrupt.
(3) A set of functions to be used by bus drivers is created as well as
a corresponding set of methods for interrupt controller drivers. These
sets take both struct resource and struct intr_map_data as one of the
arguments. There is a goal to keep struct intr_map_data in struct
resource, however, this way a final solution is not limited to that.
(4) Other small changes are done to reflect new situation.
This is only first step aiming to create stable interface for interrupt
controller drivers. Thus, some temporary solution is taken. Interrupt
descriptions for devices are stored in INTRNG and two specific mapping
function are created to be temporary used by bus drivers. That's why
the struct intr_map_data is not opaque for INTRNG now. This temporary
solution will be replaced by final one in next step.
Differential Revision: https://reviews.freebsd.org/D5730
Previously, freebsd32 binaries could submit read/write requests with lengths
greater than INT_MAX that a native kernel would have rejected.
Reviewed by: kib
Sponsored by: Chelsio Communications
Differential Revision: https://reviews.freebsd.org/D5788
not getting a proper bounds check.
Thanks to CTurt for pointing at this with a big red blinking neon sign.
PR: 206761
Submitted by: sson
Reviewed by: cturt@hardenedbsd.org
MFC after: 3 days
Previously, calls to *sleep() and cv_*wait*() immediately returned during
early boot. Instead, permit threads that request a sleep without a
timeout to sleep as wakeup() works during early boot. Sleeps with
timeouts are harder to emulate without working timers, so just punt and
panic explicitly if any thread tries to use those before timers are
working. Any threads that depend on timeouts should either wait until
SI_SUB_KICK_SCHEDULER to start or they should use DELAY() until timers
are available.
Until APs are started earlier this should be a no-op as other kthreads
shouldn't get a chance to start running until after timers are working
regardless of when they were created.
Reviewed by: kib
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D5724
- Move some blocks around to reduce the number of 'if (unmap)' checks.
- Use 'pbuf == NULL' instead of 'unmap'.
- Use nitems.
- Pull an assignment out of an if expression.
Reviewed by: kib
Sponsored by: Chelsio Communications
data (headers). Historically the size of the headers was not checked
against the socket buffer space. Application could easily overcommit the
socket buffer space.
With the new sendfile (r293439) the problem remained, but a KASSERT was
inserted that checked that amount of data written to the socket matches
its space. In case when size of headers is bigger that socket space,
KASSERT fires. Without INVARIANTS the new sendfile won't panic, but
would report incorrect amount of bytes sent.
o With this change, the headers copyin is moved down into the cycle, after
the sbspace() check. The uio size is trimmed by socket space there,
which fixes the overcommit problem and its consequences.
o The compatibility handling for FreeBSD 4 sendfile headers API is pushed
up the stack to syscall wrappers. This required a copy and paste of the
code, but in turn this allowed to remove extra stack carried parameter
from fo_sendfile_t, and embrace entire compat code into #ifdef. If in
future we got more fo_sendfile_t function, the copy and paste level would
even reduce.
Reviewed by: emax, gallatin, Maxim Dounin <mdounin mdounin.ru>
Tested by: Vitalij Satanivskij <satan ukr.net>
Sponsored by: Netflix
m_unshare passes on the source mbuf's flags as-is to m_getcl and this
results in a leak if the flags include M_NOFREE. The fix is to clear
the bits not listed in M_COPYALL before calling m_getcl. M_RDONLY
should probably be filtered out too but that's outside the scope of this
fix.
Add assertions in the zone_mbuf and zone_pack ctors to catch similar
bugs.
Update netmap_get_mbuf to not pass M_NOFREE to m_getcl. It's not clear
what the original code was trying to do but it's likely incorrect.
Updated code is no different functionally but it avoids the newly added
assertions.
Reviewed by: gnn@
Sponsored by: Chelsio Communications
Differential Revision: https://reviews.freebsd.org/D5698
controller IPI provider.
New struct intr_ipi is defined which keeps all info about an IPI:
its name, counter, send and dispatch methods. Generic intr_ipi_setup(),
intr_ipi_send() and intr_ipi_dispatch() functions are implemented.
An IPI provider must implement two functions:
(1) an intr_ipi_send_t function which is able to send an IPI,
(2) a setup function which initializes itself for an IPI and
calls intr_ipi_setup() with appropriate arguments.
Differential Revision: https://reviews.freebsd.org/D5700
No functional change.
struct radix_node_head's first element is rh so this was already
referring to the same address. It was likely an unintended
s/rnh/&rnh->rh/ change from r294706 as all other rnh_walktree() callers
pass the expected struct radix_node_head * rather than obscurely passing
the address of their first element.
Sponsored by: EMC / Isilon Storage Division
Replace free(M_RTABLE) with rn_detachhead() to match rn_inithead().
This would trigger when reloading NFS exports and was similar to
problems with pf reload [1].
PR: 194078 [1]
Sponsored by: EMC / Isilon Storage Division
This restores the pre-r290196 behaviour, eliminating the need to manually
press '.' a couple of times to get USB to finish probing.
Note that there's still something wrong with the console (character
echoing doesn't quite work), and there's also a reported problem with
BHyVe, but those two don't seem related to the problem above.
MFC after: 1 month
Sponsored by: The FreeBSD Foundation
The mbuf provider is made up of a set of Statically Defined Tracepoints
which help us look into mbufs as they are allocated and freed. This can be
used to inspect the buffers or for a simplified mbuf leak detector.
New tracepoints are:
mbuf:::m-init
mbuf:::m-gethdr
mbuf:::m-get
mbuf:::m-getcl
mbuf:::m-clget
mbuf:::m-cljget
mbuf:::m-cljset
mbuf:::m-free
mbuf:::m-freem
There is also a translator for mbufs which gives some visibility into the structure,
see mbuf.d for more details.
Reviewed by: bz, markj
MFC after: 2 weeks
Sponsored by: Rubicon Communications (Netgate)
Differential Revision: https://reviews.freebsd.org/D5682
First, update the return types of aio_return() and aio_waitcomplete() to
ssize_t.
POSIX requires aio_return() to return a ssize_t so that it can represent
all return values from read() and write(). aio_waitcomplete() should use
ssize_t for the same reason.
aio_return() has used ssize_t in <aio.h> since r31620 but the manpage and
system call entry were not updated. aio_waitcomplete() has always
returned int.
Note that this does not require new system call stubs as this is
effectively only an API change in how the compiler interprets the return
value.
Second, allow aio_nbytes values up to IOSIZE_MAX instead of just INT_MAX.
aio_read/write should now honor the same length limits as normal read/write.
Third, use longs instead of ints in the aio_return() and aio_waitcomplete()
system call functions so that the 64-bit size_t in the in-kernel aiocb
isn't truncated to 32-bits before being copied out to userland or
being returned.
Finally, a simple test has been added to verify the bounds checking on the
maximum read size from a file.
For the !unmap case it may happen that pbuf gets called unreferenced
when vm_fault_quick_hold_pages() fails.
Initialize it so it doesn't cause trouble.
CID: 1352776
Reviewed by: jhb
MFC after: 1 week
On some architectures, u_long isn't large enough for resource definitions.
Particularly, powerpc and arm allow 36-bit (or larger) physical addresses, but
type `long' is only 32-bit. This extends rman's resources to uintmax_t. With
this change, any resource can feasibly be placed anywhere in physical memory
(within the constraints of the driver).
Why uintmax_t and not something machine dependent, or uint64_t? Though it's
possible for uintmax_t to grow, it's highly unlikely it will become 128-bit on
32-bit architectures. 64-bit architectures should have plenty of RAM to absorb
the increase on resource sizes if and when this occurs, and the number of
resources on memory-constrained systems should be sufficiently small as to not
pose a drastic overhead. That being said, uintmax_t was chosen for source
clarity. If it's specified as uint64_t, all printf()-like calls would either
need casts to uintmax_t, or be littered with PRI*64 macros. Casts to uintmax_t
aren't horrible, but it would also bake into the API for
resource_list_print_type() either a hidden assumption that entries get cast to
uintmax_t for printing, or these calls would need the PRI*64 macros. Since
source code is meant to be read more often than written, I chose the clearest
path of simply using uintmax_t.
Tested on a PowerPC p5020-based board, which places all device resources in
0xfxxxxxxxx, and has 8GB RAM.
Regression tested on qemu-system-i386
Regression tested on qemu-system-mips (malta profile)
Tested PAE and devinfo on virtualbox (live CD)
Special thanks to bz for his testing on ARM.
Reviewed By: bz, jhb (previous)
Relnotes: Yes
Sponsored by: Alex Perez/Inertial Computing
Differential Revision: https://reviews.freebsd.org/D4544
This is several year's worth of fail point upgrades done at EMC Isilon. They
are interdependent enough that it makes sense to put a single diff up for them.
Primarily, we added:
- Changing all mainline execution paths to be lockless, which lets us use fail
points in more sleep-sensitive areas, and allows more parallel execution
- A number of additional commands, including 'pause' that lets us do some
interesting deterministic repros of race conditions
- The ability to dump the stacks of all threads sleeping on a fail point
- A number of other API changes to allow marking up the fail point's context in
the code, and firing callbacks before and after execution
- A man page update
Submitted by: Matthew Bryan <matthew.bryan@isilon.com>
Reviewed by: cem (earlier version), jhb, kib, pho
With feedback from: bdrewery
Sponsored by: EMC / Isilon Storage Division
Differential Revision: https://reviews.freebsd.org/D5427
In timer2sbintime(), calculate the second and fractional second portions of
the sbintime separately. When calculating the the fractional second portion,
use a 64bit multiply to prevent excess truncation. This avoids the ~7% error
in the original conversion for ns, and smaller errors of the same type for us
and ms.
PR: 198139
Reviewed by: jhb
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D5397
The base system libc is only used to run binaries built on FreeBSD 7.0 and
later. It does not need to include system call wrappers for system calls
only used by FreeBSD binaries built on versions older than 7.0. This was
already true for "COMPAT" system calls, but now wrappers for system calls
used on FreeBSD 4 and 6 are excluded as well.
Reviewed by: kib
Differential Revision: https://reviews.freebsd.org/D5597
192.168.1.1, with share "share". This commit fixes a problem
where "mkdir /net/192.168.1.1/share/meh" would return spurious
error instead of creating the directory if the target filesystem
wasn't mounted yet; subsequent attempts would work correctly.
The failure scenario is kind of complicated to explain, but it all
boils down to calling VOP_MKDIR() for the target filesystem (NFS)
with wrong dvp - the autofs vnode instead of the filesystem root
mounted over it.
Reviewed by: kib@
MFC after: 1 month
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D5442