This is based on work done by jeff@ and jhb@, as well as the numa.diff
patch that has been circulating when someone asks for first-touch NUMA
on -10 or -11.
* Introduce a simple set of VM policy and iterator types.
* tie the policy types into the vm_phys path for now, mirroring how
the initial first-touch allocation work was enabled.
* add syscalls to control changing thread and process defaults.
* add a global NUMA VM domain policy.
* implement a simple cascade policy order - if a thread policy exists, use it;
if a process policy exists, use it; use the default policy.
* processes inherit policies from their parent processes, threads inherit
policies from their parent threads.
* add a simple tool (numactl) to query and modify default thread/process
policities.
* add documentation for the new syscalls, for numa and for numactl.
* re-enable first touch NUMA again by default, as now policies can be
set in a variety of methods.
This is only relevant for very specific workloads.
This doesn't pretend to be a final NUMA solution.
The previous defaults in -HEAD (with MAXMEMDOM set) can be achieved by
'sysctl vm.default_policy=rr'.
This is only relevant if MAXMEMDOM is set to something other than 1.
Ie, if you're using GENERIC or a modified kernel with non-NUMA, then
this is a glorified no-op for you.
Thank you to Norse Corp for giving me access to rather large
(for FreeBSD!) NUMA machines in order to develop and verify this.
Thank you to Dell for providing me with dual socket sandybridge
and westmere v3 hardware to do NUMA development with.
Thank you to Scott Long at Netflix for providing me with access
to the two-socket, four-domain haswell v3 hardware.
Thank you to Peter Holm for running the stress testing suite
against the NUMA branch during various stages of development!
Tested:
* MIPS (regression testing; non-NUMA)
* i386 (regression testing; non-NUMA GENERIC)
* amd64 (regression testing; non-NUMA GENERIC)
* westmere, 2 socket (thankyou norse!)
* sandy bridge, 2 socket (thankyou dell!)
* ivy bridge, 2 socket (thankyou norse!)
* westmere-EX, 4 socket / 1TB RAM (thankyou norse!)
* haswell, 2 socket (thankyou norse!)
* haswell v3, 2 socket (thankyou dell)
* haswell v3, 2x18 core (thankyou scott long / netflix!)
* Peter Holm ran a stress test suite on this work and found one
issue, but has not been able to verify it (it doesn't look NUMA
related, and he only saw it once over many testing runs.)
* I've tested bhyve instances running in fixed NUMA domains and cpusets;
all seems to work correctly.
Verified:
* intel-pcm - pcm-numa.x and pcm-memory.x, whilst selecting different
NUMA policies for processes under test.
Review:
This was reviewed through phabricator (https://reviews.freebsd.org/D2559)
as well as privately and via emails to freebsd-arch@. The git history
with specific attributes is available at https://github.com/erikarn/freebsd/
in the NUMA branch (https://github.com/erikarn/freebsd/compare/local/adrian_numa_policy).
This has been reviewed by a number of people (stas, rpaulo, kib, ngie,
wblock) but not achieved a clear consensus. My hope is that with further
exposure and testing more functionality can be implemented and evaluated.
Notes:
* The VM doesn't handle unbalanced domains very well, and if you have an overly
unbalanced memory setup whilst under high memory pressure, VM page allocation
may fail leading to a kernel panic. This was a problem in the past, but it's
much more easily triggered now with these tools.
* This work only controls the path through vm_phys; it doesn't yet strongly/predictably
affect contigmalloc, KVA placement, UMA, etc. So, driver placement of memory
isn't really guaranteed in any way. That's next on my plate.
Sponsored by: Norse Corp, Inc.; Dell
CloudABI is a pure capability-based runtime environment for UNIX. It
works similar to Capsicum, except that processes already run in
capabilities mode on startup. All functionality that conflicts with this
model has been omitted, making it a compact binary interface that can be
supported by other operating systems without too much effort.
CloudABI is 'secure by default'; the idea is that it should be safe to
run arbitrary third-party binaries without requiring any explicit
hardware virtualization (Bhyve) or namespace virtualization (Jails). The
rights of an application are purely determined by the set of file
descriptors that you grant it on startup.
The datatypes and constants used by CloudABI's C library (cloudlibc) are
defined in separate files called syscalldefs_mi.h (pointer size
independent) and syscalldefs_md.h (pointer size dependent). We import
these files in sys/contrib/cloudabi and wrap around them in
cloudabi*_syscalldefs.h.
We then add stubs for all of the system calls in sys/compat/cloudabi or
sys/compat/cloudabi64, depending on whether the system call depends on
the pointer size. We only have nine system calls that depend on the
pointer size. If we ever want to support 32-bit binaries, we can simply
add sys/compat/cloudabi32 and implement these nine system calls again.
The next step is to send in code reviews for the individual system call
implementations, but also add a sysentvec, to allow CloudABI executabled
to be started through execve().
More information about CloudABI:
- GitHub: https://github.com/NuxiNL/cloudlibc
- Talk at BSDCan: https://www.youtube.com/watch?v=SVdF84x1EdA
Differential Revision: https://reviews.freebsd.org/D2848
Reviewed by: emaste, brooks
Obtained from: https://github.com/NuxiNL/freebsd
directory sys/contrib/libnv.
The goal of this operation is to NOT install header files which shouldn't
be used outside the nvlist library.
Approved by: pjd (mentor)
* GENERAL
- Update copyright.
- Make kernel options for RANDOM_YARROW and RANDOM_DUMMY. Set
neither to ON, which means we want Fortuna
- If there is no 'device random' in the kernel, there will be NO
random(4) device in the kernel, and the KERN_ARND sysctl will
return nothing. With RANDOM_DUMMY there will be a random(4) that
always blocks.
- Repair kern.arandom (KERN_ARND sysctl). The old version went
through arc4random(9) and was a bit weird.
- Adjust arc4random stirring a bit - the existing code looks a little
suspect.
- Fix the nasty pre- and post-read overloading by providing explictit
functions to do these tasks.
- Redo read_random(9) so as to duplicate random(4)'s read internals.
This makes it a first-class citizen rather than a hack.
- Move stuff out of locked regions when it does not need to be
there.
- Trim RANDOM_DEBUG printfs. Some are excess to requirement, some
behind boot verbose.
- Use SYSINIT to sequence the startup.
- Fix init/deinit sysctl stuff.
- Make relevant sysctls also tunables.
- Add different harvesting "styles" to allow for different requirements
(direct, queue, fast).
- Add harvesting of FFS atime events. This needs to be checked for
weighing down the FS code.
- Add harvesting of slab allocator events. This needs to be checked for
weighing down the allocator code.
- Fix the random(9) manpage.
- Loadable modules are not present for now. These will be re-engineered
when the dust settles.
- Use macros for locks.
- Fix comments.
* src/share/man/...
- Update the man pages.
* src/etc/...
- The startup/shutdown work is done in D2924.
* src/UPDATING
- Add UPDATING announcement.
* src/sys/dev/random/build.sh
- Add copyright.
- Add libz for unit tests.
* src/sys/dev/random/dummy.c
- Remove; no longer needed. Functionality incorporated into randomdev.*.
* live_entropy_sources.c live_entropy_sources.h
- Remove; content moved.
- move content to randomdev.[ch] and optimise.
* src/sys/dev/random/random_adaptors.c src/sys/dev/random/random_adaptors.h
- Remove; plugability is no longer used. Compile-time algorithm
selection is the way to go.
* src/sys/dev/random/random_harvestq.c src/sys/dev/random/random_harvestq.h
- Add early (re)boot-time randomness caching.
* src/sys/dev/random/randomdev_soft.c src/sys/dev/random/randomdev_soft.h
- Remove; no longer needed.
* src/sys/dev/random/uint128.h
- Provide a fake uint128_t; if a real one ever arrived, we can use
that instead. All that is needed here is N=0, N++, N==0, and some
localised trickery is used to manufacture a 128-bit 0ULLL.
* src/sys/dev/random/unit_test.c src/sys/dev/random/unit_test.h
- Improve unit tests; previously the testing human needed clairvoyance;
now the test will do a basic check of compressibility. Clairvoyant
talent is still a good idea.
- This is still a long way off a proper unit test.
* src/sys/dev/random/fortuna.c src/sys/dev/random/fortuna.h
- Improve messy union to just uint128_t.
- Remove unneeded 'static struct fortuna_start_cache'.
- Tighten up up arithmetic.
- Provide a method to allow eternal junk to be introduced; harden
it against blatant by compress/hashing.
- Assert that locks are held correctly.
- Fix the nasty pre- and post-read overloading by providing explictit
functions to do these tasks.
- Turn into self-sufficient module (no longer requires randomdev_soft.[ch])
* src/sys/dev/random/yarrow.c src/sys/dev/random/yarrow.h
- Improve messy union to just uint128_t.
- Remove unneeded 'staic struct start_cache'.
- Tighten up up arithmetic.
- Provide a method to allow eternal junk to be introduced; harden
it against blatant by compress/hashing.
- Assert that locks are held correctly.
- Fix the nasty pre- and post-read overloading by providing explictit
functions to do these tasks.
- Turn into self-sufficient module (no longer requires randomdev_soft.[ch])
- Fix some magic numbers elsewhere used as FAST and SLOW.
Differential Revision: https://reviews.freebsd.org/D2025
Reviewed by: vsevolod,delphij,rwatson,trasz,jmg
Approved by: so (delphij)
This will require for AArch64 as we dont have modules yet.
Sponsored by: HEIF5
Sponsored by: ARM Ltd.
Differential Revision: https://reviews.freebsd.org/D1997
Leaf drivers should not import the PCI bus interface to add IOV handling.
Instead, move the IOV client methods to a separate kobj interface.
Differential Revision: https://reviews.freebsd.org/D2584
Reviewed by: rstone
Support 7xxx adapters including firmware-assisted TSO and VLAN tagging:
- Solarflare Flareon Ultra 7000 series 10/40G adapters:
- Solarflare SFN7042Q QSFP+ Server Adapter
- Solarflare SFN7142Q QSFP+ Server Adapter
- Solarflare Flareon Ultra 7000 series 10G adapters:
- Solarflare SFN7022F SFP+ Server Adapter
- Solarflare SFN7122F SFP+ Server Adapter
- Solarflare SFN7322F Precision Time Synchronization Server Adapter
- Solarflare Flareon 7000 series 10G adapters:
- Solarflare SFN7002F SFP+ Server Adapter
Support utilities to configure adapters and update firmware.
The work is done by Solarflare developers
(Andy Moreton, Andrew Lee and many others),
Artem V. Andreev <Artem.Andreev at oktetlabs.ru> and me.
Sponsored by: Solarflare Communications, Inc.
MFC after: 2 weeks
Causually read by: gnn
Differential Revision: https://reviews.freebsd.org/D2618
In order to map memory from other domains when running on Xen FreeBSD uses
unused physical memory regions. Until now this memory has been allocated
using bus_alloc_resource, but this is not completely safe as we can end up
using unreclaimed MMIO or ACPI regions.
Fix this by introducing a new newbus method that can be used by Xen drivers
to request for unused memory regions. On amd64 we make sure this memory
comes from regions above 4GB in order to prevent clashes with MMIO/ACPI
regions. On i386 there's nothing we can do, so just fall back to the
previous mechanism.
Sponsored by: Citrix Systems R&D
Tested by: Gustau Pérez <gperez@entel.upc.edu>
remains. Xen is planning to phase out support for PV upstream since it
is harder to maintain and has more overhead. Modern x86 CPUs include
virtualization extensions that support HVM guests instead of PV guests.
In addition, the PV code was i386 only and not as well maintained recently
as the HVM code.
- Remove the i386-only NATIVE option that was used to disable certain
components for PV kernels. These components are now standard as they
are on amd64.
- Remove !XENHVM bits from PV drivers.
- Remove various shims required for XEN (e.g. PT_UPDATES_FLUSH, LOAD_CR3,
etc.)
- Remove duplicate copy of <xen/features.h>.
- Remove unused, i386-only xenstored.h.
Differential Revision: https://reviews.freebsd.org/D2362
Reviewed by: royger
Tested by: royger (i386/amd64 HVM domU and amd64 PVH dom0)
Relnotes: yes
It is not network-specific code and would
be better as part of libkern instead.
Move zlib.h and zutil.h from net/ to sys/
Update includes to use sys/zlib.h and sys/zutil.h instead of net/
Submitted by: Steve Kiernan stevek@juniper.net
Obtained from: Juniper Networks, Inc.
GitHub Pull Request: https://github.com/freebsd/freebsd/pull/28
Relnotes: yes
The only thing is used from this code is ipip_output() function, that does
IPIP encapsulation. Other parts of XF_IP4 code were removed in r275133.
Also it isn't possible to configure the use of XF_IP4, nor from userland
via setkey(8), nor from the kernel.
Simplify the ipip_output() function and rename it to ipsec_encap().
* move IP_DF handling from ipsec4_process_packet() into ipsec_encap();
* since ipsec_encap() called from ipsec[64]_process_packet(), it
is safe to assume that mbuf is contiguous at least to IP header
for used IP version. Remove all unneeded m_pullup(), m_copydata
and related checks.
* use V_ip_defttl and V_ip6_defhlim for outer headers;
* use V_ip4_ipsec_ecn and V_ip6_ipsec_ecn for outer headers;
* move all diagnostic messages to the ipsec_encap() callers;
* simplify handling of ipsec_encap() results: if it returns non zero
value, print diagnostic message and free mbuf.
* some style(9) fixes.
Differential Revision: https://reviews.freebsd.org/D2303
Reviewed by: glebius
Sponsored by: Yandex LLC
discontinued by its initial authors. In FreeBSD the code was already
slightly edited during the pf(4) SMP project. It is about to be edited
more in the projects/ifnet. Moving out of contrib also allows to remove
several hacks to the make glue.
Reviewed by: net@
function names have changed and comments are reformatted or added, but
there is no functional change.
Claim copyright for me and Adrian.
Sponsored by: Nginx, Inc.
Handle the VIRQ_DEBUG signal and print a stack trace of each vCPU on the Xen
console. This is only used for debug purposes and is triggered by the
administrator of the Xen host.
Sponsored by: Citrix Systems R&D
MFC after: 1 week
Many thanks to ian who gently provided me the DS1307 breakout board.
Tested on: Raspberry pi
Differential Revision: https://reviews.freebsd.org/D2022
Reviewed by: rpaulo
- Split the driver into independent pf and vf loadables. This is
in preparation for SRIOV support which will be following shortly.
This also allows us to keep a seperate revision control over the
two parts, making for easier sustaining.
- Make the TX/RX code a shared/seperated file, in the old code base
the ixv code would miss fixes that went into ixgbe, this model
will eliminate that problem.
- The driver loadables will now match the device names, something that
has been requested for some time.
- Rather than a modules/ixgbe there is now modules/ix and modules/ixv
- It will also be possible to make your static kernel with only one
or the other for streamlined installs, or both.
Enjoy!
Submitted by: jfv and erj
drivers can use it. This avoids some code duplication. Add missing
default case to all switch statements while at it. Also move the
hashing of the IPv6 flow field to layer 4 because the IPv6 flow field
is constant on a per L4 connection basis and not on a per L3 network.
Differential Revision: https://reviews.freebsd.org/D1987
Sponsored by: Mellanox Technologies
MFC after: 1 month
Implement the interace to create SR-IOV Virtual Functions (VFs).
When a driver registers that they support SR-IOV by calling
pci_setup_iov(), the SR-IOV code creates a new node in /dev/iov
for that device. An ioctl can be invoked on that device to
create VFs and have the driver initialize them.
At this point, allocating memory I/O windows (BARs) is not
supported.
Differential Revision: https://reviews.freebsd.org/D76
Reviewed by: jhb
MFC after: 1 month
Sponsored by: Sandvine Inc.
I2C real-time clock (RTC).
The DS3231 has an integrated temperature-compensated crystal oscillator
(TXCO) and crystal.
DS3231 has a temperature sensor, an independent 32kHz output (which can be
turned on and off by the driver) and another output that can be used as
interrupt for alarms or as a second square-wave output, which frequency and
operation mode can be set by driver sysctl(8) knobs.
Differential Revision: https://reviews.freebsd.org/D1016
Reviewed by: ian, rpaulo
Tested on: Raspberry pi model B
this option from all modules that enable it theirselves.
In C mode -fms-extensions option enables anonymous structs and unions,
allowing us to use this C11 feature in kernel. Of course, clang supports
it without any extra options.
Reviewed by: dim
Highlights:
- Multiple verbs API updates
- Support for RoCE, RDMA over ethernet
All hardware drivers depending on the common infiniband stack has been
updated aswell.
Discussed with: np @
Sponsored by: Mellanox Technologies
MFC after: 1 month
has been removed and the driver has been greatly simplified and
optimised for FreeBSD. The driver is currently not built by default.
Requested by: Bruce Simpson <bms@fastmail.net>
bits.
The motivation here is to eventually teach netisr and potentially
other networking subsystems a bit more about how RSS work queues / buckets
are configured so things have a hope of auto-configuring in the future.
* net/rss_config.[ch] takes care of the generic bits for doing
configuration, hash function selection, etc;
* topelitz.[ch] is now in net/ rather than netinet/;
* (and would be in libkern if it didn't directly include RSS_KEYSIZE;
that's a later thing to fix up.)
* netinet/in_rss.[ch] now just contains the IPv4 specific methods;
* and netinet/in6_rss.[ch] now just contains the IPv6 specific methods.
This should have no functional impact on anyone currently using
the RSS support.
Differential Revision: D1383
Reviewed by: gnn, jfv (intel driver bits)