pmap_remove_all(). Not doing the drain allows the pmap_enter() to
proceed in parallel, making the pmap_remove_all() effects void.
The race results in an invalidated page mapped wired by usermode.
Reported and tested by: pho
Reviewed by: alc
Sponsored by: The FreeBSD Foundation
Approved by: re (glebius)
Pass the pointy hat please.
Also unblock the software (Yarrow) generator for now. This will be
reverted; Yarrow needs to block until secure, not this behaviour
of serving as soon as asked.
Folks with specific requiremnts will be able to (can!) unblock this
device with any write, and are encouraged to do so in /etc/rc.d/*
scripting. ("Any" in this case could be "echo '' > /dev/random" as
root).
Changes are to
- update board and network interface detection logic
- fix reading onboard CPLD in little-endian config
- print NAE frequency conrrectly for Bx chips
- update XAUI config to disable Rx/Tx until interface is up
Submitted by: Venkatesh J V <venkatesh.vivekanandan@broadcom.com>
Use a variant of mips libc memcpy for kernel. This implementation uses
64-bit operations when compiled for 64-bit, and is significantly faster
in that case.
Submitted by: Tanmay Jagdale <tanmayj@broadcom.com>
ffsl() implementation, when it is available, instead of homegrown iteration.
On dual-E5645 amd64 system (2x6x2 cores) under heavy I/O load that reduces
time spent inside cpu_search() from 19% to 13%, while IOPS increased by 5%.
the rest by me.
o Namespace cleanup; the Yarrow name is now restricted to where it
really applies; this is in anticipation of being augmented or
replaced by Fortuna in the future. Fortuna is mentioned, but behind
#if logic, and is ignorable for now.
o The harvest queue is pulled out into its own modules.
o Entropy harvesting is emproved, both by being made more conservative,
and by separating (a bit!) the sources. Available entropy crumbs are
marginally improved.
o Selection of sources is made clearer. With recent revelations,
this will receive more work in the weeks and months to come.
Submitted by: Arthur Mesh (partly) <arthurmesh@gmail.com>
dev_ref() in the clone handlers that still use it.
- Don't set SI_CHEAPCLONE flag, it's not used anywhere neither in devfs
(for anything real)
Reviewed by: kib
IMAN register to clear the pending interrupt status bits. This patch
tries to solve problems seen on the MacBook Air, as reported by
Johannes Lundberg <johannes@brilliantservice.co.jp>
MFC after: 1 week
Our code does not consider yet the case of hash collisions. This
is a rather annoying situation where two or more files that
happen to have the same hash value will not appear accessible.
The situation is not difficult to work-around but given that things
will just work without enabling htree we will save possible
embarrassments for the next release.
Reported by: Kevin Lo
shutdown was reporetd via email. The crashes occurred because the
client side NLM would attempt to use its socket after it had been
destroyed. Looking at the code, it would soclose() once the reference
count on the socket handling structure went to 0. Unfortunately,
nlm_host_get_rpc() will simply allocate a new socket handling structure
when none exists and use the now soclose()d socket. Since there doesn't
seem to be a safe way to determine when the socket is no longer needed,
this patch modifies the code so that it never soclose()es the socket.
Since there is only one socket ever created, this does not introduce a
leak when the rpc.lockd is stopped/restarted. The patch also disables
unloading of the nfslockd module, since it is not safe to do so (and
has never been safe to do so, from what I can see).
Reported by: mav
Tested by: mav
MFC after: 2 weeks
IPI implmementations.
Submitted by: Roger Pau Monné
Sponsored by: Citrix Systems R&D
Submitted by: gibbs (misc cleanup, table driven config)
Reviewed by: gibbs
MFC after: 2 weeks
sys/amd64/include/cpufunc.h:
sys/amd64/amd64/pmap.c:
Move invltlb_globpcid() into cpufunc.h so that it can be
used by the Xen HVM version of tlb shootdown IPI handlers.
sys/x86/xen/xen_intr.c:
sys/xen/xen_intr.h:
Rename xen_intr_bind_ipi() to xen_intr_alloc_and_bind_ipi(),
and remove the ipi vector parameter. This api allocates
an event channel port that can be used for ipi services,
but knows nothing of the actual ipi for which that port
will be used. Removing the unused argument and cleaning
up the comments surrounding its declaration helps clarify
its actual role.
sys/amd64/amd64/mp_machdep.c:
sys/amd64/include/cpu.h:
sys/i386/i386/mp_machdep.c:
sys/i386/include/cpu.h:
Implement a generic framework for amd64 and i386 that allows
the implementation of certain CPU management functions to
be selected at runtime. Currently this is only used for
the ipi send function, which we optimize for Xen when running
on a Xen hypervisor, but can easily be expanded to support
more operations.
sys/x86/xen/hvm.c:
Implement Xen PV IPI handlers and operations, replacing native
send IPI.
sys/amd64/include/pcpu.h:
sys/i386/include/pcpu.h:
sys/i386/include/smp.h:
Remove NR_VIRQS and NR_IPIS from FreeBSD headers. NR_VIRQS
is defined already for us in the xen interface files.
NR_IPIS is only needed in one file per Xen platform and is
easily inferred by the IPI vector table that is defined in
those files.
sys/i386/xen/mp_machdep.c:
Restructure to more closely match the HVM implementation by
performing table driven IPI setup.
These were used to control/export dispatch policy but they're not anymore.
This commit cannot be MFC'ed to 9 because old netstat(9) binary relies
on such sysctl to work. On the other hand, there's no real reason to
keep'em around in 10.
To enable them, set WITH_GCC and WITH_GNUCXX in src.conf.
Make clang default to using libc++ on FreeBSD 10.
Bumped __FreeBSD_version for the change.
GCC is still enabled on PC98, because the PC98 bootloader requires GCC to build
(or, at least, hard-codes the use of gcc into its build).
Thanks to everyone who helped make the ports tree ready for this (and bapt
for coordinating them all). Also to imp for reviewing this and working on the
forward-porting of the changes in our gcc so that we're getting to a much
better place with regard to external toolchains.
Sorry to all of the people who helped who I forgot to mention by name.
Reviewed by: bapt, imp, dim, ...
pmap_is_modified() and pmap_is_referenced(), same as it was done for
pmap_ts_referenced().
Consolidate identical code for pmap_is_modified() and
pmap_is_referenced() into helper pmap_page_test_mappings().
Reviewed by: alc
Tested by: pho (previous version)
Sponsored by: The FreeBSD Foundation
add some packet(s) to tx ring and arge_stop() is called before receive the
sent packet interrupt from hardware. Fix arge_stop() to unload the in use
dma tags and free the associated mbuf.
PR: 178319, 163670
Approved by: adrian (mentor)
sf_buf_alloc()/sf_buf_free() inlines, to save two calls to an absolutely
empty functions.
Reviewed by: alc, kib, scottl
Sponsored by: Nginx, Inc.
Sponsored by: Netflix
krpc client side UDP was observed as way out of range and
caused the rpc.lockd daemon to hang trying to do an RPC.
Inspection of the code found two places where the RPC request
is re-queued, but the value of cu_sent was not incremented.
Since cu_sent is always decremented when the RPC request is
dequeued, I think this could have caused cu_sent to go out of
range. This patch adds lines to increment cu_sent for these
two cases.
Reported by: dwhite@ixsystems.com
Discussed with: dwhite@ixsystems.com
MFC after: 2 weeks
making sure they are all misaligned at +8 bytes. This fixes clang builds
of powerpc64 kernels (aside from a required increase in KSTACK_PAGES which
will come later).
This commit from FreeBSD/powerpc64 with a clang-built kernel.
MFC after: 2 weeks
matches the types used when computing hash indices and the type of the
maximum size of mfchashtbl[].
PR: kern/181821
Submitted by: Sven-Thorsten Dietrich <sven@vyatta.com> (IPv4)
MFC after: 1 week
run. After that, the pager put method is called, usually translated
to VOP_WRITE(). For the filesystems which use buffer cache,
bufwrite() sbusies the buffer pages again, waiting for the xbusy state
to drain. The later is done in vfs_drain_busy_pages(), which is
called with the buffer pages already sbusied (by vm_pageout_flush()).
Since vfs_drain_busy_pages() can only wait for one page at the time,
and during the wait, the object lock is dropped, previous pages in the
buffer must be protected from other threads busying them. Up to the
moment, it was done by xbusying the pages, that is incompatible with
the sbusy state in the new implementation of busy. Switch to sbusy.
Reported and tested by: pho
Sponsored by: The FreeBSD Foundation
VPB_BIT_WAITERS flag were changed between reading of busy_lock and the
cas. The vm_page_sbusy(), which is the only user of
vm_page_trysbusy() in the tree, panics on the failure, which in these
cases is transient and do not mean that the current page state
prevents sbusying.
Retry the operation inside vm_page_trysbusy() if cas failed, only
return a failure when VPB_BIT_SHARED is cleared.
Reported and tested by: pho
Reviewed by: attilio
Sponsored by: The FreeBSD Foundation
in the future in a backward compatible (API and ABI) way.
The cap_rights_t represents capability rights. We used to use one bit to
represent one right, but we are running out of spare bits. Currently the new
structure provides place for 114 rights (so 50 more than the previous
cap_rights_t), but it is possible to grow the structure to hold at least 285
rights, although we can make it even larger if 285 rights won't be enough.
The structure definition looks like this:
struct cap_rights {
uint64_t cr_rights[CAP_RIGHTS_VERSION + 2];
};
The initial CAP_RIGHTS_VERSION is 0.
The top two bits in the first element of the cr_rights[] array contain total
number of elements in the array - 2. This means if those two bits are equal to
0, we have 2 array elements.
The top two bits in all remaining array elements should be 0.
The next five bits in all array elements contain array index. Only one bit is
used and bit position in this five-bits range defines array index. This means
there can be at most five array elements in the future.
To define new right the CAPRIGHT() macro must be used. The macro takes two
arguments - an array index and a bit to set, eg.
#define CAP_PDKILL CAPRIGHT(1, 0x0000000000000800ULL)
We still support aliases that combine few rights, but the rights have to belong
to the same array element, eg:
#define CAP_LOOKUP CAPRIGHT(0, 0x0000000000000400ULL)
#define CAP_FCHMOD CAPRIGHT(0, 0x0000000000002000ULL)
#define CAP_FCHMODAT (CAP_FCHMOD | CAP_LOOKUP)
There is new API to manage the new cap_rights_t structure:
cap_rights_t *cap_rights_init(cap_rights_t *rights, ...);
void cap_rights_set(cap_rights_t *rights, ...);
void cap_rights_clear(cap_rights_t *rights, ...);
bool cap_rights_is_set(const cap_rights_t *rights, ...);
bool cap_rights_is_valid(const cap_rights_t *rights);
void cap_rights_merge(cap_rights_t *dst, const cap_rights_t *src);
void cap_rights_remove(cap_rights_t *dst, const cap_rights_t *src);
bool cap_rights_contains(const cap_rights_t *big, const cap_rights_t *little);
Capability rights to the cap_rights_init(), cap_rights_set(),
cap_rights_clear() and cap_rights_is_set() functions are provided by
separating them with commas, eg:
cap_rights_t rights;
cap_rights_init(&rights, CAP_READ, CAP_WRITE, CAP_FSTAT);
There is no need to terminate the list of rights, as those functions are
actually macros that take care of the termination, eg:
#define cap_rights_set(rights, ...) \
__cap_rights_set((rights), __VA_ARGS__, 0ULL)
void __cap_rights_set(cap_rights_t *rights, ...);
Thanks to using one bit as an array index we can assert in those functions that
there are no two rights belonging to different array elements provided
together. For example this is illegal and will be detected, because CAP_LOOKUP
belongs to element 0 and CAP_PDKILL to element 1:
cap_rights_init(&rights, CAP_LOOKUP | CAP_PDKILL);
Providing several rights that belongs to the same array's element this way is
correct, but is not advised. It should only be used for aliases definition.
This commit also breaks compatibility with some existing Capsicum system calls,
but I see no other way to do that. This should be fine as Capsicum is still
experimental and this change is not going to 9.x.
Sponsored by: The FreeBSD Foundation
Properly round-trip the "operation code" for client requests.
sys/dev/xen/blkback/blkback.c:
In xbb_dispatch_dev() when processing a flush request,
correctly set bio->bio_caller1 to the request list (not
bare request) for the operation, as is expected by the
completion handler xbb_bio_done().
In xbb_get_resources(), initialize "operation" in the
driver's internal request object from the client's "ring
request", so it is correct when used to populate the reply
when this operation completes.
Submitted by: Roger Pau Monné
Sponsored by: Citrix Systems R&D
Reviewed by: gibbs
- Restore the pre-PCID TLB shootdown handlers for whole address space
and single page invalidation asm code, and assign the IPI handler to
them when PCID is not supported or disabled. Old handlers have
linear control flow. But, still use the common return sequence.
- Stop using pcpu for INVPCID descriptors in the invlrg handler. It
is enough to allocate descriptors on the stack. As result, two
SWAPGS instructions are shaved off from the code for Haswell+.
- Fix the reverted condition in invlrng for checking of the PCID
support [1], also in invlrng check that pmap is kernel pmap before
performing other tests. For the kernel pmap, which provides global
mappings, the INVLPG must be used for invalidation always.
- Save the pre-computed pmap' %CR3 register in the struct pmap. This
allows to remove several checks for pm_pcid validity when %CR3 is
reloaded [2].
Noted by: gibbs [1]
Discussed with: alc [2]
Tested by: pho, flo
Sponsored by: The FreeBSD Foundation
is being shut down which were caused by the nfscbd_pool being
destroyed before the backchannel is disabled. This patch is
believed to fix the problem, by simply avoiding ever destroying
the nfscbd_pool. Since the NFS client module cannot be unloaded,
this should not cause a memory leak.
MFC after: 2 weeks
It turns out that synaptics_support was turned off by default
because its probing method is too intrusive not because it was unstable.
Once this is fixed it should be enabled once again.
Reported by: delphij, jkim
Rework the timeout code to use actual time rather than a DELAY() loop and
to use both typical and maximum to allow logging of timeout failures.
Also correct the erase timeout, it is specified in milliseconds not
microseconds like the other timeouts. Do not invoke DELAY() between
status queries as this adds significant latency which in turn reduced
write performance substantially.
Sanity check timeout values from the hardware.
Implement support for buffered writes (only enabled on Intel/Sharp parts
for now). This yields an order of magnitude speedup on the 64MB Intel
StrataFlash parts we use.
When making a copy of the block to modify, also keep a clean copy around
until we are ready to commit the block and use it to avoid unnecessary
erases. In the non-buffer write case, also use it to avoid
unnecessary writes when the block has not been erased. This yields a
significant speedup when doing things like zeroing a block.
Sponsored by: DARPA, AFRL
Reviewed by: imp (previous version)
set to 15 to indicate that the peer did not send a window scale option
with its SYN. Do not send a window scale option in the SYN|ACK reply
in that case.
performance... Use SSE2 instructions for calculating the XTS tweek
factor... Let the compiler do more work and handle register allocation
by using intrinsics, now only the key schedule is in assembly...
Replace .byte hard coded instructions w/ the proper instructions now
that both clang and gcc support them...
On my machine, pulling the code to userland I saw performance go from
~150MB/sec to 2GB/sec in XTS mode. GELI on GNOP saw a more modest
increase of about 3x due to other system overhead (geom and
opencrypto)...
These changes allow almost full disk io rate w/ geli...
Reviewed by: -current, -security
Thanks to: Mike Hamburg for the XTS tweek algorithm
Initialize the request id for requests in xbb_get_resources()
instead of its previous location in xbb_dispatch_io(). This
guarantees that all request types (e.g. BLKIF_OP_FLUSH_DISKCACHE)
have the front-end specified id recorded.
Submitted by: Roger Pau Monné
Sponsored by: Citrix Systems R&D
* Remove non working code related to SHA224.
* Remove support for non-standardised HMAC-IDs using SHA384 and SHA512.
* Prefer SHA256 over SHA1.
* Minor cleanup.
MFC after: 2 weeks
No functional changes.
sys/i386/xen/mp_machdep.c:
Remove extra newlines.
Group externs, forward delarations, local types, and pcpu data.
Wrap at 80 columns.
Use parens in return statements.
Tab indent members of array initializers.
MFC after: 2 weeks
always wait for provider close. Old algorithm was reported to cause NULL
dereference panic on attempt to close provider after softc destruction.
If not global workaroung in GEOM, that could even cause destruction with
requests still in flight.
date: 2010/02/04 14:10:12; author: sthen; state: Exp; lines: +24 -19;
pf_get_sport() picks a random port from the port range specified in a
nat rule. It should check to see if it's in-use (i.e. matches an existing
PF state), if it is, it cycles sequentially through other ports until
it finds a free one. However the check was being done with the state
keys the wrong way round so it was never actually finding the state
to be in-use.
- switch the keys to correct this, avoiding random state collisions
with nat. Fixes PR 6300 and problems reported by robert@ and viq.
- check pf_get_sport() return code in pf_test(); if port allocation
fails the packet should be dropped rather than sent out untranslated.
Help/ok claudio@.
Some additional changes to 1.12:
- We also need to bzero() the key to zero padding, otherwise key
won't match.
- Collapse two if blocks into one with ||, since both conditions
lead to the same processing.
- Only naddr changes in the cycle, so move initialization of other
fields above the cycle.
- s/u_intXX_t/uintXX_t/g
PR: kern/181690
Submitted by: Olivier Cochard-Labbé <olivier cochard.me>
Sponsored by: Nginx, Inc.
sys/x86/xen/hvm.c:
Do not rely on implicit conversion to boolean in expressions
(e.g. use "if (rc != 0)" instead of "if (rc)".
Line continuations for functions are indented an additional
4 spaces.
Insert an empty line if the function has no local variables.
Prefer separate initializtion statements to initialzing
local variables in their declaration.
Braces that are not necessary may be left out.
MFC after: 2 weeks
of relying on the tvtohz() workaround. The latter has been introduced
lately by jhb@ (r254699) in order to have a fix that can be backported
to STABLE.
Reported by: Vitja Makarov <vitja.makarov at gmail dot com>
Reviewed by: jhb (earlier version)
waiting for an RPC reply from the server while holding the mount
point busy (mnt_lockref incremented). This happens because dounmount()
msleep()s waiting for mnt_lockref to become 0, before calling
VFS_UNMOUNT(). This patch adds a new VFS operation called VFS_PURGE(),
which the NFS client implements as purging RPCs in progress. Making
this call before checking mnt_lockref fixes the problem, by ensuring
that the VOP_xxx() calls will fail and unbusy the mount point.
Reported by: sbruno
Reviewed by: kib
MFC after: 2 weeks
bintime_* related functions. This commit completes what was already done
by theraven@ for bintime_shift, and just uses a single underscore instead
of two (which is a style bug according to Bruce). See r251855 for reference.
Reported by: theraven
Discussed with: bde
Reviewed by: bde
functional state. While CTL is much more superior target from all points,
there is no reason why this code should not work.
Tested with ahc(4) as target side HBA.
MFC after: 2 weeks
This is a significant rewrite of much of the previous driver; lots of
misc. cleanup was also performed, and support for a few other minor
features was also added.
Partial support for the EVENT_IDX feature was added a while ago,
but this commit adds an interface for the device driver to hint
how long (in terms of descriptors) the next interrupt should be
delayed.
The first user of this will be used to reduce VirtIO net's Tx
completion interrupts.
The consequence of the bug is that fini calls are not done
when a slab is freed by a call-back from the page daemon.
It went unnoticed for two months because fini is little used.
I spotted the bug while reading the code to learn how it works
so I could write it up for the next edition of the Design and
Implementation of FreeBSD book.
No MFC needed as this code exists only in HEAD.
Reviewed by: kib, jeff
Tested by: pho
for ARM.
This is quite ugly, because it has to work around a clang bug that does not
allow built-in functions to be defined, even when they're ones that are
expected to be built as part of a library.
Reviewed by: ed
This is a nice small outdoor/indoor AP from Ubiquity Networks.
The device has:
AR7241 CPU SoC
AR9287 Wifi
8MB flash
32MB RAM
wifi has been tested to work along with leds.
Submitted by: loos
Approved by: sbruno (mentor, implicit)
Tested by: hiren
priority. If the write is requested by a system daemon, sleeping
there would starve resources and cause deadlock.
Reported and tested by: pho
Sponsored by: The FreeBSD Foundation
A warning is emitted again if the temperature became briefly valid
meanwhile. This avoids spamming the user when the sensor is broken.
Other values (ie. not _TMP) always raise a warning.
is not giving us a 100% success rate. Bump the delay to 200ms as
that seems to do the trick.
Note that during testing the delay was added to uart_bus_attach()
in uart_core.c. While having the delay in a different place can
change the behaviour, it was not expected. Having to bump the
delay with another 50ms could therefore be an indication that
the problem can not be solved with delays.
Reported by: kevlo@
Tested by: kevlo@
Intel CPUs. The feature tags TLB entries with the Id of the address
space and allows to avoid TLB invalidation on the context switch, it
is available only in the long mode. In the microbenchmarks, using the
PCID decreased latency of the context switches by ~30% on SandyBridge
class desktop CPUs, measured with the lat_ctx program from lmbench.
If available, use INVPCID instruction when a TLB entry in non-current
address space needs to be invalidated. The instruction is typically
available on the Haswell.
If needed, the use of PCID can be turned off with the
vm.pmap.pcid_enabled loader tunable set to 0. The state of the
feature is reported by the vm.pmap.pcid_enabled sysctl. The sysctl
vm.pmap.pcid_save_cnt reports the number of context switches which
avoided invalidating the TLB; compare with the total number of context
switches, available as sysctl vm.stats.sys.v_swtch.
Sponsored by: The FreeBSD Foundation
Reviewed by: alc
Tested by: pho, bf
- Allow the Rx/Tx queue sizes to be configured by tunables
- Bail out earlier if the Tx queue unlikely has enough free
descriptors to hold the frame
- Cleanup some of the offloading capabilities handling
value. Some hosts do not publish "extended" disk IDs via virtual-device in
an attempt to avoid confusing old blkfront drivers, and without this change
we failed to attach such disks.
In particular, this commit allows all 24 ephemeral disks on EC2 hs1.8xlarge
instances to be used, instead of only the first 15.
MFC after: 3 days
cards.
This is a T4 and T5 chip feature which lets the chip deliver multiple
Ethernet frames in a single buffer. This is more efficient within the
chip, in the driver, and reduces wastage of space in rx buffers.
- Always allocate rx buffers from the jumbop zone, no matter what the
MTU is. Do not use the normal cluster refcounting mechanism.
- Reserve space for an mbuf and a refcount in the cluster itself and let
the chip DMA multiple frames in the rest.
- Use the embedded mbuf for the first frame and allocate mbufs on the
fly for any additional frames delivered in the cluster. Each of these
mbufs has a reference on the underlying cluster.
refcount. This one is willing to work with buffers that may already be
referenced. MEXTADD/m_extadd are suitable only for the first attachment
to a cluster -- they initialize the refcount to 1.
Use this new driver for both PV and HVM instances.
This driver requires a Xen hypervisor that supports vector callbacks,
VCPUOP hypercalls, and reports that it has a "safe PV clock".
New timer driver:
Submitted by: will
Sponsored by: Spectra Logic Corporation
PV port to new driver, and bug fixes:
Submitted by: Roger Pau Monné
Sponsored by: Citrix Systems R&D
sys/dev/xen/timer/timer.c:
- Register a PV timer device driver which (currently)
implements device_{identify,probe,attach} and stubs
device_detach. The detach routine requires functionality
not provided by timecounters(4). The suspend and resume
routines need additional work (due to Xen requiring that
the hypercalls be executed on the target VCPU), and aren't
needed for our purposes.
- Make sure there can only be one device instance of this
driver, and that it only registers one eventtimers(4) and
one timecounters(4) device interface. Make both interfaces
use PCPU data as needed.
- Match, with a few style cleanups & API differences, the
Xen versions of the "fetch time" functions.
- Document the magic scale_delta() better for the i386 version.
- When registering the event timer, bind a separate event
channel for the timer VIRQ to the device's event timer
interrupt handler for each active VCPU. Describe each
interrupt as "xen_et:c%d", so they can be identified per
CPU in "vmstat -i" or "show intrcnt" in KDB.
- When scheduling a timer into the hypervisor, try up to
60 times if the hypervisor rejects the time as being in
the past. In the common case, this retry shouldn't happen,
and if it does, it should only happen once. This is
because the event timer advertises a minimum period of
100usec, which is only less than the usual hypercall round
trip time about 1 out of every 100 tries. (Unlike other
similar drivers, this one actually checks whether the
hypervisor accepted the singleshot timer set hypercall.)
- Implement a RTC PV clock based on the hypervisor wallclock.
sys/conf/files:
- Add dev/xen/timer/timer.c if the kernel configuration
includes either the XEN or XENHVM options.
sys/conf/files.i386:
sys/i386/include/xen/xen_clock_util.h:
sys/i386/xen/clock.c:
sys/i386/xen/xen_clock_util.c:
sys/i386/xen/mp_machdep.c:
sys/i386/xen/xen_rtc.c:
- Remove previous PV timer used in i386 XEN PV kernels, the
new timer introduced in this change is used instead (so
we share the same code between PVHVM and PV).
MFC after: 2 weeks
to 15 minutes, and 5 minutes for things like READ ELEMENT STATUS.
This is needed to account for the worst case scenarios on at least
some Spectra Logic tape libraries.
Sponsored by: Spectra Logic
MFC after: 3 days
Re-structure Xen HVM support so that:
- Xen is detected and hypercalls can be performed very
early in system startup.
- Xen interrupt services are implemented using FreeBSD's native
interrupt delivery infrastructure.
- the Xen interrupt service implementation is shared between PV
and HVM guests.
- Xen interrupt handlers can optionally use a filter handler
in order to avoid the overhead of dispatch to an interrupt
thread.
- interrupt load can be distributed among all available CPUs.
- the overhead of accessing the emulated local and I/O apics
on HVM is removed for event channel port events.
- a similar optimization can eventually, and fairly easily,
be used to optimize MSI.
Early Xen detection, HVM refactoring, PVHVM interrupt infrastructure,
and misc Xen cleanups:
Sponsored by: Spectra Logic Corporation
Unification of PV & HVM interrupt infrastructure, bug fixes,
and misc Xen cleanups:
Submitted by: Roger Pau Monné
Sponsored by: Citrix Systems R&D
sys/x86/x86/local_apic.c:
sys/amd64/include/apicvar.h:
sys/i386/include/apicvar.h:
sys/amd64/amd64/apic_vector.S:
sys/i386/i386/apic_vector.s:
sys/amd64/amd64/machdep.c:
sys/i386/i386/machdep.c:
sys/i386/xen/exception.s:
sys/x86/include/segments.h:
Reserve IDT vector 0x93 for the Xen event channel upcall
interrupt handler. On Hypervisors that support the direct
vector callback feature, we can request that this vector be
called directly by an injected HVM interrupt event, instead
of a simulated PCI interrupt on the Xen platform PCI device.
This avoids all of the overhead of dealing with the emulated
I/O APIC and local APIC. It also means that the Hypervisor
can inject these events on any CPU, allowing upcalls for
different ports to be handled in parallel.
sys/amd64/amd64/mp_machdep.c:
sys/i386/i386/mp_machdep.c:
Map Xen per-vcpu area during AP startup.
sys/amd64/include/intr_machdep.h:
sys/i386/include/intr_machdep.h:
Increase the FreeBSD IRQ vector table to include space
for event channel interrupt sources.
sys/amd64/include/pcpu.h:
sys/i386/include/pcpu.h:
Remove Xen HVM per-cpu variable data. These fields are now
allocated via the dynamic per-cpu scheme. See xen_intr.c
for details.
sys/amd64/include/xen/hypercall.h:
sys/dev/xen/blkback/blkback.c:
sys/i386/include/xen/xenvar.h:
sys/i386/xen/clock.c:
sys/i386/xen/xen_machdep.c:
sys/xen/gnttab.c:
Prefer FreeBSD primatives to Linux ones in Xen support code.
sys/amd64/include/xen/xen-os.h:
sys/i386/include/xen/xen-os.h:
sys/xen/xen-os.h:
sys/dev/xen/balloon/balloon.c:
sys/dev/xen/blkback/blkback.c:
sys/dev/xen/blkfront/blkfront.c:
sys/dev/xen/console/xencons_ring.c:
sys/dev/xen/control/control.c:
sys/dev/xen/netback/netback.c:
sys/dev/xen/netfront/netfront.c:
sys/dev/xen/xenpci/xenpci.c:
sys/i386/i386/machdep.c:
sys/i386/include/pmap.h:
sys/i386/include/xen/xenfunc.h:
sys/i386/isa/npx.c:
sys/i386/xen/clock.c:
sys/i386/xen/mp_machdep.c:
sys/i386/xen/mptable.c:
sys/i386/xen/xen_clock_util.c:
sys/i386/xen/xen_machdep.c:
sys/i386/xen/xen_rtc.c:
sys/xen/evtchn/evtchn_dev.c:
sys/xen/features.c:
sys/xen/gnttab.c:
sys/xen/gnttab.h:
sys/xen/hvm.h:
sys/xen/xenbus/xenbus.c:
sys/xen/xenbus/xenbus_if.m:
sys/xen/xenbus/xenbusb_front.c:
sys/xen/xenbus/xenbusvar.h:
sys/xen/xenstore/xenstore.c:
sys/xen/xenstore/xenstore_dev.c:
sys/xen/xenstore/xenstorevar.h:
Pull common Xen OS support functions/settings into xen/xen-os.h.
sys/amd64/include/xen/xen-os.h:
sys/i386/include/xen/xen-os.h:
sys/xen/xen-os.h:
Remove constants, macros, and functions unused in FreeBSD's Xen
support.
sys/xen/xen-os.h:
sys/i386/xen/xen_machdep.c:
sys/x86/xen/hvm.c:
Introduce new functions xen_domain(), xen_pv_domain(), and
xen_hvm_domain(). These are used in favor of #ifdefs so that
FreeBSD can dynamically detect and adapt to the presence of
a hypervisor. The goal is to have an HVM optimized GENERIC,
but more is necessary before this is possible.
sys/amd64/amd64/machdep.c:
sys/dev/xen/xenpci/xenpcivar.h:
sys/dev/xen/xenpci/xenpci.c:
sys/x86/xen/hvm.c:
sys/sys/kernel.h:
Refactor magic ioport, Hypercall table and Hypervisor shared
information page setup, and move it to a dedicated HVM support
module.
HVM mode initialization is now triggered during the
SI_SUB_HYPERVISOR phase of system startup. This currently
occurs just after the kernel VM is fully setup which is
just enough infrastructure to allow the hypercall table
and shared info page to be properly mapped.
sys/xen/hvm.h:
sys/x86/xen/hvm.c:
Add definitions and a method for configuring Hypervisor event
delievery via a direct vector callback.
sys/amd64/include/xen/xen-os.h:
sys/x86/xen/hvm.c:
sys/conf/files:
sys/conf/files.amd64:
sys/conf/files.i386:
Adjust kernel build to reflect the refactoring of early
Xen startup code and Xen interrupt services.
sys/dev/xen/blkback/blkback.c:
sys/dev/xen/blkfront/blkfront.c:
sys/dev/xen/blkfront/block.h:
sys/dev/xen/control/control.c:
sys/dev/xen/evtchn/evtchn_dev.c:
sys/dev/xen/netback/netback.c:
sys/dev/xen/netfront/netfront.c:
sys/xen/xenstore/xenstore.c:
sys/xen/evtchn/evtchn_dev.c:
sys/dev/xen/console/console.c:
sys/dev/xen/console/xencons_ring.c
Adjust drivers to use new xen_intr_*() API.
sys/dev/xen/blkback/blkback.c:
Since blkback defers all event handling to a taskqueue,
convert this task queue to a "fast" taskqueue, and schedule
it via an interrupt filter. This avoids an unnecessary
ithread context switch.
sys/xen/xenstore/xenstore.c:
The xenstore driver is MPSAFE. Indicate as much when
registering its interrupt handler.
sys/xen/xenbus/xenbus.c:
sys/xen/xenbus/xenbusvar.h:
Remove unused event channel APIs.
sys/xen/evtchn.h:
Remove all kernel Xen interrupt service API definitions
from this file. It is now only used for structure and
ioctl definitions related to the event channel userland
device driver.
Update the definitions in this file to match those from
NetBSD. Implementing this interface will be necessary for
Dom0 support.
sys/xen/evtchn/evtchnvar.h:
Add a header file for implemenation internal APIs related
to managing event channels event delivery. This is used
to allow, for example, the event channel userland device
driver to access low-level routines that typical kernel
consumers of event channel services should never access.
sys/xen/interface/event_channel.h:
sys/xen/xen_intr.h:
Standardize on the evtchn_port_t type for referring to
an event channel port id. In order to prevent low-level
event channel APIs from leaking to kernel consumers who
should not have access to this data, the type is defined
twice: Once in the Xen provided event_channel.h, and again
in xen/xen_intr.h. The double declaration is protected by
__XEN_EVTCHN_PORT_DEFINED__ to ensure it is never declared
twice within a given compilation unit.
sys/xen/xen_intr.h:
sys/xen/evtchn/evtchn.c:
sys/x86/xen/xen_intr.c:
sys/dev/xen/xenpci/evtchn.c:
sys/dev/xen/xenpci/xenpcivar.h:
New implementation of Xen interrupt services. This is
similar in many respects to the i386 PV implementation with
the exception that events for bound to event channel ports
(i.e. not IPI, virtual IRQ, or physical IRQ) are further
optimized to avoid mask/unmask operations that aren't
necessary for these edge triggered events.
Stubs exist for supporting physical IRQ binding, but will
need additional work before this implementation can be
fully shared between PV and HVM.
sys/amd64/amd64/mp_machdep.c:
sys/i386/i386/mp_machdep.c:
sys/i386/xen/mp_machdep.c
sys/x86/xen/hvm.c:
Add support for placing vcpu_info into an arbritary memory
page instead of using HYPERVISOR_shared_info->vcpu_info.
This allows the creation of domains with more than 32 vcpus.
sys/i386/i386/machdep.c:
sys/i386/xen/clock.c:
sys/i386/xen/xen_machdep.c:
sys/i386/xen/exception.s:
Add support for new event channle implementation.
- Relax atomic_read() and atomic_set() macros. Linux does not require any
memory barrier. Also, these macros may be even reordered or optimized away
according to the API documentation:
https://www.kernel.org/doc/Documentation/atomic_ops.txt
We've been seeing lots of cache line contention (but not lock contention!)
in our workloads between the various TX and RX threads going on.
The write lock is only grabbed when configuration changes are made - which
are infrequent.
With this patch, the contention and cycles spent waiting for updates
disappear.
Sponsored by: Netflix, Inc.
- Remove excessive parenthesis
- Use KNF continuation indentation
- Cut down on excessive continuation lines
- More consistent style in messages
- Use uprintf() instead of printf()
Submitted by: bde
calls ns8250_bus_ipend() almost immediately after ns8250_bus_attach().
As it appears, a line break condition is being signalled for almost
all received characters due to this. A delay of 150ms seems enough
to allow the H/W to settle and to avoid the problem.
More analysis is needed, but for now a regression has been addressed.
Reported by: kevlo@
Tested by: kevlo@
MADV_DONTNEED) and madvise(..., MADV_FREE). Specifically, introduce a new
pmap function, pmap_advise(), that operates on a range of virtual addresses
within the specified pmap, allowing for a more efficient implementation of
MADV_DONTNEED and MADV_FREE. Previously, the implementation of
MADV_DONTNEED and MADV_FREE relied on per-page pmap operations, such as
pmap_clear_reference(). Intuitively, the problem with this implementation
is that the pmap-level locks are acquired and released and the page table
traversed repeatedly, once for each resident page in the range
that was specified to madvise(2). A more subtle flaw with the previous
implementation is that pmap_clear_reference() would clear the reference bit
on all mappings to the specified page, not just the mapping in the range
specified to madvise(2).
Since our malloc(3) makes heavy use of madvise(2), this change can have a
measureable impact. For example, the system time for completing a parallel
"buildworld" on a 6-core amd64 machine was reduced by about 1.5% to 2.0%.
Note: This change only contains pmap_advise() implementations for a subset
of our supported architectures. I will commit implementations for the
remaining architectures after further testing. For now, a stub function is
sufficient because of the advisory nature of pmap_advise().
Discussed with: jeff, jhb, kib
Tested by: pho (i386), marcel (ia64)
Sponsored by: EMC / Isilon Storage Division
(re)start the interface when it is down. This change fix a race with
BOOTP where the response packet is lost because the interface is being
reset by a netmask change right after send the packet.
PR: 178318
Approved by: adrian (mentor)
Add a last-modified timestamp to each LRO entry and provide an interface
to flush all inactive entries. Drivers decide when to flush and what
the inactivity threshold should be.
Network drivers that process an rx queue to completion can enter a
livelock type situation when the rate at which packets are received
reaches equilibrium with the rate at which the rx thread is processing
them. When this happens the final LRO flush (normally when the rx
routine is done) does not occur. Pure ACKs and segments with total
payload < 64K can get stuck in an LRO entry. Symptoms are that TCP
tx-mostly connections' performance falls off a cliff during heavy,
unrelated rx on the interface.
Flushing only inactive LRO entries works better than any of these
alternates that I tried:
- don't LRO pure ACKs
- flush _all_ LRO entries periodically (every 'x' microseconds or every
'y' descriptors)
- stop rx processing in the driver periodically and schedule remaining
work for later.
Reviewed by: andre
UF_SYSTEM, UF_SPARSE, UF_OFFLINE, UF_REPARSE, UF_ARCHIVE, UF_READONLY,
and UF_HIDDEN.
Sort the file flags tmpfs supports alphabetically. tmpfs now
supports the same flags as UFS, with the exception of SF_SNAPSHOT.
Reported by: bdrewery, antoine
Sponsored by: Spectra Logic
- tom_uninit had to be reworked not to hold the adapter lock (a mutex)
around t4_deactivate_uld, which acquires the uld_list_lock.
- the ifc_match for the interface cloner that creates the tracer ifnet
had to be reworked as the kernel calls ifc_match with the global
if_cloners_mtx held.
allocations under low free-space conditions (-r254995), determine
that old block-preference search order used before -r249782 worked
a bit better. This change reverts to that block-preference search order.
MFC after: 2 weeks
I have 25TB Dell PERC 6 RAID5 array. When it becomes almost
full (10-20GB free), processes which write data to it start
eating 100% CPU and write speed drops below 1MB/sec (normally
to gives 400MB/sec). The revision at which it first became
apparent was http://svnweb.freebsd.org/changeset/base/249782.
The offending change reserved an area in each cylinder group to
store metadata. The new algorithm attempts to save this area for
metadata and allows its use for non-metadata only after all the
data areas have been exhausted. The size of the reserved area
defaults to half of minfree, so the filesystem reports full before
the data area can completely fill. However, in this report, the
filesystem has had minfree reduced to 1% thus forcing the metadata
area to be used for data. As the filesystem approached full, it
had only metadata areas left to allocate. The result was that
every block allocation had to scan summary data for 30,000 cylinder
groups before falling back to searching up to 30,000 metadata areas.
The fix is to give up on saving the metadata areas once the free
space reserve drops below 2%. The effect of this change is to use
the old algorithm of just accepting the first available block that
we find. Since most filesystems use the default 5% minfree, this
will have no effect on their operation. For those that want to push
to the limit, they will get their crappy block placements quickly.
Submitted by: Dmitry Sivachenko
Fix Tested by: Dmitry Sivachenko
PR: kern/181226
MFC after: 2 weeks
the passed vnode belongs to the same mount point (v_vfsp or also
known as v_mount in FreeBSD). This check prevents the code from
proceeding further on vnodes that do not belong to ZFS, for
instance, on UFS or NULLFS.
The recent change (merged as r254585) on upstream changes the
check of v_vfsp to instead check the znode's z_zfsvfs. On Illumos
this would work because when the vnode comes from lofs, the
VOP_REALVP() would give the right vnode, this is not true on
FreeBSD where our VOP_REALVP is a no-op, and as such tdvp is
not guaranteed to be a ZFS vnode, and will later trigger a
failed assertion when verifying the vnode.
This changeset modifies our local shims (zfs_freebsd_rename and
zfs_freebsd_link) to check if v_mount matches before proceeding
further.
Reported by: many
Diagnostic work by: avg
problems with the way MLEN, MHLEN, and struct mbuf are set up.
CTASSERT's are provided to detect such issues at compile time in the
future.
The #define MLEN and MHLEN calculation do not take actual compiler-
induced alignment and padding inside the complete struct mbuf into
account. Accordingly appropriate attention is required when changing
members of struct mbuf.
Ideally one would calculate MLEN as (MSIZE - sizeof(((struct mbuf *)0)->m_hdr)
but that doesn't work as the compiler refuses to operate on an as of
yet incomplete structure.
In particular ARM 32bit has more strict alignment requirements which
caused 4 bytes of padding between m_hdr and pkthdr in struct mbuf
because of the 64bit members in pkthdr. This wasn't picked up by MLEN
and MHLEN causing an overflow of the mbuf provided data storage by
overestimating its size.
I386 didn't show this problem because it handles unaligned access just
fine, albeit at a small performance penalty.
On 64bit architectures the struct mbuf layout is 64bit aligned in all
places.
Reported by: Thomas Skibo <ThomasSkibo-at-sbcglobal-dot-net>
Tested by: tuexen, ian, Thomas Skibo (extended patch)
Sponsored by: The FreeBSD Foundation
notify (enable spinup) required", instead of doing the normal
retries, poll for a change in status.
We will poll every half second for a minute for the status to
change.
Hitachi drives (and likely other SAS drives) return that ASC/ASCQ
when they are waiting to spin up. What it means is that they are
waiting for the SAS expander to send them the SAS
NOTIFY (ENABLE SPINUP) primitive.
That primitive is the mechanism expanders/enclosures use to
sequence drive spinup to avoid overloading power supplies.
Sponsored by: Spectra Logic
MFC after: 3 days
The aim of this function is to eventually be the completion entry point
for all 802.11 encapsulated mbufs. All the wifi drivers end up doing
what is in this function so it's an easy win to turn it into a net80211
method and abstract out this code.
Ideally the drivers will all eventually be modified to queue up completed
mbufs and call this function with all the driver locks not held.
This will allow for some much more interesting software queue handling
in the future (like net80211 based A-MSDU, fast-frames, A-MPDU aggregation
and retransmission.)
Tested:
* ath(4), iwn(4)
- Use queue size fields from the Tx/Rx queues in various places
instead of (currently the same values) from the softc.
- Fix potential crash in detach if the attached failed to alloc
queue memory.
- Move the VMXNET3_MAX_RX_SEGS define to a better spot.
- Tweak frame size calculation w.r.t. ETHER_ALIGN. This could be
tweaked some more, or removed since it probably doesn't matter
much for x86 (and the x86 class of machines this driver will
be used on).
- Route PCI interrupt for NIC
- Make "no mapping" warning more user-friendly: add device name and mention
that it's IRQ mapping
- Do not overlap ICUs' IO window with PCI devices' IO windows by starting
IO rman at offset 0x100
for the available pbuf when passed vnode is backing md(4). Other i/o
directed to the same md device might already hold pbufs, and then we
could deadlock since only our progress can free a pbuf needed for
wakeup.
Obtained from: projects/vm6
Reminded and tested by: pho
MFC after: 1 week
ps(1) utility, e.g. "ps -O fib".
bin/ps/keyword.c:
Add the "fib" keyword and default its column name to "FIB".
bin/ps/ps.1:
Add "fib" as a supported keyword.
sys/compat/freebsd32/freebsd32.h:
sys/kern/kern_proc.c:
sys/sys/user.h:
Add the default fib number for a process (p->p_fibnum)
to the user land accessible process data of struct kinfo_proc.
Submitted by: Oliver Fromme <olli@fromme.com>, gibbs
and add support for default underride to $loader_version, acting as a way to
name a release. Release text is not displayed for the aforementioned feature
of alternate display layout (introduced in r254237); however, for all other
layouts (incl. default), the release name is displayed at lower-right.
See version.4th(8) for additional information and/or historical details.
NOTE: Also a minor edit to version.4th(8) while we're here.
It is needed for fdread(1) in order to be able to recover from CRC
errors in the data field of a floppy sector (by returning the sector
data that failed CRC, rather than inventing dummy data).
When closing the device, clear all transient device options.
MFC after: 1 week
1) Clean up namespace; only use "Yarrow" where it is Yarrow-specific
or close enough to the Yarrow algorithm. For the rest use a neutral
name.
2) Tidy up headers; put private stuff in private places. More could
be done here.
3) Streamline the hashing/encryption; no need for a 256-bit counter;
128 bits will last for long enough.
There are bits of debug code lying around; these will be removed
at a later stage.
Promoting base pages to superpages can increase TLB coverage and allow for
efficient use of page table entries. This development provides FreeBSD/ARM
with superpages management mechanism roughly equivalent to what we have for
i386 and amd64 architectures.
1. Add mechanism for automatic promotion of 4KB page mappings to 1MB section
mappings (and demotion when not needed, respectively).
2. Managed and non-kernel mappings are now superpages-aware.
3. The functionality can be enabled by setting "vm.pmap.sp_enabled" tunable to
a non-zero value (either in loader.conf or by modifying "sp_enabled"
variable in pmap-v6.c file). By default, automatic promotion is currently
disabled.
Submitted by: Zbigniew Bodek <zbb@semihalf.com>
Reviewed by: alc
Sponsored by: The FreeBSD Foundation, Semihalf
This allows for enabling and configuring superpages reservation mechanism in
order to allocate and populate 256 4KB base pages (for the purpose of
promotion to a 1MB superpage).
Submitted by: Zbigniew Bodek <zbb@semihalf.com>
Reviewed by: alc
Sponsored by: The FreeBSD Foundation, Semihalf