confusing.
Note there is still no information about 'partcode' being written to disk
(gpart bootcode -p <partcode> <disk>).
Maybe in the future all the messages printed by gpart(8) on success could be
hidden under -v?
PR: bin/150239
Reported by: Roddi <roddi@me.com>
Submitted by: arundel
MFC after: 2 weeks
Fix possible loss of correct error return code in ZFS mount
OpenSolaris revisions and Bug IDs:
11824:53128e5db7cf
6863610 ZFS mount can lose correct error return
12079:13822b941977
6939941 problem with moving files in zfs (142901-12)
Approved by: delphij (mentor)
Obtained from: OpenSolaris (Bug ID 6863610, 6939941)
MFC after: 3 days
- make dflt_lock() always panic,
- add kludge to use contigmalloc() when the alignment is larger than the size
and print a diagnostic when we didn't satisfy the alignment.
unlikely that support for these ever will be implemented on sparc64 as
the IOMMUs are able to translate to up to the maximum physical address
supported by the respective machine, bypassing the IOMMU is affected
by hardware errata and being able to support DMA engines which cannot
do at least 32-bit DMA does not justify the costs.
- The page zeroing in uma_small_alloc() may use the VIS-based block zero
function so take advantage of it.
- D_TRACKCLOSE may be used there as d_close() are expected to match up
d_open() calls
- Replace the hand-crafted counter and flag with the
device_busy()/device_unbusy() proper usage.
Sponsored by: Sandvine Incorporated
Reported by: Mark Johnston <mjohnston at sandvine dot com>
Tested by: Mark Johnston
Reviewed by: emaste
MFC after: 10 days
devfs_delete() now recursively removes empty parent directories unless
the DEVFS_DEL_NORECURSE flag is specified. devfs_delete() can't be
called anymore with a parent directory vnode lock held because the
possible parent directory deletion needs to lock the vnode. Thus we
unlock the parent directory vnode in devfs_remove() before calling
devfs_delete().
Call devfs_populate_vp() from devfs_symlink() and devfs_vptocnp() as now
directories can get removed.
Add a check for DE_DOOMED flag to devfs_populate_vp() because
devfs_delete() drops dm_lock before the VI_DOOMED vnode flag gets set.
This ensures that devfs_populate_vp() returns an error for directories
which are in progress of deletion.
Reviewed by: kib
Discussed on: freebsd-current (mostly silence)
This mirrors code in tmpfs.
This changge shouldn't affect much read path, it may cause unnecessary
vm_page_lookup calls in the case where v_object has no active or inactive
pages but has some cache pages. I believe this situation to be non-essential.
In write path this change should allow us to properly detect the above
case and free a cache page when we write to a range that corresponds to it.
If this situation is undetected then we could have a discrepancy between
data in page cache and in ARC or on disk.
This change allows us to re-enable vn_has_cached_data() check in zfs_write.
NOTE: strictly speaking resident_page_count and cache fields of v_object
should be exmined under VM_OBJECT_LOCK, but for this particular usage
we may get away with it.
Discussed with: alc, kib
Approved by: pjd
Tested with: tools/regression/fsx
MFC after: 3 weeks
artificial power-of-2 rounded number to their real values specified
in RFC879 and RFC2460.
From the history and existing comments it appears that the rounded
numbers were intended to be advantageous for the kernel and mbuf
system. However this hasn't been the case at for at least a long
time. The mbuf clusters used in tcp_output() have enough space
to hold the larger real value for the default MSS for both IPv4 and
IPv6. Note that the default MSS is only used when path MTU discovery
is disabled.
Update and expand related comments.
Reviewed by: lsteward (including some word-smithing)
MFC after: 2 weeks
Otherwise, adding insult to injury, in addition to double-caching of data
we would always copy the data into a vnode's vm object page from backend.
This is specific to sendfile case only (VOP_READ with UIO_NOCOPY).
PR: kern/141305
Reported by: Wiktor Niesiobedzki <bsd@vink.pl>
Reviewed by: alc
Tested by: tools/regression/sockets/sendfile
MFC after: 2 weeks
The workaround was incorrectly documented as having something to do with
set_pcpu section's progbits, but in fact it was for incorrect placement
of __start_set_pcpu because of the bug in ld.
The bug was fixed in r210245, see commit message for details.
A side-effect of the workaround was that a zero-size set_pcpu section was
produced for modules, source code of which included pcpu.h but didn't
actually define any dynamic per-cpu variables.
This commit should remove the side-effect.
The same workaround is present sys/net/vnet.h, has an analogous side-effect
and can be removed as well.
An UPDATING entry that warns about a need for recent ld is following.
MFC after: 1 month
include/mmuvar.h - Change the MMU_DEF macro to also create the class
definition as well as define the DATA_SET. Add a macro, MMU_DEF_INHERIT,
which has an extra parameter specifying the MMU class to inherit methods
from. Update the comments at the start of the header file to describe the
new macros.
booke/pmap.c
aim/mmu_oea.c
aim/mmu_oea64.c - Collapse mmu_def_t declaration into updated MMU_DEF macro
The MMU_DEF_INHERIT macro will be used in the PS3 MMU implementation to
allow it to inherit the stock powerpc64 MMU methods.
Reviewed by: nwhitehorn
of small maxsize and "large" (including BUS_SPACE_UNRESTRICTED) nsegments
parameters. Generally using a presz of 0 (which indeed might indicate the
use of bogus parameters for DMA tag creation) is not fatal, it just means
that no additional DVMA space will be preallocated.
to handle current timecounter wraps. Make kern_clocksource.c to honor that
requirement, scheduling sleeps on first CPU for no more then specified
period. Allow other CPUs to sleep up to 1/4 second (for any case).
Do not explicitly enable interrupts in smp_init_secondary() because it
renders any spinlock protected code after that point to run with
interrupts enabled. This is because the processor is executing in the
context of idlethread whose 'md_spinlock_count' is already set to 1.
Instead just let sched_throw() re-enable interrupts when it releases
the spinlock.
The original powerpc commit log for r212559 is available here:
http://svn.freebsd.org/viewvc/base?view=revision&revision=212559
when the original offset is bigger than size of one page. X86BIOS macros
cannot be used here because it is assumed address is only linear in a page.
Tested by: netchild
unexpected things in copyout(9) and so wiring the user buffer is not
sufficient to perform a copyout(9) while holding a random mutex.
Requested by: nwhitehorn
Sponsored by: The FreeBSD Foundation
> Description of fields to fill in above: 76 columns --|
> PR: If a GNATS PR is affected by the change.
> Submitted by: If someone else sent in the change.
> Reviewed by: If someone else reviewed your modification.
> Approved by: If you needed approval for this commit.
> Obtained from: If the change is from a third party.
> MFC after: N [day[s]|week[s]|month[s]]. Request a reminder email.
> Security: Vulnerability reference (one per line) or description.
> Empty fields above will be automatically removed.
M param.h
any spin locks acquired between the enabling of interrupts in
machdep_ap_bootstrap() and the invocation of the scheduler would fail to
have interrupts disabled due to the fake spinlock already held by the
idle thread. sched_throw(NULL) will enable interrupts by itself when
exiting this spinlock, so just let it do that and don't enable interrupts
here.
understand everything correctly, we don't really need it.
- Provide default numeric value as strings. This allows to simplify
a lot of code.
- Bump version number.
- Remove sync from msgrng_send, sync needs to be called just once before
sending.
- Fix retry logic - don't reload registers when retrying in message_send,
also fix check for send pending fail.
- remove unused message_send_block_fast()
- merge message_receive_fast() to message_receive
- style(9) fixes, and comments
- rge and nlge updated for the sys/mips/rmi/msgring.h changes
ACPI specification sates that if P_LVL2_LAT > 100, then a system doesn't
support C2; if P_LVL3_LAT > 1000, then C3 is not supported.
But there are no such rules for Cx state data returned by _CST. If a
state is not supported it should not be included into the return
package. In other words, any latency value returned by _CST is valid,
it's up to the OS and/or user to decide whether to use it.
Submitted by: nork
Suggested by: mav
MFC after: 1 week
If a kobj method doesn't have any explicitly provided default
implementation, then it is auto-assigned kobj_error_method.
kobj_error_method is proper only for methods that return error code,
because it just returns ENXIO.
So, in the case of unimplemented bus_add_child caller would get
(device_t)ENXIO as a return value, which would cause the mistake to go
unnoticed, because return value is typically checked for NULL.
Thus, a specialized null_add_child is added. It would have sufficied
for correctness to return NULL, but this type of mistake was deemed to
be rare and serious enough to call panic instead.
Watch out for this kind of problem with other kobj methods.
Suggested by: jhb, imp
MFC after: 2 weeks
The main goal of this is to generate timer interrupts only when there is
some work to do. When CPU is busy interrupts are generating at full rate
of hz + stathz to fullfill scheduler and timekeeping requirements. But
when CPU is idle, only minimum set of interrupts (down to 8 interrupts per
second per CPU now), needed to handle scheduled callouts is executed.
This allows significantly increase idle CPU sleep time, increasing effect
of static power-saving technologies. Also it should reduce host CPU load
on virtualized systems, when guest system is idle.
There is set of tunables, also available as writable sysctls, allowing to
control wanted event timer subsystem behavior:
kern.eventtimer.timer - allows to choose event timer hardware to use.
On x86 there is up to 4 different kinds of timers. Depending on whether
chosen timer is per-CPU, behavior of other options slightly differs.
kern.eventtimer.periodic - allows to choose periodic and one-shot
operation mode. In periodic mode, current timer hardware taken as the only
source of time for time events. This mode is quite alike to previous kernel
behavior. One-shot mode instead uses currently selected time counter
hardware to schedule all needed events one by one and program timer to
generate interrupt exactly in specified time. Default value depends of
chosen timer capabilities, but one-shot mode is preferred, until other is
forced by user or hardware.
kern.eventtimer.singlemul - in periodic mode specifies how much times
higher timer frequency should be, to not strictly alias hardclock() and
statclock() events. Default values are 2 and 4, but could be reduced to 1
if extra interrupts are unwanted.
kern.eventtimer.idletick - makes each CPU to receive every timer interrupt
independently of whether they busy or not. By default this options is
disabled. If chosen timer is per-CPU and runs in periodic mode, this option
has no effect - all interrupts are generating.
As soon as this patch modifies cpu_idle() on some platforms, I have also
refactored one on x86. Now it makes use of MONITOR/MWAIT instrunctions
(if supported) under high sleep/wakeup rate, as fast alternative to other
methods. It allows SMP scheduler to wake up sleeping CPUs much faster
without using IPI, significantly increasing performance on some highly
task-switching loads.
Tested by: many (on i386, amd64, sparc64 and powerc)
H/W donated by: Gheorghe Ardelean
Sponsored by: iXsystems, Inc.
vnode lock and several locks needed during fork, like fd lock.
Instead, schedule the task to be executed in the taskqueue context. We
still waiting for the fork to finish, but the context of the thread
executing the task does not make real LORs with our vnode lock.
Submitted by: pluknet at gmail com
Reviewed by: jhb
Tested by: pho
MFC after: 3 weeks
* processes now can't go away while we are inserting probes (fixes a panic)
* if a trap happens, we won't be holding the process lock (fixes a hang)
* fix a LOR between the process lock and the fasttrap bucket list lock
Thanks to kib for pointing some problems.
Sponsored by: The FreeBSD Foundation
when mount and update are executed in parallel.
Encapsulate syncer vnode deallocation into the helper function
vfs_deallocate_syncvnode(), to not externalize sync_mtx from vfs_subr.c.
Found and reviewed by: jh (previous version of the patch)
Tested by: pho
MFC after: 3 weeks
- Teach SCHED_4BSD to inform cpu_idle() about high sleep/wakeup rate to
choose optimized handler. In case of x86 it is MONITOR/MWAIT. Also it
will be needed to bypass forthcoming idle tick skipping logic to not
consume resources on events rescheduling when it won't give any benefits.
- Teach SCHED_4BSD to wake up idle CPUs without using IPI. In case of x86,
when MONITOR/MWAIT is active, it require just single memory write. This
doubles performance on some heavily switching test loads.
zack.kirsch at isilon.com for a race between nfsrv_freeopen()
and nfsrv_getlockfile() in the experimental NFS server that
he found during testing. Although nfsrv_freeopen() holds a
sleep lock on the lock file structure when called with
cansleep != 0, nfsrv_getlockfile() could still search the
list, once it acquired the NFSLOCKSTATE() mutex. I believe
that acquiring the mutex in nfsrv_freeopen() fixes the race.
MFC after: 2 weeks
that it works correctly for ZFS file handles. It is possible to
have two ZFS file handles that differ only in the bytes in the
fid_reserved field of the generic "struct fid" and comparing the
bytes in fid_data didn't catch this case. This patch changes the
macro to compare all bytes of "struct fid".
Tested by: gull at gull.us
MFC after: 2 weeks
Now when one does 'make kernel ; make kernel' the second invocation
only does: `kernel.ko' is up to date.
rather than reproduce all the .fw files and relink the kernel.
code associated with overflow or with the drain function. While this
function is not expected to be used often, it produces more information
in the form of an errno that sbuf_overflowed() did.
Bring in a driver for the LSI Logic MPT2 6Gb SAS controllers.
This driver supports basic I/O, and works with SAS and SATA drives and
expanders.
Basic error recovery works (i.e. timeouts and aborts) as well.
Integrated RAID isn't supported yet, and there are some known bugs.
So this isn't ready for production use, but is certainly ready for
testing and additional development. For the moment, new commits to this
driver should go into the FreeBSD Perforce repository first
(//depot/projects/mps/...) and then get merged into -current once
they've been vetted.
This has only been added to the amd64 GENERIC, since that is the only
architecture I have tested this driver with.
Submitted by: scottl
Discussed with: imp, gibbs, will
Sponsored by: Yahoo, Spectra Logic Corporation
This reflects actual type used to store and compare child device orders.
Change is mostly done via a Coccinelle (soon to be devel/coccinelle)
semantic patch.
Verified by LINT+modules kernel builds.
Followup to: r212213
MFC after: 10 days
malo and mwl use the firmware framework to access firmware images.
Depending on the firmware modules itself is not required and in this
case even wrong because no modules with those names exist.
Pointed out by: brucec
MFC after: 1 week
for interfaces with TSO enabled, otherwise one would see an extra
ICMP unreach, frag needed pre matching packet on lo0.
This syncs pf code to ip_output.c r162084.
PR: kern/144311
Submitted by: yongari via mlaier
Reviewed by: eri
Tested by: kib
MFC after: 8 days
Before the change it wasn't possible and the following error was printed:
ZFS: can only boot from disk, mirror or raidz vdevs
Now if the original vdev (the one we are replacing) is still present we will
read from it, but if it is not present we won't read from the new vdev, as it
might not have enough valid data yet.
MFC after: 2 weeks
handlers, some of which had to do awkward things to get a large enough
FIXEDLEN buffer.
Note that some sysctl handlers were explicitly outputting a trailing NUL
byte. This behaviour was preserved, though it should not be necessary.
Reviewed by: phk
PCI status register to map its current name.
- Use PCIM_* rather than PCIR_* for constants for fields in various AER
registers. I got about half of them right in the previous commit.
MFC after: 1 week
called when the sbuf internal buffer is filled. For kernel sbufs with a
drain, the internal buffer will never be expanded. For userland sbufs
with a drain, the internal buffer may still be expanded by
sbuf_[v]printf(3).
Sbufs now have three basic uses:
1) static string manipulation. Overflow is marked.
2) dynamic string manipulation. Overflow triggers string growth.
3) drained string manipulation. Overflow triggers draining.
In all cases the manipulation is 'safe' in that overflow is detected and
managed.
Reviewed by: phk (the previous version)
- Provide 64 bit implementations for some macros. On n64 and n32,
don't split 64 bit values.
- No need for 32 bit ops for control registers.
- Fix few bugs (write control reg, write_c0_register64).
- Re-write EIRR/EIMR/CPUID operations using read_c0_registerXX, no
need of inline assembly.
- rename control reg access functions to avoid phnx, update callers.
- stlye/whitespace fixes.
NFSv2,3 byte range locking is attempted. A fix that allows the
nlm_advlock() to work with both clients is in progress, but
may take a while. As such, I am doing this commit so that
the kernel doesn't panic in the meantime.
Submitted by: jh
MFC after: 2 weeks
in a range must be checked when calling pmap_remove(). Calling
pmap_remove() from vm_pageout_map_deactivate_pages() with the entire range
of the map could result in attempting to demap an extraordinary number
of pages (> 10^15), so iterate through each map entry and unmap each of
them individually.
MFC after: 6 weeks
K2 SATA controllers. The chip's status register must be read first, and
as a long, for other registers to be correctly updated after a command, and
this includes the command sequence in device detection as well as the
previously handled case after interrupts. While here, clean up some
previous hacks related to this controller.
Reported by: many
Reviewed by: mav
MFC after: 3 weeks
syscall and the same function, but are very different and share almost no code.
To make it easier to read and analyze, split vfs_domount() into
vfs_domount_first() and vfs_domount_update().
Reviewed by: kib
- Correct error paths. The system will be useless on devfs_fixup() failure, so
why bother? Maybe for the same reason why a dead body is washed and dressed
in a nice suit before it is put into a coffin? Maybe system's last will is to
panic without any locks held?
Reviewed by: kib
Seagate FreeAgent DockStar(tm) device. It seems to be a dumb down
version of Marvell SheevaPlug. Device tree source file could use
more tweaking, but at least it wll network boot and run FreeBSD/arm.
PCI-express. I used PCIZ_* for ID constants (plain capability IDs use
PCIY_*).
- Add register definitions for the Advanced Error Reporting, Virtual
Channels, and Device Serial Number extended capabilities.
- Teach pciconf -c to list extended as well as plain capabilities. Adds
more detailed parsing for AER, VC, and device serial numbers.
MFC after: 2 weeks
- Updates for the message ring clean up in r212321.
- Instead of dropping Tx packet on credit fail, retry send until it
succeeds.
- Fix freeing mbufs in case of P2P descriptors:
We cannot free the mbuf when the P2P descriptor freeback is received. The
mbuf may be still in use by the GMAC, since the P2P freeback indicates that
it read the P2D descriptors in the P2P message.
Now we free just the P2P descriptor when the P2P freeback message is
received. Another freeback P2D message has been added to the end of
the packet descriptors, the mbuf will be freed only when we received
this.
The P2P descriptor issue was reported by srgorti at netlogicmicro dot com.
are still bound to BSP. It confuses timer management logic in per-CPU mode
and may cause timer not being reloaded. Check such cases on interrupt
arival and reload timer to give system some more time to manage proper
binding.
wrong element of the VSID bitmap array to be examined after a collision,
leading to reallocation of in-use VSIDs under some circumstances, with
attendant memory corruption. Also add an assert to check for this kind of
problem in the future.
MFC after: 4 days
Fix message ring send path:
- define msgrng_access_enable() which disables local interrupts
and enables message ring access. Also define msgrng_restore() which
restores interrupts
- remove all other msgrng enable/disable macros, no need of critical_enter
and other locking here.
- message_send() fixup: re-read status until pending bit clears
- message_send_retry() fixup: retry only few times with interrupts disabled
- Fix up message_send/message_send_retry callers - call
msgrng_access_enable() and msgrng_restore() correctly so that interrupts
are not disabled for long.
- removed unused and obsolete code from sys/mips/rmi/msgring.h
- some style fixes - more later
rge.c (XLR GMAC driver):
- updated for the message ring changes
- remove unused message_send_block()
- retry on credit failure, this is not a permanent failure when credits
are configured correctly. Add panic if credits are not available to
send for a long time.
PowerPC CPUs running a 32-bit kernel. This bug could cause in-use VSIDs
to be allocated again to another process, causing memory space overlaps
and corruption.
Reported by: linimon
discrepencies from the igb version which was the target.
Change the message when neither MSI or MSIX are enabled
and a fallback to Legacy interrupts happen, the existing
message was confusing.
directories for purposes of validating name cache entries. This
closes races where two updates to a file or directory within the same
second could result in stale entries in the name cache. While here,
remove the 'n_expiry' field as it is no longer used.
Reviewed by: rmacklem
MFC after: 1 week
"uncore" devices (such as the memory controller) in that socket. Stop
hardcoding support for two busses, but instead start probing buses at
domain 0, bus 255 and walk down until a bus probe fails. Also, do not probe
a bus if it has already been enumerated elsewhere (e.g. if ACPI ever
enumerates these buses in the future).
Fix interrupt routing so that the irq returned is correct for XLR and
XLS. This also updates the MSI hack we had earlier - we still don't
really support MSI, but we support some drivers that use MSI, by providing
support for allocating one MSI per pci link - this MSI is directly
mapped to the link IRQ.
lock on the pmc-sx lock. This prevents a deadlock with
pmc_log_process_mappings, which has an exclusive lock on pmc-sx and tries
to get a read lock on a vm_map. Downgrading the vm_map_lock in munmap
allows pmc_log_process_mappings to continue, preventing the deadlock.
Without this change I could cause a deadlock on a multicore 8.1-RELEASE
system by having one thread constantly mmap'ing and then munmap'ing a
PROT_EXEC mapping in a loop while I repeatedly invoked and stopped pmcstat
in system-wide sampling mode.
Reviewed by: fabient
Approved by: emaste (mentor)
MFC after: 2 weeks
using ipproto_{un,}register() and the newly created ip6proto_{un,}register()
so that it can again receive IPPROTO_CARP packets allowing its state machine
to work.
Reviewed by: bz
Approved by: ken (mentor)
- add identify method to create driver's own device_t
- successfully probe only driver's own device_t instead of any device_t
- (ab)use device order to hopefully be probed/attached after acpi_wmi
PR: kern/147858
Tested by: Maciej Suszko <maciej@suszko.eu>
MFC after: 1 week
- Add special check for case when time expires before being programmed.
This fixes interrupt loss and respectively timer death on attempt to
program very short interval. Increase minimal supported period to more
realistic value.
- Add support for hint.hpet.X.allowed_irqs tunable, allowing manually
specify which interrupts driver allowed to use. Unluckily, many BIOSes
program wrong allowed interrupts mask, so driver tries to stay on safe
side by not using unshareable ISA IRQs. This option gives control over
this limitation, allowing more per-CPU timers to be provided, when FSB
interrupts are not supported. Value of this tunable is bitmask.
- Do not use regular interrupts on virtual machines. QEMU and VirtualBox
do not support them properly, that may cause problems. Stay safe by default.
Same time both QEMU and VirtualBox work fine in legacy_route mode.
VirtualBox also works fine if manually specify allowed ISA IRQs with above.
upper layer. Until now, unionfs prevents to use that kind of
file system as upper layer. This time, I changed to allow
that kind of file system as upper layer. By this change, you
can use whiteout not supporting file system (e.g., especially
for tmpfs) as upper layer. It's very useful for combination of
tmpfs as upper layer and read only file system as lower layer.
By difinition, without whiteout support from the file system
backing the upper layer, there is no way that delete and rename
operations on lower layer objects can be done. EOPNOTSUPP is
returned for this kind of operations as generated by VOP_WHITEOUT()
along with any others which would make modifica tions to the
lower layer, such as chmod(1).
This change is suggested by ed.
Submitted by: ed
client to return an error when rabp is not set, so it
behaves the same way as the regular NFS client for this
case. It does not affect NFSv4, since nfs_getcacheblk()
only fails for "intr" mounts and NFSv4 can't use the
"intr" mount option.
MFC after: 2 weeks
Also change int -> u_int for order parameter in device_add_child_ordered.
There should not be any ABI change as struct device is private to subr_bus.c
and the API change should be compatible.
To do: change int -> u_int for order parameter of bus_add_child method
and its implementations. The change should also be API compatible, but
is a bit more churn.
Suggested by: imp, jhb
MFC after: 1 week
Both deadline and current_time are time_seconds (+ utc_offset())
casted to unsigned long long. No need to cast to or print as pointers.
MFC after: 4 days
vm_page_startup uses MSGBUF_SIZE value for adding msgbuf pages to minidump.
If opt_msgbuf.h is not included and MSGBUF_SIZE is overriden in kernel
config, then not all msgbuf pages will be dumped. And most importantly,
struct msgbuf itself will not be included. Thus the dump would look
corrupted/incomplete to tools like kgdb, dmesg, etc that try to access
struct msgbuf as one of the first things they do when working on a crash
dump.
MFC after: 5 days
A make buildkernel -j4 uses ~360% CPU.
- Bracket the AP spinup printf with a mutex to avoid garbled output.
- Enable SMP by default on powerpc64.
Reviewed by: nwhitehorn
Add the BIO_ORDERED flag for struct bio and update bio clients to use it.
The barrier semantics of bioq_insert_tail() were broken in two ways:
o In bioq_disksort(), an added bio could be inserted at the head of
the queue, even when a barrier was present, if the sort key for
the new entry was less than that of the last queued barrier bio.
o The last_offset used to generate the sort key for newly queued bios
did not stay at the position of the barrier until either the
barrier was de-queued, or a new barrier (which updates last_offset)
was queued. When a barrier is in effect, we know that the disk
will pass through the barrier position just before the
"blocked bios" are released, so using the barrier's offset for
last_offset is the optimal choice.
sys/geom/sched/subr_disk.c:
sys/kern/subr_disk.c:
o Update last_offset in bioq_insert_tail().
o Only update last_offset in bioq_remove() if the removed bio is
at the head of the queue (typically due to a call via
bioq_takefirst()) and no barrier is active.
o In bioq_disksort(), if we have a barrier (insert_point is non-NULL),
set prev to the barrier and cur to it's next element. Now that
last_offset is kept at the barrier position, this change isn't
strictly necessary, but since we have to take a decision branch
anyway, it does avoid one, no-op, loop iteration in the while
loop that immediately follows.
o In bioq_disksort(), bypass the normal sort for bios with the
BIO_ORDERED attribute and instead insert them into the queue
with bioq_insert_tail(). bioq_insert_tail() not only gives
the desired command order during insertion, but also provides
barrier semantics so that commands disksorted in the future
cannot pass the just enqueued transaction.
sys/sys/bio.h:
Add BIO_ORDERED as bit 4 of the bio_flags field in struct bio.
sys/cam/ata/ata_da.c:
sys/cam/scsi/scsi_da.c
Use an ordered command for SCSI/ATA-NCQ commands issued in
response to bios with the BIO_ORDERED flag set.
sys/cam/scsi/scsi_da.c
Use an ordered tag when issuing a synchronize cache command.
Wrap some lines to 80 columns.
sys/cddl/contrib/opensolaris/uts/common/fs/zfs/vdev_geom.c
sys/geom/geom_io.c
Mark bios with the BIO_FLUSH command as BIO_ORDERED.
Sponsored by: Spectra Logic Corporation
MFC after: 1 month
to pad with 0xFF when it encounter short frames. According to RFC
1042 the pad bytes should be 0x00.
Because manual padding consumes extra CPU cycles, introduce a new
tunable which controls the padding behavior. Turning this tunable
on will have driver pad manually but it's disabled by default. Users
can enable software padding by setting the following tunable to
non-zero value.
dev.sis.%d.manual_pad="1"
PR: kern/35422 (patch not used)
prevented driver from working on big-endian machines. Also rewrite
station address programming to make it work on strict-alignment
architectures. With this change, sis(4) now works on sparc64 and
performance number looks good even though sis(4) have to apply
fixup code to align received frames on 2 bytes boundary on sparc64.
In protosw we define pr_protocol as short, while on the wire
it is an uint8_t. That way we can have "internal" protocols
like DIVERT, SEND or gaps for modules (PROTO_SPACER).
Switch ipproto_{un,}register to accept a short protocol number(*)
and do an upfront check for valid boundries. With this we
also consistently report EPROTONOSUPPORT for out of bounds
protocols, as we did for proto == 0. This allows a caller
to not error for this case, which is especially important
if we want to automatically call these from domain handling.
(*) the functions have been without any in-tree consumer
since the initial introducation, so this is considered save.
Implement ip6proto_{un,}register() similarly to their legacy IP
counter parts to allow modules to hook up dynamically.
Reviewed by: philip, will
MFC after: 1 week
response to DMA activate FIS under certain circumstances. This is
recommended fix from chip datasheet. If triggered, this bug most likely
cause write command timeout.
MFC after: 2 weeks
value 0xff. On hot-plug this value confuses ata_generic_reset() device
presence detection logic. As soon as we already know drive presence from
SATA hard reset, hint ata_generic_reset() to wait for device signature
until success or full timeout.
greater than 65535 bytes then the CDC driver might not work as expected, which
is not likely with the existing USB speeds.
Submitted by: Hans Petter Selasky
the file handle's size and was recently committed to
lib/libstand/nfs.c. This allows pxeboot to use NFSv3 and work
correcty for non-FreeBSD as well as FreeBSD NFS servers.
If built with OLD_NFSV2 defined, the old
code that predated this patch will be used.
Tested by: danny at cs.huji.ac.il
boot.nfsroot.nfshandlelen and set the diskless root fs to
use NFSv3 and this file handle length when it is set. If
this environment variable is not set, the diskless root fs
will use NFSv2 and the same defaults as before. This fixes
the problem where the diskless nfs root fs had to be on a
FreeBSD server for NFSv3 to work, because it did not know
the correct file handle length and assumed the size used
by FreeBSD. Until pxeboot and loader are replaced by ones
built from commits coming soon, boot.nfsroot.nfshandlelen
will not be set by them and the diskless root fs will use
NFSv2 unless the /etc/fstab entry has the "nfsv3" option
specified.
Tested by: danny at cs.huji.ac.il
MFC after: 2 weeks
allmulti is toggled. Controller does not require reinitialization.
This removes unnecessary controller reinitialization whenever
tcpdump is used.
While I'm here remove unnecessary variable reinitialization.
configured TX/RX MACs before getting a valid link. After that, when
link state change callback is called, it called device
initialization again to reconfigure TX/RX MACs depending on
resolved link state. This hack created several bad side effects and
it required more hacks to not collide with sis_tick callback as
well as disabling switching to currently selected media in device
initialization. Also it seems sis(4) was used to be a template
driver for long time so other drivers which was modeled after
sis(4) also should be changed.
TX/RX MACs are now reconfigured after getting a valid link. Fix for
short cable error is also applied after getting a link because it's
only valid when the resolved speed is 100Mbps.
While I'm here slightly reorganize interrupt handler such that
sis(4) always read SIS_ISR register to see whether the interrupt is
ours or not. This change removes another hack and make it possible
to nuke sis_stopped variable in softc.
thread in a racy manner, which can lead to attempting to migrate a
thread that is pinned to a CPU. Instead, have sched_switch() determine
which CPU a thread should run on if the current one is not allowed.
KASSERT in sched_bind() that the thread is not yet pinned to a CPU.
KASSERT in sched_switch() that only migratable threads or those moving
due to a sched_bind() are changing CPUs.
sched_affinity code came from jhb@.
MFC after: 2 weeks
- add rm_try_rlock().
- add RM_SLEEPABLE to use sx(9) as the back-end lock in order to sleep while
holding the write lock.
- change rm_noreadtoken to a cpu bitmask to indicate which CPUs need to go
through the lock/unlock in order to synchronize. As a side effect, this
also avoids IPI to CPUs without any readers during rm_wlock.
Discussed with: ups@, rwatson@ on arch@
Sponsored by: Isilon Systems, Inc.
o Enforce TX/RX descriptor ring alignment. NS data sheet says the
controller needs 4 bytes alignment but use 16 to cover both SiS
and NS controllers. I don't have SiS data sheet so I'm not sure
what is alignment restriction of SiS controller but 16 would be
enough because it's larger than the size of a TX/RX descriptor.
Previously sis(4) ignored the alignment restriction.
o Enforce RX buffer alignment, 4.
Previously sis(4) ignored RX buffer alignment restriction.
o Limit number of TX DMA segment to be used to 16. It seems
controller has no restriction on number of DMA segments but
using more than 16 looks resource waste.
o Collapse long mbuf chains with m_collapse(9) instead of calling
expensive m_defrag(9).
o TX/RX side bus_dmamap_load_mbuf_sg(9) support and remove
unnecessary callbacks.
o Initial endianness support.
o Prefer local alignment fixup code to m_devget(9).
o Pre-allocate TX/RX mbuf DMA maps instead of creating/destroying
these maps in fast TX/RX path. On non-x86 architectures, this is
very expensive operation and there is no need to do that.
o Add missing bus_dmamap_sync(9) in TX/RX path.
o watchdog is now unarmed only when there are no pending frames
on controller. Previously sis(4) blindly unarmed watchdog
without checking the number of queued frames.
o For efficiency, loaded DMA map is reused for error frames.
o DMA map loading failure is now gracefully handled. Previously
sis(4) ignored any DMA map loading errors.
o Nuke unused macros which are not appropriate for endianness
operation.
o Stop embedding driver maintained structures into descriptor
rings. Because TX/RX descriptor structures are shared between
host and controller, frequent bus_dmamap_sync(9) operations are
required in fast path. Embedding driver structures will increase
the size of DMA map which in turn will slow down performance.
- set cache_coherent_dma flag in cpuinfo for XLR, this will make sure that
BUS_DMA_COHERENT flag is handled correctly in busdma_machdep.c
- iodi.c, call device_get_name() just once
- clear RMI specific EIRR while intializing CPUs
- remove debug print in intr_machdep.c
which also avoids NULL pointer arithmetic, as suggested by jhb. The
available space goes from 11 bytes to 7.
Reviewed by: nyan
Approved by: rpaulo (mentor)
programs that refuse to run as root (pgsql) to install probes when their
user is part of the wheel group.
Sponsored by: The FreeBSD Foundation
> Description of fields to fill in above: 76 columns --|
> PR: If a GNATS PR is affected by the change.
> Submitted by: If someone else sent in the change.
> Reviewed by: If someone else reviewed your modification.
> Approved by: If you needed approval for this commit.
> Obtained from: If the change is from a third party.
> MFC after: N [day[s]|week[s]|month[s]]. Request a reminder email.
> Security: Vulnerability reference (one per line) or description.
> Empty fields above will be automatically removed.
M dev/dtrace/dtrace_load.c
not change interrupt vector if it is not pointing the ROM itself. Actually,
we just fail shadowing altogether if that is the case because the shadowed
copy will be useless for sure and POST may not be relocatable or useful.
While I'm here, fix a debugging message under bootverbose, really. r211829
fixed one case but broke another. Mea Culpa.
or not by comparing reported TX consumer index with saved index. So
remove unnecessary check done after freeing transmitted mbufs.
While I'm here nuke unnecessary variable initializations.
tag. All controllers that are not BCM5755 or higher have 4GB
boundary DMA bug. Previously bge(4) used 32bit DMA address to
workaround the bug(r199670). However this caused the use of bounce
buffers such that it resulted in poor performance for systems which
have more than 4GB memory. Because bus_dma(9) honors boundary
restriction requirement of DMA tag for dynamic buffers, having a
separate TX/RX mbuf DMA tag will greatly reduce the possibility of
using bounce buffers. For DMA buffers allocated with
bus_dmamem_alloc(9), now bge(4) explicitly checks whether the
requested memory region crossed the boundary or not.
With this change, only the DMA buffer that crossed the boundary
will use 32bit DMA address. Other DMA buffers are not affected as
separate DMA tag is created for each DMA buffer.
Even if 32bit DMA address space is used for a buffer, the chance to
use bounce buffer is still very low as the size of buffer is small.
This change should eliminate most usage of bounce buffers on
systems that have more than 4GB memory.
More correct fix would be teaching bus_dma(9) to honor boundary
restriction for buffers created with bus_dmamem_alloc(9) but it
seems that is not easy.
While I'm here cleanup bge_dma_map_addr() and remove unnecessary
member variables in bge_dmamap_arg structure.
Tested by: marcel
gnu/lib/libobjc and sys/boot/i386/boot2, so it also works when using
absolute paths and/or options, as in CC="/absolute/path/clang -foo".
Approved by: rpaulo (mentor)
the existing code was very platform specific, and broken for SMP systems
trying to reboot from KDB.
- Add a new PLATFORM_RESET() method to the platform KOBJ interface, and
migrate existing reset functions into platform modules.
- Modify the OF_reboot() routine to submit the request by hand to avoid
the IPIs involved in the regular openfirmware() routine. This fixes
reboot from KDB on SMP machines.
- Move non-KDB reset and poweroff functions on the Powermac platform
into the relevant power control drivers (cuda, pmu, smu), instead of
using them through the Open Firmware backdoor.
- Rename platform_chrp to platform_powermac since it has become
increasingly Powermac specific. When we gain support for IBM systems,
we will grow a new platform_chrp.
signals, because it is managed by debugger, however a normal signal sent to
a interruptibly sleeping thread wakes up the thread so it will handle the
signal when the process leaves the stopped state.
PR: 150138
MFC after: 1 week
of the lower level vnode is incremented to greater than 1 when
the upper level vnode's v_usecount is greater than one. This
is necessary for the NFS clients, so that they will do a silly
rename of the file instead of actually removing it when the
file is still in use. It is "racy", since the v_usecount is
incremented in many places in the kernel with
minimal synchronization, but an extraneous silly rename is
preferred to not doing a silly rename when it is required.
The only other file systems that currently check the value
of v_usecount in their VOP_REMOVE() functions are nwfs and
smbfs. These file systems choose to fail a remove when the
v_usecount is greater than 1 and I believe will function
more correctly with this patch, as well.
Tested by: to.my.trociny at gmail.com
Submitted by: to.my.trociny at gmail.com (earlier version)
Reviewed by: kib
MFC after: 2 weeks
heavy I/O load.
Many thanks to LSI for continuing to support FreeBSD.
PR: kern/149968
Submitted by: LSI (Tom Couch)
Reported by: Kai Kockro <kkockro web de>
Tested by: Kai Kockro, jpaetzel
MFC after: 7 days
least one execute bit set, otherwise execve(2) will return EACCES even
for an user with PRIV_VFS_EXEC privilege.
Add the check also to vaccess(9), vaccess_acl_nfs4(9) and
vaccess_acl_posix1e(9). This makes access(2) to better agree with
execve(2). Because ZFS doesn't use vaccess(9) for VEXEC, add the check
to zfs_freebsd_access() too. There may be other file systems which are
not using vaccess*() functions and need to be handled separately.
PR: kern/125009
Reviewed by: bde, trasz
Approved by: pjd (ZFS part)
unused files.
- remove clock.c and clock.h, these are not used after the new timer
code was added.
- remove duplicated include files, fix header file ordering, remove
some unneeded includes.
- rename mips/rmi/shared_structs.h which contains the RMI boot loader
interface to mips/rmi/rmi_boot_info.h. Remove unused files
mips/rmi/shared_structs_func.h and sys/mips/rmi/shared_structs_offsets.h
- merge mips/rmi/xlrconfig.h and mips/rmi/rmi_mips_exts.h, and remove
duplicated functions.
- nlge - minor change to remove unneeded argument.
- Add FreeBSD svn keyword for headers
pseudo-interface. This leads to a panic due to uninitialized
if_broadcastaddr address. Initialize it and implement ip_output()
method to prevent mbuf leak later.
ipfw pseudo-interface should never send anything therefore call
panic(9) in if_start() method.
PR: kern/149807
Submitted by: Dmitrij Tejblum
MFC after: 2 weeks
PMAP_DIAGNOSTIC was eliminated from amd64/i386, and, in fact, the
non-MIPS parts of the kernel, several years ago. Any of the interesting
checks were turned into KASSERT()s. Basically, the motivation was that
lots of people run with INVARIANTS but no one runs with DIAGNOSTIC.
panic strings needn't and shouldn't have a terminating newline.
Finally, there is one functional change. The sched_pin() in
pmap_remove_pages() is an artifact of the way we temporarily map page
table pages on i386. (The mappings are processor private. We don't do
a system-wide shootdown.) It isn't needed by MIPS.
Tested by: jchandra
Submitted by: alc
nfsd_recalldelegation() function, since this function is called
by nfsd threads when they are handling NFSv2 or NFSv3 RPCs, where
no reference count would have been acquired.
MFC after: 2 weeks
the correct mutex when checking nfsv4root_lock. Although this
could be fixed by adding mutex lock/unlock calls, zack.kirsch at
isilon.com suggested a better fix that uses a non-blocking
acquisition of a reference count on nfsv4root_lock. This fix
allows the weird NFSLOCKSTATE(); NFSUNLOCKSTATE(); synchronization
to be deleted. This patch applies this fix.
Tested by: zack.kirsch at isilon.com
MFC after: 2 weeks
and XAUI 10G interfaces in addition RGMII/SGMII 1G interfaces. This driver
is work in progress.
board.c and board.h expanded to include more info.
Only one of rge and nlge can be enabled at a time, rge will be deprecated
when nlge stabilizes.
Submitted by: Sriram Gorti <srgorti at netlogicmicro com>
Fix the switching on/off of PF and NR-SACKs using sysctl.
Add minor improvement in handling malloc failures.
Improve the address checks when sending.
MFC after: 4 weeks
for socket, when specified POLLIN|POLLOUT in events, you would have one
selfd registered for receiving socket buffer, and one for sending. Now,
if both events are not ready to fire at the time of the initial scan,
but are simultaneously ready after the sleep, pollrescan() would iterate
over the pollfd struct twice. Since both times revents is not zero,
returned value would be off by one.
Fix this by recalculating the return value in pollout().
PR: kern/143029
MFC after: 2 weeks
- better ACL caching and speedup of ACL permission checks
- faster handling of stat()
- lowered mutex contention in the read/writer lock (rrwlock)
- several related bugfixes
Detailed information (OpenSolaris onnv changesets and Bug IDs):
9749:105f407a2680
6802734 Support for Access Based Enumeration (not used on FreeBSD)
6844861 inconsistent xattr readdir behavior with too-small buffer
9866:ddc5f1d8eb4e
6848431 zfs with rstchown=0 or file_chown_self privilege allows user to "take" ownership
9981:b4907297e740
6775100 stat() performance on files on zfs should be improved
6827779 rrwlock is overly protective of its counters
10143:d2d432dfe597
6857433 memory leaks found at: zfs_acl_alloc/zfs_acl_node_alloc
6860318 truncate() on zfsroot succeeds when file has a component of its path set without access permission
10232:f37b85f7e03e
6865875 zfs sometimes incorrectly giving search access to a dir
10250:b179ceb34b62
6867395 zpool_upgrade_007_pos testcase panic'd with BAD TRAP: type=e (#pf Page fault)
10269:2788675568fd
6868276 zfs_rezget() can be hazardous when znode has a cached ACL
10295:f7a18a1e9610
6870564 panic in zfs_getsecattr
Approved by: delphij (mentor)
Obtained from: OpenSolaris (multiple Bug IDs)
MFC after: 2 weeks
This provides a noticeable write speedup, especially on pools with
less than 30% of free space.
Detailed information (OpenSolaris onnv changesets and Bug IDs):
11146:7e58f40bcb1c
6826241 Sync write IOPS drops dramatically during TXG sync
6869229 zfs should switch to shiny new metaslabs more frequently
11728:59fdb3b856f6
6918420 zdb -m has issues printing metaslab statistics
12047:7c1fcc8419ca
6917066 zfs block picking can be improved
Approved by: delphij (mentor)
Obtained from: OpenSolaris (Bug ID 6826241, 6869229, 6918420, 6917066)
MFC after: 2 weeks
Actually it is hard to properly handle such a failure, especially in MNT_UPDATE
case. The only reason for the vfs_allocate_syncvnode() function to fail is
getnewvnode() failure. Fortunately it is impossible for current implementation
of getnewvnode() to fail, so we can assert this and make
vfs_allocate_syncvnode() void. This in turn free us from handling its failures
in the mount code.
Reviewed by: kib
MFC after: 1 month
but because of a bug it was a no-op, so we were still using offsets in native
byte order for the host. Do it properly this time, bump version to 4 and set
the G_ELI_FLAG_NATIVE_BYTE_ORDER flag when version is under 4.
MFC after: 2 weeks
configuration function. For failed memory allocations, em(4)/lem(4)
called panic(9) which is not acceptable on production box.
igb(4)/ixgb(4)/ix(4) allocated the required memory in stack which
consumed 768 bytes of stack memory which looks too big.
To address these issues, allocate multicast array memory in device
attach time and make multicast configuration success under any
conditions. This change also removes the excessive use of memory in
stack.
Reviewed by: jfv
Just showing some buffer allocation error is more appropriate
action for drivers. This should fix occasional panic reported on
em(4) when driver encountered resource shortage.
Reviewed by: jfv
SMP.
We used to route all PIC based interrupts to cpu 0, and used the per-CPU
interrupt mask to enable/disable interrupts. But the interrupt threads can
run on any cpu on SMP, and the interrupt thread will re-enable the interrupts
on the CPU it runs on when it is done, and not on cpu0 where the PIC will
still send interrupts to.
The fix is move the disable/enable for PIC based interrupts to PIC, we will
ack on PIC only when the interrupt thread is done, and we do not use the
per-CPU interrupt mask.
The changes also introduce a way for subsystems to add a function that
will be called to clear the interrupt on the subsystem. Currently This is
used by the PCI/PCIe for doing additional work during the interrupt
handling.
- We are fine by only share-locking the vnode.
- Remove assertion that doesn't hold for ZFS where we cross mount points
boundaries by going into .zfs/snapshot/<name>/.
Reviewed by: rmacklem
MFC after: 1 month
(replay_alloc()) knows how to handle replay_alloc() failure.
- Eliminate 'freed_one' variable, it is not needed - when no entry is found
rce will be NULL.
- Add locking assertions where we expect a rc_lock to be held.
Reviewed by: rmacklem
MFC after: 2 weeks
supported by many BIOSes to improve performance of VESA BIOS calls for real
mode OSes but it is not our intention here. However, this may help some
platforms where the video ROMs are inaccessible after suspend, for example.
Note it may consume up to 64K bytes of contiguous memory depending on video
controller model when it is enabled. This feature can be disabled by
setting zero to 'debug.vesa.shadow_rom' loader tunable via loader(8) or
loader.conf(5). The default is 1 (enabled), for now.