There were a couple of attempts in the past to reduce it since it
took more than 1ms. Because mii_tick() periodically polls link
status, waiting more than 1ms for each GMII register access was
overkill. Unfortunately all previous attempts were failed with
various ways on different controllers.
This time, add additional 20us dealy at the end of GMII register
access which seems to requirement of all RealTek controllers to
issue next GMII register access request. This is the same way what
Linux does.
ensure that grow_amount is a multiple of the page size. Otherwise, the
kernel may crash in swap_reserve_by_uid() on HEAD and FreeBSD 8.x, and
produce a core file with a missing stack on FreeBSD 7.x.
Diagnosed and reported by: jilles
Reviewed by: kib
MFC after: 1 week
actually 16 I/O lines but the other ones are used for system devices and
interrupts.
The IXP4XX platform can set interrupts on these pins for
high/low/rising/falling/transitional but this is not implemented yet.
The Cambria has the same interface but as all the pins are assigned to system
functions the gpio header is toggled via a PLD on the i2c bus and is not
supported by this commit.
function from the timer code to util, rename it appropriately and
also fix a bug in sctp_get_prev_mtu(), where calling it with a
value existing in the MTU table did not return a smaller one.
MFC after: 3 days.
controllers. sk(4) never reprogrammed station address for Yukon
controllers so overriding station address with ifconfig(8) was not
possible.
Fix the bug by reprogramming all registers that control station
address, flow-control and virtual station address. Virtual station
address has no use at this moment since driver does not make use of
fail over feature.
Tested by: "Mikhail T." <mi+thun <> aldan.algebra.com>
MFC after: 1 week
the IEEE80211_C_RATECTL flag set, default to NONE for all drivers. Only if
a driver calls ieee80211_ratectl_init() check if the NONE algo is still
selected and try to use AMRR in that case. Drivers are still free to use
any other algo by calling ieee80211_ratectl_set() prior to the
ieee80211_ratectl_init() call.
After this change it is now safe to assume that a ratectl algo is always
available and selected, which renders the IEEE80211_C_RATECTL flag pretty
much useless. Therefore revert r211314 and 211546.
Reviewed by: rpaulo
MFC after: 2 weeks
what we have. Without the check the kernel could accessing memory that
does not belong to the request struct.
Note that we do not test if the struct equals in size at this time, which
may faciliate forward compatibility with newer binaries.
Reviewed by: pjd at MeetBSD CA '2010
MFC after: 1 week
based devices (QUALCOMMINC 0x2000). He made it use SCSI eject instead of
ZTE STOR eject. This prevented my ZTE MF626 dongle from switching.
- Apply both eject methods for ZTE STOR based devices. Works on my as
well as mav's device.
- Remove the duplicate.
- Sort the usbdevs entries for Qualcomm so this won't happen again.
- Add bootverbose message displaying the fact that we are ejecting (and
how).
Reviewed by: mav
MFC after: 2 weeks
only for !usevget case. If VFS_VGET is working, the vnode shared lock
is obtained recursively and vput() shall be done, not vunref().
Submitted by: rmacklem
Tested by: Josh Carroll <josh.carroll gmail com>
MFC after: 3 days
copied as a template for _SRS, a string pointer for descriptor name is also
copied and it becomes stale as soon as it gets de-allocated[2]. Now _CRS is
used as a template for _SRS as ACPI specification suggests if it is usable.
The template from _PRS is still utilized but only when _CRS is not available
or broken. To avoid use-after-free the problem in this case, however, only
mandatory fields are copied, optional data is removed, and structure length
is adjusted accordingly.
Reported by: hps[1]
Analyzed by: avg[2]
Tested by: hps
useful counters like rl_missed_pkts is 16 bits quantity which is
too small to hold meaningful information happened in a second. This
means driver should frequently read these counters in order not to
lose accuracy and that approach is too inefficient in driver's
view. Moreover it seems there is no way to trigger an interrupt to
detect counter near-full or wraparound event as well as lacking
clearing the MAC counters. Another limitation of reading the
counters from RealTek controllers is lack of interrupt firing at
the end of DMA cycle of MAC counter read request such that driver
have to poll the end of the DMA which is a time consuming process
as well as inefficient. The more severe issue of the MAC counter
read request is it takes too long to complete the DMA. All these
limitation made maintaining MAC counters in driver impractical. For
now, just provide simple sysctl interface to trigger reading the
MAC counters. These counters could be used to track down driver
issues. Users can read MAC counters maintained in controller with
the following command.
#sysctl dev.re.0.stats=1
While I'm here add check for validity of dma map and allocated
memory before unloading/freeing them.
Tested by: rmacklem
information through devd. My E220 now produces the notification (1 line):
+u3g0 at bus=1 hubaddr=1 port=0 devaddr=2 interface=0 \
vendor=0x12d1 product=0x1003 devclass=0x00 devsubclass=0x00 \
sernum="" release=0x0000 intclass=0xff intsubclass=0xff \
ttyname=U0 ttyports=2 on uhub0
Note: serial/ufoma and net/uhso still provide port number and tty name
(uhso only) information through sysctls, which should now be removed.
Reviewed by: hpselasky
work properly with single-stepping in a kernel debugger. Specifically,
these routines have always disabled interrupts before increasing the nesting
count and restored the prior state of interrupts after decreasing the nesting
count to avoid problems with a nested interrupt not disabling interrupts
when acquiring a spin lock. However, trap interrupts for single-stepping
can still occur even when interrupts are disabled. Now the saved state of
interrupts is not saved in the thread until after interrupts have been
disabled and the nesting count has been increased. Similarly, the saved
state from the thread cannot be read once the nesting count has been
decreased to zero. To fix this, use temporary variables to store interrupt
state and shuffle it between the thread's MD area and the appropriate
registers.
In cooperation with: bde
MFC after: 1 month
- Fix the loop count on detach (causing a panic on detaching a serial
dongle).
- Increase a buffer in case some driver want extra long tty device names
(postfixing the purpose of the tty for example, e.g. u3g.ppp).
notification. devd would stop evaluating at 'at' (not '<k>=<v>') and
hence prevent 'port=X' (and 'bus=<"on" string>) from making it into the
environment for the devd action.
Reviewed by: hselasky
MFC after: 2 weeks
It seems the terminfo library on some systems (OS X, Linux) may emit the
sequence \e[x to reset to default attributes. Apart from using the
zero-command, this escape sequence allows many more operations, such as
setting ANSI colors. I don't see this used anywhere, so this should be
sufficient for now.
This deficiency was spotted by the Debian GNU/kFreeBSD. They have their
own patch, which is slightly flawed in my opinion. I don't know why they
never reported this issue to us.
MFC after: 1 week
zones whose objects are larger than a page to use startup_alloc(). This
allows allocation of zone objects during early boot on machines with a large
number of CPUs since the resulting zone objects are larger than a page.
Submitted by: trema
Reviewed by: attilio
MFC after: 1 week
This could lead to a division by zero if hardware is multi-core and/or
multi-threaded, but for some (quite unusual) reason FreeBSD sees only
one logical processor. This could happen, for example, if neither MADT
nor MP Table are presented by BIOS.
Also:
- assert in topo_probe_0x4 that BSP is accounted for
- neither cpu_cores nor cpu_logical should be zero after successful
probing, so either being zero is an indication of failed probing
Reported by: vwe, Dan Allen <danallen46@airwired.net>
Tested by: Dan Allen <danallen46@airwired.net>
MFC after: 3 days
- hw.usb.ucom.cons_unit is now split into
hw.usb.ucom.cons_unit/...cons_subunit.
Note: The tunable/sysctl hw.usb.ucom.cons_unit needs to be reviewed if
a) a console was defined a USB serial devices, and a USB device with
more than 1 subunit is present, and this device is attached before the
device functioning as a console
or
b) a console was defined on a USB device with more than 1 subunit
Reviewed by: hps
MFC after: 2 weeks
only, and should be protected with an ifdef, and the no-execute bit in
32-bit set_user_sr() should be set before the comparison, not after, or
it will never match.
of the EV_DONE flag and use the mutex to protect against losing wakeups
in g_waitfor_event().
Reported by: davidxu
Tested by: davidxu
Discussed on: freebsd-current
set_user_sr() itself caches the user segment VSID, there is no need for
cpu_switch() to do it again. This change also unifies the 32 and 64-bit
code paths for kernel faults on user pages and remaps the user SLB slot
on 64-bit systems when taking a syscall to avoid some unnecessary segment
exception traps.
a shorter message (userland generally only sees the first 6 to 8
characters) when waiting for the allproc lock. Use "-" when idle to math
the behavior of other kthreads.
Reviewed by: attilio
MFC after: 1 week
- Use > 2^32 - 1 instead of >= when checking for memory regions above 4G.
- Skip SMAP entries > 4G on i386 rather than breaking out of the loop
since SMAP entries are not guaranteed to be in order.
- Remove 'i' and loop over 'rid' directly in the dump_avail[] case.
- Only check for 4G regions in the dump_avail[] case on i386 if PAE is
enabled since vm_paddr_t is 32-bit in the !PAE case.
Submitted by: alc
to amd64, i386, and pc98. The headers are installed to /usr/include/x86
during an installworld, and an 'x86' symlink is created for kernel builds
similar to 'machine' so that the headers can be included as <x86/foo.h>.
Reviewed by: imp
about but otherwise ignored. When allowing the master to be set manually via
ifconfig(8) by adding the former to IFM_SUBTYPE_ETHERNET_OPTION_DESCRIPTIONS
(as it should be) it seems to be unfavorable that a machine can be made to
panic with a simple ifconfig(8) invocation.
concurrency bug. Since all SLB/SR entries were invalidated during an
exception, a decrementer exception could cause the user segment to be
invalidated during a copyin()/copyout() without a thread switch that
would cause it to be restored from the PCB, potentially causing the
operation to continue on invalid memory. This is now handled by explicit
restoration of segment 12 from the PCB on 32-bit systems and a check in
the Data Segment Exception handler on 64-bit.
While here, cause copyin()/copyout() to check whether the requested
user segment is already installed, saving some pipeline flushes, and
fix the synchronization primitives around the mtsr and slbmte
instructions to prevent accessing stale segments.
MFC after: 2 weeks
feature_present(3) checks.
This will help to run-time detect and conditionally handle specific
optionas of either feature in user space (i.e. in libipsec).
Descriptions read by: rwatson
MFC after: 2 weeks
not able to trigger the issue with sample boards, some users seems
to suffer from freeze/lockup when system is booted without UTP cable
plugged in. I'm not sure whether this is BIOS issue or controller
bug. This change fixes AR8132 lockup issue seen on EEE PC.
Reported by: kmoore
Tested by: kmoore
need locking as otherwise we may race against the other parts of the
MD code which expects a consistent state of these. While at it move
the resetting of the pmap before entering it in the TSB.
- Spell a 0 as TLB_CTX_KERNEL.
was incorrect as further down the road cons_probe() calls malloc() so the
former can't be called before init_heap() has succeed. Instead just exit
to the firmware in case init_heap() fails like OF_init() does when hitting
a problem as we're then likely running in a very broken environment where
hardly anything can be trusted to work.
- Rename RES_BUS_SPACE_* into BUS_SPACE_* for consistency
- Trim out an unnecessary checking condition
Sponsored by: Sandvine Incorporated
Requested and reviewed by: jhb
NFSv4 client, since the call in ncl_inactive() might be missed
because VOP_INACTIVE() is not guaranteed to be called before
VOP_RECLAIM().
MFC after: 1 week
r198301 itself. It also broke the logic of not sending more than one
ARP request per second, that consequently lead to a potential problem
of flooding network with broadcast packets.
MFC after: 1 week
OF loader on systems where address cells and size cells are both 2 (the
Mambo simulator) and fix an error where cons_probe() was called before
init_heap() but used malloc() to set environment variables.
MFC after: 1 month
when routing interrupts instead of cpu_apic_ids[0] since cpu_apic_ids[]
is only populated for multiple-CPU machines. This also matches what the
code does when SMP is not enabled.
PR: bin/151616
Tested by: "Damian S. Kolodziejczyk" damkol | gmail
Submitted by: avg
MFC after: 1 week
In xbb_detach() only perform cleanup of our taskqueue and
device statistics structures if they have been initialized.
This avoids a panic when xbb_detach() is called on a partially
initialized device instance, due to an early failure in
attach.
Sponsored by: Spectra Logic Corporation
ip and tcp pointers were not reset after some
pullups. In practice this led to an NFS mount
failure when using UDP reported by Kevin Lo,
thanks Kevin. Fix from yongari, thank you!
the dual port BCM5717 and BCM5718 devices which are intended for
mainstream workstation and entry-level server designs and
represents the twelfth generation of NetXtreme Ethernet controllers.
This family is the successor to the BCM5714/BCM5715 family and
supports IPv4/IPv6 checksum offloading, TSO, VLAN hardware tagging,
jumbo frames, MSI/MSIX, IOV, RSS and TSS.
This change set supports all hardware features except IOV and
RSS/TSS. Unlike its predecessors, only extended RX buffer
descriptors can be posted to the jumbo producer ring. Single RX
buffer descriptors for jumbo frame are not supported. RSS requires
a more substantial set of changes and will apply to a larger set
of NetXtreme devices so RSS/TSS multi-queue support will be
implemented in a future releases.
Special thanks to Broadcom who kindly sent a sample board to me
and to davidch who gave provided the initial support code.
Submitted by: davidch (initial version)
HW donated by: Broadcom
physical page mapping should span two or more MTRRs of different types.
Add a pmap function, pmap_demote_DMAP(), by which the MTRR module can
ensure that the direct map region doesn't have such a mapping.
[2] Fix a couple of nearby style errors in amd64_mrset().
[3] Re-enable the use of 1GB page mappings for implementing the direct
map. (See also r197580 and r213897.)
Tested by: kib@ on a Westmere-family processor [3]
MFC after: 3 weeks
delegations are being returned for reasons other than a Recall.
Also, re-organize nfscl_recalldeleg() slightly, so that it leaves
clearing NMODIFIED to the ncl_flush() call and invalidates the
attribute cache after flushing. It is hoped that these changes
might fix the problem others have seen when using the NFSv4
client with delegations enabled, since I can't reliably reproduce
the problem. These changes only affect the client when doing NFSv4
mounts with delegations enabled.
MFC after: 10 days
'hw.acpi.remove_interface'. hw.acpi.install_interface lets you install new
interfaces. Conversely, hw.acpi.remove_interface lets you remove OS
interfaces from the pre-defined list in ACPICA. For example,
hw.acpi.install_interface="FreeBSD"
lets _OSI("FreeBSD") method to return 0xffffffff (or success) and
hw.acpi.remove_interface="Windows 2009"
lets _OSI("Windows 2009") method to return zero (or failure). Both are
comma-separated lists and leading white spaces are ignored. For example,
the following examples are valid:
hw.acpi.install_interface="Linux, FreeBSD"
hw.acpi.remove_interface="Windows 2006, Windows 2006.1"
OpenSolaris onnv-revision: 10209:91f47f0e7728
6830541 zfs_get_data_trips on a verify
6696242 multiple zfs_fillpage() zfs: accessing past end of object panics
6785914 zfs fails to drop dn_struct_rwlock in recovery code path
Approved by: delphij (mentor)
Obtained from: OpenSolaris (Bug ID 6830541, 6696242, 6785914)
MFC after: 2 weeks
- Chasin down bogus watchdogs has led to an improved
design to this handling, the hang decision takes
place in the tx cleanup, with only a simple report
check in local_timer. Our tests have shown no false
watchdogs with this code.
- VLAN fixes from jhb, the shadow vfta should be per
interface, but as global it was not. Thanks John.
- Bug fixes in the support for new PCH2 hardware.
- Thanks for all the help and feedback on the driver,
changes to lem with be coming shortly as well.
autotuned. It is only an upper bound (the memory is not always allocated)
and the system contains a vm_lowmem handler so nothing will crash and burn
if it's tuned too high.
Reviewed by: mckusick
introduce function pointers once set up to the respective implementation
for reading the (S)TICK and writing the (S)STICK_COMPARE registers as a
compromise between duplicating code and selecting between different
implementations during execution over and over again, similar to what is
done elsewhere in the MD in order to support different CPU models that
won't ever change at runtime.
- In the remaining tick interrupt handler further push down disabling of
interrupts to the periodic case as it isn't necessary here in one-shot
mode at all.
This was needed for recover implementation.
Implement the recover command for GPT. Now GPT will marked as
corrupt when any of three types of corruption will be detected:
1. Damaged primary GPT header or table
2. Damaged secondary GPT header or table
3. Secondary header is not located in the last LBA
Marked GPT becomes read-only. Any changes with corrupt table
are prohibited. Only "destroy" and "recover" commands are allowed.
Discussed with: geom@ (mostly silence)
Tested by: Ilya A. Arhipov
Approved by: mav (mentor)
MFC after: 2 weeks
within the first 4 bytes of the EHCI memory space. For controllers that
use big-endian MMIO, reading them with 1- and 2-byte reads would then
return the wrong values. Instead, read the combined register with a 4-byte
read and mask out the interesting quantities.
enhancements (1). Switch to a standard 2-clause BSD license for this (2).
Unfortunately we have to un-static the ifindex_table for this but do not
publicly export it.
Suggested by: rwatson (1) a while back.
Approved by: thompsa (2) for the change from r204279.
MFC after: 6 days
VLAN hardware tagging to make TSO work over VLAN. So if VLAN
hardware tagging is disabled explicitly clear TSO over VLAN. While
I'm here allow disabling VLAN TX checksum offloading.
Tested by: Liudas < liudasb <> centras dot lt >
MFC after: 10 days
existing code caused problems with some SCSI controllers.
A new sysctl kern.cam.ada.spindown_shutdown has been added that controls
whether or not to spin-down disks when shutting down.
Spinning down the disks unloads/parks the heads - this is
much better than removing power when the disk is still
spinning because otherwise an Emergency Unload occurs which may cause damage
to the actuator.
PR: kern/140752
Submitted by: olli
Reviewed by: arundel
Discussed with: mav
MFC after: 2 weeks
converted to use the mii_phy_add_media()/mii_phy_setmedia() pair instead
of mii_add_media()/mii_anar() remove the latter.
- Declare mii_media mii_media_table static as it shouldn't be used outside
of mii_physubr.c.
MFC after: never
interface also has such connectors.
- In tl_attach() unify three different ways of obtaining the device and
vendor IDs and remove the now obsolete tl_dinfo from tl_softc.
- Given that tlphy(4) only handles the integrated PHYs of NICs driven by
tl(4) make it only probe on the latter.
- Switch mlphy(4) and tlphy(4) to use mii_phy_add_media()/mii_phy_setmedia().
- Simplify looking for the respective companion PHY in mlphy(4) and tlphy(4)
by ignoring the native one by just comparing the device_t's directly rather
than the device name.
- Use mii_phy_add_media() instead of mii_add_media(). I'm not sure how
this driver actually managed to work before as mii_add_media() is
intended to be used to gether with mii_anar() while mii_phy_add_media()
is intended to be used with mii_phy_setmedia(), however this driver
mii_add_media() along with mii_phy_setmedia().
to use the generic hash32_buf() function. Although adding the
bytes seemed sufficient for UFS and ZFS, since most of the bytes
are the same for file handles on the same volume, this might not
be sufficient for other file systems. Use of a generic function
also seems preferable to one specific to NFSv4.
Suggested by: gleb.kurtsou at gmail.com
MFC after: 10 days
to BCM6906 A0/A2. This should fix a long standing BCM5906 A2 lockup
issues. Data sheet explicitly mentions BCM5906 A0, A1 and A2 use
de-pipelined mode on these revisions.
Special thanks to Buganini who tried all combinations of
experimental patches for more than 10 days.
Tested by: Buganini <buganini <> gmail dot com >
legacy and IPv6 route destination address.
Previously in case of IPv6, there was a memory overwrite due to not enough
space for the IPv6 address.
PR: kern/122565
MFC After: 2 weeks
the association notification), the included information though always
contains an elem block with an odd number of bytes. We handle the last
byte as if it might contain a whole elem block, this of course is not
true as one byte is not enough to hold a block, we therefore discard the
complete frame. The solution here is to subtract one from the actual
notification length, this is also what the Linux driver does. With this
change the frames ends exactly where the last elem block ends.
This commit also reverts r214160 which is no longer required and now even
wrong.
MFC after: 1 week
information that device is already suspended or that device is using
one-time key and suspend is not supported.
- 'geli suspend -a' silently skips devices that use one-time key, this is fine,
but because we log which device were suspended on the console, log also which
devices were skipped.
server so that it will work better for non-UFS file systems.
The new function simply sums the bytes of the fh_fid field
of fhandle_t.
MFC after: 10 days
auto-negotiation results in half-duplex operation, excess collision
on the ethernet link may cause internal chip delays that may result
in subsequent valid frames being dropped due to insufficient
receive buffer resources. The workaround is to choose de-pipeline
method as a flow control decision for SDI. De-pipeline method
allows only 1 data in TxMbuf at a time such that a request to RDMA
from SDI is made only when TxMbuf is empty. Thanks for david for
providing detailed errata information.
PCI-express or PCI-X capabilities if we are running in a virtual machine.
- Whitelist the Intel 82440 chipset used by QEMU.
Tested by: jfv
MFC after: 1 week
MOD_UNLOAD. This makes it possible to add custom hooks for other module
events.
Return EOPNOTSUPP when there is no callback available.
Pointed out by: jhb
Reviewed by: jhb
MFC after: 1 month
break the loop instead. We want to run the code after the while loop
to set an associd and capinfo. If we don't do this net80211 will drop
frames because it assumes the node has not yet been associated.
MFC after: 1 week
may be left. This fixes a memory leak that can occur when tracing is
disabled on a process via disabling tracing of a specific file (or if
an I/O error occurs with the tracefile) if the process's next system
call is exit(). The trace disabling code clears p_traceflag, so exit1()
doesn't do any KTRACE-related cleanup leading to the leak. I chose to
make the free'ing of pending records synchronous rather than patching
exit1().
- Move KTRACE-specific logic out of kern_(exec|exit|fork).c and into
kern_ktrace.c instead. Make ktrace_mtx private to kern_ktrace.c as a
result.
MFC after: 1 month
r214049 for the regular NFS server, so that it will not do
a VOP_LOOKUP() of ".." when at the root of a file system
when performing a ReaddirPlus RPC.
MFC after: 10 days
ignore BARs that are invalid due to having a size of zero, not to ignore
BARs with an existing base of zero. While here, reorganize the code
slightly to make the intent clearer.
Reported by: avg
MFC after: 1 week
its value as a loop invariant. Currently this is a no-op because
'atomic_cmpset_int()' clobbers all memory on current architectures.
- Use atomic_fetchadd_int() instead of an atomic_cmpset_int() loop to drop
a reference in vmspace_free().
Reviewed by: alc
MFC after: 1 month
- move all the chunks into one file, which allows to hide SIOCGIFCONF32
global definition as well.
- replace __amd64__ with proper COMPAT_FREEBSD32 around.
- handle 32bit capacity before going into the handler itself instead of
doing internal 32bit specific changes within it (e.g. as it's done for
SIOCGDEFIFACE32_IN6).
- use explicitely sized types for ABI compat.
Approved by: kib (mentor)
MFC after: 2 weeks
Specification Rev. 1.2. Rename pp_pcmcsr field of PM capabilities to pp_bse
to avoid further confusions and adjust some comments accordingly. The real
PMCSR (Power Management Control/Status Register) is PCIR_POWER_STATUS and
it is actually BSE (PCI-to-PCI Bridge Support Extensions) register.
Before this change if you wanted to suspend your laptop and be sure that your
encryption keys are safe, you had to stop all processes that use file system
stored on encrypted device, unmount the file system and detach geli provider.
This isn't very handy. If you are a lucky user of a laptop where suspend/resume
actually works with FreeBSD (I'm not!) you most likely want to suspend your
laptop, because you don't want to start everything over again when you turn
your laptop back on.
And this is where geli suspend/resume steps in. When you execute:
# geli suspend -a
geli will wait for all in-flight I/O requests, suspend new I/O requests, remove
all geli sensitive data from the kernel memory (like encryption keys) and will
wait for either 'geli resume' or 'geli detach'.
Now with no keys in memory you can suspend your laptop without stopping any
processes or unmounting any file systems.
When you resume your laptop you have to resume geli devices using 'geli resume'
command. You need to provide your passphrase, etc. again so the keys can be
restored and suspended I/O requests released.
Of course you need to remember that 'geli suspend' won't clear file system
cache and other places where data from your geli-encrypted file system might be
present. But to get rid of those stopping processes and unmounting file system
won't help either - you have to turn your laptop off. Be warned.
Also note, that suspending geli device which contains file system with geli
utility (or anything used by 'geli resume') is not very good idea, as you won't
be able to resume it - when you execute geli(8), the kernel will try to read it
and this read I/O request will be suspended.
of proper value. It caused bunch of "EMPTY CRPB" messages and potentially
may cause premature requests completion, which could cause data corruption.
For most cases it seems enough to just reread register to get proper value.
To protect against worse cases - erase processed queue entries with
impossible values and ignore them if problem still happen.
released a drver lock for shared interrupt case such that it caused
panic. While I'm here check whether driver is still running before
serving TX/RX handler.
Reported by: Jerahmy Pocott < QUAKENET1 <> optusnet dot com dot au >
Tested by: Jerahmy Pocott < QUAKENET1 <> optusnet dot com dot au >
MFC after: 3 days
receive two back-to-back send BDs with less than or equal to 8
total bytes then the device may hang. The two back-to-back send
BDs must be in the same frame for this failure to occur.
Thanks to davidch for detailed errata information.
Reviewed by: davidch
o Add support for backend devices (e.g. blkback)
o Implement extensions to the Xen para-virtualized block API to allow
for larger and more outstanding I/Os.
o Import a completely rewritten block back driver with support for fronting
I/O to both raw devices and files.
o General cleanup and documentation of the XenBus and XenStore support code.
o Robustness and performance updates for the block front driver.
o Fixes to the netfront driver.
Sponsored by: Spectra Logic Corporation
sys/xen/xenbus/init.txt:
Deleted: This file explains the Linux method for XenBus device
enumeration and thus does not apply to FreeBSD's NewBus approach.
sys/xen/xenbus/xenbus_probe_backend.c:
Deleted: Linux version of backend XenBus service routines. It
was never ported to FreeBSD. See xenbusb.c, xenbusb_if.m,
xenbusb_front.c xenbusb_back.c for details of FreeBSD's XenBus
support.
sys/xen/xenbus/xenbusvar.h:
sys/xen/xenbus/xenbus_xs.c:
sys/xen/xenbus/xenbus_comms.c:
sys/xen/xenbus/xenbus_comms.h:
sys/xen/xenstore/xenstorevar.h:
sys/xen/xenstore/xenstore.c:
Split XenStore into its own tree. XenBus is a software layer built
on top of XenStore. The old arrangement and the naming of some
structures and functions blurred these lines making it difficult to
discern what services are provided by which layer and at what times
these services are available (e.g. during system startup and shutdown).
sys/xen/xenbus/xenbus_client.c:
sys/xen/xenbus/xenbus.c:
sys/xen/xenbus/xenbus_probe.c:
sys/xen/xenbus/xenbusb.c:
sys/xen/xenbus/xenbusb.h:
Split up XenBus code into methods available for use by client
drivers (xenbus.c) and code used by the XenBus "bus code" to
enumerate, attach, detach, and service bus drivers.
sys/xen/reboot.c:
sys/dev/xen/control/control.c:
Add a XenBus front driver for handling shutdown, reboot, suspend, and
resume events published in the XenStore. Move all PV suspend/reboot
support from reboot.c into this driver.
sys/xen/blkif.h:
New file from Xen vendor with macros and structures used by
a block back driver to service requests from a VM running a
different ABI (e.g. amd64 back with i386 front).
sys/conf/files:
Adjust kernel build spec for new XenBus/XenStore layout and added
Xen functionality.
sys/dev/xen/balloon/balloon.c:
sys/dev/xen/netfront/netfront.c:
sys/dev/xen/blkfront/blkfront.c:
sys/xen/xenbus/...
sys/xen/xenstore/...
o Rename XenStore APIs and structures from xenbus_* to xs_*.
o Adjust to use of M_XENBUS and M_XENSTORE malloc types for allocation
of objects returned by these APIs.
o Adjust for changes in the bus interface for Xen drivers.
sys/xen/xenbus/...
sys/xen/xenstore/...
Add Doxygen comments for these interfaces and the code that
implements them.
sys/dev/xen/blkback/blkback.c:
o Rewrite the Block Back driver to attach properly via newbus,
operate correctly in both PV and HVM mode regardless of domain
(e.g. can be in a DOM other than 0), and to deal with the latest
metadata available in XenStore for block devices.
o Allow users to specify a file as a backend to blkback, in addition
to character devices. Use the namei lookup of the backend path
to automatically configure, based on file type, the appropriate
backend method.
The current implementation is limited to a single outstanding I/O
at a time to file backed storage.
sys/dev/xen/blkback/blkback.c:
sys/xen/interface/io/blkif.h:
sys/xen/blkif.h:
sys/dev/xen/blkfront/blkfront.c:
sys/dev/xen/blkfront/block.h:
Extend the Xen blkif API: Negotiable request size and number of
requests.
This change extends the information recorded in the XenStore
allowing block front/back devices to negotiate for optimal I/O
parameters. This has been achieved without sacrificing backward
compatibility with drivers that are unaware of these protocol
enhancements. The extensions center around the connection protocol
which now includes these additions:
o The back-end device publishes its maximum supported values for,
request I/O size, the number of page segments that can be
associated with a request, the maximum number of requests that
can be concurrently active, and the maximum number of pages that
can be in the shared request ring. These values are published
before the back-end enters the XenbusStateInitWait state.
o The front-end waits for the back-end to enter either the InitWait
or Initialize state. At this point, the front end limits it's
own capabilities to the lesser of the values it finds published
by the backend, it's own maximums, or, should any back-end data
be missing in the store, the values supported by the original
protocol. It then initializes it's internal data structures
including allocation of the shared ring, publishes its maximum
capabilities to the XenStore and transitions to the Initialized
state.
o The back-end waits for the front-end to enter the Initalized
state. At this point, the back end limits it's own capabilities
to the lesser of the values it finds published by the frontend,
it's own maximums, or, should any front-end data be missing in
the store, the values supported by the original protocol. It
then initializes it's internal data structures, attaches to the
shared ring and transitions to the Connected state.
o The front-end waits for the back-end to enter the Connnected
state, transitions itself to the connected state, and can
commence I/O.
Although an updated front-end driver must be aware of the back-end's
InitWait state, the back-end has been coded such that it can
tolerate a front-end that skips this step and transitions directly
to the Initialized state without waiting for the back-end.
sys/xen/interface/io/blkif.h:
o Increase BLKIF_MAX_SEGMENTS_PER_REQUEST to 255. This is
the maximum number possible without changing the blkif
request header structure (nr_segs is a uint8_t).
o Add two new constants:
BLKIF_MAX_SEGMENTS_PER_HEADER_BLOCK, and
BLKIF_MAX_SEGMENTS_PER_SEGMENT_BLOCK. These respectively
indicate the number of segments that can fit in the first
ring-buffer entry of a request, and for each subsequent
(sg element only) ring-buffer entry associated with the
"header" ring-buffer entry of the request.
o Add the blkif_request_segment_t typedef for segment
elements.
o Add the BLKRING_GET_SG_REQUEST() macro which wraps the
RING_GET_REQUEST() macro and returns a properly cast
pointer to an array of blkif_request_segment_ts.
o Add the BLKIF_SEGS_TO_BLOCKS() macro which calculates the
number of ring entries that will be consumed by a blkif
request with the given number of segments.
sys/xen/blkif.h:
o Update for changes in interface/io/blkif.h macros.
o Update the BLKIF_MAX_RING_REQUESTS() macro to take the
ring size as an argument to allow this calculation on
multi-page rings.
o Add a companion macro to BLKIF_MAX_RING_REQUESTS(),
BLKIF_RING_PAGES(). This macro determines the number of
ring pages required in order to support a ring with the
supplied number of request blocks.
sys/dev/xen/blkback/blkback.c:
sys/dev/xen/blkfront/blkfront.c:
sys/dev/xen/blkfront/block.h:
o Negotiate with the other-end with the following limits:
Reqeust Size: MAXPHYS
Max Segments: (MAXPHYS/PAGE_SIZE) + 1
Max Requests: 256
Max Ring Pages: Sufficient to support Max Requests with
Max Segments.
o Dynamically allocate request pools and segemnts-per-request.
o Update ring allocation/attachment code to support a
multi-page shared ring.
o Update routines that access the shared ring to handle
multi-block requests.
sys/dev/xen/blkfront/blkfront.c:
o Track blkfront allocations in a blkfront driver specific
malloc pool.
o Strip out XenStore transaction retry logic in the
connection code. Transactions only need to be used when
the update to multiple XenStore nodes must be atomic.
That is not the case here.
o Fully disable blkif_resume() until it can be fixed
properly (it didn't work before this change).
o Destroy bus-dma objects during device instance tear-down.
o Properly handle backend devices with powef-of-2 sector
sizes larger than 512b.
sys/dev/xen/blkback/blkback.c:
Advertise support for and implement the BLKIF_OP_WRITE_BARRIER
and BLKIF_OP_FLUSH_DISKCACHE blkif opcodes using BIO_FLUSH and
the BIO_ORDERED attribute of bios.
sys/dev/xen/blkfront/blkfront.c:
sys/dev/xen/blkfront/block.h:
Fix various bugs in blkfront.
o gnttab_alloc_grant_references() returns 0 for success and
non-zero for failure. The check for < 0 is a leftover
Linuxism.
o When we negotiate with blkback and have to reduce some of our
capabilities, print out the original and reduced capability before
changing the local capability. So the user now gets the correct
information.
o Fix blkif_restart_queue_callback() formatting. Make sure we hold
the mutex in that function before calling xb_startio().
o Fix a couple of KASSERT()s.
o Fix a check in the xb_remove_* macro to be a little more specific.
sys/xen/gnttab.h:
sys/xen/gnttab.c:
Define GNTTAB_LIST_END publicly as GRANT_REF_INVALID.
sys/dev/xen/netfront/netfront.c:
Use GRANT_REF_INVALID instead of driver private definitions of the
same constant.
sys/xen/gnttab.h:
sys/xen/gnttab.c:
Add the gnttab_end_foreign_access_references() API.
This API allows a client to batch the release of an array of grant
references, instead of coding a private for loop. The implementation
takes advantage of this batching to reduce lock overhead to one
acquisition and release per-batch instead of per-freed grant reference.
While here, reduce the duration the gnttab_list_lock is held during
gnttab_free_grant_references() operations. The search to find the
tail of the incoming free list does not rely on global state and so
can be performed without holding the lock.
sys/dev/xen/xenpci/evtchn.c:
sys/dev/xen/evtchn/evtchn.c:
sys/xen/xen_intr.h:
o Implement the bind_interdomain_evtchn_to_irqhandler API for HVM mode.
This allows an HVM domain to serve back end devices to other domains.
This API is already implemented for PV mode.
o Synchronize the API between HVM and PV.
sys/dev/xen/xenpci/xenpci.c:
o Scan the full region of CPUID space in which the Xen VMM interface
may be implemented. On systems using SuSE as a Dom0 where the
Viridian API is also exported, the VMM interface is above the region
we used to search.
o Pass through bus_alloc_resource() calls so that XenBus drivers
attaching on an HVM system can allocate unused physical address
space from the nexus. The block back driver makes use of this
facility.
sys/i386/xen/xen_machdep.c:
Use the correct type for accessing the statically mapped xenstore
metadata.
sys/xen/interface/hvm/params.h:
sys/xen/xenstore/xenstore.c:
Move hvm_get_parameter() to the correct global header file instead
of as a private method to the XenStore.
sys/xen/interface/io/protocols.h:
Sync with vendor.
sys/xeninterface/io/ring.h:
Add macro for calculating the number of ring pages needed for an N
deep ring.
To avoid duplication within the macros, create and use the new
__RING_HEADER_SIZE() macro. This macro calculates the size of the
ring book keeping struct (producer/consumer indexes, etc.) that
resides at the head of the ring.
Add the __RING_PAGES() macro which calculates the number of shared
ring pages required to support a ring with the given number of
requests.
These APIs are used to support the multi-page ring version of the
Xen block API.
sys/xeninterface/io/xenbus.h:
Add Comments.
sys/xen/xenbus/...
o Refactor the FreeBSD XenBus support code to allow for both front and
backend device attachments.
o Make use of new config_intr_hook capabilities to allow front and back
devices to be probed/attached in parallel.
o Fix bugs in probe/attach state machine that could cause the system to
hang when confronted with a failure either in the local domain or in
a remote domain to which one of our driver instances is attaching.
o Publish all required state to the XenStore on device detach and
failure. The majority of the missing functionality was for serving
as a back end since the typical "hot-plug" scripts in Dom0 don't
handle the case of cleaning up for a "service domain" that is not
itself.
o Add dynamic sysctl nodes exposing the generic ivars of
XenBus devices.
o Add doxygen style comments to the majority of the code.
o Cleanup types, formatting, etc.
sys/xen/xenbus/xenbusb.c:
Common code used by both front and back XenBus busses.
sys/xen/xenbus/xenbusb_if.m:
Method definitions for a XenBus bus.
sys/xen/xenbus/xenbusb_front.c:
sys/xen/xenbus/xenbusb_back.c:
XenBus bus specialization for front and back devices.
MFC after: 1 month
added with hw.pci.do_powerstate but the PCI version was splitted into two
separate tunables later and now this is completely stale. To make it worse,
PCI devices enumerated in ACPI tree ignore this tunable as it is handled by
a function in acpi_pci.c instead.
knowledges from the file. All PCI devices enumerated in ACPI tree must use
correct one from acpi_pci.c any way. Reduce duplicate codes as we did for
pci.c in r213905. Do not return ESRCH from PCIB_POWER_FOR_SLEEP method.
When the method is not found, just return zero without modifying the given
default value as it is completely optional. As a side effect, the return
state must not be NULL. Note there is actually no functional change by
removing ESRCH because acpi_pcib_power_for_sleep() always returns zero.
Adjust debugging messages and add new ones under bootverbose to help
debugging device power state related issues.
Reviewed by: jhb, imp (earlier versions)
a critical section as apparently required by both. I don't think either
belongs in the event timer front-ends but the callback should handle
this as necessary instead just like for example intr_event_handle()
does but this is how the other architectures currently handle it, either
explicitly or implicitly.
- Further rename and reword references to hardclock as this front-end no
longer has a notion of actually calling it.
This can happen if the algos are built as modules but are not loaded. If
the selected ratectl algo is not available, try to load it (The load
module functions does nothing currently). Add a dummy ratectl algo which
always selects the first available rate. Use that one if the desired algo
is not available.
MFC after: 1 week
and print a diagnostic if the call fails.
This avoids a panic when a device with an invalid name is attempted to
be registered. For example the label class gets device names from
untrusted input.
Reviewed by: freebsd-geom
not support VFS_VGET, like msdosfs, do not call VOP_LOOKUP() for
dotdot on the root directory. Our filesystems expect that VFS handles
dotdot lookups on root on its own.
Reported and tested by: kevlo
MFC after: 2 weeks
by both clients. Since the NLM uses various fields of the
nfsmount structure, those fields were extracted and put in a
separate nfs_mountcommon structure stored in sys/nfs/nfs_mountcommon.h.
This structure also has a function pointer for a function that
extracts the required information from the mount point and nfs vnode
for that particular client, for information stored differently by the
clients.
Reviewed by: jhb
MFC after: 2 weeks
fixed the issues with file descriptor locks, but the same problems are
present for vnode lock/user map lock.
If the nfs_asyncio() cannot find the free nfsiod, schedule task to
create new nfsiod and return error. This causes fall back to the
synchronous i/o for nfs_strategy(), or does not start read at all in
the case of readahead. The caller that holds vnode and potentially
user map lock does not wait for kproc_create() to finish, preventing
the LORs.
The change effectively reverts r203072, because we never hand off the
request to newly created nfsiod thread anymore.
Reviewed by: jhb
Tested by: jhb, pluknet
MFC after: 3 weeks
- Implement proper combined mode decoding for Intel controllers to properly
identify SATA and PATA channels and associate ATA channels with SATA ports.
This fixes wrong reporting and in some cases hard resets to wrong SATA ports.
- Improve SATA registers support to handle hot-plug events and potentially
interface errors. For ICH5/6300ESB chipsets these registers accessible via
PCI config space. For later ones they may be accessible via PCI BAR(5).
- For controllers not generating interrupts on hot-plug events, implement
periodic status polling. Use it to detect hot-plug on Intel and VIA
controllers. Same probably could also be used for Serverworks and SIS.
root file system (starting with devfs and a synthesized configuration) can
contain directives for mounting another file system as root. The old root
file system is re-mounted under the new root file system (with /.mount or
/mnt as the mount point) to allow access to the underlying file system.
The configuration allows for creating vnode-backed memory disks that can
subsequently be mounted as root. This allows for an efficient and low-
cost way to distribute and boot FreeBSD software images that reside on
some storage media.
When trying a mount, the kernel will wait for the device in question to
arrive. The timeout is configurable and is part of the configuration.
This allows arbitrarily complex GEOM configurations to be constructed
on the fly.
A side-effect of this change is that all root specifications, whether
compiled into the kernel or typed at the prompt can contain root mount
options.
To protect against malicious software, we demand that the file name is at
a particular location (i.e. appended to the mdio structure) for it to be
treated as in-kernel.
counter on SMU-based systems, which causes FreeBSD to reject the RTC time
when used in a dual-boot environment. Since we don't use the day-of-week
counter anyway, solve this by just not checking that it matches.
MFC after: 3 weeks
drift in order to achieve a more stable clock as the tick intervals may
vary in the first place. In fact I haven't seen this code kick in when
in oneshot-mode so just skip it in that case.
- There's no need to explicitly stop the (S)TICK counter in oneshot-mode
with every tick as it just won't trigger again with the (S)TICK compare
register set to a value in the past (with a wrap-around once every ~195
years of uptime at 1.5 GHz this isn't something we have to worry about
in practice).
- Given that we'll disable interrupts completely anyway there's no
need to enter critical sections.
- In thr_exit() and kthread_exit(), only remove thread from
hash if it can directly exit, otherwise let exit1() do it.
- In thread_suspend_check(), fix cleanup code when thread needs
to exit.
This change seems fixed the "Bad link elm " panic found by
Peter Holm.
Stress testing: pho
This should make vnode_pager_getpages path a bit shorter and clearer.
Also this should eliminate problems with partially valid pages.
Having this method opens room for future optimizations.
To do: try to satisfy other pages besides the required one taking into
account tradeofs between number of page faults, read throughput and read
latency. Also, eventually vop_putpages should be added too.
Reviewed by: kib, mm, pjd
MFC after: 3 weeks
Make it harder to exploit certain in_control() related races between the
intiial lookup at the beginning and the time we will remove the entry
from the lists by re-checking that entry is still in the list before
trying to remove it.
(*) It is believed that with the current code and locking strategy we
cannot completely fix all race.
Reported by: Nima Misaghian (nima_misa hotmail.com) on net@ 20100817
Tested by: Nima Misaghian (nima_misa hotmail.com) (original version)
PR: kern/146250
Submitted by: Mikolaj Golub (to.my.trociny gmail.com) (different version)
MFC after: 1 week
important for USB 2.0 devices and some of them reported to have problems
with large transactions. But USB 3.0 benchmarks show that limited number
of transactions per second on USB makes impossible to reach high transfer
speeds without using bigger transactions.
On my tests this change allows to read up to 220MB/s from USB-attached SSD
(at block size of 256-512KB), comparing to only 113MB/s without it.
Reviewed by: hselasky
over all interfaces to make sure the address will neither change nor be
freed while we are working on it.
PR: kern/146250
Submitted by: Mikolaj Golub (to.my.trociny gmail.com)
MFC after: 1 week
Old scrolls tell that once upon a time IBM AT BIOS was known to put some
useful system diagnostic information into RTC NVRAM. It is not really
known if and for how long PC BIOSes followed that convention, but I
believe that many, if not all, modern BIOSes do not do that any more
(not mentioning other types of x86 firmware).
Some diagnostic bits don't even make any sense any longer.
The check results in confusing messages upon boot on some systems.
So I am removing it.
Discussed with: bde, jhb, mav
MFC after: 3 weeks
Move debug.ncnegfactor to vfs.ncnegfactor [1].
Provide some descriptions for the namecache related sysctls [1].
Based on the submission by: Rogier R. Mulhuijzen <drwilco drwilco net> [1]
MFC after: 2 weeks
X-MFC-note: remove debug.ncnegfactor in HEAD after MFC
too coarse grained to be useful and the default value significantly degrades TCP
performance on moderate to high bandwidth-delay product paths with non-zero loss
(e.g. 5+Mbps connections across the public Internet often suffer).
Replace the outgoing mechanism with an individual per-queue limit based on the
number of MSS segments that fit into the socket's receive buffer. This should
strike a good balance between performance and the potential for resource
exhaustion when FreeBSD is acting as a TCP receiver. With socket buffer
autotuning (which is enabled by default), the reassembly queue tracks the
socket buffer and benefits too.
As the XXX comment suggests, my testing uncovered some unexpected behaviour
which requires further investigation. By using so->so_rcv.sb_hiwat
instead of sbspace(&so->so_rcv), we allow more segments to be held across both
the socket receive buffer and reassembly queue than we probably should. The
tradeoff is better performance in at least one common scenario, versus a devious
sender's ability to consume more resources on a FreeBSD receiver.
Sponsored by: FreeBSD Foundation
Reviewed by: andre, gnn, rpaulo
MFC after: 2 weeks
"net.inet.tcp.reass.maxsegments" sysctl variables to be based on UMA zone
stats. The value returned by the cursegments sysctl is approximate owing to
the way in which uma_zone_get_cur is implemented.
- Discontinue use of V_tcp_reass_qsize as a global reassembly segment count
variable in the reassembly implementation. The variable was used without
proper synchronisation and was duplicating accounting done by UMA already. The
lack of synchronisation was particularly problematic on SMP systems
terminating many TCP sessions, resulting in poor TCP performance for
connections with non-zero packet loss.
Sponsored by: FreeBSD Foundation
Reviewed by: andre, gnn, rpaulo (as part of a larger patch)
MFC after: 2 weeks
rounding. The same value can also be obtained with uma_zone_get_max, but this
change avoids a caller having to make two back-to-back calls.
Sponsored by: FreeBSD Foundation
Reviewed by: gnn, jhb
- Add uma_zone_get_cur which returns the current approximate occupancy of
a zone. This is useful for providing stats via sysctl amongst other things.
Sponsored by: FreeBSD Foundation
Reviewed by: gnn, jhb
MFC after: 2 weeks
the NIC drivers as well as the PHY drivers to take advantage of the
mii_attach() introduced in r213878 to get rid of certain hacks. For
the most part these were:
- Artificially limiting miibus_{read,write}reg methods to certain PHY
addresses; we now let mii_attach() only probe the PHY at the desired
address(es) instead.
- PHY drivers setting MIIF_* flags based on the NIC driver they hang
off from, partly even based on grabbing and using the softc of the
parent; we now pass these flags down from the NIC to the PHY drivers
via mii_attach(). This got us rid of all such hacks except those of
brgphy() in combination with bce(4) and bge(4), which is way beyond
what can be expressed with simple flags.
While at it, I took the opportunity to change the NIC drivers to pass
up the error returned by mii_attach() (previously by mii_phy_probe())
and unify the error message used in this case where and as appropriate
as mii_attach() actually can fail for a number of reasons, not just
because of no PHY(s) being present at the expected address(es).
This file was missed in r213893.
PowerMac7,2.
- The fcu driver lets us read and write the fan RPMs for all fans in the
PowerMac7,2. This driver is PowerMac specific.
- The ds1775 is a driver to read the temperature for the drive bay sensor.
- The max6690 is another driver to read temperatures. Here it is used to
read the inlet, the backside and the U3 heatsink temperature.
An additional driver, the ad7417, will follow later.
Thanks to nwhitehorn for guiding me through this driver development.
Approved by: nwhitehorn (mentor)
use pmap_extract() rather than pmap_kextract() on direct map addresses.
Thus, pmap_extract() needs to be able to deal with 1GB page mappings if
we are to use 1GB page mappings for the direct map. (See r197580.)
the NIC drivers as well as the PHY drivers to take advantage of the
mii_attach() introduced in r213878 to get rid of certain hacks. For
the most part these were:
- Artificially limiting miibus_{read,write}reg methods to certain PHY
addresses; we now let mii_attach() only probe the PHY at the desired
address(es) instead.
- PHY drivers setting MIIF_* flags based on the NIC driver they hang
off from, partly even based on grabbing and using the softc of the
parent; we now pass these flags down from the NIC to the PHY drivers
via mii_attach(). This got us rid of all such hacks except those of
brgphy() in combination with bce(4) and bge(4), which is way beyond
what can be expressed with simple flags.
While at it, I took the opportunity to change the NIC drivers to pass
up the error returned by mii_attach() (previously by mii_phy_probe())
and unify the error message used in this case where and as appropriate
as mii_attach() actually can fail for a number of reasons, not just
because of no PHY(s) being present at the expected address(es).
Reviewed by: jhb, yongari
- fix the leak of command struct on error
- simplify the cleanup logic
- EINPROGRESS is not a fatal error
- buggy comment and error message
Reviewed by: ken