#ifdef MSDOSFS_LARGE to run-time checks to see if "-o large" was specified.
Test case provided by Oliver Fromme:
truncate -s 200G test.img
mdconfig -a -t vnode -f test.img -u 9
newfs_msdos -s 419430400 -n 1 /dev/md9 zip250
mount -t msdosfs /dev/md9 /mnt # should fail
mount -t msdosfs -o large /dev/md9 /mnt # should succeed
PR: 105964
Requested by: Oliver Fromme <olli lurza secnetix de>
Tested by: trhodes
MFC after: 2 weeks
SOCK_DGRAM (i.e. UDP), respect the value configured earlier. This allows
TCP NFS root mounts using e.g. the boot.nfsroot.options="tcp" tunable.
In this case some of the connection parameters like the retry timer were
previously set appropriately for TCP but inappropriately for the UDP
socket that was actually used, leading to e.g. extremely long recovery
times (O(hours)) after a nfs server reboot.
Reviewed by: mohans
MFC After: 2 weeks
/usr/share/examples/etc/bsd-style-copyright. I've fixed a
few minor wording and formatting differences.
Approved by: matk, Hannu Savolainen <hannu@opensound.com>
Reviewed by: imp
We can't bind to a CPU which is not yet on-line, so add code that wait for
CPUs to go on-line before binding to them.
Reported by: Alin-Adrian Anton <aanton@spintech.ro>
MFC after: 2 weeks
configured and that in turn controls the descriptor layout; the rate
control module has no business peeking inside the descriptor but until
we can change the api so the driver records the tx rates and passes
them deal with it
unsolicited pin sense event and need manual control to turn off speaker
volume while attaching headphone.
Tested by: Ingeborg Hellemo <Ingeborg.Hellemo@cc.uit.no>
Disable global Acer + ALC883 headphone automute settings since there are
few models that does not respect this and causing broken behaviour.
Reported/Tested by: Pavel Argentov <argentoff@rtelekom.ru>
When the disk has an error, it will now print SMART
instead of 'Unknown CMD'.
PR: kern/93368
Submitted by: Garry Belka <garry at NetworkPhysics dot COM>
Approved by: sos
firmware in that module (eventhough this is a programming error) - drop the
reference to the module again.
Submitted by: Benjamin Close
MFC after: 3 days
the space allocated for the double fault handler since this space
is otherwise unused till the time a double fault occurs.
This change should have been committed alongside r1.127 of
"exception.S", but I somehow missed doing so.
Problem reported by: jeff
Pointy hat to: jkoshy
not used in any of our code. Also remove explicit padding variable that
kept the bpf_d structure the same size before and after the change in
select implementation, since binary compatibility is not required for this
data structure on 7-CURRENT.
IPv6 over point-to-point gif(4) tunnels.
These revisions caused a host route to the destination of a
point-to-point gif(4) interface to not get installed when the interface
and destination addresses were assigned. This caused
"no route to host" errors when trying to send traffic over the
interface. The first packet arriving inbound over the tunnel,
however, would cause the correct route to get installed, allowing
subsequent outbound traffic to be routed correctly.
gif(4) interfaces with prefix lengths of less than 128 bits
(i.e. no explicit destination address assigned) were not affected
by this bug.
This bug fix is a possible candidate for a 6.2-RELEASE errata note.
Approved by: jhay (original committer)
Discussed with: jhay, JINMEI Tatuya
MFC after: 3 days
blade systems, such as the Dell 1955 and the Intel SBXD132.
Development hardware for this work was provided by Broadcom and iXsystems.
A SBXD132 blade for testing was provided by Iron Systems.
is replaced with BSD gzip, let's make it possible to
distinguish between the two with a __FreeBSDversion bump,
just in case some developers want it.
Suggested by: linimon
minimize IPIs and rescheduling when scheduling like tasks while keeping
latency low for important threads.
1) An idle thread is running.
2) The current thread is worse than realtime and the new thread is
better than realtime. Realtime to realtime doesn't preempt.
3) The new thread's priority is less than the threshold.
with bypass header, to send it out to userland.
- Use ng_ppp_bypass() in ng_ppp_proto_recv().
- Use ng_ppp_bypass() in ng_ppp_comp_recv() and in
ng_ppp_crypt_recv() if compression or encryption is
disabled, respectively.
- Any LCP packet goes directly to ng_ppp_bypass(), instead
of passing through PPP stack.
- Any non-LCP packet on disabled link is discarded. This
is behavior defined in RFC.
Submitted by: Alexander Motin <mav alkar.net>
support sched_4bsd.
- Rename the KTR level for non schedgraph parsed events. They take event
space from things we'd like to graph.
- Reset our slice value after we sleep. The slice is simply there to
prevent starvation among equal priorities. A thread which had almost
exhausted it's slice and then slept doesn't need to be rescheduled a
tick after it wakes up.
- Set the maximum slice value to a more conservative 100ms now that it is
more accurately enforced.
carp_clone_destroy() we are on a safe side, we don't need to
unlock the cif, that can me already non-existent at this point.
Reported by: Anton Yuzhaninov <citrin rambler-co.ru>
apparently be confused by short TCP segments that have been manually
padded to the minimum ethernet frame size. The driver does short frame
padding in software as a workaround for a bug in the 8169 PCI devices
that causes short IP fragments to be corrupted due to an apparent
conflict between the hardware autopadding and hardware IP checksumming.
To fix this, we avoid software padding for short TCP segments, since
the hardware seems to autopad and checksum these correctly (even the
older 8169 NICs get these right). Short UDP packets appear to be
handled correctly in all cases. This should work around the IP header
checksum bug in the 8169 while not tripping the TCP checksum bug in
the 8111B/8168B and 8101E.
collisions with nfsclient's names. Even static names should have a
unique prefix so that they can be debugged easily.
Hide the unused colliding variable nfsv3_commit_on_close in "#if 0"
together with other unused sysctl variables. Duplicating the nfs sysctl
under nfs4 is probably just a bug.
Fix some nearby style bugs.
Remove duplicate $FreeBSD$.
nfs_* to nfs4_* to avoid collisions with nfsclient's names. Even
static names should have a unique prefix so that they can be debugged
easily.
Most of the renamed functions can probably be shared. nfs4_cmount()
and nfs4_sync() are identical to the nfs_* versions, and all the others
except nfs4_vfsops() seem to be idendentical except for style bugs,
missing support for mountroot, and bugs.
Fix some nearby style bugs.
Remove duplicate $FreeBSD$.
of duplicating it except for larger style bugs in the copy.
Fix some nearby style bugs (including a harmless type mismatch)
in and near the remaining copy.
This is part of fixing collisions of the 2 nfs*client's names. Even
static names should have a unique prefixes so that they can be debugged
easily.
zone. Cluster allocations fail when this happens. Also processes that may have
blocked on cluster allocations will never be woken up. Thanks to rwatson for
an overview of the issue and pointers to the mbuma paper and his tool to dump
out UMA zones.
Reviewed by: andre@
maxpages on a zone is woken up, with the rest never being woken up as
a result of the ZFLAG_FULL flag being cleared. Wakeup all such blocked
procsses instead. This change introduces a thundering herd, but since
this should be relatively infrequent, optimizing this (by introducing
a count of blocked processes, for example) may be premature.
Reviewd by: ups@
negative. Use unsigned integers for sleep and run time so this doesn't
disturb sched_interact_score(). This should fix the invalid interactive
priority panics reported by several users.
o remove errata_a0 and introduce the corresponding flags into 'errata'.
o introduce a new errata for K8, namely some platform might set the
PENDING_BIT but aren't able to unset it, also don't loop forever
waiting PENDING_BIT being cleared.
o try to introduce a workaround for the PENDING_BIT stuck problem,
o support now half multipliers for K8.
Tested by: Abdullah Al-Marrie
Approved by: njl
file are after snaplock, while other ffs device buffers are before
snaplock in global lock order. By itself, this could cause deadlock
when bdwrite() tries to flush dirty buffers on snapshotted ffs. If,
during the flush, COW activity for snapshot needs to allocate block
and ffs_alloccg() selects the cylinder group that is being written
by bdwrite(), then kernel would panic due to recursive buffer lock
acquision.
Avoid dealing with buffers in bdwrite() that are from other side of
snaplock divisor in the lock order then the buffer being written. Add
new BOP, bop_bdwrite(), to do dirty buffer flushing for same vnode in
the bdwrite(). Default implementation, bufbdflush(), refactors the code
from bdwrite(). For ffs device buffers, specialized implementation is
used.
Reviewed by: tegge, jeff, Russell Cattelan (cattelan xfs org, xfs changes)
Tested by: Peter Holm
X-MFC after: 3 weeks (if ever: it changes ABI)
- Define our own maybe_preempt() as sched_preempt(). We want to be able
to preempt idlethread in all cases.
- Define our idlethread to require preemption to exit.
- Get the cpu estimation tick from sched_tick() so we don't have to worry
about errors from a sampling interval that differs from the time
domain. This was the source of sched_priority prints/panics and
inaccurate pctcpu display in top.
for clock.h, so changing th i386 clock.h broke it. MFi386 (not tested):
Cleaned up declaration and initialization of clock_lock. It is only
used by clock code, so don't export it to the world for machdep.c to
initialize. There is a minor problem initializing it before it is
used, since although clock initialization is split up so that parts
of it can be done early, the first part was never done early enough
to actually work. Split it up a bit more and do the first part as
late as possible to document the necessary order. The functions that
implement the split are still bogusly exported.
Cleaned up initialization of the i8254 clock hardware using the new
split. Actually initialize it early enough, and don't work around it
not being initialized in DELAY() when DELAY() is called early for
initialization of some console drivers.
This unfortunately moves a little more code before the early debugger
breakpoint so that it is harder to debug. The ordering of console and
related initialization is delicate because we want to do as little as
possible before the breakpoint, but must initialize a console.
setrunqueue() was mostly empty. The few asserts and thread state
setting were moved to the individual schedulers. sched_add() was
chosen to displace it for naming consistency reasons.
- Remove adjustrunqueue, it was 4 lines of code that was ifdef'd to be
different on all three schedulers where it was only called in one place
each.
- Remove the long ifdef'd out remrunqueue code.
- Remove the now redundant ts_state. Inspect the thread state directly.
- Don't set TSF_* flags from kern_switch.c, we were only doing this to
support a feature in one scheduler.
- Change sched_choose() to return a thread rather than a td_sched. Also,
rely on the schedulers to return the idlethread. This simplifies the
logic in choosethread(). Aside from the run queue links kern_switch.c
mostly does not care about the contents of td_sched.
Discussed with: julian
- Move the idle thread loop into the per scheduler area. ULE wants to
do something different from the other schedulers.
Suggested by: jhb
Tested on: x86/amd64 sched_{4BSD, ULE, CORE}.
used by clock code, so don't export it to the world for machdep.c to
initialize. There is a minor problem initializing it before it is
used, since although clock initialization is split up so that parts
of it can be done early, the first part was never done early enough
to actually work. Split it up a bit more and do the first part as
late as possible to document the necessary order. The functions that
implement the split are still bogusly exported.
Cleaned up initialization of the i8254 clock hardware using the new
split. Actually initialize it early enough, and don't work around it
not being initialized in DELAY() when DELAY() is called early for
initialization of some console drivers.
This unfortunately moves a little more code before the early debugger
breakpoint so that it is harder to debug. The ordering of console and
related initialization is delicate because we want to do as little as
possible before the breakpoint, but must initialize a console.
the mount options list with vfs_deleteopt(). At this point, the export
information is saved in mp->mnt_export, so we can delete
the "export" mount option from mp->mnt_optnew and mp->mnt_opt.
This fixes read-write/read-only update mounts (mount -u -o rw, mount -u -o ro)
of NFS exported directories.
For some reason, I could only reproduce the problem with a configuration
supplied by Andre:
- "options QUOTA" enabled in kernel config
- "/ -maproot=root 10.0.1.105" in /etc/exports
Reported by: kris, Andre Guibert de Bruet <andy siliconlandmark com>,
Andrzej Tobola <ato iem pw edu pl>
Tested by: Andre Guibert de Bruet
addresses shall access invalid descriptor DMA addresses on PCIe
hardwares and then panicked the system.
To fix it set descriptor DMA addresses before enabling Tx and Rx
such that hardware can see valid descriptor DMA addresses. Also
set RL_EARLY_TX_THRESH before starting Tx and Rx.
Reported by: steve.tell AT crashmail DOT de
Tested by: steve.tell AT crashmail DOT de
Obtained from: NetBSD
MFC after: 1 week
- First off, device drivers really do need to know if they are allocating
MSI or MSI-X messages. MSI requires allocating powerof2() messages for
example where MSI-X does not. To address this, split out the MSI-X
support from pci_msi_count() and pci_alloc_msi() into new driver-visible
functions pci_msix_count() and pci_alloc_msix(). As a result,
pci_msi_count() now just returns a count of the max supported MSI
messages for the device, and pci_alloc_msi() only tries to allocate MSI
messages. To get a count of the max supported MSI-X messages, use
pci_msix_count(). To allocate MSI-X messages, use pci_alloc_msix().
pci_release_msi() still handles both MSI and MSI-X messages, however.
As a result of this change, drivers using the existing API will only
use MSI messages and will no longer try to use MSI-X messages.
- Because MSI-X allows for each message to have its own data and address
values (and thus does not require all of the messages to have their
MD vectors allocated as a group), some devices allow for "sparse" use
of MSI-X message slots. For example, if a device supports 8 messages
but the OS is only able to allocate 2 messages, the device may make the
best use of 2 IRQs if it enables the messages at slots 1 and 4 rather
than default of using the first N slots (or indicies) at 1 and 2. To
support this, add a new pci_remap_msix() function that a driver may call
after a successful pci_alloc_msix() (but before allocating any of the
SYS_RES_IRQ resources) to allow the allocated IRQ resources to be
assigned to different message indices. For example, from the earlier
example, after pci_alloc_msix() returned a value of 2, the driver would
call pci_remap_msix() passing in array of integers { 1, 4 } as the
new message indices to use. The rid's for the SYS_RES_IRQ resources
will always match the message indices. Thus, after the call to
pci_remap_msix() the driver would be able to access the first message
in slot 1 at SYS_RES_IRQ rid 1, and the second message at slot 4 at
SYS_RES_IRQ rid 4. Note that the message slots/indices are 1-based
rather than 0-based so that they will always correspond to the rid
values (SYS_RES_IRQ rid 0 is reserved for the legacy INTx interrupt).
To support this API, a new PCIB_REMAP_MSIX() method was added to the
pcib interface to change the message index for a single IRQ.
Tested by: scottl
control data but no payload data is passed.
Change m_uiotombuf() to return at least one empty mbuf if the requested
length was zero. Add comment to sosend_dgram and sosend_generic().
Diagnoses by: jhb
Regression test by: rwatson
Pointy hat to. andre
--------------------------
[Deadlock] is caused by a lock order reversal in vfs_lookup(), where
[some] process is trying to lock a directory vnode, that is the parent
directory of covered vnode) while holding an exclusive vnode lock on
covering vnode.
A simplified scenario:
root fs var fs
/ A / (/var) D
/var B /log (/var/log) E
vfs lock C vfs lock F
Within each file system, the lock order is clear: C->A->B and F->D->E
When traversing across mounts, the system can choose between two lock orders,
but everything must then follow that lock order:
L1: C->A->B
|
+->F->D->E
L2: F->D->E
|
+->C->A->B
The lookup() process for namei("/var") mixes those two lock orders:
VOP_LOOKUP() obtains B while A is held
vfs_busy() obtains a shared lock on F while A and B are held (follows L1,
violates L2)
vput() releases lock on B
VOP_UNLOCK() releases lock on A
VFS_ROOT() obtains lock on D while shared lock on F is held
vfs_unbusy() releases shared lock on F
vn_lock() obtains lock on A while D is held (violates L1, follows L2)
dounmount() follows L1 (B is locked while F is drained).
Without unmount activity, vfs_busy() will always succeed without blocking
and the deadlock isn't triggered (the system behaves as if L2 is followed).
With unmount, you can get 4 processes in a deadlock:
p1: holds D, want A (in lookup())
p2: holds shared lock on F, want D (in VFS_ROOT())
p3: holds B, want drain lock on F (in dounmount())
p4: holds A, want B (in VOP_LOOKUP())
You can have more than one instance of p2.
The reversal was introduced in revision 1.81 of src/sys/kern/vfs_lookup.c and
MFCed to revision 1.80.2.1, probably to avoid a cascade of vnode locks when nfs
servers are dead (VFS_ROOT() just hangs) spreading to the root fs root vnode.
- Tor Egge
To fix the LOR, ups@ noted that when crossing the mount point, ni_dvp
is actually not used by the callers of namei. Thus, placeholder deadfs
vnode vp_crossmp is introduced that is filled into ni_dvp.
Idea by: ups
Reviewed by: tegge, ups, jeff, rwatson (mac interaction)
Tested by: Peter Holm
MFC after: 2 weeks
sparc64 GENERIC and the sound device drivers known working on sparc64
to use bus_get_dma_tag() to obtain the parent DMA tag so we can get rid
of the sparc64_root_dma_tag kludge eventually. Except for ath(4), sk(4),
stge(4) and ti(4) these changes are runtime tested (unless I booted up
the wrong kernels again...).
a power saving mode otherwise.
- If the thread is already bound in sched_bind() unbind it before
re-binding it to a new cpu. I don't like these semantics but they are
expected by some code in the tree. Patch by jkoshy.
Dont expose em->shared to the outside world before its properly
initialized. Might not affect anything but its at least a better
coding style.
Dont expose em via p->p_emuldata until its properly initialized.
This also enables us to get rid of some locking and simplify the
code because we are workin on a local copy.
In linux_fork and linux_vfork create the process in stopped state
to be sure that the new process runs with fully initialized emuldata
structure [1]. Also fix the vfork (both in linux_clone and linux_vfork)
race that could result in never woken up process [2].
Reported by: Scot Hetzel [1]
Suggested by: jhb [2]
Reviewed by: jhb (at least some important parts)
Submitted by: rdivacky
Tested by: Scot Hetzel (on amd64)
Change 2 comments (in the new code) to comply to style(9).
Suggested by: jhb
work when we start requiring this.
- Don't specify an alignment when creating our own parent DMA tag;
the supported DMA engines require no alignment constraint (f.e. the
LANCE child does though) and it's no inherited by the child DMA
tags anyway (which probably is a bug though).
- Fix whitespace nits.
These are shared-memory variants based on Am79C90-compatible chips
that apart from the missing DMA engine are similar to the 'ledma'
variant including using a (pseudo-)bus/device for the buffer that
the actual LANCE device hangs off from. The performance of these is
close to that of the 'ledma' one, like expected at a few times the
CPU load though.
1) Do not do quota accounting for the actual quota data files
or for file system snapshot files ("system" files). This
prevents a deadlock descibed in PR kern/30958 if the kernel
ever has to grow the quota file. Snapshot files were already
exempt from the quota checks, but this change generalized the check.
2) Fix a cast that caused extremely large uids/gids to incorrectly
write the quota information to the data file at a truncated
value for a uint_t32 id value. The incorrect cast caused quota
files in this case to be around 4GB in size, with the correct cast
they can now be 131GB in size. Also related to PR kern/30958.
3) Check for what appear to be negative UIDs/GIDs and not account
for them. This prevents the quota files from becoming 131GB in
size and causing quotacheck to run forever at bootup. This could
also cause the kernel to try and expand the quota file, which might
deadlock due to the issue in #1. kern/30958 and kern/38156
(and some much older closed PR's).
4) With the deadlock problems gone, the kernel can now expand the
size of the quota database files if it needs to.
5) Pass in the i-node count change value to chkiq and chkiqchg as an
int, like it used to be before the common routine was split up
into 2 different routines to increase / decrease the i-node in-use
count. Prevents an underflow on the i-node count. Related
to PR kern/89247.
6) Prevent the block usage from growing slowly if a file system is
full and the write was denied due to that fact. PR kern/89247.
Some of these changes require an updated quotacheck to prevent
the creation of huge (131GB) quota data files (item #3).
#1/#4 probably fixes a lot of the random hangs when quotas are enabled,
possibly some of the jail hangs.
unlike documented may not take effect without an initialization. So
don't invoke (*sc_mediachange) directly in lance_mediachange() but
go through lance_init_locked(). It's suboptimal to impose this for
all chips but given that besides the affected PCI bus front-end the
only other front-end which supports media selection is and likely
ever will be the 'ledma' front-end I see not enough reason to break
the in-driver API for this (though one could argue both ways here).
the ipi settings. If NEEDRESCHED is set and an ipi is later delivered
it will clear it rather than cause extra context switches. However, if
we miss setting it we can have terrible latency.
- In sched_bind() correctly implement bind. Also be slightly more
tolerant of code which calls bind multiple times. However, we don't
change binding if another call is made with a different cpu. This
does not presently work with hwpmc which I believe should be changed.
front of isp_init so we can read NVRAM even if we're role ISP_NONE.
Prepare for reintroduction of channels (for FC) for N-Port
Virtualization.
Fix a botch in handle assignment that caused us to nuke one device
when a new one arrives and end up with two devices with the same
identity in the virtual target mapping table.
ifmedia_init() invocation. IFM_IMASK makes only sense here when all of
the maxium of 32 PHYs on each one MII bus support disjoint sets of media,
which generally isn't the case (though it would be nice if we had a way
to let NIC drivers indicate that for the few card models where the PHY
configuration is known/fixed and IFM_IMASK actually makes sense).
- Add and use a miibus_print_child() for the bus_print_child method which
additionally prints the PHY number (which actually is the PHY address)
so one can figure out the media instance <-> PHY number mapping from the
PHY driver attach output. This is intented to be usefull in situations
where the addresses of the PHYs on the bus are known (f.e. of internal/
integrated PHYs) so one can feed the appropriate media instance number
to ifconfig(8) (with the upcoming change for ifconfig(8)).
This is more or less inspired by the NetBSD mii_print().
multiple PHYs. In case some PHYs currently driven by ukphy(4) exhibit
problems when isolating due to incomplete implementations or silicon bugs
we'll need to add specific drivers for these. Looking at NetBSD and
OpenBSD I don't expect problems here though (quite the contrary; we still
seem to set MIIF_NOISOLATE without good reason in a bunch of PHY drivers).
- Fix a style(9) whitespace nit.
capability rather than hardcoded offsets for a particular card. While
I'm here, expand the constants some.
- Change the ahd(4) driver to use pci_find_extcap() to locate the PCI-X
capability to keep up with the first change.
Reviewed by: scottl, gibbs (earlier version)
- Switch back to direct modification of remote CPU run queues. This added
a lot of complexity with questionable gain. It's easy enough to
reimplement if it's shown to help on huge machines.
- Re-implement the old tdq_transfer() call as tdq_pickidle(). Change
sched_add() so we have selectable cpu choosers and simplify the logic
a bit here.
- Implement tdq_pickpri() as the new default cpu chooser. This algorithm
is similar to Solaris in that it tries to always run the threads with
the best priorities. It is actually slightly more complex than
solaris's algorithm because we also tend to favor the local cpu over
other cpus which has a boost in latency but also potentially enables
cache sharing between the waking thread and the woken thread.
- Add a bunch of tunables that can be used to measure effects of different
load balancing strategies. Most of these will go away once the
algorithm is more definite.
- Add a new mechanism to steal threads from busy cpus when we idle. This
is enabled with kern.sched.steal_busy and kern.sched.busy_thresh. The
threshold is the required length of a tdq's run queue before another
cpu will be able to steal runnable threads. This prevents most queue
imbalances that contribute the long latencies.
headers in .S directly rather than getting to their macros through
genassym.c/assym.s so there are less headers genassym.c has to be
kept in sync with.
While at it fix some stytle(9) bugs (indentation, prototype format,
sort headers, etc) and remove trailing whitespace.
that can be used to check whether receive data is ready, i.e. whether
the subsequent call of uart_poll() should return a char, and unlike
uart_poll() doesn't actually receive data.
- Remove the device-specific implementations of uart_poll() and implement
uart_poll() in terms of uart_getc() and the newly added uart_rxready()
in order to minimize code duplication.
- In sunkbd(4) take advantage of uart_rxready() and use it to implement
the polled mode part of sunkbd_check() so we don't need to buffer a
potentially read char in the softc.
- Fix some mis-indentation in sunkbd_read_char().
Discussed with: marcel