Andre:
First lets get major new features into the kernel in a clean and nice way,
and then start optimizing. In this case we don't have any obfusication that
makes later profiling and/or optimizing difficult in any way.
Requested by: csjp, sam
either src or dst) fails. This closes a potential data loss case
(where the fsync failed with ENOSPC, for example).
Submitted by: Mohan Srinivasan mohans at yahoo-inc dot com
Obtained from: Yahoo!
Kick off a readahead only when sequential access is detected. This
eliminates wasteful readaheads in random file access.
Submitted by: Mohan Srinivasan mohans at yahoo-inc dot com
Obtained from: Yahoo!
mechanism used by pfil. This shared locking mechanism will remove
a nasty lock order reversal which occurs when ucred based rules
are used which results in hard locks while mpsafenet=1.
So this removes the debug.mpsafenet=0 requirement when using
ucred based rules with IPFW.
It should be noted that this locking mechanism does not guarantee
fairness between read and write locks, and that it will favor
firewall chain readers over writers. This seemed acceptable since
write operations to firewall chains protected by this lock tend to
be less frequent than reads.
Reviewed by: andre, rwatson
Tested by: myself, seanc
Silence on: ipfw@
MFC after: 1 month
prematurely report that they were full and/or to panic the kernel
with the message ``ffs_clusteralloc: allocated out of group''.
Submitted by: Henry Whincup <henry@jot.to>
MFC after: 1 week
completely. For some reason (that I am still curious about) we started to no
longer manage to finish the initialization before the timeouts run the first
time leading to panics when using uninitialized mutex etc.
The root of this problem is that we currently first link a domain to the
domains list and only later initialize the domain's protocols. This should
be reworked in the future, but with the current API it is not possible in
all situations. We settle with this lazy fix for now.
Tested by: gnn, ru, myself
stepped the process to the system call), we need to clear the trap flag
from the new frame unless the debugger had set PF_FORK on the parent.
Otherwise, the child will receive a (likely unexpected) SIGTRAP when it
executes the first instruction after returning to userland.
Reviewed by: bde
MFC after: 3 days
[Changes listed only since last public release 0.9.12.14; for changes
prior to that consult the CVS logs at http://madwifi.sourceforge.net]
o reorg directory structure to have a single set of public binary builds
shared by all systems
o support for new parts (all shipping pci/cardbus parts to this date work)
o new capabilities for identifying various chip features
o set/get tx power cap for supporting 802.11h information element
o revised api for set/get tx queue properties
o support for updating CTS in frames when doing packet bursting
o support for querying which tx queues have pending interrupts
here but it includes completed 802.11g, WPA, 802.11i, 802.1x, WME/WMM,
AP-side power-save, crypto plugin framework, authenticator plugin framework,
and access control plugin frameowrk.
security.mac.portacl.autoport_exempt
This sysctl exempts to bind port '0' as long as IP_PORTRANGELOW hasn't
been set on the socket. This is quite useful as it allows applications
to use automatic binding without adding overly broad rules for the
binding of port 0. This sysctl defaults to enabled.
This is a slight variation on the patch submitted by the contributor.
MFC after: 2 weeks
Submitted by: Michal Mertl <mime at traveller dot cz>
the PCI bus. We presently have no drivers for these devices, so they
are powered down. This is undesirable behavior since it breaks the
system when the base peripherals go away suddenly in the middle of
boot.
# if we ever get generic drivers for memory and/or base peripherals, then
# we can remove the tests here.
revision 1.55, the address parameter to vnode_pager_addr() was changed
from an unsigned 32-bit quantity to a signed 64-bit quantity. However,
an out-of-range check on the address was not updated. Consequently,
memory-mapped I/O on files greater than 2GB could cause a kernel panic.
Since the address is now a signed 64-bit quantity, the problem resolution
is simply to remove a cast.
Reviewed by: bde@ and tegge@
PR: 73010
MFC after: 1 week
as this may cause deadlocks.
This should fix kern/72123.
Discussed with: jhb
Tested by: Nik Azim Azam, Andy Farkas, Flack Man, Aykut KARA
Izzet BESKARDES, Jens Binnewies, Karl Keusgen
Approved by: sam (mentor)
split the conversion of the remaining three filesystems out from the root
mounting changes, so in one go:
cd9660:
Convert to nmount.
Add omount compat shims.
Remove dedicated rootfs mounting code.
Use vfs_mountedfrom()
Rely on vfs_mount.c calling VFS_STATFS()
nfs(client):
Convert to nmount (the simple way, mount_nfs(8) is still necessary).
Add omount compat shims.
Drop COMPAT_PRELITE2 mount arg compatibility.
ffs:
Convert to nmount.
Add omount compat shims.
Remove dedicated rootfs mounting code.
Use vfs_mountedfrom()
Rely on vfs_mount.c calling VFS_STATFS()
Remove vfs_omount() method, all filesystems are now converted.
Remove MNTK_WANTRDWR, handling RO/RW conversions is a filesystem
task, and they all do it now.
Change rootmounting to use DEVFS trampoline:
vfs_mount.c:
Mount devfs on /. Devfs needs no 'from' so this is clean.
symlink /dev to /. This makes it possible to lookup /dev/foo.
Mount "real" root filesystem on /.
Surgically move the devfs mountpoint from under the real root
filesystem onto /dev in the real root filesystem.
Remove now unnecessary getdiskbyname().
kern_init.c:
Don't do devfs mounting and rootvnode assignment here, it was
already handled by vfs_mount.c.
Remove now unused bdevvp(), addaliasu() and addalias(). Put the
few necessary lines in devfs where they belong. This eliminates the
second-last source of bogo vnodes, leaving only the lemming-syncer.
Remove rootdev variable, it doesn't give meaning in a global context and
was not trustworth anyway. Correct information is provided by
statfs(/).
These devices should be probed first because they are at fixed
locations and cannot be turned off. ISA PNP devices, on the other
hand, can be turned off and often can be flexible in the resources
they use. Probe them last, as always.
eg. if the firmware load fails. Shortish MFC timeout so this can be merged
before the 4.11 freeze.
PR: kern/34306
Submitted by: gibbs
Approved by: gibbs, imp (mentor)
MFC after: 5 days
upcalls which do RPC header parsing and match up the reply with the
request. NFS calls now sleep on the nfsreq structure. This enables
us to eliminate the NFS recvlock.
Submitted by: Mohan Srinivasan mohans at yahoo-inc dot com
Same comment as msdosfs applies: It would be nice if we had generic option
names for charset conversions.
Use vfs_mountefrom(). Rely on vfs_mount.c calling VFS_STATFS().
- Do not put/remove node references, since this no longer
needed.
- Remove timerActive flag, use callout flags.
- Schedule next callout after doing current one.
Reviewed by: archie
Approved by: julian (mentor)
the sx lock was used previously because we might sleep allocating
additional memory by using auto-extending sbufs. However, we no longer
do this, instead retaining the user-submitted rule string, so mutexes
can be used instead. Annotate the reason for not using the sbuf-related
rule-to-string code with a comment.
Switch to using TAILQ_CONCAT() instead of manual list copying, as it's
O(1), reducing the rule replacement step under the mutex from O(2N) to
O(2).
Remove now uneeded vnode-related includes.
MFC after: 2 weeks
- Change the cached mtime to a 'struct timespec' from a
time_t. Improving the precision of the cached mtime tightens up
NFS' "close-to-open" consistency considerably.
- Always force an over-the-wire consistency check from nfs_open()
(unless the file is marked modified). This further improves
NFS' "close-to-open" consistency.
Submitted by: Mohan Srinivasan mohans at yahoo-inc dot com
Add a vfs_cmount() function which converts omount argument stucture
to nmount arguments.
Convert vfs_omount() to vfs_mount() and parse nmount arguments.
This is 100% compatible with existing userland.
Later on, but before userland gets converted to nmount we may want
to revisit the names of the mountoptions, for instance it may make
sense to use consistent options for charset conversion etc.
vnode EXCLUSIVE lock. This prevents threads from adding pages to
the vnode while an invalidation is in progress, closing potential
races. In the bioread() path, callers acquire the SHARED vnode lock
- so while an invalidate was in progress, it was possible to fault
in new pages onto the vnode causing the invalidation to take a while
or fail. We saw these races at Yahoo! with very large files+heavy
concurrent access. Forcing an upgrade to EXCLUSIVE lock before doing
the invalidation closes all these races.
Submitted by: Mohan Srinivasan mohans at yahoo-inc dot com
vfs_flagopt() for binary/boolean options.
vfs_getopts() for string options
vfs_filteropt() to check for unknown options.
vfs_scanopt() for scanf() like processing of options.
Also add function for setting the stat.f_mntfromname field.
socket callbacks or similar callers, from both the NFS client and the
server.
Instituted nfsm_dissect_nonblock(), nfsm_dissect_xx_nonblock(). And
nfsm_disct() now takes an extra M_TRYWAIT/M_DONTWAIT argument.
Submitted by: Mohan Srinivasan mohans at yahoo-inc dot com
is safe to turn off the nfsnode's NMODIFIED flag.
- Move the check for signals to the top of the loop where we loop
around the dirty buffers on the vnode, scheduling writes. This
ensures that we'll break ouf of the flush operation on reception of
a signal.
Submitted by: Mohan Srinivasan mohans at yahoo-inc dot com
Root filessytems (like NFS) don't have an associated disk device,
and even if they had, the exact semantics would be filesystem
dependent and should be implemented there.
userland and a dedicated system call to get replies.
The vnode-bypass of fifos broke this into a panic.
Ditch all the magic and create a device /dev/nfslock instead, and
use that for both directions apart from the shorter path, this is
also faster because the device driver runs Giant free using the
vnode bypass.
Noticed by: marcel
actually is a property of the northbridge and applies to all PCI/PCI-X/PCIe
devices in the system, though only PCIe devices will respond to registers
higher than 256. This uses per-CPU pools of temporary mappings so that
the whole 256MB of configuration space doesn't have to be mapped all at
once. While the sf_buf API was considered for this, the fact that it
requires sleep locks and can return failure made it unsuitable for this use.
For now only the Intel Grantsdale and Lindenhurst (925 and 752x) chipsets are
supported. Since there doesn't appear to be a compatible way to determine
northbridge support, new chipsets will have to be explicitely added in the
future.
zero-copy receive of jumbo frames. This eliminates the need for the
jumbo frame allocator implemented in kern/uipc_jumbo.c and sys/jumbo.h.
Remove it.
Note: Zero-copy receive of jumbo frames did not work without these changes;
I believe there was insufficient locking on the jumbo vm object.
Tested by: ken@
Discussed with: gallatin@
properly support bounce buffers and resource shortages. This allows the
driver to work properly and reliably with more than 4GB of RAM. Of the
three data paths that exist in the driver, (block, CAM, ioctl), the ioctl
path has not been well tested with these changes due to difficulty with
finding an application that uses it that actually works.
Sponsored by: The FreeBSD Foundation and FreeBSD Systems, Inc.
and annotate that nfs_mountroot assumes it is OK to step on the
values in the global NFSv3 diskless structure as the mountroot
function is called during a serialized part of the boot, before
any other NFS client activity occurs.
MFC after: 2 weeks
doesn't. Most of the implementations have grown weeds for this so they
copy some fields from mnt_stat if the passed argument isn't that.
Fix this the cleaner way: Always call the implementation on mnt_stat
and copy that in toto to the VFS_STATFS argument if different.
commit. In the new world order, the transitive closure on the vector
operations is not precomputed. As such, it's unsafe to actually use
any of the function pointers in an indirect function call. They can
be null, and we need to use the default vector in that case.
This is mostly a quick fix for the four function pointers that are
ed explicitly. A more generic or scalable solution is likely to see
the light of day.
No pathos on: current@
tcpip_fillheaders()
tcp_discardcb()
tcp_close()
tcp_notify()
tcp_new_isn()
tcp_xmit_bandwidth_limit()
Fix a locking comment in tcp_twstart(): the pcbinfo will be locked (and
is asserted).
MFC after: 2 weeks
inp->inp_moptions pointer, so that ip_getmoptions() can perform
necessary locking when doing non-atomic reads.
Lock the inpcb by default to copy any data to local variables, then
unlock before performing sooptcopyout().
MFC after: 2 weeks
to implement the sanity check should have been changed when we converted
the implementation of vm_pindex_t from 32 to 64 bits. (Thus, RELENG_4 is
not affected.) The consequence of this error would be a legimate write to
an extremely large file being treated as an errant attempt to write meta-
data.
Discussed with: tegge@
modifications to the inpcb IP options mbuf:
- Lock the inpcb before passing it into ip_pcbopts() in order to prevent
simulatenous reads and read-modify-writes that could result in races.
- Pass the inpcb reference into ip_pcbopts() instead of the option chain
pointer in the inpcb.
- Assert the inpcb lock in ip_pcbots.
- Convert one or two uses of a pointer as a boolean or an integer
comparison to a comparison with NULL for readability.
- Always check that index number passed from userland
is <= NG_NETFLOW_MAXIFACES. [1]
- Increase NG_NETFLOW_MAXIFACES up to 512. [2]
Noticed by: Roman Palagin [1]
Requested by: Yuri Y. Bushmelev [2]
MFC after: 1 week
header. pf finds the first TCP/UDP/ICMP6 header to filter by traversing
the header chain. In the case where headers are skipped, the protocol
checksum verification used the wrong length (included the skipped headers),
leading to incorrectly mismatching checksums. Such IPv6 packets with
headers were silently dropped.
Discovered by: Bernhard Schmidt
MFC after: 1 week
and that I've verified things seem to basically work. I was able to
boot and hot plug usb devices. Please let me know if this causes
problems for anybody.
The push down of giant has proceeded to the point that this will start
to matter more and more.
anywhere in the DAG. This includes configurations that are not
allowed by the EFI specification.
o Reject a GPT partition table if it's not preceeded by a PMBR.
There's no need to preserve the MBR partitioning anymore as GPT
is mature and with the first bullet extending the applicability
of GPT, it's better to be a bit more strict.
If we are resuming non-MPSAFE drivers, they need Giant held for them.
This may fix some obscure suspend/resume problems. It has fixed keyrate
setting problems that were triggered by cardbus (MPSAFE) changing the
ordering for syscons resume (non-MPSAFE). Also, add some asserts that
Giant is held in our suspend/resume and shutdown methods.
Found by: iedowse
MFC after: 2 days
because we know it then and we need it when inserting a component which
wasn't destroyed while device was running.
Reported by: Michael Handler <handler@grendel.net>
MFC after: 1 week
and if so call it.
The cmount method will gather and interpret omount() style arguments,
and issue a kern_[v]mount() call to execute the corresponding nmount
operation.
- Initialize sc->pcn_type during ATTACH as softc contents may not surivive
from PROBE.
- Print out chip-id to assist with ongoing pcn(4) debugging efforts.
non-standard BIOSen. We used to implement this in local patches but
now that ACPI-CA has merged/re-implemented most of our fixes, they were
no longer needed and we just needed to turn this knob on. Also, remove
an unnecessary cast.
Tested by: phk
we really want vs. the size changing 'long' (i386 vs. AMD64).
This fixes the problem with DRM with Radeon's on AMD64.
Submitted by: Jung-uk Kim <jkim@niksun.com>
back on again in resume. Override the default of D3 with the value the
BIOS specifies in _SxD, if present. Skip serial devices (PNP05xx) since
they seem to hang when set to D3 and may require special driver support.
Also, skip non-type 0 PCI devices (i.e., bridges) since our we don't yet
save/restore their config space and that seems to be necessary.
If this gives you trouble with suspend/resume, you can disable the new
ACPI and PCI power behavior separately with these tunables & sysctls:
debug.acpi.do_powerstate
hw.pci.do_powerstate
Approved by: imp (pci)
Tested by: acpi@ (numerous)
instead for the time being. Intel should fix this.
Note that if this commit is correct, it is made on the vendor branch.
We expect the Intel folks to fix it, and we don't want to unnecessarily
take files off the vendor branch.
Approved by: njl
MFC after: 1 week
initializations but we did have lofty goals and big ideals.
Adjust to more contemporary circumstances and gain type checking.
Replace the entire vop_t frobbing thing with properly typed
structures. The only casualty is that we can not add a new
VOP_ method with a loadable module. History has not given
us reason to belive this would ever be feasible in the the
first place.
Eliminate in toto VOCALL(), vop_t, VNODEOP_SET() etc.
Give coda correct prototypes and function definitions for
all vop_()s.
Generate a bit more data from the vnode_if.src file: a
struct vop_vector and protype typedefs for all vop methods.
Add a new vop_bypass() and make vop_default be a pointer
to another struct vop_vector.
Remove a lot of vfs_init since vop_vector is ready to use
from the compiler.
Cast various vop_mumble() to void * with uppercase name,
for instance VOP_PANIC, VOP_NULL etc.
Implement VCALL() by making vdesc_offset the offsetof() the
relevant function pointer in vop_vector. This is disgusting
but since the code is generated by a script comparatively
safe. The alternative for nullfs etc. would be much worse.
Fix up all vnode method vectors to remove casts so they
become typesafe. (The bulk of this is generated by scripts)
in the _PRS or _CRS of link devices. If faced with multiple DPFs in a
_PRS, we just use the first one. We assume that if _CRS has DPF tags they
only contain a single set since multiple DPFs wouldn't make any sense. In
practice, the only DPFs I've seen so far for link devices are that the one
IRQ resource is surrounded by a DPF tag pair for no apparent reason, and
this should handle that case fine now.
- Only allocate link structures for IRQ resources for link devices rather
than allocating a link structure for every resource.
Reviewed by: njl
Tested by: phk
in the error cases, causing panics.
Adapted from similar fix to NFSv3 mkdir submitted by Mohan Srinivasan mohans
at yahoo-inc dot com
Approved by: alfred
should not return ERESTART after it caught a signal, otherwise
thr_wake() call will be lost, also a timeout wait should not be
restarted. Final, using wakeup not wakeup_one to be safeness.
they would leave enough elements on the stack that if you escaped to the
loader prompt and then typed 'setenv', it would pull in all of the leaked
junk and cause an exception in the environment. There still seems to be
3 leaked elements, but they don't appear to be coming from this file.
a deadlock (with NFS exclusive vnode locks enabled). Lookup
grabs the parent's lock and wants to lock child. Readdirplus
locks the child and wants to lock parent (for loading the attrs
for ".."). The fix is to not load the attrs for ".." in
readdirplus.
Submitted by: Mohan Srinivasan mohans at yahoo-inc dot com
Reviewed by: rwatson
This closes a major hole in close-to-open consistency support.
Added a new sysctl so that this can be disabled for single NFS
client applications with very large amounts of mmap'ed IO (for
performance).
Submitted by: Mohan Srinivasan mohans at yahoo-inc dot com
Reviewed by: rwatson
returned back to df from a statfs call. Causing df to print negative
values.
Submitted by: Mohan Srinivasan mohans at yahoo-inc dot com
Reviewed by: rwatson
specified register, but a pointer to the in-memory representation of
that value. The reason for this is twofold:
1. Not all registers can be represented by a register_t. In particular
FP registers fall in that category. Passing the new register value
by reference instead of by value makes this point moot.
2. When we receive a G or P packet, both are for writing a register,
the packet will have the register value in target-byte order and
in the memory representation (modulo the fact that bytes are sent
as 2 printable hexadecimal numbers of course). We only need to
decode the packet to have a pointer to the register value.
This change fixes the bug of extracting the register value of the P
packet as a hexadecimal number instead of as a bit array. The quick
(and dirty) fix to bswap the register value in gdb_cpu_setreg() as
it has been added on i386 and amd64 can therefore be removed and has
in fact been that.
Tested on: alpha, amd64, i386, ia64, sparc64
@sys/dev/acpica/acpi_pci_link.c:153" panic by backing out rev 1.37 in the SMP
case. It appears that on a dual-proc machine the assertions in the rev 1.37
commit log hold true.
Introduce domain_init_status to keep track of the init status of the domains
list (surprise). 0 = uninitialized, 1 = initialized/unpopulated, 2 =
initialized/done. Higher values can be used to support late addition of
domains which right now "works", but is potential dangerous. I choose to
only give a warning when doing so.
Use domain_init_status with if_attachdomain[1]() to ensure that we have a
complete domains list when we init the if_afdata array. Store the current
value of domain_init_status in if_afdata_initialized. This way we can update
if_afdata after a new protocol has been added (once that is allowed).
Submitted by: se (with changes)
Reviewed by: julian, glebius, se
PR: kern/73321 (partly)
call net_add_domain(). Calling this function too early (or late) breaks
assertations about the global domains list.
Actually it should be forbidden to call net_add_domain() outside of
SI_SUB_PROTO_DOMAIN completely as there are many places where we traverse
the domains list unprotected, but for now we allow late calls (mostly to
support netgraph). In order to really fix this we have to lock the domains
list in all places or find another way to ensure that we can safely walk the
list while another thread might be adding a new domain.
Spotted by: se
Reviewed by: julian, glebius
PR: kern/73321 (partly)
lock collision.
2. Fix two race conditions. One is between _umtx_unlock and signal,
also a thread was marked TDF_UMTXWAKEUP by _umtx_unlock, it is
possible a signal delivered to the thread will cause msleep
returns EINTR, and the thread breaks out of loop, this causes
umtx ownership is not transfered to the thread. Another is in
_umtx_unlock itself, when the function sets the umtx to
UMTX_UNOWNED state, a new thread can come in and lock the umtx,
also the function tries to set contested bit flag, but it will
fail. Although the function will wake a blocked thread, if that
thread breaks out of loop by signal, no contested bit will be set.
observations lead me to believe that the convetion for pc98 boot
loaders is to have a jump unstruction, followed by a string, followed
by code. The jump usually doesn't have a nop after it and usually the
string is NUL terminated, but Grub/98 breaks both of these rules.
# I looked for, but failed to find the Minux boot blocks for PC-9801 port.
512. If I had an audio cdrom in my cd player when I booted my system,
I'd get a panic from geom because you can't read 8192 bytes from an
audio cdrom.
Remove XXX comment about IPL1 and replace it with some information
from my soon to be published web page on the pc98 disk layout. The
IPL1 test was the result of an observation of a disk with FreeBSD's
boot0 program. It was testing part of an area what appears to be
reserved for a boot loader name, which comes after a jump over this
area. I don't yet know if it is required to be any specific jump
instruction, or if the destination has to be location 11. [1]
[1] FreeBSD Press No. 13, page 115, poorly translated by myself. The
picture there shows offset 8 as the destination of the jump, but
FreeBSD's boot0 program has three padding NULs after the IPL1 name and
uses a 16-bit 'jmp' instruction.
resource lists. It used to be sized based only on _CRS, hence _PRS could
perform an out-of-bounds access if it was larger (i.e., when there are
dependent functions). Add asserts to detect this case. Note, this is
only a temporary fix and I believe _PRS and _CRS should have separate
arrays.
Also, fix a typo where the wrong irq was being check for the APIC case.
Submitted by: tegge
to do a window update to the peer (thru an ACK) from soreceive()
itself. TCP will do that upon return from the socket callback.
Sending a window update from soreceive() results in a lock reversal.
Submitted by: Mohan Srinivasan mohans at yahoo-inc dot com
Reviewed by: rwatson
soreceive(), then pass in M_DONTWAIT to m_copym(). Also fix up error
handling for the case where m_copym() returns failure.
Submitted by: Mohan Srinivasan mohans at yahoo-inc dot com
Reviewed by: rwatson
that the exclusive lock is already held, then we call panic. Don't
clobber internal lock state before panic'ing. This change improves
debugging if this case were to happen.
Submitted by: Mohan Srinivasan mohans at yahoo-inc dot com
Reviewed by: rwatson
Approved by: Robert Watson <rwatson@freebsd.org>
Add locking to the IPv6 scoping code.
All spl() like calls have also been removed.
Cleaning up the handling of ifnet data will happen at a later date.