Commit Graph

137989 Commits

Author SHA1 Message Date
Jeff Roberson
0ee6cecc9d - Greatly simplify vget() by removing the guarantee that any new
references to a vnode with VI_OWEINACT set will force the vinactive()
   call.  The kernel makes no guarantees about which reference was the
   last to close a file or when the actual inactive processing will
   happen.  The previous code was designed to preserve existing semantics
   in the face of shared locks, however, this was unnecessary.

Discussed with:	mckusick
2008-03-24 04:22:58 +00:00
Jeff Roberson
804e60d4cf - Don't acquire the vnode interlock in _vn_lock() unless no lock type
is requested.  Handle this case specially before the while loop.
 - Use the held vnode lock to check for VI_DOOMED.  The vnode lock and
   interlock must both be held to set VI_DOOMED so either one held, even
   shared, is sufficient to check it.

No objection by:	kib
2008-03-24 04:17:35 +00:00
Jeff Roberson
97735db712 - Remove an old comment; vnodes have been working without Giant for
years now.
 - Clarify the locking required for VI_DOOMED in preparation for
   simplifications to vget() and vn_lock().
2008-03-24 04:11:40 +00:00
Kip Macy
8815ab518a Label inp as unused in the non-INVARIANTS case 2008-03-24 00:29:01 +00:00
Peter Wemm
f001eabf3a First pass at (possibly futile) microoptimizing of cpu_switch. Results
are mixed.  Some pure context switch microbenchmarks show up to 29%
improvement.  Pipe based context switch microbenchmarks show up to 7%
improvement.  Real world tests are far less impressive as they are
dominated more by actual work than switch overheads, but depending on
the machine in question, workload, kernel options, phase of moon, etc, a
few percent gain might be seen.

Summary of changes:
- don't reload MSR_[FG]SBASE registers when context switching between
  non-threaded userland apps.  These typically cost 120 clock cycles each
  on an AMD cpu (less on Barcelona/Phenom).  Intel cores are probably no
  faster on this.
- The above change only helps unthreaded userland apps that tend to use
  the same value for gsbase.  Threaded apps will get no benefit from this.
- reorder things like accessing the pcb to be in memory order, to give
  prefetching a better chance of working.  Operations are now in increasing
  memory address order, rather than reverse or random.
- Push some lesser used code out of the main code paths.  Hopefully
  allowing better code density in cache lines.  This is probably futile.
- (part 2 of previous item) Reorder code so that branches have a more
  realistic static branch prediction hint.  Both Intel and AMD cpus
  default to predicting branches to lower memory addresses as being
  taken, and to higher memory addresses as not being taken.  This is
  overridden by the limited dynamic branch prediction subsystem.  A trip
  through userland might overflow this.
- Futule attempt at spreading the use of the results of previous operations
  in new operations.  Hopefully this will allow the cpus to execute in
  parallel better.
- stop wasting 16 bytes at the top of kernel stack, below the PCB.
- Never load the userland fs/gsbase registers for kthreads, but preserve
  curpcb->pcb_[fg]sbase as caches for the cpu. (Thanks Jeff!)

Microbenchmarking this code seems to be really sensitive to things like
scheduling luck, timing, cache behavior, tlb behavior, kernel options,
other random code changes, etc.

While it doesn't help heavy userland workloads much, it does help high
context switch loads a little, and should help those that involve
switching via kthreads a bit more.

A special thanks to Kris for the testing and reality checks, and Jeff for
tormenting me into doing this. :)

This is still work-in-progress.
2008-03-23 23:09:06 +00:00
Alan Cox
58680920e9 Correct an error in pmap_mincore() when applied to a 2MB page mapping:
Use PG_PS_FRAME, not PG_FRAME, to obtain the physical address of the
2MB physical page from the PDE.
2008-03-23 23:04:09 +00:00
Peter Wemm
22c0c6e9d3 Export TDP_KTHREAD to asm files. 2008-03-23 22:46:37 +00:00
Peter Wemm
6c73bb3557 Move pcb_flags to make trivially better use of cache lines. 2008-03-23 22:45:51 +00:00
Peter Wemm
3d60169ef4 Protect the setting of the fsbase/gsbase MSR registers and the
pcb_[fg]sbase values with a critical section, like the rest of the kernel.
2008-03-23 22:44:56 +00:00
Kip Macy
3d5853271e Insulate inpcb consumers outside the stack from the lock type and offset within the pcb by adding accessor functions.
Reviewed by: rwatson
MFC after: 3 weeks
2008-03-23 22:34:16 +00:00
Alan Cox
702006ff76 To date, we have assumed that the TLB will only set the PG_M bit in a
PTE if that PTE has the PG_RW bit set.  However, this assumption does
not hold on recent processors from Intel.  For example, consider a PTE
that has the PG_RW bit set but the PG_M bit clear.  Suppose this PTE
is cached in the TLB and later the PG_RW bit is cleared in the PTE,
but the corresponding TLB entry is not (yet) invalidated.
Historically, upon a write access using this (stale) TLB entry, the
TLB would observe that the PG_RW bit had been cleared and initiate a
page fault, aborting the setting of the PG_M bit in the PTE.  Now,
however, P4- and Core2-family processors will set the PG_M bit before
observing that the PG_RW bit is clear and initiating a page fault.  In
other words, the write does not occur but the PG_M bit is still set.

The real impact of this difference is not that great.  Specifically,
we should no longer assert that any PTE with the PG_M bit set must
also have the PG_RW bit set, and we should ignore the state of the
PG_M bit unless the PG_RW bit is set.  However, these changes enable
me to remove a work-around from pmap_promote_pde(), the superpage
promotion procedure.

(Note: The AMD processors that we have tested, including the latest,
the Phenom, still exhibit the historical behavior.)

Acknowledgments: After I observed the problem, Stephan (ups) was
instrumental in characterizing the exact behavior of Intel's recent
TLBs.

Tested by: Peter Holm
2008-03-23 20:38:01 +00:00
Konstantin Belousov
1be222e9df Yield the cpu in the kernel while iterating the list of the
vnodes belonging to the mountpoint. Also, yield when in the
softdep_process_worklist() even when we are not going to sleep due to
buffer drain.

It is believed that the ULE fixed the problem [1], but the yielding
seems to be needed at least for the 4BSD case.

Discussed:	on stable@, with bde
Reviewed by:	tegge, jeff [1]
MFC after:	2 weeks
2008-03-23 13:45:24 +00:00
Colin Percival
c58b62eff4 When updating the install list for files which have had local changes
merged with upgrade changes, don't try to compute the SHA256 hash of
files which don't exist.

Reported by:	Jaakko Heinonen
MFC after:	1 week
2008-03-23 13:41:54 +00:00
Jeff Roberson
fbb275f59d - Restore kse.h in this directory so other tools don't find it by mistake.
- Restore the ability to debug kse coredumps in 8.0.

Suggested by:	marcel
2008-03-23 09:38:11 +00:00
Konstantin Belousov
3f7905d29c Prevent the overflow in the calculation of the next page directory.
The overflow causes the wraparound with consequent corruption of the
(almost) whole address space mapping.

As Alan noted, pmap_copy() does not require the wrap-around checks
because it cannot be applied to the kernel's pmap. The checks there are
included for consistency.

Reported and tested by:	kris (i386/pmap.c:pmap_remove() part)
Reviewed by:	alc
MFC after:	1 week
2008-03-23 07:07:27 +00:00
Pyun YongHyeon
2000cf6c0b MSI handling on some RealTek chips are broken so disable it by
default.

Reported by:	Giulio Ferro ( auryn AT zirakzigil DOT org )
Tested by:	Giulio Ferro ( auryn AT zirakzigil DOT org )
2008-03-23 05:35:18 +00:00
Pyun YongHyeon
03ca7ae8a9 For MSI capable hardwares, enable MSI enable bit in RL_CFG2
register.  If MSI was disabled by hw.re.msi_disable tunable
expliclty clear the MSI enable bit.
2008-03-23 05:31:35 +00:00
Pyun YongHyeon
ce6283934e Some RealTek chips are known to be buggy on DAC handling, so
disable DAC by default.
2008-03-23 05:13:45 +00:00
Pyun YongHyeon
ccf34c81f8 VLAN hardware tag information should be set for all desciptors of a
multi-descriptor transmission attempt. Datasheet said nothing about
this requirements. This should fix a long-standing VLAN hardware
tagging issues with re(4).

Reported by:	Giulio Ferro ( auryn AT zirakzigil DOT org )
Tested by:	Giulio Ferro ( auryn AT zirakzigil DOT org )
2008-03-23 05:06:16 +00:00
Pyun YongHyeon
70acaecfd0 Always honor configured VLAN/checksum offload capabilities.
Previously re(4) used to blindly enable VLAN hardware tag stripping
and Rx checksum offload regardless of enabled optional features of
interface.
2008-03-23 04:59:13 +00:00
Bruce A. Mah
72b218d366 New release notes: KSE removed, cmx(4), uslcom(4), sf(4) update,
re(4) update, vr(4) update, TCP options padding fix, ata(4) spindown,
hptrr(4) 1.2, mmap(2)/ZFS fix, chflags(1) -v/-f, cp(1) -a, find(1)
primaries to match GNU find, split(1) -n, tar(1) -Z, bzip2 1.0.5.

Modified release notes:  CVS post-1.11.12 snapshot from 10 March 2008.
2008-03-23 04:12:07 +00:00
Craig Rodrigues
67361fdf2d Remove comment about "-r" flag from readlabel. "-r" is a no-op.
The is comment is left over from the old disklabel command.

Reviewed by:	phk
2008-03-23 03:01:10 +00:00
David Xu
34d05d83f6 Remove commented out code, thread suspension is done in thread library. 2008-03-23 02:03:06 +00:00
Jeff Roberson
e6b2545b3b - Only return 1 from sync_vnode() in cases where the vnode is still
at the head of the sync list.  This prevents sched_sync() from
   re-queueing a vnode which may have been freed already.

Discussed with:	kib
2008-03-23 01:44:28 +00:00
Marcel Moolenaar
807e684076 Instead of making a single geom_part.ko module, make a module
for each partitioning scheme. The gpart code is currently non-
optional.
2008-03-23 01:42:47 +00:00
Jeff Roberson
f6a8cecfc6 - Pass BO_MTX(bo) to lockmgr in vtruncbuf, we don't own the vnode
interlock here anymore.

Reported by:	kris
2008-03-23 01:42:19 +00:00
Marcel Moolenaar
4ffca444a5 Redefine G_PART_SCHEME_DECLARE() from populating a private linker set
to declaring a proper module. The module event handler is part of the
gpart core and will add the scheme to an internal list on module load
and will remove the scheme from the internal list on module unload.
This makes it possible to dynamically load and unload partitioning
schemes.
2008-03-23 01:31:59 +00:00
Marcel Moolenaar
8a8fcb0089 Add g_retaste(), which given a class will present all non-open providers
to it for tasting. This is useful when the class, through means outside
the scope of GEOM, can claim providers previously unclaimed.

The g_retaste() function posts an event which is handled by the
g_retaste_event().

Event suggested by: phk
2008-03-23 01:23:35 +00:00
Olivier Houchard
2c361379e4 We need to prototype _start() as well, as we use it to test if we're running
from flash or from RAM.

Reported by:	imp
MFC After:	3 days
2008-03-22 20:34:07 +00:00
Qing Li
c7a0fc800c Reuse the mbuf that was just retrieved from the receive ring if mbuf
exhaustion is encountered. There was a fix made previously for this
problem but the solution (breaking out of the receive loop) does not
seem to work. mbuf reuse strategy is already adopted by other drivers
such as if_bge.  The problem was recreated and the patch is also
verified in the same test environment.
2008-03-22 18:13:39 +00:00
Sam Leffler
dd5ac081b8 add hints to specify how NPE ports are mapped to MAC+PHY; these
could be commented out as they just duplicate the defaults that
are built into the code

Reviewed by:	imp
MFC after:	1 week
2008-03-22 16:55:51 +00:00
Sam Leffler
c7ad0d8736 Improve mac+phy configuration so that hints can be used to describe
layouts different than the defaults:
o hint.npe.0.mac="A", "B", etc. specifies the window for MAC register accesses
o hint.npe.0.mii="A", "B", etc. specifies PHY registers
o hint.npe.1.phy=%d specifies the PHY to map to a port

This allows devices like NSLU to be setup w/o code changes and will
also be used for forthcoming support for more Avila boards.

Reviewed by:	imp
MFC after	1 week
2008-03-22 16:53:28 +00:00
Sam Leffler
0e3ec582c6 sync w/ p4: minor cleanups to improve msgs 2008-03-22 16:39:30 +00:00
Poul-Henning Kamp
4218a7310b In abort2(2): Accept a NULL arg pointer if nargs == 0 2008-03-22 16:32:52 +00:00
Sam Leffler
c28953b424 (finally) add the hal status to the diagnostic generated after
a failed ath_hal_reset call

MFC after:	3 days
2008-03-22 16:27:47 +00:00
Sam Leffler
61063e478a Defer state change on disassociate to avoid unnecessarily dropping the
lease: track the current bssid and if it changes (as reported in an
assoc/reassoc) event only then kick the state machine.  This gives us
immediate response when roaming but otherwise causes us to fallback on
the normal state machine.

Reviewed by:	brooks, jhb
MFC after:	3 weeks
2008-03-22 16:24:02 +00:00
Sam Leffler
043f1935e0 correct syslog mask so LOG_DEBUG msgs are not lost
MFC after:	2 weeks
2008-03-22 16:18:07 +00:00
Stefan Farfeleder
c20ee5ab6d Add a test case for options.c revision 1.26. 2008-03-22 14:07:49 +00:00
Stefan Farfeleder
f9ec075e88 Reset the internal state used for the 'getopts' built-in when 'shift' or 'set'
are used to modify the arguments.  Not doing so caused random memory reads or
null pointer dereferences when 'getopts' was called again later (SUSv3 says
getopts produces unspecified results in this case).

PR:	48318
2008-03-22 14:06:01 +00:00
Remko Lodder
6764f54349 In route.c in newroute() there's a call to exit(0) if the command was
'get'. Since rtmsg() always gets called and returns 0 on success and -1
on failure, it's possible to exit with a suitable exit code by calling
exit(ret != 0) instead, as is done at the end of newroute().

PR:		bin/112303
Submitted by:	bruce@cran.org.uk
MFC after:	1 week
2008-03-22 12:50:43 +00:00
David Xu
9939a13667 Add POSIX pthread API pthread_getcpuclockid() to get a thread's cpu
time clock id.
2008-03-22 09:59:20 +00:00
Jeff Roberson
698b1a6643 - Complete part of the unfinished bufobj work by consistently using
BO_LOCK/UNLOCK/MTX when manipulating the bufobj.
 - Create a new lock in the bufobj to lock bufobj fields independently.
   This leaves the vnode interlock as an 'identity' lock while the bufobj
   is an io lock.  The bufobj lock is ordered before the vnode interlock
   and also before the mnt ilock.
 - Exploit this new lock order to simplify softdep_check_suspend().
 - A few sync related functions are marked with a new XXX to note that
   we may not properly interlock against a non-zero bv_cnt when
   attempting to sync all vnodes on a mountlist.  I do not believe this
   race is important.  If I'm wrong this will make these locations easier
   to find.

Reviewed by:	kib (earlier diff)
Tested by:	kris, pho (earlier diff)
2008-03-22 09:15:16 +00:00
Alfred Perlstein
435cdf88ea Fix a race where timeout/untimeout could cause crashes for Giant locked
code.

The bug:

There exists a race condition for timeout/untimeout(9) due to the
way that the softclock thread dequeues timeouts.

The softclock thread sets the c_func and c_arg of the callout to
NULL while holding the callout lock but not Giant.  It then drops
the callout lock and acquires Giant.

It is at this point where untimeout(9) on another cpu/thread could
be called.

Since c_arg and c_func are cleared, untimeout(9) does not touch the
callout and returns as if the callout is canceled.

The softclock then tries to acquire Giant and likely blocks due to
the other cpu/thread holding it.

The other cpu/thread then likely deallocates the backing store that
c_arg points to and finishes working and hence drops Giant.

Softclock resumes and acquires giant and calls the function with
the now free'd c_arg and we have corruption/crash.

The fix:

We need to track curr_callout even for timeout(9) (LOCAL_ALLOC)
callouts.  We need to free the callout after the softclock processes
it to deal with the race here.

Obtained from: Juniper Networks, iedowse
Reviewed by: jhb, iedowse
MFC After: 2 weeks.
2008-03-22 07:29:45 +00:00
David Xu
20b94d8035 Use linker set to collection all target operations. 2008-03-22 05:40:44 +00:00
Doug Ambrisko
a355f43ed2 Add in a compat. mode so you can either open the card's device
node or directly open mfi0 and specify the card you want to talk to
in the ioctl.
2008-03-22 02:57:49 +00:00
Warner Losh
d6aed19dfb No need to be gratuitously style(9) non-compliant here, even though
C++ lets me get away with it.
2008-03-21 20:38:28 +00:00
Remko Lodder
eba8219e9b Replace reference from vinum.8 to gvinum.8, it was advised in the PR to
replace this with vinum.4, but that's the kernel interface manual, which
is not appropriate in my understanding.  I think that gvinum is a suitable
replacement for this.

PR:		docs/121938
Submitted by:	"Federico" <federicogalvezdurand at yahoo dot com>
MFC after:	3 days
2008-03-21 20:16:25 +00:00
Bjoern A. Zeeb
3e43d2ae25 Add ';' missed with the SYSINIT changes.
Not noticed by tb as TCP_SIGNATURE is not in LINT.

MFC after:	1 month
2008-03-21 18:31:42 +00:00
Remko Lodder
5f185dbd84 Add the i915 GME device to DRM.
PR:		kern/121808
Submitted by:	Volker Werth <volker at vwsoft dot com>
Approved by:	imp (mentor, implicit for trivial changes)
MFC after:	3 days
2008-03-21 16:38:42 +00:00
Konstantin Belousov
e7ffdf423a Reduce contention on the vnode interlock by not acquiring the BO_LOCK
around the check for the BV_BKGRDINPROG in the brelse() and bqrelse().
See the comment for the explanation why it is safe.

Tested by:	pho
Submitted by:	jeff
2008-03-21 12:38:44 +00:00