Commit Graph

10333 Commits

Author SHA1 Message Date
jeff
a469063987 PR 117603
- Close a sleepqueue signal race by interlocking with the per-process
   spinlock.  This was mistakenly omitted from the thread_lock patch and
   has been a race since.

MFC After:	1 week
PR:		bin/117603
Reported by:	Danny Braniss <danny@cs.huji.ac.il>
2008-03-13 00:46:12 +00:00
jeff
acb93d599c Remove kernel support for M:N threading.
While the KSE project was quite successful in bringing threading to
FreeBSD, the M:N approach taken by the kse library was never developed
to its full potential.  Backwards compatibility will be provided via
libmap.conf for dynamically linked binaries and static binaries will
be broken.
2008-03-12 10:12:01 +00:00
jeff
3b1acbdce2 - Pass the priority argument from *sleep() into sleepq and down into
sched_sleep().  This removes extra thread_lock() acquisition and
   allows the scheduler to decide what to do with the static boost.
 - Change the priority arguments to cv_* to match sleepq/msleep/etc.
   where 0 means no priority change.  Catch -1 in cv_broadcastpri() and
   convert it to 0 for now.
 - Set a flag when sleeping in a way that is compatible with swapping
   since direct priority comparisons are meaningless now.
 - Add a sysctl to ule, kern.sched.static_boost, that defaults to on which
   controls the boost behavior.  Turning it off gives better performance
   in some workloads but needs more investigation.
 - While we're modifying sleepq, change signal and broadcast to both
   return with the lock held as the lock was held on enter.

Reviewed by:	jhb, peter
2008-03-12 06:31:06 +00:00
jeff
e30139dff5 - KSE may free a thread that was never actually forked. This will leave
td_cpuset NULL.  Check for this condition before dereferencing the
   cpuset.

Reported by:	david@catwhisker.org, miwi@freebsd.org
Sponsored by:	Nokia
2008-03-12 05:01:14 +00:00
jeff
540fa064d9 - Fix the invalid priority panics people are seeing by forcing
tdq_runq_add to select the runq rather than hoping we set it properly
   when we adjusted the priority.  This involves the same number of
   branches as before so should perform identically without the extra
   fragility.

Tested by:	bz
Reviewed by:	bz
2008-03-10 22:48:27 +00:00
jeff
e53ae3b798 - Don't rely on a side effect of sched_prio() to set the initial ts_runq
for thread0.  Set it directly in sched_setup().  This fixes traps on boot
   seen on some machines.

Reported by:	phk
2008-03-10 09:50:29 +00:00
jeff
aa3cc14d3d - Handle kdb switch panics outside of mi_switch() to remove some instructions
from the common path and make the code more clear.  Whether this has any
   impact on performance may depend on optimization levels.

Sponsored by:	Nokia
2008-03-10 03:16:51 +00:00
jeff
14a6f96adb Reduce ULE context switch time by over 25%.
- Only calculate timeshare priorities once per tick or when a thread is woken
   from sleeping.
 - Keep the ts_runq pointer valid after all priority changes.
 - Call tdq_runq_add() directly from sched_switch() without passing in via
   tdq_add().  We don't need to adjust loads or runqs anymore.
 - Sort tdq and ts_sched according to utilization to improve cache behavior.

Sponsored by:	Nokia
2008-03-10 03:15:19 +00:00
imp
be1fee2a1a Tiny bit of KNF to make bus_setup_intr() look like the rest of this
function.
2008-03-10 01:48:25 +00:00
jeff
128f1c2547 - Add the missing '2' case to the switch table for kern.smp.topology and
assign it to create the flat 'none' topology where all cpus are scheduled
   as if they are equal and unrelated.
2008-03-10 01:38:53 +00:00
jeff
7dc7c824ee - Add an implementation of sched_preempt() that avoids excessive IPIs.
- Normalize the preemption/ipi setting code by introducing sched_shouldpreempt()
   so the logical is identical and not repeated between tdq_notify() and
   sched_setpreempt().
 - In tdq_notify() don't set NEEDRESCHED as we may not actually own the thread lock
   this could have caused us to lose td_flags settings.
 - Garbage collect some tunables that are no longer relevant.
2008-03-10 01:32:01 +00:00
jeff
171a608f92 - Add a sched_preempt() routine to be called by md code after IPI_PREEMPT is
delivered.
 - Add a simple implementation to 4bsd.
2008-03-10 01:30:35 +00:00
imp
a7ddb800e8 Any driver that relies on its parent to set the devclass has no way to
know if has siblings that need an actual probe.  Introduce a specail
return value called BUS_PROBE_NOOWILDCARD.  If the driver returns
this, the probe is only successful for devices that have had a
specific devclass set for them.

Reviewed by: current@, jhb@, grehan@
2008-03-09 05:10:22 +00:00
antoine
514f31f40e Introduce a new F_DUP2FD command to fcntl(2), for compatibility with
Solaris and AIX.
fcntl(fd, F_DUP2FD, arg) and dup2(fd, arg) are functionnaly equivalent.
Document it.
Add some regression tests (identical to the dup2(2) regression tests).

PR:		120233
Submitted by:	Jukka Ukkonen
Approved by:	rwaston (mentor)
MFC after:	1 month
2008-03-08 22:02:21 +00:00
rwatson
bb69385843 Use sbuf routines to construct core dump filenames rather than custom
string buffer handling, making the code both easier to read and more
robust against string-handling bugs.

MFC after:	1 week
2008-03-08 16:31:29 +00:00
rwatson
32931f304a Unlock the process lock when expand_name() fails, or we may leak the
process lock leading to a hang.  This bug was introduced in
kern_sig.c:1.351, when the call to expand_name() was moved earlier
bit this particular error case was not updated.
2008-03-08 15:48:06 +00:00
rwatson
d91018d529 Add __FBSDID() tag.
MFC after:	3 days
Pointed out by:	antoine
2008-03-07 15:27:08 +00:00
jeff
7b52c04658 - Add a missing unlock to cpuset_setaffinity(CPU_LEVEL_CPUSET, CPU_WHICH_PID)
Found by:	gallatin
2008-03-06 20:11:24 +00:00
jeff
f278be8741 - Don't overwrite the recently allocated 'nset' in cpuset_setthread() by
passing it to cpuset_which().  Pass in 'set' instead.  This argument
   is not used but for convenience cpuset_which() nulls all incoming
   parameters.

Submitted by:	davidxu
2008-03-05 08:08:32 +00:00
jeff
7e2fbaa872 - Verify that when a user supplies a mask that is bigger than the kernel
mask none of the upper bits are set.
 - Be more careful about enforcing the boundaries of masks and child sets.
 - Introduce a few more CPU_* macros for implementing these tests.
 - Change the cpusetsize argument to be bytes rather than bits to match
   other apis.

Sponsored by:	Nokia
2008-03-05 01:49:20 +00:00
ru
fb19d1efe4 Make it possible to continue working after calling doadump()
manually from debugger.  (This got broken in rev. 1.122.)
2008-03-04 07:39:31 +00:00
raj
0757a4afb5 Initial support for Freescale PowerQUICC III MPC85xx system-on-chip family.
The PQ3 is a high performance integrated communications processing system
based on the e500 core, which is an embedded RISC processor that implements
the 32-bit Book E definition of the PowerPC architecture. For details refer
to: http://www.freescale.com/webapp/sps/site/prod_summary.jsp?code=MPC8555E

This port was tested and successfully run on the following members of the PQ3
family: MPC8533, MPC8541, MPC8548, MPC8555.

The following major integrated peripherals are supported:

  * On-chip peripherals bus
  * OpenPIC interrupt controller
  * UART
  * Ethernet (TSEC)
  * Host/PCI bridge
  * QUICC engine (SCC functionality)

This commit brings the main functionality and will be followed by individual
drivers that are logically separate from this base.

Approved by:	cognet (mentor)
Obtained from:	Juniper, Semihalf
MFp4:		e500
2008-03-03 17:17:00 +00:00
marcel
dcf8897ad7 Unbreak after cpuset: initialize td_cpuset in sched_fork_thread(). 2008-03-02 21:34:57 +00:00
jeff
c12e39a76d Add support for the new cpu topology api:
- When searching for affinity search backwards in the tree from the last
   cpu we ran on while the thread still has affinity for the group.   This
   can take advantage of knowledge of shared L2 or L3 caches among a
   group of cores.
 - When searching for the least loaded cpu find the least loaded cpu via
   the least loaded path through the tree.  This load balances system bus
   links, individual cache levels, and hyper-threaded/SMT cores.
 - Make the periodic balancer recursively balance the highest and lowest
   loaded cpu across each link.

Add support for cpusets:
 - Convert the cpuset to a simple native cpumask_t while the kernel still
   only supports cpumask.
 - Pass the derived cpumask down through the cpu_search functions to
   restrict the result cpus.
 - Make the various steal functions resilient to failure since all threads
   can not run on all cpus any longer.

General improvements:
 - Precisely track the lowest priority thread on every runq with
   tdq_setlowpri().  Before it was more advisory but this ended up having
   pathological behaviors.
 - Remove many #ifdef SMP conditions to simplify the code.
 - Get rid of the old cumbersome tdq_group.  This is more naturally
   expressed via the cpu_group tree.

Sponsored by:	Nokia
Testing by:	kris
2008-03-02 08:20:59 +00:00
jeff
ad2a31513f - Remove the old smp cpu topology specification with a new, more flexible
tree structure that encodes the level of cache sharing and other
   properties.
 - Provide several convenience functions for creating one and two level
   cpu trees as well as a default flat topology.  The system now always
   has some topology.
 - On i386 and amd64 create a seperate level in the hierarchy for HTT
   and multi-core cpus.  This will allow the scheduler to intelligently
   load balance non-uniform cores.  Presently we don't detect what level
   of the cache hierarchy is shared at each level in the topology.
 - Add a mechanism for testing common topologies that have more information
   than the MD code is able to provide via the kern.smp.topology tunable.
   This should be considered a debugging tool only and not a stable api.

Sponsored by:	Nokia
2008-03-02 07:58:42 +00:00
jeff
9b809b84f1 - Regen for cpuset
Sponsored by:	Nokia
2008-03-02 07:41:10 +00:00
jeff
694203dedd Add cpuset, an api for thread to cpu binding and cpu resource grouping
and assignment.
 - Add a reference to a struct cpuset in each thread that is inherited from
   the thread that created it.
 - Release the reference when the thread is destroyed.
 - Add prototypes for syscalls and macros for manipulating cpusets in
   sys/cpuset.h
 - Add syscalls to create, get, and set new numbered cpusets:
   cpuset(), cpuset_{get,set}id()
 - Add syscalls for getting and setting affinity masks for cpusets or
   individual threads: cpuid_{get,set}affinity()
 - Add types for the 'level' and 'which' parameters for the cpuset.  This
   will permit expansion of the api to cover cpu masks for other objects
   identifiable with an id_t integer.  For example, IRQs and Jails may be
   coming soon.
 - The root set 0 contains all valid cpus.  All thread initially belong to
   cpuset 1.  This permits migrating all threads off of certain cpus to
   reserve them for special applications.

Sponsored by:	Nokia
Discussed with:	arch, rwatson, brooks, davidxu, deischen
Reviewed by:	antoine
2008-03-02 07:39:22 +00:00
jeff
3bd7de5a7c - Add a new sched_affinity() api to be used in the upcoming cpuset
implementation.
 - Add empty implementations of sched_affinity() to 4BSD and ULE.

Sponsored by:	Nokia
2008-03-02 07:19:35 +00:00
attilio
0d87334131 - Handle buffer lock waiters count directly in the buffer cache instead
than rely on the lockmgr support [1]:
  * bump the waiters only if the interlock is held
  * let brelvp() return the waiters count
  * rely on brelvp() instead than BUF_LOCKWAITERS() in order to check
    for the waiters number
- Remove a namespace pollution introduced recently with lockmgr.h
  including lock.h by including lock.h directly in the consumers and
  making it mandatory for using lockmgr.
- Modify flags accepted by lockinit():
  * introduce LK_NOPROFILE which disables lock profiling for the
    specified lockmgr
  * introduce LK_QUIET which disables ktr tracing for the specified
    lockmgr [2]
  * disallow LK_SLEEPFAIL and LK_NOWAIT to be passed there so that it
    can only be used on a per-instance basis
- Remove BUF_LOCKWAITERS() and lockwaiters() as they are no longer
  used

This patch breaks KPI so __FreBSD_version will be bumped and manpages
updated by further commits. Additively, 'struct buf' changes results in
a disturbed ABI also.

[2] Really, currently there is no ktr tracing in the lockmgr, but it
will be added soon.

[1] Submitted by:	kib
Tested by:	pho, Andrea Barberio <insomniac at slackware dot it>
2008-03-01 19:47:50 +00:00
kib
8b513f42f1 Do not assert any locks for VOP_PRINT. In particular, do not assert that
the vnode interlock is not held. vn_printf() already correctly handles
locked and unlocked vnode interlocks, and all the in-tree vop_print
methods are interlock-agnostic.

Some code calls vprintf() with the vnode interlock held, that causes
unjustified panics with INVARIANTS (ffs_syncvnode() as example).

Reported by:	Peter Holm
2008-02-26 12:16:35 +00:00
attilio
4014b55830 Axe the 'thread' argument from VOP_ISLOCKED() and lockstatus() as it is
always curthread.

As KPI gets broken by this patch, manpages and __FreeBSD_version will be
updated by further commits.

Tested by:	Andrea Barberio <insomniac at slackware dot it>
2008-02-25 18:45:57 +00:00
attilio
0d54671a48 Introduce some functions in the vnode locks namespace and in the ffs
namespace in order to handle lockmgr fields in a controlled way instead
than spreading all around bogus stubs:
- VN_LOCK_AREC() allows lock recursion for a specified vnode
- VN_LOCK_ASHARE() allows lock sharing for a specified vnode

In FFS land:
- BUF_AREC() allows lock recursion for a specified buffer lock
- BUF_NOREC() disallows recursion for a specified buffer lock

Side note: union_subr.c::unionfs_node_update() is the only other function
directly handling lockmgr fields. As this is not simple to fix, it has
been left behind as "sole" exception.
2008-02-24 16:38:58 +00:00
cperciva
b02e531c35 After finishing sending file data in sendfile(2), don't forget to send
the provided trailers.  This has been broken since revision 1.240.

Submitted by:	Dan Nelson
PR:		kern/120948
"sounds ok to me" from:	phk
MFC after:	3 days
2008-02-24 00:07:00 +00:00
des
df26e399aa This patch adds a new ktrace(2) record type, KTR_STRUCT, whose payload
consists of the null-terminated name and the contents of any structure
you wish to record.  A new ktrstruct() function constructs and emits a
KTR_STRUCT record.  It is accompanied by convenience macros for struct
stat and struct sockaddr.

In kdump(1), KTR_STRUCT records are handled by a dispatcher function
that runs stringent sanity checks on its contents before handing it
over to individual decoding funtions for each type of structure.
Currently supported structures are struct stat and struct sockaddr for
the AF_INET, AF_INET6 and AF_UNIX families; support for AF_APPLETALK
and AF_IPX is present but disabled, as I am unable to test it properly.

Since 's' was already taken, the letter 't' is used by ktrace(1) to
enable KTR_STRUCT trace points, and in kdump(1) to enable their
decoding.

Derived from patches by Andrew Li <andrew2.li@citi.com>.

PR:		kern/117836
MFC after:	3 weeks
2008-02-23 01:01:49 +00:00
yar
ce8c493400 Undo the damage I did in sys/kern/vfs_mount.c #1.274 and
sbin/mount_nfs/mount_nfs.c #1.76.  Let the dragons sleep.

Requested by:	rodrigc, des
PR:		kern/120319 (welcome the bug back)
2008-02-18 20:58:57 +00:00
yar
2bac23abfa Add a remark on a questionable property of vfs_mergeopts(). 2008-02-18 10:10:42 +00:00
antoine
fb176dbab6 Make sysctl_kern_arnd return a random buffer instead of a random long,
as it is expected by userland (stack protector guard setup for example).

PR:		119129
Approved by:	rwatson (mentor)
MFC after:	1 month
2008-02-17 16:44:48 +00:00
kris
8697995804 Switch from conditionally dropping Giant in exit1() to asserting it is
not held, which appears to be always true.
2008-02-17 15:28:28 +00:00
imp
0514d6cc34 Fix typo in comment. 2008-02-17 02:46:54 +00:00
antoine
ab8945769a Remove a superfluous line in run_interrupt_driven_config_hooks(),
next_entry is already initialized during TAILQ_FOREACH_SAFE().

PR:		kern/119604
Approved by:	rwatson (mentor)
MFC after:	1 month
2008-02-15 21:54:21 +00:00
attilio
265cb5fb91 - Introduce lockmgr_args() in the lockmgr space. This function performs
the same operation of lockmgr() but accepting a custom wmesg, prio and
  timo for the particular lock instance, overriding default values
  lkp->lk_wmesg, lkp->lk_prio and lkp->lk_timo.
- Use lockmgr_args() in order to implement BUF_TIMELOCK()
- Cleanup BUF_LOCK()
- Remove LK_INTERNAL as it is nomore used in the lockmgr namespace

Tested by:	Andrea Barberio <insomniac at slackware dot it>
2008-02-15 21:04:36 +00:00
yar
9713f1f445 In the new order of things dictated by nmount(2), a read-only mount
is to be requested via a "ro" option.  At the same time, MNT_RDONLY
is gradually becoming an indicator of the current state of the FS
instead of a command flag.  Today passing MNT_RDONLY alone to the
kernel's mount machinery will lead to various glitches.  (See the
PRs for examples.)

Therefore mount the root FS with a "ro" option instead of the
MNT_RDONLY flag.  (Note that MNT_RDONLY still is added to the mount
flags internally, by vfs_donmount(), if "ro" was specified.)

To be able to pass "ro" cleanly to kernel_vmount(), teach the latter
function to accept options with NULL values.

Also correct the comment explaining how mount_arg() handles length
of -1.

PR:		bin/106636 kern/120319
Submitted by:	Jaakko Heinonen <see PR kern/120319 for email> (originally)
2008-02-14 17:04:31 +00:00
simon
49aa39283b Fix sendfile(2) write-only file permission bypass.
Security:	FreeBSD-SA-08:03.sendfile
Submitted by:	kib
2008-02-14 11:44:31 +00:00
jhb
fd8332efc0 Add KASSERT()'s to catch attempts to recurse on spin mutexes that aren't
marked recursable either via mtx_lock_spin() or thread_lock().

MFC after:	1 week
2008-02-13 23:39:05 +00:00
jhb
32100bd15f Mark sleepqueue chain spin mutexes are recursable since the sleepq code
now recurses on them in sleepq_broadcast() and sleepq_signal() when
resuming threads that are fully asleep.

MFC after:	1 week
2008-02-13 23:36:56 +00:00
jhb
64735ffb5f Add a couple of assertions and KTR logging to thread_lock_flags() to
match mtx_lock_spin_flags().

MFC after:	1 week
2008-02-13 23:33:50 +00:00
jhb
e8b1d791b2 Add an automatic kernel module version dependency to prevent loading
modules using invalid ABI versions (e.g. a 7.x module with an 8.x kernel)
for a given kernel:
- Add a 'kernel' module version whose value is __FreeBSD_version.
- Add a version dependency on 'kernel' in every module that has an
  acceptable version range of __FreeBSD_version up to the end of the
  branch __FreeBSD_version is part of.  E.g. a module compiled on 701000
  would work on kernels with versions between 701000 and 799999 inclusive.

Discussed on:	arch@
MFC after:	1 week
2008-02-13 21:34:06 +00:00
attilio
456bfb1f0f - Add real assertions to lockmgr locking primitives.
A couple of notes for this:
  * WITNESS support, when enabled, is only used for shared locks in order
    to avoid problems with the "disowned" locks
  * KA_HELD and KA_UNHELD only exists in the lockmgr namespace in order
    to assert for a generic thread (not curthread) owning or not the
    lock.  Really, this kind of check is bogus but it seems very
    widespread in the consumers code.  So, for the moment, we cater this
    untrusted behaviour, until the consumers are not fixed and the
    options could be removed (hopefully during 8.0-CURRENT lifecycle)
  * Implementing KA_HELD and KA_UNHELD (not surported natively by
    WITNESS) made necessary the introduction of LA_MASKASSERT which
    specifies the range for default lock assertion flags
  * About other aspects, lockmgr_assert() follows exactly what other
    locking primitives offer about this operation.

- Build real assertions for buffer cache locks on the top of
  lockmgr_assert().  They can be used with the BUF_ASSERT_*(bp)
  paradigm.

- Add checks at lock destruction time and use a cookie for verifying
  lock integrity at any operation.

- Redefine BUF_LOCKFREE() in order to not use a direct assert but
  let it rely on the aforementioned destruction time check.

KPI results evidently broken, so __FreeBSD_version bumping and
manpage update result necessary and will be committed soon.

Side note: lockmgr_assert() will be used soon in order to implement
real assertions in the vnode namespace replacing the legacy and still
bogus "VOP_ISLOCKED()" way.

Tested by:      kris (earlier version)
Reviewed by:    jhb
2008-02-13 20:44:19 +00:00
csjp
b24cb219b9 Make sure we restrict Linux only IPC calls from being executed
through the FreeBSD ABI.  IPC_INFO, SHM_INFO, SHM_STAT were added
specifically for Linux binary support.  They are not documented
as being a part of the FreeBSD ABI, also, the structures necessary
for them have been hidden away from the users for a long time.

Also, the Linux ABI layer uses it's own structures to populate the
responses back to the user to ensure that the ABI is consistent.

I think there is a bit more separation work that needs to happen.

Reviewed by:	jhb
Discussed with:	jhb
Discussed on:	freebsd-arch@ (very briefly)
MFC after:	1 month
2008-02-12 20:55:03 +00:00
ru
841dab65e0 Regenerate for readlink(2). 2008-02-12 20:11:54 +00:00