Commit Graph

370 Commits

Author SHA1 Message Date
julian
396ed947f6 Threading cleanup.. part 2 of several.
Make part of John Birrell's KSE patch permanent..
Specifically, remove:
Any reference of the ksegrp structure. This feature was
never fully utilised and made things overly complicated.
All code in the scheduler that tried to make threaded programs
fair to unthreaded programs.  Libpthread processes will already
do this to some extent and libthr processes already disable it.

Also:
Since this makes such a big change to the scheduler(s), take the opportunity
to rename some structures and elements that had to be moved anyhow.
This makes the code a lot more readable.

The ULE scheduler compiles again but I have no idea if it works.

The 4bsd scheduler still reqires a little cleaning and some functions that now do
ALMOST nothing will go away, but I thought I'd do that as a separate commit.

Tested by David Xu, and Dan Eischen using libthr and libpthread.
2006-12-06 06:34:57 +00:00
davidxu
c2eeccb29d Copy base user priority in NO_KSE case. 2006-11-12 11:48:37 +00:00
jb
f82c799735 Make KSE a kernel option, turned on by default in all GENERIC
kernel configs except sun4v (which doesn't process signals properly
with KSE).

Reviewed by:	davidxu@
2006-10-26 21:42:22 +00:00
rwatson
7beaaf5cd2 Complete break-out of sys/sys/mac.h into sys/security/mac/mac_framework.h
begun with a repo-copy of mac.h to mac_framework.h.  sys/mac.h now
contains the userspace and user<->kernel API and definitions, with all
in-kernel interfaces moved to mac_framework.h, which is now included
across most of the kernel instead.

This change is the first step in a larger cleanup and sweep of MAC
Framework interfaces in the kernel, and will not be MFC'd.

Obtained from:	TrustedBSD Project
Sponsored by:	SPARTA
2006-10-22 11:52:19 +00:00
rwatson
aa39d5a21d SI_ORDER_THIRD + 2, not SI_ORDER_FOURTH + 2.
MFC after:	3 days
Submitted by:	mlaier
2006-09-26 00:15:56 +00:00
rwatson
ffe75b8d42 Add "FreeBSD" trademark statement to copyright section of boot messages.
MFC after:	3 days
Approved by:	core, board at FreeBSDFoundation dot org
2006-09-25 23:19:01 +00:00
davidxu
e2f7e9cc95 Initialize kg_base_user_pri. 2006-08-25 06:29:16 +00:00
benno
64e3a72f51 The VERBOSE_SYSINIT stuff sees the DDB define a lot better if we include
opt_ddb.h.

Spotted by:	benno
Pointy hat to:	benno
2006-05-14 07:11:28 +00:00
benno
ed056cb6c7 Add a new kernel config option, VERBOSE_SYSINIT.
When porting FreeBSD to a new platform, one of the more useful things to do is
get mi_startup() to let you know which SYSINIT it's up to.  Most people tend to
whack a printf in the SYSINIT loop to print the address of the function it's
about to call.  Going one better, jhb made a version that uses DDB to look up
the name of the function and print that instead.  This version is essentially
his with the addition of some ifdeffery to make it optional and to allow it to
work (although using only the function address, not the symbol) if you forgot
to enable DDB.

All the cool bits by:	jhb
Approved by:		scottl, rink, cognet, imp
2006-05-12 02:01:38 +00:00
phk
bb2f62f536 Modify the way we account for CPU time spent (step 1)
Keep track of time spent by the cpu in various contexts in units of
"cputicks" and scale to real-world microsec^H^H^H^H^H^H^H^Hclock_t
only when somebody wants to inspect the numbers.

For now "cputicks" are still derived from the current timecounter
and therefore things should by definition remain sensible also on
SMP machines.  (The main reason for this first milestone commit is
to verify that hypothesis.)

On slower machines, the avoided multiplications to normalize timestams
at every context switch, comes out as a 5-7% better score on the
unixbench/context1 microbenchmark.  On more modern hardware no change
in performance is seen.
2006-02-07 21:22:02 +00:00
cognet
41e16a89f7 rwlock expects the struct thread to be aligned on 8 bytes, so make sure
thread0 is.
2006-02-06 16:03:10 +00:00
rwatson
15f0f561fc Hook up audit to the initial process creation events (proc0, proc1).
Much help from:	wsalamon
Obtained from:	TrustedBSD Project
2006-02-02 01:16:31 +00:00
rwatson
2a5785fb21 Moderate rewrite of kernel ktrace code to attempt to generally improve
reliability when tracing fast-moving processes or writing traces to
slow file systems by avoiding unbounded queueuing and dropped records.
Record loss was previously possible when the global pool of records
become depleted as a result of record generation outstripping record
commit, which occurred quickly in many common situations.

These changes partially restore the 4.x model of committing ktrace
records at the point of trace generation (synchronous), but maintain
the 5.x deferred record commit behavior (asynchronous) for situations
where entering VFS and sleeping is not possible (i.e., in the
scheduler).  Records are now queued per-process as opposed to
globally, with processes responsible for committing records from their
own context as required.

- Eliminate the ktrace worker thread and global record queue, as they
  are no longer used.  Keep the global free record list, as records
  are still used.

- Add a per-process record queue, which will hold any asynchronously
  generated records, such as from context switches.  This replaces the
  global queue as the place to submit asynchronous records to.

- When a record is committed asynchronously, simply queue it to the
  process.

- When a record is committed synchronously, first drain any pending
  per-process records in order to maintain ordering as best we can.
  Currently ordering between competing threads is provided via a global
  ktrace_sx, but a per-process flag or lock may be desirable in the
  future.

- When a process returns to user space following a system call, trap,
  signal delivery, etc, flush any pending records.

- When a process exits, flush any pending records.

- Assert on process tear-down that there are no pending records.

- Slightly abstract the notion of being "in ktrace", which is used to
  prevent the recursive generation of records, as well as generating
  traces for ktrace events.

Future work here might look at changing the set of events marked for
synchronous and asynchronous record generation, re-balancing queue
depth, timeliness of commit to disk, and so on.  I.e., performing a
drain every (n) records.

MFC after:	1 month
Discussed with:	jhb
Requested by:	Marc Olzheim <marcolz at stack dot nl>
2005-11-13 13:27:44 +00:00
rwatson
583b25a64f Remove mac_create_root_mount() and mpo_create_root_mount(), which
provided access to the root file system before the start of the
init process.  This was used briefly by SEBSD before it knew about
preloading data in the loader, and using that method to gain
access to data earlier results in fewer inconsistencies in the
approach.  Policy modules still have access to the root file system
creation event through the mac_create_mount() entry point.

Removed now, and will be removed from RELENG_6, in order to gain
third party policy dependencies on the entry point for the lifetime
of the 6.x branch.

MFC after:	3 days
Submitted by:	Chris Vance <Christopher dot Vance at SPARTA dot com>
Sponsored by:	SPARTA
2005-09-19 13:59:57 +00:00
rse
56379f0e5b Fix system shutdown timeout handling by again supporting longer running
shutdown procedures (which have a duration of more than 120 seconds).

We have two user-space affecting shutdown timeouts: a "soft" one in
/etc/rc.shutdown and a "hard" one in init(8). The first one can be
configured via /etc/rc.conf variable "rcshutdown_timeout" and defaults
to 30 seconds. The second one was originally (in 1998) intended to be
configured via sysctl(8) variable "kern.shutdown_timeout" and defaults
to 120 seconds.

Unfortunately, the "kern.shutdown_timeout" was declared "unused" in 1999
(as it obviously is actually not used within the kernel itself) and
hence was intentionally but misleadingly removed in revision 1.107 from
init_main.c. Kernel sysctl(8) variables are certainly a wrong way to
control user-space processes in general, but in this particular case the
sysctl(8) variable should have remained as it supports init(8), which
isn't passed command line flags (which in turn could have been set via
/etc/rc.conf), etc.

As there is already a similar "kern.init_path" sysctl(8) variable which
directly affects init(8), resurrect the init(8) shutdown timeout under
sysctl(8) variable "kern.init_shutdown_timeout". But this time document
it as being intentionally unused within the kernel and used by init(8).
Also document it in the manpages init(8) and rc.conf(5).

Reviewed by: phk
MFC after: 2 weeks
2005-09-15 13:16:07 +00:00
ssouhlal
efe31cd3da Fix the recent panics/LORs/hangs created by my kqueue commit by:
- Introducing the possibility of using locks different than mutexes
for the knlist locking. In order to do this, we add three arguments to
knlist_init() to specify the functions to use to lock, unlock and
check if the lock is owned. If these arguments are NULL, we assume
mtx_lock, mtx_unlock and mtx_owned, respectively.

- Using the vnode lock for the knlist locking, when doing kqueue operations
on a vnode. This way, we don't have to lock the vnode while holding a
mutex, in filt_vfsread.

Reviewed by:	jmg
Approved by:	re (scottl), scottl (mentor override)
Pointyhat to:	ssouhlal
Will be happy:	everyone
2005-07-01 16:28:32 +00:00
des
627292879c Add /rescue/init to the default init_path, before /stand/sysinstall.
MFC after:	2 weeks
2005-02-17 10:00:10 +00:00
imp
20280f1431 /* -> /*- for copyright notices, minor format tweaks as necessary 2005-01-06 23:35:40 +00:00
phk
4a639d6164 The remaining part of nmount/omount/rootfs mount changes. I cannot sensibly
split the conversion of the remaining three filesystems out from the root
mounting changes, so in one go:

cd9660:
	Convert to nmount.
	Add omount compat shims.
	Remove dedicated rootfs mounting code.
	Use vfs_mountedfrom()
	Rely on vfs_mount.c calling VFS_STATFS()

nfs(client):
	Convert to nmount (the simple way, mount_nfs(8) is still necessary).
	Add omount compat shims.
	Drop COMPAT_PRELITE2 mount arg compatibility.

ffs:
	Convert to nmount.
	Add omount compat shims.
	Remove dedicated rootfs mounting code.
	Use vfs_mountedfrom()
	Rely on vfs_mount.c calling VFS_STATFS()

Remove vfs_omount() method, all filesystems are now converted.

Remove MNTK_WANTRDWR, handling RO/RW conversions is a filesystem
task, and they all do it now.

Change rootmounting to use DEVFS trampoline:

vfs_mount.c:
	Mount devfs on /.  Devfs needs no 'from' so this is clean.
	symlink /dev to /.  This makes it possible to lookup /dev/foo.
	Mount "real" root filesystem on /.
	Surgically move the devfs mountpoint from under the real root
	filesystem onto /dev in the real root filesystem.

Remove now unnecessary getdiskbyname().

kern_init.c:
	Don't do devfs mounting and rootvnode assignment here, it was
	already handled by vfs_mount.c.

Remove now unused bdevvp(), addaliasu() and addalias().  Put the
few necessary lines in devfs where they belong.  This eliminates the
second-last source of bogo vnodes, leaving only the lemming-syncer.

Remove rootdev variable, it doesn't give meaning in a global context and
was not trustworth anyway.  Correct information is provided by
statfs(/).
2004-12-07 08:15:41 +00:00
das
130bed6547 Don't include sys/user.h merely for its side-effect of recursively
including other headers.
2004-11-27 06:51:39 +00:00
das
22907ad4ac Malloc p_stats instead of putting it in the U area. We should consider
simply embedding it in struct proc.

Reviewed by:	arch@
2004-11-20 02:28:48 +00:00
phk
418eb9d34b Allow fdinit() to be called with a NULL fdp argument so we can use
it when setting up init.

Make fdinit() lock the fdp argument as needed.
2004-11-07 12:39:28 +00:00
jhb
ce2d3f89af Rework how we store process times in the kernel such that we always store
the raw values including for child process statistics and only compute the
system and user timevals on demand.

- Fix the various kern_wait() syscall wrappers to only pass in a rusage
  pointer if they are going to use the result.
- Add a kern_getrusage() function for the ABI syscalls to use so that they
  don't have to play stackgap games to call getrusage().
- Fix the svr4_sys_times() syscall to just call calcru() to calculate the
  times it needs rather than calling getrusage() twice with associated
  stackgap, etc.
- Add a new rusage_ext structure to store raw time stats such as tick counts
  for user, system, and interrupt time as well as a bintime of the total
  runtime.  A new p_rux field in struct proc replaces the same inline fields
  from struct proc (i.e. p_[isu]ticks, p_[isu]u, and p_runtime).  A new p_crux
  field in struct proc contains the "raw" child time usage statistics.
  ruadd() has been changed to handle adding the associated rusage_ext
  structures as well as the values in rusage.  Effectively, the values in
  rusage_ext replace the ru_utime and ru_stime values in struct rusage.  These
  two fields in struct rusage are no longer used in the kernel.
- calcru() has been split into a static worker function calcru1() that
  calculates appropriate timevals for user and system time as well as updating
  the rux_[isu]u fields of a passed in rusage_ext structure.  calcru() uses a
  copy of the process' p_rux structure to compute the timevals after updating
  the runtime appropriately if any of the threads in that process are
  currently executing.  It also now only locks sched_lock internally while
  doing the rux_runtime fixup.  calcru() now only requires the caller to
  hold the proc lock and calcru1() only requires the proc lock internally.
  calcru() also no longer allows callers to ask for an interrupt timeval
  since none of them actually did.
- calcru() now correctly handles threads executing on other CPUs.
- A new calccru() function computes the child system and user timevals by
  calling calcru1() on p_crux.  Note that this means that any code that wants
  child times must now call this function rather than reading from p_cru
  directly.  This function also requires the proc lock.
- This finishes the locking for rusage and friends so some of the Giant locks
  in exit1() and kern_wait() are now gone.
- The locking in ttyinfo() has been tweaked so that a shared lock of the
  proctree lock is used to protect the process group rather than the process
  group lock.  By holding this lock until the end of the function we now
  ensure that the process/thread that we pick to dump info about will no
  longer vanish while we are trying to output its info to the console.

Submitted by:	bde (mostly)
MFC after:	1 month
2004-10-05 18:51:11 +00:00
julian
5813d27029 Refactor a bunch of scheduler code to give basically the same behaviour
but with slightly cleaned up interfaces.

The KSE structure has become the same as the "per thread scheduler
private data" structure. In order to not make the diffs too great
one is #defined as the other at this time.

The KSE (or td_sched) structure is  now allocated per thread and has no
allocation code of its own.

Concurrency for a KSEGRP is now kept track of via a simple pair of counters
rather than using KSE structures as tokens.

Since the KSE structure is different in each scheduler, kern_switch.c
is now included at the end of each scheduler. Nothing outside the
scheduler knows the contents of the KSE (aka td_sched) structure.

The fields in the ksegrp structure that are to do with the scheduler's
queueing mechanisms are now moved to the kg_sched structure.
(per ksegrp scheduler private data structure). In other words how the
scheduler queues and keeps track of threads is no-one's business except
the scheduler's. This should allow people to write experimental
schedulers with completely different internal structuring.

A scheduler call sched_set_concurrency(kg, N) has been added that
notifies teh scheduler that no more than N threads from that ksegrp
should be allowed to be on concurrently scheduled. This is also
used to enforce 'fainess' at this time so that a ksegrp with
10000 threads can not swamp a the run queue and force out a process
with 1 thread, since the current code will not set the concurrency above
NCPU, and both schedulers will not allow more than that many
onto the system run queue at a time. Each scheduler should eventualy develop
their own methods to do this now that they are effectively separated.

Rejig libthr's kernel interface to follow the same code paths as
linkse for scope system threads. This has slightly hurt libthr's performance
but I will work to recover as much of it as I can.

Thread exit code has been cleaned up greatly.
exit and exec code now transitions a process back to
'standard non-threaded mode' before taking the next step.
Reviewed by:	scottl, peter
MFC after:	1 week
2004-09-05 02:09:54 +00:00
julian
e9d9514975 Give setrunqueue() and sched_add() more of a clue as to
where they are coming from and what is expected from them.

MFC after:	2 days
2004-09-01 02:11:28 +00:00
jmg
bc1805c6e8 Add locking to the kqueue subsystem. This also makes the kqueue subsystem
a more complete subsystem, and removes the knowlege of how things are
implemented from the drivers.  Include locking around filter ops, so a
module like aio will know when not to be unloaded if there are outstanding
knotes using it's filter ops.

Currently, it uses the MTX_DUPOK even though it is not always safe to
aquire duplicate locks.  Witness currently doesn't support the ability
to discover if a dup lock is ok (in some cases).

Reviewed by:	green, rwatson (both earlier versions)
2004-08-15 06:24:42 +00:00
phk
075684f5fd Remove global variable rootdevs and rootvp, they are unused as such.
Add local rootvp variables as needed.

Remove checks for miniroot's in the swappartition.  We never did that
and most of the filesystems could never be used for that, but it had
still been copy&pasted all over the place.
2004-07-28 20:21:04 +00:00
alfred
8a1713aada Make VFS_ROOT() and vflush() take a thread argument.
This is to allow filesystems to decide based on the passed thread
which vnode to return.
Several filesystems used curthread, they now use the passed thread.
2004-07-12 08:14:09 +00:00
julian
6c9d81ae0d Nice, is a property of a process as a whole..
I mistakenly moved it to the ksegroup when breaking up the process
structure. Put it back in the proc structure.
2004-06-16 00:26:31 +00:00
phk
fdda2333fa Loudly announce WITNESS and DIAGNOSTIC options and warn about reduced
performance.
2004-02-29 16:56:54 +00:00
jhb
279b2b8278 Locking for the per-process resource limits structure.
- struct plimit includes a mutex to protect a reference count.  The plimit
  structure is treated similarly to struct ucred in that is is always copy
  on write, so having a reference to a structure is sufficient to read from
  it without needing a further lock.
- The proc lock protects the p_limit pointer and must be held while reading
  limits from a process to keep the limit structure from changing out from
  under you while reading from it.
- Various global limits that are ints are not protected by a lock since
  int writes are atomic on all the archs we support and thus a lock
  wouldn't buy us anything.
- All accesses to individual resource limits from a process are abstracted
  behind a simple lim_rlimit(), lim_max(), and lim_cur() API that return
  either an rlimit, or the current or max individual limit of the specified
  resource from a process.
- dosetrlimit() was renamed to kern_setrlimit() to match existing style of
  other similar syscall helper functions.
- The alpha OSF/1 compat layer no longer calls getrlimit() and setrlimit()
  (it didn't used the stackgap when it should have) but uses lim_rlimit()
  and kern_setrlimit() instead.
- The svr4 compat no longer uses the stackgap for resource limits calls,
  but uses lim_rlimit() and kern_setrlimit() instead.
- The ibcs2 compat no longer uses the stackgap for resource limits.  It
  also no longer uses the stackgap for accessing sysctl's for the
  ibcs2_sysconf() syscall but uses kernel_sysctl() instead.  As a result,
  ibcs2_sysconf() no longer needs Giant.
- The p_rlimit macro no longer exists.

Submitted by:	mtm (mostly, I only did a few cleanups and catchups)
Tested on:	i386
Compiled on:	alpha, amd64
2004-02-04 21:52:57 +00:00
rwatson
c294b6d48a KASSERT() that initproc->p_pid is 1. Very bad things happen if init's
pid isn't 1, and it can actually occur if kthread_create() is called
before SUB_SI_CREATE_INIT without RFHIGHPID.

Discussed with:	jhb
2004-01-16 20:29:23 +00:00
des
f270054cd3 New file descriptor allocation code, derived from similar code introduced
in OpenBSD by Niels Provos.  The patch introduces a bitmap of allocated
file descriptors which is used to locate available descriptors when a new
one is needed.  It also moves the task of growing the file descriptor table
out of fdalloc(), reducing complexity in both fdalloc() and do_dup().

Debts of gratitude are owed to tjr@ (who provided the original patch on
which this work is based), grog@ (for the gdb(4) man page) and rwatson@
(for assistance with pxeboot(8)).
2004-01-15 10:15:04 +00:00
rwatson
1c522512fd Remove the global variable 'cmask', which was used to initialize the
fd_cmask field in the file descriptor structure for the first process
indirectly from CMASK, and when an fd structure is initialized before
being filled in, and instead just use CMASK.  This appears to be an
artifact left over from the initial integration of quotas into BSD.

Suggested by:	peter
2003-10-02 03:57:59 +00:00
peter
8ecb3577d8 Add sysentvec->sv_fixlimits() hook so that we can catch cases on 64 bit
systems where the data/stack/etc limits are too big for a 32 bit process.

Move the 5 or so identical instances of ELF_RTLD_ADDR() into imgact_elf.c.

Supply an ia32_fixlimits function.  Export the clip/default values to
sysctl under the compat.ia32 heirarchy.

Have mmap(0, ...) respect the current p->p_limits[RLIMIT_DATA].rlim_max
value rather than the sysctl tweakable variable.  This allows mmap to
place mappings at sensible locations when limits have been reduced.

Have the imgact_elf.c ld-elf.so.1 placement algorithm use the same
method as mmap(0, ...) now does.

Note that we cannot remove all references to the sysctl tweakable
maxdsiz etc variables because /etc/login.conf specifies a datasize
of 'unlimited'.  And that causes exec etc to fail since it can no
longer find space to mmap things.
2003-09-25 01:10:26 +00:00
sam
59ff2ad5c7 Change instances of callout_init that specify MPSAFE behaviour to
use CALLOUT_MPSAFE instead of "1" for the second parameter.  This
does not change the behaviour; it just makes the intent more clear.
2003-08-19 17:51:11 +00:00
phk
eb30c92e49 Revert stuff which accidentally ended up in the previous commit. 2003-07-22 10:36:36 +00:00
phk
c4a9334fa6 Don't attempt to inline large functions mb_alloc() and mb_free(),
it more than doubles the text size of this file.

GCC has wisely ignored us on this previously
2003-07-22 10:24:41 +00:00
obrien
3b8fff9e4c Use __FBSDID(). 2003-06-11 00:56:59 +00:00
tegge
e41badac0a Add tracking of process leaders sharing a file descriptor table and
allow a file descriptor table to be shared between multiple process
leaders.

PR:		50923
2003-06-02 16:05:32 +00:00
jhb
89a4eb17de - Merge struct procsig with struct sigacts.
- Move struct sigacts out of the u-area and malloc() it using the
  M_SUBPROC malloc bucket.
- Add a small sigacts_*() API for managing sigacts structures: sigacts_alloc(),
  sigacts_free(), sigacts_copy(), sigacts_share(), and sigacts_shared().
- Remove the p_sigignore, p_sigacts, and p_sigcatch macros.
- Add a mutex to struct sigacts that protects all the members of the struct.
- Add sigacts locking.
- Remove Giant from nosys(), kill(), killpg(), and kern_sigaction() now
  that sigacts is locked.
- Several in-kernel functions such as psignal(), tdsignal(), trapsignal(),
  and thread_stopped() are now MP safe.

Reviewed by:	arch@
Approved by:	re (rwatson)
2003-05-13 20:36:02 +00:00
des
8ed712ead1 Instead of recording the Unix time in a process when it starts, record the
uptime.  Where necessary, convert it back to Unix time by adding boottime
to it.  This fixes a potential problem in the accounting code, which would
compute the elapsed time incorrectly if the Unix time was stepped during
the lifetime of the process.
2003-05-01 16:59:23 +00:00
jake
0827d0138a Made vmspace0 non-static. Its useful to be able to identify a vmspace as
the kernel vmspace.
2003-04-13 21:29:11 +00:00
julian
6f175a0e20 Move the _oncpu entry from the KSE to the thread.
The entry in the KSE still exists but it's purpose will change a bit
when we add the ability to lock a KSE to a cpu.
2003-04-10 17:35:44 +00:00
tjr
6ebeaa8ec8 Use the proc lock to protect p_realtimer instead of Giant, and obtain
sched_lock around accesses to p_stats->p_timer[] to avoid a potential
race with hardclock. getitimer(), setitimer() and the realitexpire()
callout are now Giant-free.
2003-02-17 10:03:02 +00:00
jeff
590a39e29b - Split the struct kse into struct upcall and struct kse. struct kse will
soon be visible only to schedulers.  This greatly simplifies much the
   KSE code.

Submitted by:	davidxu
2003-02-17 05:14:26 +00:00
des
0fb179fc08 It seems the extra precautions are no longer needed. 2003-02-13 10:05:20 +00:00
des
79e32cc4b8 Correct grammatical error in previous commit. 2003-02-04 18:47:17 +00:00
des
3038c34616 Extra precautions before trying to start init(8). 2003-02-04 18:16:50 +00:00
julian
e8efa7328e Reversion of commit by Davidxu plus fixes since applied.
I'm not convinced there is anything major wrong with the patch but
them's the rules..

I am using my "David's mentor" hat to revert this as he's
offline for a while.
2003-02-01 12:17:09 +00:00
phk
9f5bb44936 NODEVFS cleanup: remove #ifdefs. 2003-01-30 12:35:17 +00:00
davidxu
4b9b549ca2 Move UPCALL related data structure out of kse, introduce a new
data structure called kse_upcall to manage UPCALL. All KSE binding
and loaning code are gone.

A thread owns an upcall can collect all completed syscall contexts in
its ksegrp, turn itself into UPCALL mode, and takes those contexts back
to userland. Any thread without upcall structure has to export their
contexts and exit at user boundary.

Any thread running in user mode owns an upcall structure, when it enters
kernel, if the kse mailbox's current thread pointer is not NULL, then
when the thread is blocked in kernel, a new UPCALL thread is created and
the upcall structure is transfered to the new UPCALL thread. if the kse
mailbox's current thread pointer is NULL, then when a thread is blocked
in kernel, no UPCALL thread will be created.

Each upcall always has an owner thread. Userland can remove an upcall by
calling kse_exit, when all upcalls in ksegrp are removed, the group is
atomatically shutdown. An upcall owner thread also exits when process is
in exiting state. when an owner thread exits, the upcall it owns is also
removed.

KSE is a pure scheduler entity. it represents a virtual cpu. when a thread
is running, it always has a KSE associated with it. scheduler is free to
assign a KSE to thread according thread priority, if thread priority is changed,
KSE can be moved from one thread to another.

When a ksegrp is created, there is always N KSEs created in the group. the
N is the number of physical cpu in the current system. This makes it is
possible that even an userland UTS is single CPU safe, threads in kernel still
can execute on different cpu in parallel. Userland calls kse_create to add more
upcall structures into ksegrp to increase concurrent in userland itself, kernel
is not restricted by number of upcalls userland provides.

The code hasn't been tested under SMP by author due to lack of hardware.

Reviewed by: julian
2003-01-26 11:41:35 +00:00
phk
24596ddb76 Originally when DEVFS was added, a global variable "devfs_present"
was used to control code which were conditional on DEVFS' precense
since this avoided the need for large-scale source pollution with
#include "opt_geom.h"

Now that we approach making DEVFS standard, replace these tests
with an #ifdef to facilitate mechanical removal once DEVFS becomes
non-optional.

No functional change by this commit.
2003-01-19 11:03:07 +00:00
jake
a307536f90 Improve the way that an elf image activator for an alternate word size is
included in the kernel.  Include imgact_elf.c in conf/files,  instead of
both imgact_elf32.c and imgact_elf64.c, which will use the default word
size for an architecture as defined in machine/elf.h.  Architectures that
wish to build an additional image activator for an alternate word size can
include either imgact_elf32.c or imgact_elf64.c in files.${ARCH}, which
allows it to be dependent on MD options instead of solely on architecture.

Glanced at by:	peter
2003-01-04 22:07:48 +00:00
julian
dde96893c9 Add code to ddb to allow backtracing an arbitrary thread.
(show thread {address})

Remove the IDLE kse state and replace it with a change in
the way threads sahre KSEs. Every KSE now has a thread, which is
considered its "owner" however a KSE may also be lent to other
threads in the same group to allow completion of in-kernel work.
n this case the owner remains the same and the KSE will revert to the
owner when the other work has been completed.

All creations of upcalls etc. is now done from
kse_reassign() which in turn is called from mi_switch or
thread_exit(). This means that special code can be removed from
msleep() and cv_wait().

kse_release() does not leave a KSE with no thread any more but
converts the existing thread into teh KSE's owner, and sets it up
for doing an upcall. It is just inhibitted from being scheduled until
there is some reason to do an upcall.

Remove all trace of the kse_idle queue since it is no-longer needed.
"Idle" KSEs are now on the loanable queue.
2002-12-28 01:23:07 +00:00
keramida
791bba5ef9 Fix typo in comment. It's SYSINIT, not SYSINT.
Approved by:	re (murray)
2002-11-30 22:15:30 +00:00
jeff
49bf96275f - Implement a mechanism for allowing schedulers to place scheduler dependant
data in the scheduler independant structures (proc, ksegrp, kse, thread).
 - Implement unused stubs for this mechanism in sched_4bsd.

Approved by:	re
Reviewed by:	luigi, trb
Tested on:	x86, alpha
2002-11-21 01:22:38 +00:00
robert
1e0cdb534a Use strlcpy() instead of strncpy() to copy NUL terminated strings
for safety and consistency.
2002-10-17 20:03:38 +00:00
jhb
7cc0ed53c2 - Move p_cpulimit to struct proc from struct plimit and protect it with
sched_lock.  This means that we no longer access p_limit in mi_switch()
  and the p_limit pointer can be protected by the proc lock.
- Remove PRS_ZOMBIE check from CPU limit test in mi_switch().  PRS_ZOMBIE
  processes don't call mi_switch(), and even if they did there is no longer
  the danger of p_limit being NULL (which is what the original zombie check
  was added for).
- When we bump the current processes soft CPU limit in ast(), just bump the
  private p_cpulimit instead of the shared rlimit.  This fixes an XXX for
  some value of fix.  There is still a (probably benign) bug in that this
  code doesn't check that the new soft limit exceeds the hard limit.

Inspired by:	bde (2)
2002-10-09 17:17:24 +00:00
julian
6b6ba96b60 Round out the facilty for a 'bound' thread to loan out its KSE
in specific situations. The owner thread must be blocked, and the
borrower can not proceed back to user space with the borrowed KSE.
The borrower will return the KSE on the next context switch where
teh owner wants it back. This removes a lot of possible
race conditions and deadlocks. It is consceivable that the
borrower should inherit the priority of the owner too.
that's another discussion and would be simple to do.

Also, as part of this, the "preallocatd spare thread" is attached to the
thread doing a syscall rather than the KSE. This removes the need to lock
the scheduler when we want to access it, as it's now "at hand".

DDB now shows a lot mor info for threaded proceses though it may need
some optimisation to squeeze it all back into 80 chars again.
(possible JKH project)

Upcalls are now "bound" threads, but "KSE Lending" now means that
other completing syscalls can be completed using that KSE before the upcall
finally makes it back to the UTS. (getting threads OUT OF THE KERNEL is
one of the highest priorities in the KSE system.) The upcall when it happens
will present all the completed syscalls to the KSE for selection.
2002-10-09 02:33:36 +00:00
scottl
3a150bca9c Some kernel threads try to do significant work, and the default KSTACK_PAGES
doesn't give them enough stack to do much before blowing away the pcb.
This adds MI and MD code to allow the allocation of an alternate kstack
who's size can be speficied when calling kthread_create.  Passing the
value 0 prevents the alternate kstack from being created.  Note that the
ia64 MD code is missing for now, and PowerPC was only partially written
due to the pmap.c being incomplete there.
Though this patch does not modify anything to make use of the alternate
kstack, acpi and usb are good candidates.

Reviewed by:	jake, peter, jhb
2002-10-02 07:44:29 +00:00
jmallett
7a693db242 Back our kernel support for reliable signal queues.
Requested by:	rwatson, phk, and many others
2002-10-01 17:15:53 +00:00
jmallett
0341f71df1 First half of implementation of ksiginfo, signal queues, and such. This
gets signals operating based on a TailQ, and is good enough to run X11,
GNOME, and do job control.  There are some intricate parts which could be
more refined to match the sigset_t versions, but those require further
evaluation of directions in which our signal system can expand and contract
to fit our needs.

After this has been in the tree for a while, I will make in kernel API
changes, most notably to trapsignal(9) and sendsig(9), to use ksiginfo
more robustly, such that we can actually pass information with our
(queued) signals to the userland.  That will also result in using a
struct ksiginfo pointer, rather than a signal number, in a lot of
kern_sig.c, to refer to an individual pending signal queue member, but
right now there is no defined behaviour for such.

CODAFS is unfinished in this regard because the logic is unclear in
some places.

Sponsored by:	New Gold Technology
Reviewed by:	bde, tjr, jake [an older version, logic similar]
2002-09-30 20:20:22 +00:00
jake
2b71a04b1e Use the fields in the sysentvec and in the vm map header in place of the
constants VM_MIN_ADDRESS, VM_MAXUSER_ADDRESS, USRSTACK and PS_STRINGS.
This is mainly so that they can be variable even for the native abi, based
on different machine types.  Get stack protections from the sysentvec too.
This makes it trivial to map the stack non-executable for certain abis, on
machines that support it.
2002-09-21 22:07:17 +00:00
julian
c7e9e7e892 Allocate KSEs and KSEGRPs separatly and remove them from the proc structure.
next step is to allow > 1 to be allocated per process. This would give
multi-processor threads. (when the rest of the infrastructure is
in place)

While doing this I noticed libkvm and sys/kern/kern_proc.c:fill_kinfo_proc
are diverging more than they should.. corrective action needed soon.
2002-09-15 23:52:25 +00:00
julian
5702a380a5 Completely redo thread states.
Reviewed by:	davidxu@freebsd.org
2002-09-11 08:13:56 +00:00
jake
ce650f8c33 Added fields for VM_MIN_ADDRESS, PS_STRINGS and stack protections to
sysentvec.  Initialized all fields of all sysentvecs, which will allow
them to be used instead of constants in more places.  Provided stack
fixup routines for emulations that previously used the default.
2002-09-01 21:41:24 +00:00
iedowse
be17b12cb6 Split out a number of mostly VFS and signal related syscalls into
a kernel-internal kern_*() version and a wrapper that is called via
the syscall vector table. For paths and structure pointers, the
internal version either takes a uio_seg parameter or requires the
caller to copyin() the data to kernel memory as appropiate. This
will permit emulation layers to use these syscalls without having
to copy out translated arguments to the stack gap.

Discussed on:		-arch
Review/suggestions:	bde, jhb, peter, marcel
2002-09-01 20:37:28 +00:00
rwatson
647037845a Refresh the credential on the first initproc thread following divorcing
the initproc credential from the proc0 credential.  Otherwise, the
proc0 credential is used instead of initproc's credentil when authorizing
start_init() activities prior to initproc hitting userland for the
first time.  This could result in the incorrect credential being used
to authorize mounting of the root file system, which could in turn cause
problems for NFS when used in combination with uid/gid ipfw rules, or
with MAC.

Discussed with:	julian
2002-08-07 17:53:31 +00:00
rwatson
6228fca505 Introduce support for Mandatory Access Control and extensible
kernel access control.

Invoke the necessary MAC entry points to maintain labels on
mount structures.  In particular, invoke entry points for
intialization and destruction in various scenarios (root,
non-root).  Also introduce an entry point in the boot procedure
following the mount of the root file system, but prior to the
start of the userland init process to permit policies to
perform further initialization.

Obtained from:	TrustedBSD Project
Sponsored by:	DARPA, NAI Labs
2002-07-31 01:11:29 +00:00
rwatson
3c013fb68e Introduce support for Mandatory Access Control and extensible
kernel access control.

Invoke the necessary MAC entry points to maintain labels on
process credentials.  In particular, invoke entry points for
the initialization and destruction of struct ucred, the copying
of struct ucred, and permit the initial labels to be set for
both process 0 (parent of all kernel processes) and process 1
(parent of all user processes).

Obtained from:	TrustedBSD Project
Sponsored by:	DARPA, NAI Labs
2002-07-31 00:39:19 +00:00
peter
cc7b2e4248 Infrastructure tweaks to allow having both an Elf32 and an Elf64 executable
handler in the kernel at the same time.  Also, allow for the
exec_new_vmspace() code to build a different sized vmspace depending on
the executable environment.  This is a big help for execing i386 binaries
on ia64.   The ELF exec code grows the ability to map partial pages when
there is a page size difference, eg: emulating 4K pages on 8K or 16K
hardware pages.

Flesh out the i386 emulation support for ia64.  At this point, the only
binary that I know of that fails is cvsup, because the cvsup runtime
tries to execute code in pages not marked executable.

Obtained from:  dfr (mostly, many tweaks from me).
2002-07-20 02:56:12 +00:00
julian
d84464f213 Thinking about it I came to the conclusion that the KSE states were incorrectly
formulated.  The correct states should be:
IDLE:  On the idle KSE list for that KSEG
RUNQ:  Linked onto the system run queue.
THREAD: Attached to a thread and slaved to whatever state the thread is in.

This means that most places where we were adjusting kse state can go away
as it is just moving around because the thread is..
The only places we need to adjust the KSE state is in transition to and from
the idle and run queues.

Reviewed by:	jhb@freebsd.org
2002-07-14 03:43:33 +00:00
mux
d6e07ac29c Remove an unused argument in vfs_mountroot(). 2002-07-03 08:52:37 +00:00
julian
aa2dc0a5d9 Part 1 of KSE-III
The ability to schedule multiple threads per process
(one one cpu) by making ALL system calls optionally asynchronous.
to come: ia64 and power-pc patches, patches for gdb, test program (in tools)

Reviewed by:	Almost everyone who counts
	(at various times, peter, jhb, matt, alfred, mini, bernd,
	and a cast of thousands)

	NOTE: this is still Beta code, and contains lots of debugging stuff.
	expect slight instability in signals..
2002-06-29 17:26:22 +00:00
trhodes
28d42899b7 More s/file system/filesystem/g 2002-05-16 21:28:32 +00:00
mux
85b0c22bf2 Convert devfs to nmount.
Reviewed by:	phk
2002-05-02 20:27:42 +00:00
rwatson
63ab78794e Divorce proc0 and proc1 credentials earlier; while this isn't technically
needed in the current code, in the MAC tree, create_init() relies on the
ability to modify the credentials present for initproc, and should not
perform that modification on a shared credential.  Pro-active diff
reduction against MAC changes that are in the queue; also facilitates
other work, including the capabilities implementation.

Submitted by:	green
Obtained from:	TrustedBSD Project
Sponsored by:	DARPA, NAI Labs
2002-04-19 13:35:53 +00:00
mux
a207e41bef Rework the kernel environment subsystem. We now convert the static
environment needed at boot time to a dynamic subsystem when VM is
up.  The dynamic kernel environment is protected by an sx lock.

This adds some new functions to manipulate the kernel environment :
freeenv(), setenv(), unsetenv() and testenv().  freeenv() has to be
called after every getenv() when you have finished using the string.
testenv() only tests if an environment variable is present, and
doesn't require a freeenv() call. setenv() and unsetenv() are self
explanatory.

The kenv(2) syscall exports these new functionalities to userland,
mainly for kenv(1).

Reviewed by:	peter
2002-04-17 13:06:36 +00:00
jhb
db9aa81e23 Change callers of mtx_init() to pass in an appropriate lock type name. In
most cases NULL is passed, but in some cases such as network driver locks
(which use the MTX_NETWORK_LOCK macro) and UMA zone locks, a name is used.

Tested on:	i386, alpha, sparc64
2002-04-04 21:03:38 +00:00
tanimura
9ae6d1242c The description of fd_mtx is "filedesc structure." 2002-03-29 11:26:05 +00:00
jeff
dff418f166 Add a new mtx_init option "MTX_DUPOK" which allows duplicate acquires of locks
with this flag.  Remove the dup_list and dup_ok code from subr_witness.  Now
we just check for the flag instead of doing string compares.

Also, switch the process lock, process group lock, and uma per cpu locks over
to this interface.  The original mechanism did not work well for uma because
per cpu lock names are unique to each zone.

Approved by:	jhb
2002-03-27 09:23:41 +00:00
phk
0a13c0f2f6 Move the mount of the root filesystem to happen in the init process before
the exec if /sbin/init.

This allows the scheduler to get started and kthreads a chance to run
before we start filesystem operations.
2002-03-08 10:33:11 +00:00
peter
b7d611f95e Fix warning. s/microuptime()/binuptime()/ for switchtime initial value. 2002-02-26 01:03:39 +00:00
tanimura
a09da29859 Lock struct pgrp, session and sigio.
New locks are:

- pgrpsess_lock which locks the whole pgrps and sessions,
- pg_mtx which protects the pgrp members, and
- s_mtx which protects the session members.

Please refer to sys/proc.h for the coverage of these locks.

Changes on the pgrp/session interface:

- pgfind() needs the pgrpsess_lock held.

- The caller of enterpgrp() is responsible to allocate a new pgrp and
  session.

- Call enterthispgrp() in order to enter an existing pgrp.

- pgsignal() requires a pgrp lock held.

Reviewed by:	jhb, alfred
Tested on:	cvsup.jp.FreeBSD.org
		(which is a quad-CPU machine running -current)
2002-02-23 11:12:57 +00:00
phk
fa959f1afd Convert p->p_runtime and PCPU(switchtime) to bintime format. 2002-02-22 13:32:01 +00:00
julian
37369620df In a threaded world, differnt priorirites become properties of
different entities.  Make it so.

Reviewed by:	jhb@freebsd.org (john baldwin)
2002-02-11 20:37:54 +00:00
julian
b5eb64d6f0 Pre-KSE/M3 commit.
this is a low-functionality change that changes the kernel to access the main
thread of a process via the linked list of threads rather than
assuming that it is embedded in the process. It IS still embeded there
but remove all teh code that assumes that in preparation for the next commit
which will actually move it out.

Reviewed by: peter@freebsd.org, gallatin@cs.duke.edu, benno rice,
2002-02-07 20:58:47 +00:00
alfred
5e2f4cf200 Include sys/_lock.h and sys/_mutex.h to reduce namespace pollution.
Requested by: jhb
2002-01-13 21:37:49 +00:00
alfred
844237b396 SMP Lock struct file, filedesc and the global file list.
Seigo Tanimura (tanimura) posted the initial delta.

I've polished it quite a bit reducing the need for locking and
adapting it for KSE.

Locks:

1 mutex in each filedesc
   protects all the fields.
   protects "struct file" initialization, while a struct file
     is being changed from &badfileops -> &pipeops or something
     the filedesc should be locked.

1 mutex in each struct file
   protects the refcount fields.
   doesn't protect anything else.
   the flags used for garbage collection have been moved to
     f_gcflag which was the FILLER short, this doesn't need
     locking because the garbage collection is a single threaded
     container.
  could likely be made to use a pool mutex.

1 sx lock for the global filelist.

struct file *	fhold(struct file *fp);
        /* increments reference count on a file */

struct file *	fhold_locked(struct file *fp);
        /* like fhold but expects file to locked */

struct file *	ffind_hold(struct thread *, int fd);
        /* finds the struct file in thread, adds one reference and
                returns it unlocked */

struct file *	ffind_lock(struct thread *, int fd);
        /* ffind_hold, but returns file locked */

I still have to smp-safe the fget cruft, I'll get to that asap.
2002-01-13 11:58:06 +00:00
luigi
4893656ff8 Add/correct description for some sysctl variables where it was missing.
The description field is unused in -stable, so the MFC there is equivalent
to a comment. It can be done at any time, i am just setting a reminder
in 45 days when hopefully we are past 4.5-release.

MFC after: 45 days
2001-12-16 16:07:20 +00:00
jhb
21b6b26912 Overhaul the per-CPU support a bit:
- The MI portions of struct globaldata have been consolidated into a MI
  struct pcpu.  The MD per-CPU data are specified via a macro defined in
  machine/pcpu.h.  A macro was chosen over a struct mdpcpu so that the
  interface would be cleaner (PCPU_GET(my_md_field) vs.
  PCPU_GET(md.md_my_md_field)).
- All references to globaldata are changed to pcpu instead.  In a UP kernel,
  this data was stored as global variables which is where the original name
  came from.  In an SMP world this data is per-CPU and ideally private to each
  CPU outside of the context of debuggers.  This also included combining
  machine/globaldata.h and machine/globals.h into machine/pcpu.h.
- The pointer to the thread using the FPU on i386 was renamed from
  npxthread to fpcurthread to be identical with other architectures.
- Make the show pcpu ddb command MI with a MD callout to display MD
  fields.
- The globaldata_register() function was renamed to pcpu_init() and now
  init's MI fields of a struct pcpu in addition to registering it with
  the internal array and list.
- A pcpu_destroy() function was added to remove a struct pcpu from the
  internal array and list.

Tested on:	alpha, i386
Reviewed by:	peter, jake
2001-12-11 23:33:44 +00:00
jhb
46e3f92a5d Add a per-thread ucred reference for syscalls and synchronous traps from
userland.  The per thread ucred reference is immutable and thus needs no
locks to be read.  However, until all the proc locking associated with
writes to p_ucred are completed, it is still not safe to use the per-thread
reference.

Tested on:	x86 (SMP), alpha, sparc64
2001-10-26 08:12:54 +00:00
jhb
b03804752e Don't initialize proc0's mutex twice. It is already done earlier on in the
MD startup code.
2001-09-18 22:09:47 +00:00
peter
3a353f88b0 In the devfs case, have initproc attempt the easy cases of mounting /dev.
This works if /dev exists, or if / is read/write (nfsroot).  If it is
too hard, leave it up to init -d (which will probably fail if /dev does
not exist, but there isn't much else we can do short of making a union
mount on /).

This means we get a proper /dev if you boot a 5.x kernel on a 4.x world,
which I happen to do often (the ramdisks on our install netboot servers
have 4.x userland worlds on them).
2001-09-15 11:15:22 +00:00
julian
5596676e6c KSE Milestone 2
Note ALL MODULES MUST BE RECOMPILED
make the kernel aware that there are smaller units of scheduling than the
process. (but only allow one thread per process at this time).
This is functionally equivalent to teh previousl -current except
that there is a thread associated with each process.

Sorry john! (your next MFC will be a doosie!)

Reviewed by: peter@freebsd.org, dillon@freebsd.org

X-MFC after:    ha ha ha ha
2001-09-12 08:38:13 +00:00
dillon
e028603b7e With Alfred's permission, remove vm_mtx in favor of a fine-grained approach
(this commit is just the first stage).  Also add various GIANT_ macros to
formalize the removal of Giant, making it easy to test in a more piecemeal
fashion. These macros will allow us to test fine-grained locks to a degree
before removing Giant, and also after, and to remove Giant in a piecemeal
fashion via sysctl's on those subsystems which the authors believe can
operate without Giant.
2001-07-04 16:20:28 +00:00
peter
f10fa038c1 With this commit, I hereby pronounce gensetdefs past its use-by date.
Replace the a.out emulation of 'struct linker_set' with something
a little more flexible.  <sys/linker_set.h> now provides macros for
accessing elements and completely hides the implementation.

The linker_set.h macros have been on the back burner in various
forms since 1998 and has ideas and code from Mike Smith (SET_FOREACH()),
John Polstra (ELF clue) and myself (cleaned up API and the conversion
of the rest of the kernel to use it).

The macros declare a strongly typed set.  They return elements with the
type that you declare the set with, rather than a generic void *.

For ELF, we use the magic ld symbols (__start_<setname> and
__stop_<setname>).  Thanks to Richard Henderson <rth@redhat.com> for the
trick about how to force ld to provide them for kld's.

For a.out, we use the old linker_set struct.

NOTE: the item lists are no longer null terminated.  This is why
the code impact is high in certain areas.

The runtime linker has a new method to find the linker set
boundaries depending on which backend format is in use.

linker sets are still module/kld unfriendly and should never be used
for anything that may be modular one day.

Reviewed by:	eivind
2001-06-13 10:58:39 +00:00
rwatson
f504530d9f o Merge contents of struct pcred into struct ucred. Specifically, add the
real uid, saved uid, real gid, and saved gid to ucred, as well as the
  pcred->pc_uidinfo, which was associated with the real uid, only rename
  it to cr_ruidinfo so as not to conflict with cr_uidinfo, which
  corresponds to the effective uid.
o Remove p_cred from struct proc; add p_ucred to struct proc, replacing
  original macro that pointed.
  p->p_ucred to p->p_cred->pc_ucred.
o Universally update code so that it makes use of ucred instead of pcred,
  p->p_ucred instead of p->p_pcred, cr_ruidinfo instead of p_uidinfo,
  cr_{r,sv}{u,g}id instead of p_*, etc.
o Remove pcred0 and its initialization from init_main.c; initialize
  cr_ruidinfo there.
o Restruction many credential modification chunks to always crdup while
  we figure out locking and optimizations; generally speaking, this
  means moving to a structure like this:
        newcred = crdup(oldcred);
        ...
        p->p_ucred = newcred;
        crfree(oldcred);
  It's not race-free, but better than nothing.  There are also races
  in sys_process.c, all inter-process authorization, fork, exec, and
  exit.
o Remove sigio->sio_ruid since sigio->sio_ucred now contains the ruid;
  remove comments indicating that the old arrangement was a problem.
o Restructure exec1() a little to use newcred/oldcred arrangement, and
  use improved uid management primitives.
o Clean up exit1() so as to do less work in credential cleanup due to
  pcred removal.
o Clean up fork1() so as to do less work in credential cleanup and
  allocation.
o Clean up ktrcanset() to take into account changes, and move to using
  suser_xxx() instead of performing a direct uid==0 comparision.
o Improve commenting in various kern_prot.c credential modification
  calls to better document current behavior.  In a couple of places,
  current behavior is a little questionable and we need to check
  POSIX.1 to make sure it's "right".  More commenting work still
  remains to be done.
o Update credential management calls, such as crfree(), to take into
  account new ruidinfo reference.
o Modify or add the following uid and gid helper routines:
      change_euid()
      change_egid()
      change_ruid()
      change_rgid()
      change_svuid()
      change_svgid()
  In each case, the call now acts on a credential not a process, and as
  such no longer requires more complicated process locking/etc.  They
  now assume the caller will do any necessary allocation of an
  exclusive credential reference.  Each is commented to document its
  reference requirements.
o CANSIGIO() is simplified to require only credentials, not processes
  and pcreds.
o Remove lots of (p_pcred==NULL) checks.
o Add an XXX to authorization code in nfs_lock.c, since it's
  questionable, and needs to be considered carefully.
o Simplify posix4 authorization code to require only credentials, not
  processes and pcreds.  Note that this authorization, as well as
  CANSIGIO(), needs to be updated to use the p_cansignal() and
  p_cansched() centralized authorization routines, as they currently
  do not take into account some desirable restrictions that are handled
  by the centralized routines, as well as being inconsistent with other
  similar authorization instances.
o Update libkvm to take these changes into account.

Obtained from:	TrustedBSD Project
Reviewed by:	green, bde, jhb, freebsd-arch, freebsd-audit
2001-05-25 16:59:11 +00:00
jhb
87198ba253 - Lock the VM when initializing the vmspace for proc0.
- Don't bother releasing Giant while doing a lookup on the vm_map of
  initproc while starting up init.  We have to grab it again right after
  the lookup anyways.
2001-05-23 22:06:47 +00:00
alfred
a3f0842419 Introduce a global lock for the vm subsystem (vm_mtx).
vm_mtx does not recurse and is required for most low level
vm operations.

faults can not be taken without holding Giant.

Memory subsystems can now call the base page allocators safely.

Almost all atomic ops were removed as they are covered under the
vm mutex.

Alpha and ia64 now need to catch up to i386's trap handlers.

FFS and NFS have been tested, other filesystems will need minor
changes (grabbing the vm lock when twiddling page properties).

Reviewed (partially) by: jake, jhb
2001-05-19 01:28:09 +00:00
jhb
59ffccfbd6 - Move the setting of bootverbose to a MI SI_SUB_TUNABLES SYSINIT.
- Attach a writable sysctl to bootverbose (debug.bootverbose) so it can be
  toggled after boot.
- Move the printf of the version string to a SI_SUB_COPYRIGHT SYSINIT just
  afer the display of the copyright message instead of doing it by hand in
  three MD places.
2001-05-17 22:28:46 +00:00
grog
4b9d9cbaac Revert consequences of changes to mount.h, part 2.
Requested by:	bde
2001-04-29 02:45:39 +00:00
grog
1f5de30718 Correct #includes to work with fixed sys/mount.h. 2001-04-23 09:05:15 +00:00
jhb
3588cc574a Stick proc0 in the PID hash table. 2001-04-11 18:50:50 +00:00
jhb
79cf991a6b Convert the allproc and proctree locks from lockmgr locks to sx locks. 2001-03-28 11:52:56 +00:00
obrien
1814a78315 Do not set a default ELF syscall ABI fallback.
If one runs an un-branded Linux static binary that calls Linux's fcntl
the machine will reboot when interupted by the FreeBSD syscall ABI.
2001-03-04 11:58:50 +00:00
iedowse
63324ae546 The kernel did not hold a vnode reference associated with the
`rootvnode' pointer, but vfs_syscalls.c's checkdirs() assumed that
it did. This bug reliably caused a panic at reboot time if any
filesystem had been mounted directly over /.

The checkdirs() function is called at mount time to find any process
fd_cdir or fd_rdir pointers referencing the covered mountpoint
vnode. It transfers these to point at the root of the new filesystem.
However, this process was not reversed at unmount time, so processes
with a cwd/root at a mount point would unexpectedly lose their
cwd/root following a mount-unmount cycle at that mountpoint.

This change should fix both of the above issues. Start_init() now
holds an extra vnode reference corresponding to `rootvnode', and
dounmount() releases this reference when the root filesystem is
unmounted just before reboot. Dounmount() now undoes the actions
taken by checkdirs() at mount time; any process cdir/rdir pointers
that reference the root vnode of the unmounted filesystem are
transferred to the now-uncovered vnode.

Reviewed by:	bde, phk
2001-02-28 20:54:28 +00:00
jake
623f06756a Sigh. Try to get priorities sorted out. Don't bother trying to
update native priority, it is diffcult to get right and likely
to end up horribly wrong.  Use an honestly wrong fixed value
that seems to work; PUSER for user threads, and the interrupt
priority for ithreads.  Set it once when the process is created
and forget about it.

Suggested by:	bde
Pointy hat:	me
2001-02-28 02:53:44 +00:00
jake
3d95b58cac Initialize native priority to PRI_MAX. It was usually 0 which made a
process's priority go through the roof when it released a (contested)
mutex.  Only set the native priority in mtx_lock if hasn't already
been set.

Reviewed by:	jhb
2001-02-26 23:27:35 +00:00
jhb
a97d568931 It turns out the kernel console works fine and thus doesn't need quite this
much extra testing.
2001-02-24 03:40:23 +00:00
peter
c4ff4fe19e Stricter style(9) conformance - remove unnecessary blank lines in previous
commit.
2001-02-23 23:05:46 +00:00
jhb
1e88ca386b Test out the kernel console just before launching the AP's. 2001-02-23 19:44:25 +00:00
rwatson
ab5676fc87 o Move per-process jail pointer (p->pr_prison) to inside of the subject
credential structure, ucred (cr->cr_prison).
o Allow jail inheritence to be a function of credential inheritence.
o Abstract prison structure reference counting behind pr_hold() and
  pr_free(), invoked by the similarly named credential reference
  management functions, removing this code from per-ABI fork/exit code.
o Modify various jail() functions to use struct ucred arguments instead
  of struct proc arguments.
o Introduce jailed() function to determine if a credential is jailed,
  rather than directly checking pointers all over the place.
o Convert PRISON_CHECK() macro to prison_check() function.
o Move jail() function prototypes to jail.h.
o Emulate the P_JAILED flag in fill_kinfo_proc() and no longer set the
  flag in the process flags field itself.
o Eliminate that "const" qualifier from suser/p_can/etc to reflect
  mutex use.

Notes:

o Some further cleanup of the linux/jail code is still required.
o It's now possible to consider resolving some of the process vs
  credential based permission checking confusion in the socket code.
o Mutex protection of struct prison is still not present, and is
  required to protect the reference count plus some fields in the
  structure.

Reviewed by:	freebsd-arch
Obtained from:	TrustedBSD Project
2001-02-21 06:39:57 +00:00
jake
55d5108ac5 Implement a unified run queue and adjust priority levels accordingly.
- All processes go into the same array of queues, with different
  scheduling classes using different portions of the array.  This
  allows user processes to have their priorities propogated up into
  interrupt thread range if need be.
- I chose 64 run queues as an arbitrary number that is greater than
  32.  We used to have 4 separate arrays of 32 queues each, so this
  may not be optimal.  The new run queue code was written with this
  in mind; changing the number of run queues only requires changing
  constants in runq.h and adjusting the priority levels.
- The new run queue code takes the run queue as a parameter.  This
  is intended to be used to create per-cpu run queues.  Implement
  wrappers for compatibility with the old interface which pass in
  the global run queue structure.
- Group the priority level, user priority, native priority (before
  propogation) and the scheduling class into a struct priority.
- Change any hard coded priority levels that I found to use
  symbolic constants (TTIPRI and TTOPRI).
- Remove the curpriority global variable and use that of curproc.
  This was used to detect when a process' priority had lowered and
  it should yield.  We now effectively yield on every interrupt.
- Activate propogate_priority().  It should now have the desired
  effect without needing to also propogate the scheduling class.
- Temporarily comment out the call to vm_page_zero_idle() in the
  idle loop.  It interfered with propogate_priority() because
  the idle process needed to do a non-blocking acquire of Giant
  and then other processes would try to propogate their priority
  onto it.  The idle process should not do anything except idle.
  vm_page_zero_idle() will return in the form of an idle priority
  kernel thread which is woken up at apprioriate times by the vm
  system.
- Update struct kinfo_proc to the new priority interface.  Deliberately
  change its size by adjusting the spare fields.  It remained the same
  size, but the layout has changed, so userland processes that use it
  would parse the data incorrectly.  The size constraint should really
  be changed to an arbitrary version number.  Also add a debug.sizeof
  sysctl node for struct kinfo_proc.
2001-02-12 00:20:08 +00:00
jhb
6e847a265b Move the initailization of the proc lock for proc0 very early into the MD
startup code.
2001-02-09 16:25:16 +00:00
bmilekic
f364d4ac36 Change and clean the mutex lock interface.
mtx_enter(lock, type) becomes:

mtx_lock(lock) for sleep locks (MTX_DEF-initialized locks)
mtx_lock_spin(lock) for spin locks (MTX_SPIN-initialized)

similarily, for releasing a lock, we now have:

mtx_unlock(lock) for MTX_DEF and mtx_unlock_spin(lock) for MTX_SPIN.
We change the caller interface for the two different types of locks
because the semantics are entirely different for each case, and this
makes it explicitly clear and, at the same time, it rids us of the
extra `type' argument.

The enter->lock and exit->unlock change has been made with the idea
that we're "locking data" and not "entering locked code" in mind.

Further, remove all additional "flags" previously passed to the
lock acquire/release routines with the exception of two:

MTX_QUIET and MTX_NOSWITCH

The functionality of these flags is preserved and they can be passed
to the lock/unlock routines by calling the corresponding wrappers:

mtx_{lock, unlock}_flags(lock, flag(s)) and
mtx_{lock, unlock}_spin_flags(lock, flag(s)) for MTX_DEF and MTX_SPIN
locks, respectively.

Re-inline some lock acq/rel code; in the sleep lock case, we only
inline the _obtain_lock()s in order to ensure that the inlined code
fits into a cache line. In the spin lock case, we inline recursion and
actually only perform a function call if we need to spin. This change
has been made with the idea that we generally tend to avoid spin locks
and that also the spin locks that we do have and are heavily used
(i.e. sched_lock) do recurse, and therefore in an effort to reduce
function call overhead for some architectures (such as alpha), we
inline recursion for this case.

Create a new malloc type for the witness code and retire from using
the M_DEV type. The new type is called M_WITNESS and is only declared
if WITNESS is enabled.

Begin cleaning up some machdep/mutex.h code - specifically updated the
"optimized" inlined code in alpha/mutex.h and wrote MTX_LOCK_SPIN
and MTX_UNLOCK_SPIN asm macros for the i386/mutex.h as we presently
need those.

Finally, caught up to the interface changes in all sys code.

Contributors: jake, jhb, jasone (in no particular order)
2001-02-09 06:11:45 +00:00
jhb
eef4e5d2f9 - Catch up to p_sflag changes.
- The MD code now initializes proc0.p_heldmtx, proc0.p_contested, and
  curproc.
- The MD code calls here with Giant already held.
- Proc locking.
2001-01-24 10:40:56 +00:00
jake
4f5d8ed825 Use PCPU_GET, PCPU_PTR and PCPU_SET to access all per-cpu variables
other then curproc.
2001-01-10 04:43:51 +00:00
jake
a4ad237eaa - Change the allproc_lock to use a macro, ALLPROC_LOCK(how), instead
of explicit calls to lockmgr.  Also provides macros for the flags
  pased to specify shared, exclusive or release which map to the
  lockmgr flags.  This is so that the use of lockmgr can be easily
  replaced with optimized reader-writer locks.
- Add some locking that I missed the first time.
2000-12-13 00:17:05 +00:00
jhb
fb0ab4bf94 - Add a mutex to the proc structure p_mtx that will be used to lock accesses
to each individual proc.
- Initialize the lock during fork1(), and destroy it in wait1().
2000-12-03 01:22:34 +00:00
jake
95fb73d495 Use an mp-safe callout for endtsleep. 2000-12-01 04:55:52 +00:00
jake
9326f655fc Use callout_reset instead of timeout(9). Most callouts are statically
allocated, 2 have been added to struct proc for setitimer and sleep.

Reviewed by:	jhb, jlemon
2000-11-27 22:52:31 +00:00
jake
0c0be4e826 Protect the following with a lockmgr lock:
allproc
	zombproc
	pidhashtbl
	proc.p_list
	proc.p_hash
	nextpid

Reviewed by:	jhb
Obtained from:	BSD/OS and netbsd
2000-11-22 07:42:04 +00:00
obrien
c3b27201d1 ELF kernels should use an ELF sysvec. This allows us to move a.out
specific files to those platforms that acutally support a.out.
2000-11-05 10:41:35 +00:00
jhb
d944886e4d Catch up to moving headers:
- machine/ipl.h -> sys/ipl.h
- machine/mutex.h -> sys/mutex.h
2000-10-20 07:58:15 +00:00
jhb
ad31d928a6 Release Giant before starting up init.
Submitted by:	jake
2000-09-15 19:25:29 +00:00
dfr
d721d5e1d6 Move the include of <sys/systm.h> so that KTR gets a declaration for
snprintf().
2000-09-10 13:54:52 +00:00
jasone
769e0f974d Major update to the way synchronization is done in the kernel. Highlights
include:

* Mutual exclusion is used instead of spl*().  See mutex(9).  (Note: The
  alpha port is still in transition and currently uses both.)

* Per-CPU idle processes.

* Interrupts are run in their own separate kernel threads and can be
  preempted (i386 only).

Partially contributed by:	BSDi (BSD/OS)
Submissions by (at least):	cp, dfr, dillon, grog, jake, jhb, sheldonh
2000-09-07 01:33:02 +00:00
truckman
0575d8a3f9 Remove uidinfo hash table lookup and maintenance out of chgproccnt() and
chgsbsize(), which are called rather frequently and may be called from an
interrupt context in the case of chgsbsize().  Instead, do the hash table
lookup and maintenance when credentials are changed, which is a lot less
frequent.  Add pointers to the uidinfo structures to the ucred and pcred
structures for fast access.  Pass a pointer to the credential to chgproccnt()
and chgsbsize() instead of passing the uid.  Add a reference count to the
uidinfo structure and use it to decide when to free the structure rather
than freeing the structure when the resource consumption drops to zero.
Move the resource tracking code from kern_proc.c to kern_resource.c.  Move
some duplicate code sequences in kern_prot.c to separate helper functions.
Change KASSERTs in this code to unconditional tests and calls to panic().
2000-09-05 22:11:13 +00:00
phk
e47f61e183 Avoid the modules madness I inadvertently introduced by making the
cloning infrastructure standard in kern_conf.  Modules are now
the same with or without devfs support.

If you need to detect if devfs is present, in modules or elsewhere,
check the integer variable "devfs_present".

This happily removes an ugly hack from kern/vfs_conf.c.

This forces a rename of the eventhandler and the standard clone
helper function.

Include <sys/eventhandler.h> in <sys/conf.h>: it's a helper #include
like <sys/queue.h>

Remove all #includes of opt_devfs.h they no longer matter.
2000-09-02 19:17:34 +00:00
phk
b648921acc Remove all traces of Julians DEVFS (incl from kern/subr_diskslice.c)
Remove old DEVFS support fields from dev_t.

  Make uid, gid & mode members of dev_t and set them in make_dev().

  Use correct uid, gid & mode in make_dev in disk minilayer.

  Add support for registering alias names for a dev_t using the
  new function make_dev_alias().  These will show up as symlinks
  in DEVFS.

  Use makedev() rather than make_dev() for MFSs magic devices to prevent
  DEVFS from noticing this abuse.

  Add a field for DEVFS inode number in dev_t.

  Add new DEVFS in fs/devfs.

  Add devfs cloning to:
        disk minilayer (ie: ad(4), sd(4), cd(4) etc etc)
        md(4), tun(4), bpf(4), fd(4)

  If DEVFS add -d flag to /sbin/inits args to make it mount devfs.

  Add commented out DEVFS to GENERIC
2000-08-20 21:34:39 +00:00
peter
85c9a2ddc1 Clean up some low level bootstrap code:
- stop using the evil 'struct trapframe' argument for mi_startup()
  (formerly main()).  There are much better ways of doing it.
- do not use prepare_usermode() - setregs() in execve() will do it
  all for us as long as the p_md.md_regs pointer is set.  (which is
  now done in machdep.c rather than init_main.c.  The Alpha port did it
  this way all along and is much cleaner).
- collect all the magic %cr0 etc register settings into one place and
  have the AP's call that instead of using magic numbers (!!) that keep
  changing over and over again.
- Make it safe to call kthread_create() earlier, including during the
  device probe sequence.  It doesn't need the callback mechanism that
  NetBSD's version uses.
- kthreads created this way are root-less as they exist before the root
  filesystem is mounted.  init(1) is set up so that it aquires the root
  pointers prior to running.  If other kthreads want filesystem acccess
  we can make this code more generic.
- set all threads start times once we have decided what time it is.
- init uses a trampoline rather than the evil prepare_usermode() hack.
- kern_descrip.c has a couple of tweaks to deal with forking when there
  is no rootdir or cwd etc.
- adjust the early SYSINIT() sequence so that a few prereqisites are in
  place. eg: make sure the run queue is initialized before doing forks.

With this, the USB code can easily create a kthread to do the device
tree discovery.  (I have tested it, it works nicely).

There are still some open issues before this is truely useful.
- tsleep() does not like working before the clock is running.  It
  sort-of tries to spin wait, but it can do more useful things now.
- stopping a kthread in kld code at unload time is "interesting" but
  we have a solution for that.

The Alpha code needs no changes for this.  It already uses pretty much the
same strategies, but a little cleaner.
2000-08-11 09:05:12 +00:00
peter
ee51d18b57 Fix the SYSINIT() bubble sort. This was fixed in kern_linker.c already. 2000-08-02 21:05:21 +00:00
markm
137a7a4f8d Remove no-longer-relevant comment. 2000-06-25 10:14:06 +00:00
alfred
7f71a1a091 fix races in the uidinfo subsystem, several problems existed:
1) while allocating a uidinfo struct malloc is called with M_WAITOK,
   it's possible that while asleep another process by the same user
   could have woken up earlier and inserted an entry into the uid
   hash table.  Having redundant entries causes inconsistancies that
   we can't handle.

   fix: do a non-waiting malloc, and if that fails then do a blocking
   malloc, after waking up check that no one else has inserted an entry
   for us already.

2) Because many checks for sbsize were done as "test then set" in a non
   atomic manner it was possible to exceed the limits put up via races.

   fix: instead of querying the count then setting, we just attempt to
   set the count and leave it up to the function to return success or
   failure.

3) The uidinfo code was inlining and repeating, lookups and insertions
   and deletions needed to be in their own functions for clarity.

Reviewed by: green
2000-06-22 22:27:16 +00:00
jkh
71756b1d36 Add new oid, debug.boothowto. This allows userland apps to see
how the kernel was booted and perhaps do conditional things
based upon it (sysinstall, for example, will now turn Debug mode
on automatically if boot -v was done).

Submitted by:	msmith
Suggested by:	ulf
2000-02-25 11:43:08 +00:00
grog
1af318cfbc If we fail to find init, print out the search path used. This helps
differentiate between one of three different scenarios:

1.  No init.
2.  Path to init munged by an incorrect loader configuration.
3.  Root file system not mounted.

Reviewed-by:  billf
1999-12-20 02:50:49 +00:00
phk
1adcecffd9 struct mountlist and struct mount.mnt_list have no business being
a CIRCLEQ.  Change them to TAILQ_HEAD and TAILQ_ENTRY respectively.

This removes ugly  mp != (void*)&mountlist  comparisons.

Requested by:   phk
Submitted by:   Jake Burkholder jake@checker.org
PR:             14967
1999-11-20 10:00:46 +00:00
phk
8fca18de89 This is a partial commit of the patch from PR 14914:
Alot of the code in sys/kern directly accesses the *Q_HEAD and *Q_ENTRY
   structures for list operations.  This patch makes all list operations
   in sys/kern use the queue(3) macros, rather than directly accessing the
   *Q_{HEAD,ENTRY} structures.

This batch of changes compile to the same object files.

Reviewed by:    phk
Submitted by:   Jake Burkholder <jake@checker.org>
PR:     14914
1999-11-16 10:56:05 +00:00
msmith
6270a8f009 swapinit isn't called from vfs_mountroot, so don't complain about it in
a #if 0'ed comment.

Call the machine-dependant cpu_rootconf functions from sysinits in their
respective areas, don't do it from a stub here.
1999-11-01 23:53:27 +00:00
phk
8e3c3eafed useracc() the prequel:
Merge the contents (less some trivial bordering the silly comments)
of <vm/vm_prot.h> and <vm/vm_inherit.h> into <vm/vm.h>.  This puts
the #defines for the vm_inherit_t and vm_prot_t types next to their
typedefs.

This paves the road for the commit to follow shortly: change
useracc() to use VM_PROT_{READ|WRITE} rather than B_{READ|WRITE}
as argument.
1999-10-29 18:09:36 +00:00
peter
787140aa42 Trim unused options (or #ifdef for undoc options).
Submitted by:	phk
1999-10-11 15:19:12 +00:00
bde
10335a428d Moved the definition of `boottime' and its sysctl to the correct file. 1999-09-13 14:22:27 +00:00
peter
3b842d34e8 $Id$ -> $FreeBSD$ 1999-08-28 01:08:13 +00:00
phk
663cbe4fc2 Convert DEVFS hooks in (most) drivers to make_dev().
Diskslice/label code not yet handled.

Vinum, i4b, alpha, pc98 not dealt with (left to respective Maintainers)

Add the correct hook for devfs to kern_conf.c

The net result of this excercise is that a lot less files depends on DEVFS,
and devtoname() gets more sensible output in many cases.

A few drivers had minor additional cleanups performed relating to cdevsw
registration.

A few drivers don't register a cdevsw{} anymore, but only use make_dev().
1999-08-23 20:59:21 +00:00
peter
6cb5fe6c6b Slight reorganization of kernel thread/process creation. Instead of using
SYSINIT_KT() etc (which is a static, compile-time procedure), use a
NetBSD-style kthread_create() interface.  kproc_start is still available
as a SYSINIT() hook.  This allowed simplification of chunks of the
sysinit code in the process.  This kthread_create() is our old kproc_start
internals, with the SYSINIT_KT fork hooks grafted in and tweaked to work
the same as the NetBSD one.

One thing I'd like to do shortly is get rid of nfsiod as a user initiated
process.  It makes sense for the nfs client code to create them on the
fly as needed up to a user settable limit.  This means that nfsiod
doesn't need to be in /sbin and is always "available".  This is a fair bit
easier to do outside of the SYSINIT_KT() framework.
1999-07-01 13:21:46 +00:00
peter
a9c3f31bb0 Slight tweak to fork1() calling conventions. Add a third argument so
the caller can easily find the child proc struct.  fork(), rfork() etc
syscalls set p->p_retval[] themselves.  Simplify the SYSINIT_KT() code
and other kernel thread creators to not need to use pfind() to find the
child based on the pid.  While here, partly tidy up some of the fork1()
code for RF_SIGSHARE etc.
1999-06-30 15:33:41 +00:00
jb
154417badf Use colons instead of semi-colons to behave like UNIX instead of DOS.
Suggested by: bde
1999-05-11 10:08:10 +00:00
peter
2bc4e0eea2 Lites2 seems to have pretty much disappeared from the radar, and I suspect
far more than this hack would be needed now..
1999-05-09 20:42:45 +00:00