__mac_get_pid Retrieve MAC label of a process by pid
Similar to __mac_get_proc() except that the target process of
the operation is explicitly specified rather than assuming
curthread.
__mac_get_link Retrieve MAC label of a path with NOFOLLOW
__mac_set_link Set MAC label of a path with NOFOLLOW
extattr_set_link Set EAs on a path with NOFOLLOW
extattr_get_link Retrieve EAs on a path with NOFOLLOW
extattr_delete_link Delete EAs on a path with NOFOLLOW
These calls are similar to __mac_get_file(), __mac_set_file(),
extattr_set_file(), extattr_get_file(), and extattr_delete_file(),
except that they do not follow symlinks. The distinction between
these calls is similar to lchown() vs chown().
Implementations to follow.
Obtained from: TrustedBSD Project
Sponsored by: DARPA, Network Associates Laboratories
I've added a structure, kernel-private, to represent a pending or in-delivery
signal, called `ksiginfo'. It is roughly analogous to the basic information
that is exported by the POSIX interface 'siginfo_t', but more basic. I've
added functions to allocate these structures, and further to wrap all signal
operations using them.
Once the operations are wrapped, I've added a TailQ (see queue(3)) of these
structures to 'struct proc', and all pending signals are in that TailQ. When
a signal is being delivered, it is dequeued from the list. Once I finish
the spreading of ksiginfo throughout the tree, the dequeued structure will be
delivered to the process in question, whereas currently and normally, the
signal number is what is used.
has exceeded its CPU time limit.
- In mi_switch(), set PS_XCPU when the CPU time limit is exceeded.
- Perform actual CPU time limit exceeded work in ast() when PS_XCPU is set.
Requested by: many
interlock in getnewvnode() to avoid possible sleeps while holding
the mutex. Note that the warning from Witness is a slight false
positive since we know there will be no contention on the interlock
since we haven't made the vnode available for use yet, but the theory
is not a bad one.
Obtained from: TrustedBSD Project
Sponsored by: DARPA, Network Associates Laboratories
gets signals operating based on a TailQ, and is good enough to run X11,
GNOME, and do job control. There are some intricate parts which could be
more refined to match the sigset_t versions, but those require further
evaluation of directions in which our signal system can expand and contract
to fit our needs.
After this has been in the tree for a while, I will make in kernel API
changes, most notably to trapsignal(9) and sendsig(9), to use ksiginfo
more robustly, such that we can actually pass information with our
(queued) signals to the userland. That will also result in using a
struct ksiginfo pointer, rather than a signal number, in a lot of
kern_sig.c, to refer to an individual pending signal queue member, but
right now there is no defined behaviour for such.
CODAFS is unfinished in this regard because the logic is unclear in
some places.
Sponsored by: New Gold Technology
Reviewed by: bde, tjr, jake [an older version, logic similar]
from stopping another thread from completing a syscall, and this allows it to
release its resources etc. Probably more related commits to follow (at least
one I know of)
Initial concept by: julian, dillon
Submitted by: davidxu
device_t the same throughout kernel.
This is a very fine point of C which fortunatly does not make any
difference in normal circumstances but which due to the pervasiveness
of device_t in the kernel can make a lint barf a lot.
sparc v9 ABI. The Elf_Rela records for local symbols appear to already
have the symbol's value added in to the addend field, even though the ABI
specifies we need to lookup the symbol and add its value too. This breaks
text relocations in klds because the symbol's value is added twice, and
the resulting address points off into nowhere land, so for now just use
the addend.
Tested by: rwatson
if they are not going to cross over themselves. Also change how the list of
completed user threads is tracked and passed to the KSE. This is not
a change in design but rather the implementation of what was originally
envisionned.
- Make the VI asserts more orthogonal to the rest of the asserts by using a
new, common vfs_badlock() function and adding a 'str' arg.
- Adjust generated ASSERTS to match the new prototype.
- Adjust explicit ASSERTS to match the new prototype.
- Enable vfs_badlock_mutex by default.
- Assert that the vp is locked in VOP_UNLOCK.
- Use standard interlock macros in remaining code.
- Correct a race in getnewvnode().
- Lock access to v_numoutput with interlock.
- Lock access to buf lists and splay tree with interlock.
- Add VOP and VI asserts.
- Lock b_vnbufs with the vnode interlock.
- Add vrefcnt() for callers who want to retreive the vnode ref without
holding a lock. Add a comment that describes when this is safe.
- Add vholdl() and vdropl() so that callers who already own the interlock
can avoid race conditions and unnecessary unlocking.
- Move the VOP_GETATTR() in vflush() into the WRITECLOSE conditional case.
- Hold the interlock before droping the mntlist_mtx in vflush() to avoid
a race.
- Fix locking in vfs_msync().
ALQ (Asynch. Logging Queues). ALQ supports many seperate queues with
different record and buffer sizes. It opens and logs to any vnode so
it can be used with character devices as well as regular files.
Reviewed in part by: phk, jake, markm
constants VM_MIN_ADDRESS, VM_MAXUSER_ADDRESS, USRSTACK and PS_STRINGS.
This is mainly so that they can be variable even for the native abi, based
on different machine types. Get stack protections from the sysentvec too.
This makes it trivial to map the stack non-executable for certain abis, on
machines that support it.
and predictable way, and I apologize if I have gotten it wrong anywhere,
getting prior review on a patch like this is not feasible, considering
the number of people involved and hardware availability etc.)
If struct disklabel is the messenger: kill the messenger.
Inside struct disk we had a struct disklabel which disk drivers used to
communicate certain metrics to the disklayer above (GEOM or the disk
mini-layer). This commit changes this communication to use four
explicit fields instead.
Amongst the benefits is that the fields do not get overwritten by
wrong or bogus on-disk disklabels.
Once that is clear, <sys/disk.h> which is included in the drivers
no longer need to pull <sys/disklabel.h> and <sys/diskslice.h> in,
the few places that needs them, have gotten explicit #includes for
them.
The disklabel inside struct disk is now only for internal use in
the disk mini-layer, so instead of embedding it, we malloc it as
we need it.
This concludes (modulus any mistakes) the series of disklabel related
commits.
I belive it all amounts to a NOP for all the rest of you :-)
Sponsored by: DARPA & NAI Labs.
function were put in i386/i386/machdep.c from where it has been
cut and pasted to other architectures with only minor corruption.
Disklabel is really a MI format in many ways, at least it certainly
is when you operate on struct disklabel.
Put bounds_check_with_label() back in subr_disklabel.c where it belongs.
Sponsored by: DARPA & NAI Labs.
Rename bioqdisksort() to bioq_disksort().
Keep a #define around to avoid changing all diskdrivers right now.
Move it from subr_disklabel.c to subr_disk.c.
Move prototype from <sys/disklabel.h> to <sys/bio.h>
Sponsored by: DARPA and NAI Labs.
Rename diskerr() to disk_err() for naming consistency.
Drop the by now entirely useless struct disklabel argument.
Add a flag argument for new-line termination.
Fix a couple of printf-format-casts to %j instead of %l.
Correctly print the name of all bio commands.
Move the function from subr_disklabel.c to subr_disk.c,
and from <sys/disklabel.h> to <sys/disk.h>.
Use the new disk_err() throughout, #include <sys/disk.h> as needed.
Bump __FreeBSD_version for the sake of the aac disk drivers #ifdefs.
Remove unused disklabel members of softc for aac, amr and mlx, which seem
to originally have been intended for diskerr() use, but which only rotted
and got Copy&Pasted at least two times to many.
Sponsored by: DARPA & NAI Labs.
the kernel. By default this is turned off since otherwise it could scroll
valuable panic messages off of the screen. This option can be turned on
by the DDB_TRACE kernel option as well as the debug.trace_on_panic sysctl.
Also, fix the DDB_UNATTENDED option to use its own header instead of
abusing opt_ddb.h. This way turning that one option on or off doesn't
force you to recompile all of ddb.
Requested by: many (1), bde (2*)
* - I know bde prefers !abusing option headers in general but can't
remember if he as brought up this specific case.
wasn't doing. Rather than just lock and unlock the vnode around the call
to VOP_FSYNC(), implement rwatson's suggestion to lock the file vnode
in kern_link() before calling VOP_LINK(), since the other filesystems
also locked the file vnode right away in their link methods. Remove the
locking and and unlocking from the leaf filesystem link methods.
Reviewed by: rwatson, bde (except for the unionfs_link() changes)
header and return that length, was misguided.
The check itself didn't take into account the fact that the
mbuf pointer pased in may be null, and the function is
defined specifically for cases where the caller knows what it wants.
Rather than fix the check I'm removing it as phk suggested.
Submitted by: phk@freebsd.org
Option 'P1003_1B_SEMAPHORES' to compile them in, or load the "sem" module
to activate them.
Have kern/makesyscalls.sh emit an include for sys/_semaphore.h into sysproto.h
to pull in the typedef for semid_t.
Add the syscalls to the syscall table as module stubs.
that are shareable between processes.
There will be a cleanup shortly along with the necessary changes made to
libc, libc_r, libpthread as well as the hooks into sys/conf and sys/modules.
- Remove all instances of the mallochash.
- Stash the slab pointer in the vm page's object pointer when allocating from
the kmem_obj.
- Use the overloaded object pointer to find slabs for malloced memory.
under way to move the remnants of the a.out toolchain to ports. As the
comment in src/Makefile said, this stuff is deprecated and one should not
expect this to remain beyond 4.0-REL. It has already lasted WAY beyond
that.
Notable exceptions:
gcc - I have not touched the a.out generation stuff there.
ldd/ldconfig - still have some code to interface with a.out rtld.
old as/ld/etc - I have not removed these yet, pending their move to ports.
some includes - necessary for ldd/ldconfig for now.
Tested on: i386 (extensively), alpha
- Use ucontext_t's to store KSE thread state.
- Synthesize state for the UTS upon each upcall, rather than
saving and copying a trapframe.
- Deliver signals to KSE-aware processes via upcall.
- Rename kse mailbox structure fields to be more BSD-like.
- Store the UTS's stack in struct proc in a stack_t.
Reviewed by: bde, deischen, julian
Approved by: -arch
next step is to allow > 1 to be allocated per process. This would give
multi-processor threads. (when the rest of the infrastructure is
in place)
While doing this I noticed libkvm and sys/kern/kern_proc.c:fill_kinfo_proc
are diverging more than they should.. corrective action needed soon.
functions. We add pnpinfo, locationinfo, devflags (the newbus flags
on the device), flags (the flags that device_get_flags returns) and
state to the list of things we return.
pnpinfo and locationinfo are place holders at the moment that will be
filled in by the device's parent (optionally). Userland programs will
likely use this information from time to time and take appropriate
actions.
Improvements to devinfo to follow.
recursion when closef() calls pfind() which also wants the proc lock.
This case only occurred when setugidsafety() needed to close unsafe files.
Reviewed by: truckman
v_tag is now const char * and should only be used for debugging.
Additionally:
1. All users of VT_NTS now check vfsconf->vf_type VFCF_NETWORK
2. The user of VT_PROCFS now checks for the new flag VV_PROCDEP, which
is propagated by pseudofs to all child vnodes if the fs sets PFS_PROCDEP.
Suggested by: phk
Reviewed by: bde, rwatson (earlier version)
Introduce biowait() function. Currently there is a race condition and the
mitigation is a timeout/retry. It is not obvious what kind of locking (if any)
is suitable for BIO_DONE, since the majority of users take are of this
themselves, and only a few places actually rely on the wakeup.
Sponsored by: DARPA & NAI Labs.
to be dropped, 2) attempting to lock acct_mtx while already holding it.
Sorry to those who experienced pain.
- Added two comments referring to two areas in which acct_mtx is held over
vnode operations that might sleep. Patch in the works for this.
request structure.
- Re-optimize the case of utrace being disabled by doing an explicit
KTRPOINT check instead of relying on the one in ktr_getrequest() so that
we don't waste time on a malloc in the non-tracing case.
- Change utrace() to return an error if the copyin() fails. Before it
would just ignore the request but still return success. This last is
a change in behavior and can be backed out if necessary.
transfer to a malloc'd buffer and use that bufer for the ktrace event.
This means that genio ktrace events no longer need to be synchronous.
- Now that ktr_buffer isn't overloaded to sometimes point to a cached uio
pointer for genio requests and always points to a malloc'd buffer if not
NULL, free the buffer in ktr_freerequest() instead of in
ktr_writerequest(). This closes a memory leak for ktrace events that
used a malloc'd buffer that had their vnode ripped out from under them
while they were on the todo list.
Suggested by: bde (1, in principle)
- Rename kern.ktrace_request_pool tunable/sysctl to
kern.ktrace.request_pool.
- Add a variable to control the max amount of data to log for genio events.
This variable is tunable via the tunable/sysctl kern.ktrace.genio_size
and defaults to one page.
Changed rename(2) to follow the letter of the POSIX spec. POSIX
requires rename() to have no effect if its args "resolve to the same
existing file". I think "file" can only reasonably be read as referring
to the inode, although the rationale and "resolve" seem to say that
sameness is at the level of (resolved) directory entries.
ext2fs_vnops.c, ufs_vnops.c:
Replaced code that gave the historical BSD behaviour of removing one
link name by checks that this code is now unreachable. This fixes
some races. All vnodes needed to be unlocked for the removal, and
locking at another level using something like IN_RENAME was not even
attempted, so it was possible for rename(x, y) to return with both x
and y removed even without any unlink(2) syscalls (one process can
remove x using rename(x, y) and another process can remove y using
rename(y, x)).
Prodded by: alfred
MFC after: 8 weeks
PR: 42617
- Get the initial mode from the prom settings and don't clobber the mode
on open.
- Copy output into an internal ring buffer instead of accessing the tty
outq directly in the interrupt handler. This fixes a problem where
garbage would show up in the output stream.
- Reset the console port completely and reprogram all the parameters
before enabling it. This fixes seemingly random hangs on startup
when using a fast interrupt handler.
- Add minimal locking in place of spls.
- Remove dead code and minor cleanups.
available at module compile time. Do not #include the bogus
opt_kstack_pages.h at this point and instead refer to the variables that
are also exported via sysctl.
PS_STRINGS and USRSTACK is. This is necessary in order to decode a.out
core dumps. kern_proc.c was already referring to both of these values
but was missing the #include "opt_kstack_pages.h". Make the sysctl
variables visible so that certain kld modules can see how their parent
kernel was configured.
The process allocator now caches and hands out complete process structures
*including substructures* .
i.e. it get's the process structure with the first thread (and soon KSE)
already allocated and attached, all in one hit.
For the average non threaded program (non KSE that is) the allocated thread and its stack remain attached to the process, even when the process is
unused and in the process cache. This saves having to allocate and attach it
later, effectively bringing us (hopefully) close to the efficiency
of pre-KSE systems where these were a single structure.
Reviewed by: davidxu@freebsd.org, peter@freebsd.org
Together these two implement a simple transcation style grouping for
modifications of extended attributes on a vnode.
VOP_CLOSEEXTATTR() takes a boolean "commit" argument, which determines
if the aggregate changes are attempted written or not. A commit will
fail if any of the VOP_SETEXTATTR() calls since the VOP_OPENEXTATTR()
have failed to meet their objective or if the flush to disk fails.
The default operations for these two VOP's is to return EOPNOTSUPP.
This API may still be subject to change.
Sponsored by: DARPA & NAI Labs
s/SNGL/SINGLE/
s/SNGLE/SINGLE/
Fix abbreviation for P_STOPPED_* etc flags, in original code they were
inconsistent and difficult to distinguish between them.
Approved by: julian (mentor)
layers deep in <sys/proc.h> or <sys/vnode.h>.
Removed unused includes.
Fixed some printf format errors (1 fatal on i386's; 1 fatal on alphas;
1 not fatal on any supported machine).
functions which run for several milliseconds at a time and getting
in queue behind one or more of those makes us miss our rewind.
Instead call it from hardclock() like we used to do, but retain the
prescaler so we still cope with high HZ values.
out that there is no easy way to discern the difference between a text
segment and a data segment through the read-only OR execute attribute
in the elf segment header, so revert the algorithm to what it was before.
Neither can we account for multiple data load segments in the vmspace
structure (at least not without more work), due to assumptions obreak()
makes in regards to the data start and data size fields.
Retain RLIMIT_VMEM checking by using a local variable to track the
total bytes of data being loaded.
Reviewed by: peter
X-MFC after: ASAP
so that it works on the Alpha. This defines the segment that the entry
point exists in as 'text' and any others (usually one) as data.
Submitted by: tmm
Tested on: i386, alpha
it can do it w/o needing to hold the filelist_lock sx lock.
- fdalloc() doesn't need Giant to call free() anymore. It also doesn't
need to drop and reacquire the filedesc lock around free() now as a
result.
- Try to make the code that copies fd tables when extending the fd table in
fdalloc() a bit more readable by performing assignments in separate
statements. This is still a bit ugly though.
- Use max() instead of an if statement so to figure out the starting point
in the search-for-a-free-fd loop in fdalloc() so it reads better next to
the min() in the previous line.
- Don't grow nfiles in steps up to the size needed if we dup2() to some
really large number. Go ahead and double 'nfiles' in a loop prior
to doing the malloc().
- malloc() doesn't need Giant now.
- Use malloc() and free() instead of MALLOC() and FREE() in fdalloc().
- Check to see if the size we are going to grow to is too big, not if the
current size of the fd table is too big in the loop in fdalloc(). This
means if we are out of space or if dup2() requests too high of a fd,
then we will return an error before we go off and try to allocate some
huge table and copy the existing table into it.
- Move all of the logic for dup'ing a file descriptor into do_dup() instead
of putting some of it in do_dup() and duplicating other parts in four
different places. This makes dup(), dup2(), and fcntl(F_DUPFD) basically
wrappers of do_dup now. fcntl() still has an extra check since it uses
a different error return value in one case then the other functions.
- Add a KASSERT() for an assertion that may not always be true where the
fdcheckstd() function assumes that falloc() returns the fd requested and
not some other fd. I think that the assertion is always true because we
are always single-threaded when we get to this point, but if one was
using rfork() and another process sharing the fd table were playing with
the fd table, there might could be a problem.
- To handle the problem of a file descriptor we are dup()'ing being closed
out from under us in dup() in general, do_dup() now obtains a reference
on the file in question before calling fdalloc(). If after the call to
fdalloc() the file for the fd we are dup'ing is a different file, then
we drop our reference on the original file and return EBADF. This
race was only handled in the dup2() case before and would just retry
the operation. The error return allows the user to know they are being
stupid since they have a locking bug in their app instead of dup'ing
some other descriptor and returning it to them.
Tested on: i386, alpha, sparc64
sleep mutexes and vice versa. WITNESS normally should catch this but
not everyone uses WITNESS so this is a fallback to catch nasty but easy
to do bugs.
PCATCH means 'if we get a signal, interrupt me!" and tsleep returns
either EINTR or ERESTART depending on the circumstances. ERESTART is
"special" because it causes the system call to fail, but right as it
returns back to userland it tells the trap handler to move %eip back a
bit so that userland will immediately re-run the syscall.
This is a syscall restart. It only works for things like read() etc where
nothing has changed yet. Note that *userland* is tricked into restarting
the syscall by the kernel. The kernel doesn't actually do the restart. It
is deadly for things like select, poll, nanosleep etc where it might cause
the elapsed time to be reset and start again from scratch. So those
syscalls do this to prevent userland rerunning the syscall:
if (error == ERESTART) error = EINTR;
Fake "signals" like SIGTSTP from ^Z etc do not normally invoke userland
signal handlers. But, in -current, the PCATCH *is* being triggered and
tsleep is returning ERESTART, and the syscall is aborted even though no
userland signal handler was run.
That is the fault here. We're triggering the PCATCH in cases that we
shouldn't. ie: it is being triggered on *any* signal processing, rather
than the case where the signal is posted to userland.
--- Peter
The work of psignal() is a patchwork of special case required by the process
debugging and job-control facilities...
--- Kirk McKusick
"The design and impelementation of the 4.4BSD Operating system"
Page 105
in STABLE source, when psignal is posting a STOP signal to sleeping
process and the signal action of the process is SIG_DFL, system will
directly change the process state from SSLEEP to SSTOP, and when
SIGCONT is posted to the stopped process, if it finds that the process
is still on sleep queue, the process state will be restored to SSLEEP,
and won't wakeup the process.
this commit mimics the behaviour in STABLE source tree.
Reviewed by: Jon Mini, Tim Robbins, Peter Wemm
Approved by: julian@freebsd.org (mentor)
brand early in the process of loading an elf file, so that we can
identify the sysentvec, and so that we do not continue if we do not
have a brand (and thus a sysentvec). Use the values in the sysentvec
for the page size and vm ranges unconditionally, since they are all
filled in now.
sysentvec. Initialized all fields of all sysentvecs, which will allow
them to be used instead of constants in more places. Provided stack
fixup routines for emulations that previously used the default.
a kernel-internal kern_*() version and a wrapper that is called via
the syscall vector table. For paths and structure pointers, the
internal version either takes a uio_seg parameter or requires the
caller to copyin() the data to kernel memory as appropiate. This
will permit emulation layers to use these syscalls without having
to copy out translated arguments to the stack gap.
Discussed on: -arch
Review/suggestions: bde, jhb, peter, marcel
compat code. Clean up accounting for multiple segments. Part 1/2.
Submitted by: Andrey Alekseyev <uitm@zenon.net> (with some modifications)
MFC after: 3 days
in the original hardwired sysctl implementation.
The buf size calculator still overflows an integer on machines with large
KVA (eg: ia64) where the number of pages does not fit into an int. Use
'long' there.
Change Maxmem and physmem and related variables to 'long', mostly for
completeness. Machines are not likely to overflow 'int' pages in the
near term, but then again, 640K ought to be enough for anybody. This
comes for free on 32 bit machines, so why not?