Remove extraneous uses of vop_null, instead defering to the default op.
Rename vnode type "vfs" to the more descriptive "syncer".
Fix formatting for various filesystems that use vop_print.
branches:
Initialize struct cdevsw using C99 sparse initializtion and remove
all initializations to default values.
This patch is automatically generated and has been tested by compiling
LINT with all the fields in struct cdevsw in reverse order on alpha,
sparc64 and i386.
Approved by: re(scottl)
calculations. Keep this changes local to the function so the tick count
is in its natural form otherwise. Previously 1000 was added each time
a tick fired and we divided by 1000 when it was reported. This is done
to reduce rounding errors.
To do this, initialize the d_maj member of the cdevsw to MAJOR_AUTO.
When the cdevsw is first passed to make_dev() a free major number
will be assigned.
Until we have a bit more experience with this a printf will announce
this fact.
Major numbers are not reclaimed, so loading/unloading the same
device driver which uses MAJOR_AUTO will eventually deplete the
pool of free major numbers and the system will panic when it can
not allocate one. Still undecided who to invonvenience with the
solution to this.
td_wmesg field in the thread structure points to the description string of
the condition variable or mutex. If the condvar or the mutex had been
initialized from a loadable module that was unloaded in the meantime,
td_wmesg may now point to invalid memory. Retrieving the process table now
may panic the kernel (or access junk). Setting the td_wmesg field to NULL
after unblocking on the condvar/mutex prevents this panic.
PR: kern/47408
Approved by: jake (mentor)
turns runs its tasks free of Giant too. It is intended that as drivers
become locked down, they will move out of the old, Giant-bound taskqueue
and into this new one. The old taskqueue has been renamed to
taskqueue_swi_giant, and the new one keeps the name taskqueue_swi.
delta 1.371) we must ensure that we do not get ourselves into a
recursive trap endlessly trying to clean up after ourselves.
Reported by: Attila Nagy <bra@fsn.hu>
Sponsored by: DARPA & NAI Labs.
o Always check for null when dereferencing the filename component.
o Implement a try-and-backoff method for allocating memory to
dump stats to avoid a spin-lock -> sleep-lock mutex lock order
panic with WITNESS.
Approved by: des, markm (mentor)
Not objected: jhb
track of the number of dirty buffers held by a vnode. When a
bdwrite is done on a buffer, check the existing number of dirty
buffers associated with its vnode. If the number rises above
vfs.dirtybufthresh (currently 90% of vfs.hidirtybuffers), one
of the other (hopefully older) dirty buffers associated with
the vnode is written (using bawrite). In the event that this
approach fails to curb the growth in it the vnode's number of
dirty buffers (due to soft updates rollback dependencies),
the more drastic approach of doing a VOP_FSYNC on the vnode
is used. This code primarily affects very large and actively
written files such as snapshots. This change should eliminate
hanging when taking snapshots or doing background fsck on
very large filesystems.
Hopefully, one day it will be possible to cache filesystem
metadata in the VM cache as is done with file data. As it
stands, only the buffer cache can be used which limits total
metadata storage to about 20Mb no matter how much memory is
available on the system. This rather small memory gets badly
thrashed causing a lot of extra I/O. For example, taking a
snapshot of a 1Tb filesystem minimally requires about 35,000
write operations, but because of the cache thrashing (we only
have about 350 buffers at our disposal) ends up doing about
237,540 I/O's thus taking twenty-five minutes instead of four
if it could run entirely in the cache.
Reported by: Attila Nagy <bra@fsn.hu>
Sponsored by: DARPA & NAI Labs.
- Remove the buftimelock mutex and acquire the buf's interlock to protect
these fields instead.
- Hold the vnode interlock while locking bufs on the clean/dirty queues.
This reduces some cases from one BUF_LOCK with a LK_NOWAIT and another
BUF_LOCK with a LK_TIMEFAIL to a single lock.
Reviewed by: arch, mckusick
- Get rid of the useless atop() / pmap_phys_address() detour. The
device mmap handlers must now give back the physical address
without atop()'ing it.
- Don't borrow the physical address of the mapping in the returned
int. Now we properly pass a vm_offset_t * and expect it to be
filled by the mmap handler when the mapping was successful. The
mmap handler must now return 0 when successful, any other value
is considered as an error. Previously, returning -1 was the only
way to fail. This change thus accidentally fixes some devices
which were bogusly returning errno constants which would have been
considered as addresses by the device pager.
- Garbage collect the poorly named pmap_phys_address() now that it's
no longer used.
- Convert all the d_mmap_t consumers to the new API.
I'm still not sure wheter we need a __FreeBSD_version bump for this,
since and we didn't guarantee API/ABI stability until 5.1-RELEASE.
Discussed with: alc, phk, jake
Reviewed by: peter
Compile-tested on: LINT (i386), GENERIC (alpha and sparc64)
Runtime-tested on: i386
in massive locking issues on diskless systems.
It is also not clear that this sysctl is non-dangerous in its
requirements for locked down memory on large RAM systems.
Retire the "d_dump_t" and use the "dumper_t" type instead.
Dumper_t takes a void * as first arg which is more general than the
dev_t taken by d_dump_t. (Remember: we could have net-dumpers if
somebody wrote us one!)
Define the convention for GEOM controlled disk devices to be that the
first argument to the dumper function is the struct disk pointer.
Change device drivers accordingly.
dev_t to the method functions.
The dev_t can still be found at struct consdev *->cn_dev.
Add a void *cn_arg element to struct consdev which the drivers can use
for retrieving their softc.
In devsw() return dead_cdevsw instead of NULL in case the dev_t does not
have a si_devsw.
This may improve our survival chances with devices which go away unexpectedly.
compile-time constants). That is, a "bucket" now is not necessarily
a page-worth of mbufs or clusters, but it is MBUF_BUCK_SZ, CLUS_BUCK_SZ
worth of mbufs, clusters.
o Rename {mbuf,clust}_limit to {mbuf,clust}_hiwm and introduce
{mbuf,clust}_lowm, which currently has no effect but will be used
to set the low watermarks.
o Fix netstat so that it can deal with the differently-sized buckets
and teach it about the low watermarks too.
o Make sure the per-cpu stats for an absent CPU has mb_active set to 0,
explicitly.
o Get rid of the allocate refcounts from mbuf map mess. Instead,
just malloc() the refcounts in one shot from mbuf_init()
o Clean up / update comments in subr_mbuf.c
used to share resource limits between rfork threads, but never was.
Removing it makes resource limit locking much simpler -- only the current
process can change the contents of the structure that p_limit points to.
reference counter array for mbuf clusters. I don't know
how this got past early testing nor how it survived so long
without getting caught. If anyone was seeing really really
bizarre memory corruption in a few mbufs this would be why.
#if'ed out for a while. Complete the deed and tidy up some other bits.
We need to be able to call this stuff from outer edges of interrupt
handlers for devices that have the ISR bits in pci config space. Making
the bios code mpsafe was just too hairy. We had also stubbed it out some
time ago due to there simply being too much brokenness in too many systems.
This adds a leaf lock so that it is safe to use pci_read_config() and
pci_write_config() from interrupt handlers. We still will use pcibios
to do interrupt routing if there is no acpi.. [yes, I tested this]
Briefly glanced at by: imp
sched_lock around accesses to p_stats->p_timer[] to avoid a potential
race with hardclock. getitimer(), setitimer() and the realitexpire()
callout are now Giant-free.
add a signal to a mailbox's pending set.
- Add a new function, thread_signal_upcall(), this causes the current thread
to upcall so that we can deliver pending signals.
Reviewed by: mini
I was in two minds as to where to put them in the first case..
I should have listenned to the other mind.
Submitted by: parts by davidxu@
Reviewed by: jeff@ mini@
queue lock already held.
- In getblk() and flushbufqueues() use bremfreel() while we still have the
buf queue lock held to keep the lists consistent.
- Add LK_NOWAIT to two cases where we're essentially asserting that the bufs
are not locked while acquiring the locks. This will make sure that we get
the appropriate panic() and not another one for sleeping with a lock held.
- Mark the process leader as having an advisory lock
- Check if process leader is marked as having advisory lock when
closing file
- Check that file is still open after lock has been obtained
- Don't allow file descriptor table sharing between processes
with different leaders
PR: 10265
Reviewed by: alfred
freebsd4_sigaction() and osigaction() instead of around the whole
body of those functions. They now no longer hold Giant around calls
to copyin() and copyout(), and it is slightly more obvious what
Giant is protecting.
barrier between free'ing filedesc structures. Basically if you want to
access another process's filedesc, you want to hold this mutex over the
entire operation.
opposed to returning the top of the old chain when there was one and
the top of the newly allocated chain if there was no old chain.
Actually, it should be noted that prior to this fix, although the
comment above m_getm() advertised that m_getm() would return the
top of the old chain (if an old chain was being passed in) it
actually [wrongly] was returning the tail mbuf in the old chain
instead. This is a bug but since the one use of m_getm() in
the tree luckily did not depend on the behavior, it happened
to work out without notice.
Harti Brandt pointed out that the advertised behavior was actually
not the real behavior and so this change makes m_getm() ALWAYS
return the newly allocated chain (and fixes the comment). This
is less confusing and is the best course of action as then the
caller is always able to have both a reference to the top of
the original chain (because it's passing it in in the call) and
a reference to the newly attached chain. Although the API is
slightly modified, I don't think that any third-party code uses
m_getm() and if it does, it surely can't be working properly
because the old behavior was bogus.
API bug pointed out by: Harti Brandt <brandt@fokus.fraunhofer.de>
To fix scsi, don't wait for ithreads if we're dumping, it makes the
debugger sad.
To fix ata, use what appears to be a polling method if we're dumping,
I stole this from tmm but added code to ensure that this change is
only in effect while dumping.
Tested by: des
The locking here needs to be revisited, but this ought to get rid of the
LOR messages that people are complaining about for now. I imagine either
I or someone else interested with smp will eventually clear this up.
- Use the ratio of kg_runtime / kg_slptime to determine our dynamic priority.
- Scale kg_runtime and kg_slptime back when the sum of the two exceeds
SCHED_SLP_RUN_MAX. This allows us to slowly forget old behavior.
- Scale back the runtime and slptime in fork so that the new process has the
same ratio but much less accumulated time. This causes new behavior to be
noticed more quickly.
that is protected by the vnode lock.
- Move B_SCANNED into b_vflags and call it BV_SCANNED.
- Create a vop_stdfsync() modeled after spec's sync.
- Replace spec_fsync, msdos_fsync, and hpfs_fsync with the stdfsync and some
fs specific processing. This gives all of these filesystems proper
behavior wrt MNT_WAIT/NOWAIT and the use of the B_SCANNED flag.
- Annotate the locking in buf.h
buf lists, synchronization variables, and atomic ops for the counters.
This change does not remove giant from any code although some pushdown
may be possible.
- In vfs_bio_awrite() don't access buf fields without the buf lock.
Change the si_name of dev_t's to be a char * and put a private buffer for
holding the name at then end of the struct.
Initialize si_name to point to the private buffer.
Put a KASSERT in geom_disk to prevent overrun on the fake dev_t we still
have to generate for the disk_drivers.
prevent the compiler from optimizing assignments into byte-copy
operations which might make access to the individual fields non-atomic.
Use the individual fields throughout, and don't bother locking them with
Giant: it is no longer needed.
Inspired by: tjr
statclock based on profhz when profiling is enabled MD, since most platforms
don't use this anyway. This removes the need for statclock_process, whose
only purpose was to subdivide profhz, and gets the profiling clock running
outside of sched_lock on platforms that implement suswintr.
Also changed the interface for starting and stopping the profiling clock to
do just that, instead of changing the rate of statclock, since they can now
be separate.
Reviewed by: jhb, tmm
Tested on: i386, sparc64
have some negative effect on interactivity but it yields great perf. gains.
This also brings the conditions under which ULE context switches inline
with SCHED_4BSD.
- Define some new kseq_* functions for manipulating the run queue.
- Add a new kseq member ksq_rslices and ksq_bload. rslices is the sum of
the slices of runnable kses. This will be used for push load balance
decisions. bload is the number of threads blocked waiting on IO.
I'm not convinced there is anything major wrong with the patch but
them's the rules..
I am using my "David's mentor" hat to revert this as he's
offline for a while.
than having change_dir() release the vnode lock on success, hold the
lock so that we can use it later when invoking MAC checks and
VOP_ACCESS() in the chroot() code. Update the comment to reflect
this calling convention. Update callers to unlock the vnode
lock. Correct a typo regarding vnode naming in the MAC case that
crept in via the previous patch applied.
cases: we might multiply vrele() a vnode when certain classes of
failures occur. This appears to stem from earlier Giant/file
descriptor lock pushdown and restructuring.
Submitted by: maxim
This implicitly removes the need for major numbers, but a number of
drivers still know things they shouldn't need to, and we need to
consider if there are applications which cache major(+minor) gleaned
from stat(2) and rely on it being constant over reboots before we
start assigning random majors.
sched_runnable() et all.
- Remove some dead code in sched_clock().
- Define two macros KSEQ_SELF() and KSEQ_CPU() for getting the kseq of the
current cpu or some alternate cpu.
- Start introducing kseq_() functions, such as kseq_choose() and kseq_setup().
run queue for each cpu.
- Introduce kse stealing into the sched_choose() code. This helps balance
cpus better in cases where process turnover is high. This implementation
is fairly trivial and will likely be only a temporary measure until
something more sophisticated has been written.
data structure called kse_upcall to manage UPCALL. All KSE binding
and loaning code are gone.
A thread owns an upcall can collect all completed syscall contexts in
its ksegrp, turn itself into UPCALL mode, and takes those contexts back
to userland. Any thread without upcall structure has to export their
contexts and exit at user boundary.
Any thread running in user mode owns an upcall structure, when it enters
kernel, if the kse mailbox's current thread pointer is not NULL, then
when the thread is blocked in kernel, a new UPCALL thread is created and
the upcall structure is transfered to the new UPCALL thread. if the kse
mailbox's current thread pointer is NULL, then when a thread is blocked
in kernel, no UPCALL thread will be created.
Each upcall always has an owner thread. Userland can remove an upcall by
calling kse_exit, when all upcalls in ksegrp are removed, the group is
atomatically shutdown. An upcall owner thread also exits when process is
in exiting state. when an owner thread exits, the upcall it owns is also
removed.
KSE is a pure scheduler entity. it represents a virtual cpu. when a thread
is running, it always has a KSE associated with it. scheduler is free to
assign a KSE to thread according thread priority, if thread priority is changed,
KSE can be moved from one thread to another.
When a ksegrp is created, there is always N KSEs created in the group. the
N is the number of physical cpu in the current system. This makes it is
possible that even an userland UTS is single CPU safe, threads in kernel still
can execute on different cpu in parallel. Userland calls kse_create to add more
upcall structures into ksegrp to increase concurrent in userland itself, kernel
is not restricted by number of upcalls userland provides.
The code hasn't been tested under SMP by author due to lack of hardware.
Reviewed by: julian
potential discontinuities in our UTC timescale.
Applications can monitor this variable if they want to be informed
about steps in the timescale. Slews (ntp and adjtime(2)) and
frequency adjustments (ntp) will not increment this counter, only
operations which set the clock. No attempt is made to classify
size or direction of the step.
correctly against PF_LOCAL. It seems that the test always fails then
sockaddr was not filled. So, I added else clause for workaround.
I doubt if it is right fix. However, it is better than nothing. I
found that NetBSD has same potential problem. But, fortunately,
NetBSD has equivalent else clause.
MFC after: 1 week
1. eliminate unnecessary loop which frees and re-allocates
the just allocated array
2. eliminate the newsize recomputation
3. eliminate unnecessary unlock and relock around free
4. correctly match the free with the malloc into M_KQUEUE instead of M_TEMP
5. eliminate conditional assignment of oldlist, which is equivalent to a
simple assignment
6. eliminate the oldlist temporary variable completely
Reviewed by: jhb
metadata. This fixes module dependency resolution by the kernel linker on
sparc64, where the relocations for the metadata are different than on other
architectures; the relative offset is in the addend of an Elf_Rela record
instead of the original value of the location being patched.
Also fix printf formats in debug code.
Submitted by: Hartmut Brandt <brandt@fokus.gmd.de>
PR: 46732
Tested on: alpha (obrien), i386, sparc64
was used to control code which were conditional on DEVFS' precense
since this avoided the need for large-scale source pollution with
#include "opt_geom.h"
Now that we approach making DEVFS standard, replace these tests
with an #ifdef to facilitate mechanical removal once DEVFS becomes
non-optional.
No functional change by this commit.
more than return ENXIO from its open routine, so most of this file
is unneeded.
A straight #ifdef'ing would look quite messy, and make the file
quite unreadable, so instead I have simply added the DEVFS version
of the file at the top, protected by #ifndef NODEVFS.
Once we have removed NODEVFS option, we can retain 86 the 86 lines at
the top and drop the other 287 lines.
SS_ISCONNECTING state, returning EINVAL (which is what POSIX mandates
in this case).
listen() on connected or connecting sockets would cause them to enter
a bad state; in the TCP case, this could cause sockets to go
catatonic or panics, depending on how the socket was connected.
Reviewed by: -net
MFC after: 2 weeks
vm_pageout_deficit:
1. Update vm_pageout_deficit before VM_WAIT. There is no sense in
delaying the update; the sooner the pageout daemon receives this
information the better. Reviewed by: tegge
2. Update vm_pageout_deficit according to the number of pages still
needed to complete the allocation, not the original size of the
allocation. Submitted by: tegge
(These errors have existed since the introduction of vm_pageout_deficit
in revision 1.144.)
portable copy. Note that pmap_extract() must be used instead of
pmap_kextract().
This is precursor work to a reorganization of vmapbuf() to close remaining
user/kernel races (which can lead to a panic).
in addition to secure level 1. The mask supports up to a secure level of 8
but only add defines through CTLFLAG_SECURE3 for now.
As per the missif in the log entry for 1.11 of ip_fw2.c which added the
secure flag to the IPFW sysctl's in the first place, change the secure
level requirement from 1 to 3 now that we have support for it.
Reviewed by: imp
With Design Suggestions by: imp
dereferenced when a process exits due to the vmspace ref-count being
bumped. Change shmexit() and shmexit_myhook() to take a vmspace instead
of a process and call it in vmspace_dofree(). This way if it is missed
in exit1()'s early-resource-free it will still be caught when the zombie is
reaped.
Also fix a potential race in shmexit_myhook() by NULLing out
vmspace->vm_shm prior to calling shm_delete_mapping() and free().
MFC after: 7 days
to access the pctcpu. This will have to be sorted out more later as the
new scheduler requires a procedural interface for this data. A more
complete solution will follow.
pointer types, and remove a huge number of casts from code using it.
Change struct xfile xf_data to xun_data (ABI is still compatible).
If we need to add a #define for f_data and xf_data we can, but I don't
think it will be necessary. There are no operational changes in this
commit.
that crept in recently. GCC will optimize the divides and multiplies for us.
Submitted by: David Schultz <dschultz@uclink.Berkeley.EDU>
MFC after: 1 day
so that entities that want to use the post_sync hook to write stuff
to devices and other tidy-up can do so before the device tree is
shot down. eg: da doing a SYNC_CACHE etc. This should get crashdumps
working on mpt devices again, and stops the ia64 boxes locking up
on regular shutdown when da tries to issue the scsi commands to mpt.
Obtained from: njl, gibbs
which expects it to be NULL unless the return value was 0 will work.
Obtained from: TrustedBSD Project
Sponsored by: DARPA, Network Associates Laboratories
so it's value is not sign extended when assigned to the uintmax_t variable
used internally by printf. For example, if bit 31 is set in the cpuid
feature word, then %b would print out the initial value as a 16 character
hexadecimal value. Now it only prints out an 8 character value.
Reviewed by: bde
The leak in lseek was introduced in vfs_syscalls.c revision 1.218.
The leak in do_dup was introduced in kern_descrip.c revision 1.158.
Submitted by: iedowse
Add support for GCC's --test-coverage --profile-arcs options.
Add code to call the functions listed in the .ctors section, these are
used to string the per .o file counter blocks into a linked list.
Add empty __bb_fork_func() to cope with GCC magic gandling of exec*()
named functions.
To add support for other platforms should be trivial, but involves
determining the exact data-types gcc uses on that platform.
called. Otherwise (depending on a non-deterministic sort), the timecounter
code can be initialized before the clock rate has been set (on ia64) and it
assumes hz = 100, rather than the real value of 1024. I'm not sure how much
gets upset by this.
Glanced at by: phk
then call do_setopt_accept_filter(so, NULL) which will free the filter
instead of duplicating the code in do_setopt_accept_filter().
Pointed out by: Hiten Pandya <hiten@angelica.unixdaemons.com>
to sort out disk-io from file-io in the vm/buffer/filesystem space.
The intent is to sort VOP_STRATEGY calls into those which operate
on "real" vnodes and those which operate on VCHR vnodes. For
the latter kind, the call will be changed to VOP_SPECSTRATEGY,
possibly conditionally for those places where dual-use happens.
Add a default VOP_SPECSTRATEGY method which will call the normal
VOP_STRATEGY. First time it is called it will print debugging
information. This will only happen if a normal vnode is passed
to VOP_SPECSTRATEGY by mistake.
Add a real VOP_SPECSTRATEGY in specfs, which does what VOP_STRATEGY
does on a VCHR vnode today.
Add a new VOP_STRATEGY method in specfs to catch instances where
the conversion to VOP_SPECSTRATEGY has not yet happened. Handle
the request just like we always did, but first time called print
debugging information.
Apart up to two instances of console messages per boot, this amounts
to a glorified no-op commit.
If you get any of the messages on your console I would very much
like a copy of them mailed to phk@freebsd.org
included in the kernel. Include imgact_elf.c in conf/files, instead of
both imgact_elf32.c and imgact_elf64.c, which will use the default word
size for an architecture as defined in machine/elf.h. Architectures that
wish to build an additional image activator for an alternate word size can
include either imgact_elf32.c or imgact_elf64.c in files.${ARCH}, which
allows it to be dependent on MD options instead of solely on architecture.
Glanced at by: peter
void backtrace(void);
function which will print a backtrace if DDB is in the kernel and an
explanation if not.
This is useful for recording backtraces in non-fatal circumstances and
does not require pollution with DDB #includes in the files where it
is used.
It would of course be nice to have a non-DDB dependent version too,
but since the meat of a backtrace is MD it is probably not worth it.
On architectures with a non-executable stack, eg sparc64, this is used by
libgcc to determine at runtime if its necessary to enable execute permissions
on a region of the stack which will be used to execute code, allowing the
call to mprotect to be avoided if the kernel is configured to map the stack
executable.
o Allow callers of m_extadd() to allocate their own reference
m_ext.ref_cnt pointer, rather than having the mbuf system allocate it
with a malloc() in the critical path. This speeds m_extadd() up, and
also simplifies locking (malloc() may need Giant).
A driver or subsystem wishing to take use its own ref counter must
initialize m_ext.ref_cnt to point to its ref counter prior to
calling m_extadd(), and it must use EXT_EXTREF as its external type.
Eg:
m->m_ext.ref_cnt = my_ref_cnt_ptr;
m_extadd(.....,EXT_EXTREF);
Reviewed by: bosko
this was causing filedesc work to be very painful.
In order to make this work split out sigio definitions to thier own header
(sigio.h) which is included from proc.h for the time being.
take pointers to filedesc structures instead of threads. This makes
it more clear that they do not do any voodoo with the thread/proc
or anything other than the filedesc passed in or returned.
Remove some XXX KSE's as this resolves the issue.
calling getmicrouptime (but maintain the struct timeval-based calling
convention for compatibility)
o eliminate the use of timersub in ratecheck
Note that flood ping tests indicate ppsratecheck is inaccurate (but on the
conservative side) with this revised implementation. If more accuracy is
needed we'll have to introduce an alternate interface or increase the
overhead.
Reviewed by: silby, dillon, bde
were sometimes propagated using M_COPY_PKTHDR which actually did
something between a "move" and a "copy" operation. This is replaced
by M_MOVE_PKTHDR (which copies the pkthdr contents and "removes" it
from the source mbuf) and m_dup_pkthdr which copies the packet
header contents including any m_tag chain. This corrects numerous
problems whereby mbuf tags could be lost during packet manipulations.
These changes also introduce arguments to m_tag_copy and m_tag_copy_chain
to specify if the tag copy work should potentially block. This
introduces an incompatibility with openbsd which we may want to revisit.
Note that move/dup of packet headers does not handle target mbufs
that have a cluster bound to them. We may want to support this;
for now we watch for it with an assert.
Finally, M_COPYFLAGS was updated to include M_FIRSTFRAG|M_LASTFRAG.
Supported by: Vernier Networks
Reviewed by: Robert Watson <rwatson@FreeBSD.org>
__acl_get_link(), __acl_set_link(), acl_delete_link(), and
__acl_aclcheck_link(), with almost identical implementations to
the existing __acl_*_file() variants on these calls. Update
copyright.
Obtained from: TrustedBSD Project
__acl_get_link() Retrieve an ACL by name without following
symbolic links.
__acl_set_link() Set an ACL by name without following
symbolic links.
__acl_delete_link() Delete an ACL by name without following
symbolic links.
__acl_aclcheck_link() Check an ACL against a file by name without
following symbolic links.
These calls are similar in spirit to lstat(), lchown(), lchmod(), etc,
and will be used under similar circumstances.
Obtained from: TrustedBSD Project
call is in progress on the vnode. When vput() or vrele() sees a
1->0 reference count transition, it now return without any further
action if this flag is set. This flag is necessary to avoid recursion
into VOP_INACTIVE if the filesystem inactive routine causes the
reference count to increase and then drop back to zero. It is also
used to guarantee that an unlocked vnode will not be recycled while
blocked in VOP_INACTIVE().
There are at least two cases where the recursion can occur: one is
that the softupdates code called by ufs_inactive() via ffs_truncate()
can call vput() on the vnode. This has been reported by many people
as "lockmgr: draining against myself" panics. The other case is
that nfs_inactive() can call vget() and then vrele() on the vnode
to clean up a sillyrename file.
Reviewed by: mckusick (an older version of the patch)
to treat desiredvnodes much more like a limit than as a vague concept.
On a 2GB RAM machine where desired vnodes is 130k, we run out of
kmem_map space when we hit about 190k vnodes.
If we wake up the vnode washer in getnewvnode(), sleep until it is done,
so that it has a chance to offer us a washed vnode. If we don't sleep
here we'll just race ahead and allocate yet a vnode which will never
get freed.
In the vnodewasher, instead of doing 10 vnodes per mountpoint per
rotation, do 10% of the vnodes distributed evenly across the
mountpoints.
(show thread {address})
Remove the IDLE kse state and replace it with a change in
the way threads sahre KSEs. Every KSE now has a thread, which is
considered its "owner" however a KSE may also be lent to other
threads in the same group to allow completion of in-kernel work.
n this case the owner remains the same and the KSE will revert to the
owner when the other work has been completed.
All creations of upcalls etc. is now done from
kse_reassign() which in turn is called from mi_switch or
thread_exit(). This means that special code can be removed from
msleep() and cv_wait().
kse_release() does not leave a KSE with no thread any more but
converts the existing thread into teh KSE's owner, and sets it up
for doing an upcall. It is just inhibitted from being scheduled until
there is some reason to do an upcall.
Remove all trace of the kse_idle queue since it is no-longer needed.
"Idle" KSEs are now on the loanable queue.
The duplication is caused by the fact that imgact_elf.c is included
by both imgact_elf32.c and imgact_elf64.c and both are compiled by
default on ia64. Consequently, we have two seperate copies of the
elf_legacy_coredump variable due to them being declared static, and
two entries for the same sysctl in the linker set, both referencing
the unique copy of the elf_legacy_coredump variable. Since the second
sysctl cannot be registered, one of the elf_legacy_coredump variables
can not be tuned (if ordering still holds, it's the ELF64 related one).
The only solution is to create two different sysctl variables, just
like the elf<32|64>_trace sysctl variables. This unfortunately is an
(user) interface change, but unavoidable. Thus, on ELF32 platforms
the sysctl variable is called elf32_legacy_coredump and on ELF64
platforms it is called elf64_legacy_coredump. Platforms that have
both ELF formats have both sysctl variables.
These variables should probably be retired sooner rather than later.
skipping read-only pages, which can result in valuable non-text-related
data not getting dumped, the ELF loader and the dynamic loader now mark
read-only text pages NOCORE and the coredump code only checks (primarily) for
complete inaccessibility of the page or NOCORE being set.
Certain applications which map large amounts of read-only data will
produce much larger cores. A new sysctl has been added,
debug.elf_legacy_coredump, which will revert to the old behavior.
This commit represents collaborative work by all parties involved.
The PR contains a program demonstrating the problem.
PR: kern/45994
Submitted by: "Peter Edwards" <pmedwards@eircom.net>, Archie Cobbs <archie@dellroad.org>
Reviewed by: jdp, dillon
MFC after: 7 days
_KERNEL scope from "src/sys/sys/mchain.h".
Replace each occurrence of the above in _KERNEL scope with the
appropriate macro from the set of hto(be|le)(16|32|64) and
(be|le)toh(16|32|64) from "src/sys/sys/endian.h".
Tested by: tjr
Requested by: comment marked with XXX
resource starvation we clean-up as much of the vmspace structure as we
can when the last process using it exits. The rest of the structure
is cleaned up when it is reaped. But since exit1() decrements the ref
count it is possible for a double-free to occur if someone else, such as
the process swapout code, references and then dereferences the structure.
Additionally, the final cleanup of the structure should not occur until
the last process referencing it is reaped.
This commit solves the problem by introducing a secondary reference count,
calling 'vm_exitingcnt'. The normal reference count is decremented on exit
and vm_exitingcnt is incremented. vm_exitingcnt is decremented when the
process is reaped. When both vm_exitingcnt and vm_refcnt are 0, the
structure is freed for real.
MFC after: 3 weeks
they may be the only viable ones to flush. Thus it will now wait for
an inode lock if the other alternatives will result in rollbacks (and
immediate redirtying of the buffer). If only buffers with rollbacks
are available, one will be flushed, but then the buffer daemon will
wait briefly before proceeding. Failing to wait briefly effectively
deadlocks a uniprocessor since every other process writing to that
filesystem will wait for the buffer daemon to clean up which takes
close enough to forever to feel like a deadlock.
Reported by: Archie Cobbs <archie@dellroad.org>
Sponsored by: DARPA & NAI Labs.
Approved by: re
These call uma_large_malloc() and uma_large_free() which require Giant.
Fixes panic when descriptor table is larger than KMEM_ZMAX bytes
noticed by kkenn.
Reviewed by: jhb
unused. Replace it with a dm_mount back-pointer to the struct mount
that the devfs_mount is associated with. Export that pointer to MAC
Framework entry points, where all current policies don't use the
pointer. This permits the SEBSD port of SELinux's FLASK/TE to compile
out-of-the-box on 5.0-CURRENT with full file system labeling support.
Approved by: re (murray)
Obtained from: TrustedBSD Project
Sponsored by: DARPA, Network Associates Laboratories
converting from individual vnode locks to the snapshot
lock, be sure to pass any waiting processes along to the
new lock as well. This transfer is done by a new function
in the lock manager, transferlockers(from_lock, to_lock);
Thanks to Lamont Granquist <lamont@scriptkiddie.org> for
his help in pounding on snapshots beyond all reason and
finding this deadlock.
Sponsored by: DARPA & NAI Labs.
1) Record all device events when devctl is enabled, rather than just when
devd has devctl open. This is necessary to prevent races between when
a device arrives, and when devd starts.
2) Add hw.bus.devctl_disable to disable devctl, this can also be set as a
tunable.
3) Fix async support. Reset nonblocking and async_td in open. remove
async flags.
4) Free all memory when devctl is disabled.
Approved by: re (blanket)
on this.
o Update the `cur' pointer in the cluster loop in m_getm() to avoid
incorrect truncation and leaked mbufs.
Reviewed by: bmilekic
Approved by: re
create an ABI that encodes offsets and sizes of structures into client
drivers. The functions isolate the ABI from changes to the resource
structure. Since these are used very rarely (once at startup), the
speed penalty will be down in the noise.
Also, add r_rid to the structure so that clients can save the 'rid' of
the resource in the struct resource, plus accessor functions. Future
additions to newbus will make use of this to present a simplified
interface for resource specification.
Approved by: re (jhb)
Reviewed by: jhb, jake
problem was a locked directory vnode), do not give the process a chance
to sleep in state "stopevent" (depends on the S_EXEC bit being set in
p_stops) until most resources have been released again.
Approved by: re
instead of panicing. Also, perform some of the simpler sanity checks on
the fds before acquiring the filedesc lock.
Approved by: re
Reported by: Dan Nelson <dan@emsphone.com> and others
by policy modules making use of downgrades in the MAC AST event. This
is required by the mac_lomac port of LOMAC to the MAC Framework.
Approved by: re
Obtained from: TrustedBSD Project
Sponsored by: DARPA, Network Associates Laboratories
i386 cpu_thread_exit(). This resulted in a panic with WITNESS
since we need to hold Giant to call kmem_free(), and we weren't
helding it anymore in cpu_thread_exit(). We now do this from a
new MD function, cpu_thread_dtor(), called by thread_dtor().
Approved by: re@
Suggested by: jhb
- Provide a routine in sched_4bsd to add this functionality.
- Use sched_pctcpu() in kern_proc, which is the one place outside of
sched_4bsd where the old pctcpu value was accessed directly.
Approved by: re
data in the scheduler independant structures (proc, ksegrp, kse, thread).
- Implement unused stubs for this mechanism in sched_4bsd.
Approved by: re
Reviewed by: luigi, trb
Tested on: x86, alpha
in struct proc. While the process label is actually stored in the
struct ucred pointed to by p_ucred, there is a need for transient
storage that may be used when asynchronous (deferred) updates need to
be performed on the "real" label for locking reasons. Unlike other
label storage, this label has no locking semantics, relying on policies
to provide their own protection for the label contents, meaning that
a policy leaf mutex may be used, avoiding lock order issues. This
permits policies that act based on historical process behavior (such
as audit policies, the MAC Framework port of LOMAC, etc) can update
process properties even when many existing locks are held without
violating the lock order. No currently committed policies implement use
of this label storage.
Approved by: re
Obtained from: TrustedBSD Project
Sponsored by: DARPA, Network Associates Laboratories
checks permit policy modules to augment the system policy for permitting
kld operations. This permits policies to limit access to kld operations
based on credential (and other) properties, as well as to perform checks
on the kld being loaded (integrity, etc).
Approved by: re
Obtained from: TrustedBSD Project
Sponsored by: DARPA, Network Associates Laboratories
leader wasn't exiting during a fork; instead, do remember to release
the lock avoiding lock order reversals and recursion panic.
Reported by: "Joel M. Baldwin" <qumqats@outel.org>
also add rusage time in thread mailbox.
2. Minor change for thread limit code in thread_user_enter(),
fix typo in kse_release() last I committed.
Reviewed by: deischen, mini
kern.threads.max_threads_per_proc
kern.threads.max_groups_per_proc
2.Temporary disable borrower thread stash itself as
owner thread's spare thread in thread_exit(). there
is a race between owner thread and borrow thread:
an owner thread may allocate a spare thread as this:
if (td->td_standin == NULL)
td->standin = thread_alloc();
but thread_alloc() can block the thread, then a borrower
thread would possible stash it self as owner's spare
thread in thread_exit(), after owner is resumed, result
is a thread leak in kernel, double check in owner can
avoid the race, but it may be ugly and not worth to do.
sysconf.c:
Use 'break' rather than 'goto yesno' in sysconf.c so that we report a '0'
return value from the kernel sysctl.
vfs_aio.c:
Make aio reset its configuration parameters to -1 after unloading
instead of 0.
posix4_mib.c:
Initialize the aio configuration parameters to -1
to indicate that it is not loaded.
Add a facility (p31b_iscfg()) to determine if a posix4 facility has been
initialized to avoid having to re-order the SYSINITs.
Use p31b_iscfg() to determine if aio has had a chance to run yet which
is likely if it is compiled into the kernel and avoid spamming its
values.
Introduce a macro P31B_VALID() instead of doing the same comparison over
and over.
posix4.h:
Prototype p31b_iscfg().
Previously these were libc functions but were requested to
be made into system calls for atomicity and to coalesce what
might be two entrances into the kernel (signal mask setting
and floating point trap) into one.
A few style nits and comments from bde are also included.
Tested on alpha by: gallatin
signed, since they describe a ring buffer and signed arithmetic is
performed on them. This avoids some evilish casts.
Since this changes all but two members of this structure, style(9)
those remaining ones, too.
Requested by: bde
Reviewed by: bde (earlier version)
the MAC policy list is busy during a load or unload attempt.
We assert no locks held during the cv wait, meaning we should
be fairly deadlock-safe. Because of the cv model and busy
count, it's possible for a cv waiter waiting for exclusive
access to the policy list to be starved by active and
long-lived access control/labeling events. For now, we
accept that as a necessary tradeoff.
Obtained from: TrustedBSD Project
Sponsored by: DARPA, Network Associates Laboratories
we brought in the new cache and locking model for vnode labels. We
now rely on mac_associate_devfs_vnode().
Obtained from: TrustedBSD Project
Sponsored by: DARPA, Network Associates Laboratories
earlier acquired lock with the same witness as the lock currently being
acquired. If we had released several earlier acquired locks after
acquiring enough locks to require another lock_list_entry bucket in the
lock list, then subsequent lock_list_entry buckets could contain only one
lock instance in which case i would be zero.
Reported by: Joel M. Baldwin <qumqats@outel.org>
dynamic mapping of an operation vector into an operation structure,
rather, we rely on C99 sparse structure initialization.
Obtained from: TrustedBSD Project
Sponsored by: DARPA, Network Associates Laboratories
indirectly through vm_page_protect(). The one remaining page flag that
is updated by vm_page_protect() is already being updated by our various
pmap implementations.
Note: A later commit will similarly change the VM_PROT_READ case and
eliminate vm_page_protect().
in the ELF code. Missed in earlier merge from the MAC tree.
Approved by: re
Obtained from: TrustedBSD Project
Sponsored by: DARPA, Network Associates Laboratories
mac_thread_userret() only if PS_MACPEND is set in the process AST mask.
This avoids the cost of the entry point in the common case, but
requires policies interested in the userret event to set the flag
(protected by the scheduler lock) if they do want the event. Since
all the policies that we're working with which use mac_thread_userret()
use the entry point only selectively to perform operations deferred
for locking reasons, this maintains the desired semantics.
Approved by: re
Requested by: bde
Obtained from: TrustedBSD Project
Sponsored by: DARPA, Network Associates Laboratories
points, rather than relying on policies to grub around in the
image activator instance structure.
Approved by: re
Obtained from: TrustedBSD Project
Sponsored by: DARPA, Network Associates Laboratories
sysctls to MI code; this reduces code duplication and makes all of them
available on sparc64, and the latter two on powerpc.
The semantics by the i386 and pc98 hw.availpages is slightly changed:
previously, holes between ranges of available pages would be included,
while they are excluded now. The new behaviour should be more correct
and brings i386 in line with the other architectures.
Move physmem to vm/vm_init.c, where this variable is used in MI code.
- Remove the comments which were justifying this by the fact
that we don't have %q in the kernel, this was probably right
back in time, but we now have %q, and we even have better to
print those types (%j).
of the original AIO request: save and restore the active thread credential
as well as using the file credential, since MAC (and some other bits of
the system) rely on the thread credential instead of/as well as the
file credential. In brief: cache td->td_ucred when the AIO operation
is queued, temporarily set and restore the kernel thread credential,
and release the credential when done. Similar to ktrace credential
management.
Reviewed by: alc
Approved by: re
Obtained from: TrustedBSD Project
Sponsored by: DARPA, Network Associates Laboratories
manipulated directly (rather than using sballoc()/sbfree()); update them
to tweak the new sb_ctl field too.
Sponsored by: NTT Multimedia Communications Labs
(1) Permit userland applications to request a change of label atomic
with an execve() via mac_execve(). This is required for the
SEBSD port of SELinux/FLASK. Attempts to invoke this without
MAC compiled in result in ENOSYS, as with all other MAC system
calls. Complexity, if desired, is present in policy modules,
rather than the framework.
(2) Permit policies to have access to both the label of the vnode
being executed as well as the interpreter if it's a shell
script or related UNIX nonsense. Because we can't hold both
vnode locks at the same time, cache the interpreter label.
SEBSD relies on this because it supports secure transitioning
via shell script executables. Other policies might want to
take both labels into account during an integrity or
confidentiality decision at execve()-time.
Approved by: re
Obtained from: TrustedBSD Project
Sponsored by: DARPA, Network Associates Laboratories
Allow transitioning to be twiddled off using the process and fs enforcement
flags, although at some point this should probably be its own flag.
Obtained from: TrustedBSD Project
Sponsored by: DARPA, Network Associates Laboratories
entrypoints, #ifdef MAC. The supporting logic already existed in
kern_mac.c, so no change there. This permits MAC policies to cause
a process label change as the result of executing a binary --
typically, as a result of executing a specially labeled binary.
For example, the SEBSD port of SELinux/FLASK uses this functionality
to implement TE type transitions on processes using transitioning
binaries, in a manner similar to setuid. Policies not implementing
a notion of transition (all the ones in the tree right now) require
no changes, since the old label data is copied to the new label
via mac_create_cred() even if a transition does occur.
Obtained from: TrustedBSD Project
Sponsored by: DARPA, Network Associates Laboratories
describes an image activation instance. Instead, make use of the
existing fname structure entry, and introduce two new entries,
userspace_argv, and userspace_envv. With the addition of
mac_execve(), this divorces the image structure from the specifics
of the execve() system call, removes a redundant pointer, etc.
No semantic change from current behavior, but it means that the
structure doesn't depend on syscalls.master-generated includes.
There seems to be some redundant initialization of imgact entries,
which I have maintained, but which could probably use some cleaning
up at some point.
Obtained from: TrustedBSD Project
Sponsored by: DARPA, Network Associates Laboratories
system accounting configuration and for nfsd server thread attach.
Policies might use this to protect the integrity or confidentiality
of accounting data, limit the ability to turn on or off accounting,
as well as to prevent inappropriately labeled threads from becoming nfs
server threads.
Obtained from: TrustedBSD Project
Sponsored by: DARPA, Network Associates Laboratories
the data value returned by kevent()'s EVFILT_READ filter on non-TCP
sockets accurately reflects the amount of data that can be read from the
sockets by applications.
PR: 30634
Reviewed by: -net, -arch
Sponsored by: NTT Multimedia Communications Labs
MFC after: 2 weeks
permitting MAC policies to limit access to the kernel environment.
Obtained from: TrustedBSD Project
Sponsored by: DARPA, Network Associates Laboratories
malloc(9) failed last time. This is intended to help code adjust
memory usage to the current circumstances.
A typical use could be:
if (malloc_last_fail() < 60)
reduce_cache_by_one();