o Add reference to TrustedBSD Project in license header.
o Update dated comments, including comment in extattr.h claiming that
no file systems support extended attributes.
o Improve comment consistency.
credential selection, rather than reference via a thread or process
pointer. This is part of a gradual migration to suser() accepting
a struct ucred instead of a struct proc, simplifying the reference
and locking semantics of suser().
Obtained from: TrustedBSD Project
- Now that apm loadable module can inform its existence to other kernel
components (e.g. i386/isa/clock.c:startrtclock()'s TCS hack).
- Exchange priority of SI_SUB_CPU and SI_SUB_KLD for above purpose.
- Add simple arbitration mechanism for APM vs. ACPI. This prevents
the kernel enables both of them.
- Remove obsolete `#ifdef DEV_APM' related code.
- Add abstracted interface for Powermanagement operations. Public apm(4)
functions, such as apm_suspend(), should be replaced new interfaces.
Currently only power_pm_suspend (successor of apm_suspend) is implemented.
Reviewed by: peter, arch@ and audit@
function symbols in the kernel in a list of C strings, with an extra
nul-termination at the end.
This sysctl requires addition of a new linker operation. Now,
linker_file_t's need to respond to "each_function_name" to export
their function symbols.
Note that the sysctl doesn't currently allow distinguishing multiple
symbols with the same name from different modules, but could quite
easily without a change to the linker operation. This will be a nicety
to have when it can be used.
Obtained from: NAI Labs CBOSS project
Funded by: DARPA
by the profiler on a running system. This is not done sparsely, as
memory is cheaper than processor speed and each gprof mcount() and
mexitcount() operation is already very expensive.
Obtained from: NAI Labs CBOSS project
Funded by: DARPA
number) instead of allocating next free unit for them. If someone needs
fixed place, he must specify it correctly. "Allocating next" is especially bad
because leads to double device detection and to "repeat make_dev panic" as
result. This can happens if the same devices present somewhere on PCI bus,
hints and ACPI. Making them present in one place only not always
possible, "sc" f.e. can't be removed from hints, it results to no console at
all.
2) In make_device(), detect when devclass_add_device() fails, free dev and
return. I.e. add missing error checking. This part needed to finish fix in 1),
but must be done this way in anycase, with old variant too.
non-existent disk in a legacy /dev on a DEVFS system would panic the system
if stat(2)'ed.
Do not whine about anonymous device nodes not having a si_devsw, they're
not supposed to.
Make it a panic to repeat make_dev() or destroy_dev(), this check
should maybe be neutered when -current goes -stable.
Whine if devsw() is called on anon dev_t's in a devfs system.
Make a hack to avoid our lazy-eval disk code triggering the above whine.
Fix the multiple make_dev() in disk code by making ${disk}${unit}s${slice}
an alias/symlink to ${disk}${unit}s${slice}c
it has not yet returned. Use this flag to deny debugging requests while
the process is execve()ing, and close once and for all any race conditions
that might occur between execve() and various debugging interfaces.
Reviewed by: jhb, rwatson
strip the space from '( struct thread *...', wrap long lines.
o Remove an unneeded comment on the topic of no lock being required as
part of the NDINIT() in __acl_get_file(), as it's really not required
there.
Obtained from: TrustedBSD Project
of Giant during the Giant unwinding phase, and start work on instrumenting
Giant for the file and proc mutexes.
These wrappers allow developers to turn on and off Giant around various
subsystems. DEVELOPERS SHOULD NEVER TURN OFF GIANT AROUND A SUBSYSTEM JUST
BECAUSE THE SYSCTL EXISTS! General developers should only considering
turning on Giant for a subsystem whos default is off (to help track down
bugs). Only developers working on particular subsystems who know what
they are doing should consider turning off Giant.
These wrappers will greatly improve our ability to unwind Giant and test
the kernel on a (mostly) subsystem by subsystem basis. They allow Giant
unwinding developers (GUDs) to emplace appropriate subsystem and structural
mutexes in the main tree and then request that the larger community test
the work by turning off Giant around the subsystem(s), without the larger
community having to mess around with patches. These wrappers also allow
GUDs to boot into a (more likely to be) working system in the midst of
their unwinding work and to test that work under more controlled
circumstances.
There is a master sysctl, kern.giant.all, which defaults to 0 (off). If
turned on it overrides *ALL* other kern.giant sysctls and forces Giant to
be turned on for all wrapped subsystems. If turned off then Giant around
individual subsystems are controlled by various other kern.giant.XXX sysctls.
Code which overlaps multiple subsystems must have all related subsystem Giant
sysctls turned off in order to run without Giant.
while it is on a queue with the queue lock and remove the per-task locks.
- Remove TASK_DESTROY now that it is no longer needed.
- Go back to inlining TASK_INIT now that it is short again.
Inspired by: dfr
userland. The per thread ucred reference is immutable and thus needs no
locks to be read. However, until all the proc locking associated with
writes to p_ucred are completed, it is still not safe to use the per-thread
reference.
Tested on: x86 (SMP), alpha, sparc64
queue, and a mutex to protect the global list of taskqueues. The only
visible change is that a TASK_DESTROY() macro has been added to mirror
the TASK_INIT() macro to destroy a task before it is free'd.
Submitted by: Andrew Reiter <awr@watson.org>
splhigh() before the mtx_unlock and tsleep(). The splhigh() was probably
correct in the original code using simplelocks but is not correct in
5.0-current.
Noticed by: Andrew Reiter <awr@FreeBSD.org>
real effect.
Optimize vfs_msync(). Avoid having to continually drop and re-obtain
mutexes when scanning the vnode list. Improves looping case by 500%.
Optimize ffs_sync(). Avoid having to continually drop and re-obtain
mutexes when scanning the vnode list. This makes a couple of assumptions,
which I believe are ok, in regards to vnode stability when the mount list
mutex is held. Improves looping case by 500%.
(more optimization work is needed on top of these fixes)
MFC after: 1 week
or the cluster will not be properly merged. Dup the code from
cluster_wbuild() and add some printf()s to see if bad cases are present.
MFC after: 2 weeks
returns an success/failure code rather than the actual value.
- Add getenv_string() which copies a string from the environment to another
string and returns true on success.
all successful calls to VOP_OPEN() might be reflected in a call to
VOP_CLOSE(). For now, simply add a comment reflecting this problem;
this should be fixed at some point.
terminated and flushes pending dirty pages it is possible for the
object to be ref'd (0->1) and then deref'd (1->0) during termination.
We do not terminate the object a second time.
Document vop_stdgetvobject() to explicitly allow it to be called without
the vnode interlock held (for upcoming sync_msync() and ffs_sync()
performance optimizations)
MFC after: 3 days
the system load average. Previously, the load average measurement
was susceptible to synchronisation with processes that run at
regular intervals such as the system bufdaemon process.
Each interval is now chosen at random within the range of 4 to 6
seconds. This large variation is chosen so that over the shorter
5-minute load average timescale there is a good dispersion of
samples across the 5-second sample period (the time to perform 60
5-second samples now has a standard deviation of approx 4.5 seconds).
to kern_synch.c in preparation for adding some jitter to the
inter-sample time.
Note that the "vm.loadavg" sysctl still lives in vm_meter.c which
isn't the right place, but it is appropriate for the current (bad)
name of that sysctl.
Suggested by: jhb (some time ago)
Reviewed by: bde
off to witness_init() making the check for double intializating a lock by
testing the LO_INITIALIZED flag moot. Workaround this by checking the
LO_INITIALIZED flag ourself before we bzero the lock structure.
be so dangerous it isn't funny. eg: if you panic inside NFS or softdep,
and then try and sync you run into held locks and cause either deadlocks,
recursive panics or other interesting chaos. Default is unchanged.
number, portable OpenAFS applications don't have to attempt to determine
what system call number was dynamically allocated. No system call
prototype or implementation is defined.
Requested by: Tom Maher <tardis@watson.org>
This stops panics on unloading modules which define their own sysctl sets.
However, this also removes the protection against somebody actually
defining a static sysctl with an oid in the range of the dynamic ones,
which would break badly if there is already a dynamic sysctl with
the requested oid.
Apparently, the algorithm for removing sysctl sets needs a bit more work.
For the present, the panic I introduced only leads to Bad Things (tm).
Submitted by: many users of -current :(
Pointy hat to: roam (myself) for not testing rev. 1.112 enough.
- Add proc locking to the jail() syscall. This mostly involved shuffling
a few things around so that blockable things like malloc and copyin
were performed before acquiring the lock and checking the existing
ucred and then updating the ucred as one "atomic" change under the proc
lock.
- crhold() returns a reference to the ucred whose refcount it bumps.
- crcopy() now simply copies the credentials from one credential to
another and has no return value.
- a new crshared() primitive is added which returns true if a ucred's
refcount is > 1 and false (0) otherwise.
Updated by peter following KSE and Giant pushdown.
I've running with this patch for two week with no ill side effects.
PR: kern/12014: Fix SysV Semaphore handling
Submitted by: Peter Jeremy <peter.jeremy@alcatel.com.au>
a single kern.security.seeotheruids_permitted, describes as:
"Unprivileged processes may see subjects/objects with different real uid"
NOTE: kern.ps_showallprocs exists in -STABLE, and therefore there is
an API change. kern.ipc.showallsockets does not.
- Check kern.security.seeotheruids_permitted in cr_cansee().
- Replace visibility calls to socheckuid() with cr_cansee() (retain
the change to socheckuid() in ipfw, where it is used for rule-matching).
- Remove prison_unpcb() and make use of cr_cansee() against the UNIX
domain socket credential instead of comparing root vnodes for the
UDS and the process. This allows multiple jails to share the same
chroot() and not see each others UNIX domain sockets.
- Remove unused socheckproc().
Now that cr_cansee() is used universally for socket visibility, a variety
of policies are more consistently enforced, including uid-based
restrictions and jail-based restrictions. This also better-supports
the introduction of additional MAC models.
Reviewed by: ps, billf
Obtained from: TrustedBSD Project
processes to attach debugging to themselves even though the
global kern_unprivileged_procdebug_permitted policy might disallow
this.
o Move the kern_unprivileged_procdebug_permitted check above the
(p1==p2) check.
Reviewed by: des
Until now, the ptrace syscall was implemented as a wrapper that called
various functions in procfs depending on which ptrace operation was
requested. Most of these functions were themselves wrappers around
procfs_{read,write}_{,db,fp}regs(), with only some extra error checks,
which weren't necessary in the ptrace case anyway.
This commit moves procfs_rwmem() from procfs_mem.c into sys_process.c
(renaming it to proc_rwmem() in the process), and implements ptrace()
directly in terms of procfs_{read,write}_{,db,fp}regs() instead of
having it fake up a struct uio and then call procfs_do{,db,fp}regs().
It also moves the prototypes for procfs_{read,write}_{,db,fp}regs()
and proc_rwmem() from proc.h to ptrace.h, and marks all procfs files
except procfs_machdep.c as "optional procfs" instead of "standard".
confused. Since sa_sigaction and sa_handler alias each other in a
union, the bug was completely harmless. This had been fixed as part
of the SIGCHLD changes in revision 1.125, but it was reverted when
they were backed out in revision 1.126.
'regression.*'.
o Add 'regression.securelevel_nonmonotonic', conditional on 'options
REGRESSION', which allows the securelevel to be lowered for the purposes
of efficient regression testing of securelevel policy decisions.
Regression tests for securelevels will be committed shortly.
NOTE: 'options REGRESSION' should never be used on production machines, as
it permits violation of system invariants so as to improve the ability to
effectively test edge cases, and improve testing efficiency.
syscalls are of type NODEF but not in a way that fits the given
definition of that type. The exact difference of lkmressys and
lkmnosys is unclear, which makes it all the more confusing. A
reevaluation of what we have and what we really need is in order.
Spotted by: Maxime Henrion <mux@qualys.com>
Pointy hat: marcel
wait for both read AND write I/O to complete. Only NFS calls vinvalbuf()
on an active vnode (when the server indicates that the file is stale), so
this bug fix only effects NFS clients.
MFC after: 3 days
kern.ipc.showallsockets is set to 0.
Submitted by: billf (with modifications by me)
Inspired by: Dave McKay (aka pm aka Packet Magnet)
Reviewed by: peter
MFC after: 2 weeks
missed in the previous commit; a line that exceeded 80 characters. No
functional changes, but the object file's md5 checksum changes because some
lines have been displaced.
1) Allow the sending of more than one control message at a time
over a unix domain socket. This should cover the PR 29499.
2) This requires that unp_{ex,in}ternalize and unp_scan understand
mbufs with more than one control message at a time.
3) Internalize and externalize used to work on the mbuf in-place.
This made life quite complicated and the code for sizeof(int) <
sizeof(file *) could end up doing the wrong thing. The patch always
create a new mbuf/cluster now. This resulted in the change of the
prototype for the domain externalise function.
4) You can now send SCM_TIMESTAMP messages.
5) Always use CMSG_DATA(cm) to determine the start where the data
in unp_{ex,in}ternalize. It was using ((struct cmsghdr *)cm + 1)
in some places, which gives the wrong alignment on the alpha.
(NetBSD made this fix some time ago).
This results in an ABI change for discriptor passing and creds
passing on the alpha. (Probably on the IA64 and Spare ports too).
6) Fix userland programs to use CMSG_* macros too.
7) Be more careful about freeing mbufs containing (file *)s.
This is made possible by the prototype change of externalise.
PR: 29499
MFC after: 6 weeks
in vfs_syscalls.c:
if (mp->mnt_stat.f_owner != p->p_ucred->cr_uid &&
(error = suser_td(td)) != 0) {
unwrap_lots_of_stuff();
return (error);
}
to:
if (mp->mnt_stat.f_owner != p->p_ucred->cr_uid) {
error = suser_td(td);
if (error) {
unwrap_lots_of_stuff();
return (error);
}
}
This makes the code more readable when complex clauses are in use,
and minimizes conflicts for large outstanding patchsets modifying the
kernel authorization code (of which I have several), especially where
existing authorization and context code are combined in the same if()
conditional.
Obtained from: TrustedBSD Project
to avoid removing higher level directory vnodes from the namecache has
no perceivable effect and will be removed. This is especially true
when vmiodirenable is turned on, which it is by default now. ( vmiodirenable
makes a huge difference in directory caching ). The vfs.vmiodirenable and
vfs.nameileafonly sysctls have been left in to allow further testing, but
I expect to rip out vfs.nameileafonly soon too.
I have also determined through testing that the real problem with numvnodes
getting too large is due to the VM Page cache preventing the vnode from
being reclaimed. The directory stuff made only a tiny dent relative
to Poul's original code, enough so that some tests succeeded. But tests
with several million small files show that the bigger problem is the VM Page
cache. This will have to be addressed by a future commit.
MFC after: 3 days
when I changed the allocator bits. This implements per-CPU mbtypes
stats by keeping net number of decrements/increments of a given mbtype
per-CPU and then summing all of the per-CPU mbtypes to produce the total
net number of allocated mbufs of the given mbtype.
Counters are carefully balanced to avoid/prevent underflows/overflows.
mbtypes stats are re-enabled with the idea that we may occasionally
(although very rarely) observe slight inconsistencies in the stat
reporting. Most of the time, we should be fine, though.
Also make appropriate modifications to netstat(1) and systat(1) to do
the necessary reporting.
Submitted by: Jiangyi Liu <jyliu@163.net>
the static callout list allocated by the system.
Change malloc type from M_TEMP to M_KQUEUE to better track memory.
Add a kern.kq_calloutmax to globally limit the amount of kernel memory
that can be allocated by callouts.
Submitted by: iedowse (items 1, 2)