On some architectures, the structures returned by PT_GET*REGS were not
fully populated and could contain uninitialized stack memory. The same
issue existed with the register files in procfs.
Reported by: Thomas Barabosch, Fraunhofer FKIE
Reviewed by: kib
MFC after: 3 days
Security: kernel stack memory disclosure
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D18421
by doing most of the work in a new function prison_add_vfs in kern_jail.c
Now a jail-enabled filesystem need only mark itself with VFCF_JAIL, and
the rest is taken care of. This includes adding a jail parameter like
allow.mount.foofs, and a sysctl like security.jail.mount_foofs_allowed.
Both of these used to be a static list of known filesystems, with
predefined permission bits.
Reviewed by: kib
Differential Revision: D14681
opt_compat.h is mentioned in nearly 180 files. In-progress network
driver compabibility improvements may add over 100 more so this is
closer to "just about everywhere" than "only some files" per the
guidance in sys/conf/options.
Keep COMPAT_LINUX32 in opt_compat.h as it is confined to a subset of
sys/compat/linux/*.c. A fake _COMPAT_LINUX option ensure opt_compat.h
is created on all architectures.
Move COMPAT_LINUXKPI to opt_dontuse.h as it is only used to control the
set of compiled files.
Reviewed by: kib, cem, jhb, jtl
Sponsored by: DARPA, AFRL
Differential Revision: https://reviews.freebsd.org/D14941
The existing algorithm in procfs_map() to calculate count of resident
pages in an entry is too primitive, resulting in too long run time for
large sparse mapping entries. Re-use the kern_proc_vmmap_resident()
from kern_proc.c which only looks at the existing pages in the
iterations.
Also, this makes procfs to honor kern.proc_vmmap_skip_resident_count,
if user does not need this information.
Reported by: Glenn Weinberg <glenn.weinberg@intel.com>
PR: 224532
No objections from: des (procfs maintainer)
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D13595
Mainly focus on files that use BSD 2-Clause license, however the tool I
was using misidentified many licenses so this was mostly a manual - error
prone - task.
The Software Package Data Exchange (SPDX) group provides a specification
to make it easier for automated tools to detect and summarize well known
opensource licenses. We are gradually adopting the specification, noting
that the tags are considered only advisory and do not, in any way,
superceed or replace the license texts.
Mainly focus on files that use BSD 3-Clause license.
The Software Package Data Exchange (SPDX) group provides a specification
to make it easier for automated tools to detect and summarize well known
opensource licenses. We are gradually adopting the specification, noting
that the tags are considered only advisory and do not, in any way,
superceed or replace the license texts.
Special thanks to Wind River for providing access to "The Duke of
Highlander" tool: an older (2014) run over FreeBSD tree was useful as a
starting point.
The Software Package Data Exchange (SPDX) group provides a specification
to make it easier for automated tools to detect and summarize well known
opensource licenses. We are gradually adopting the specification, noting
that the tags are considered only advisory and do not, in any way,
superceed or replace the license texts.
Special thanks to Wind River for providing access to "The Duke of
Highlander" tool: an older (2014) run over FreeBSD tree was useful as a
starting point.
Initially, only tag files that use BSD 4-Clause "Original" license.
RelNotes: yes
Differential Revision: https://reviews.freebsd.org/D13133
This interface has no in-tree consumers and has been more or less
non-functional for several releases.
Remove manpage note that the procfs special file 'mem' is grouped to
kmem. This hasn't been true since r81107.
Remove procfs' README file. It is an out of date duplication of the manpage
(quoth the README: "since the bsd kernel is single-processor...").
Reviewed by: vangyzen, bcr (manpage)
Approved by: des (procfs maintainer), vangyzen (mentor)
Differential Revision: https://reviews.freebsd.org/D9802
Renumber cluase 4 to 3, per what everybody else did when BSD granted
them permission to remove clause 3. My insistance on keeping the same
numbering for legal reasons is too pedantic, so give up on that point.
Submitted by: Jan Schaumann <jschauma@stevens.edu>
Pull Request: https://github.com/freebsd/freebsd/pull/96
and getboottimebin(9) KPI. Change consumers of boottime to use the
KPI. The variables were renamed to avoid shadowing issues with local
variables of the same name.
Issue is that boottime* should be adjusted from tc_windup(), which
requires them to be members of the timehands structure. As a
preparation, this commit only introduces the interface.
Some uses of boottime were found doubtful, e.g. NLM uses boottime to
identify the system boot instance. Arguably the identity should not
change on the leap second adjustment, but the commit is about the
timekeeping code and the consumers were kept bug-to-bug compatible.
Tested by: pho (as part of the bigger patch)
Reviewed by: jhb (same)
Discussed with: bde
Sponsored by: The FreeBSD Foundation
MFC after: 1 month
X-Differential revision: https://reviews.freebsd.org/D7302
is non-zero.
- Include the process address in the PROC_ASSERT_HELD() and
PROC_ASSERT_NOT_HELD() assertion messages so that the corresponding
process can be found easily when debugging.
MFC after: 1 week
SIGCHLD signal, should keep full 32 bits of the status passed to the
_exit(2).
Split the combined p_xstat of the struct proc into the separate exit
status p_xexit for normal process exit, and signalled termination
information p_xsig. Kernel-visible macro KW_EXITCODE() reconstructs
old p_xstat from p_xexit and p_xsig. p_xexit contains complete status
and copied out into si_status.
Requested by: Joerg Schilling
Reviewed by: jilles (previous version), pho
Tested by: pho
Sponsored by: The FreeBSD Foundation
When providing memory map information to userland, populate the vnode pointer
for tmpfs files. Set the memory mapping to appear as a vnode type, to match
FreeBSD 9 behavior.
This fixes the use of tmpfs files with the dtrace pid provider,
procstat -v, procfs, linprocfs, pmc (pmcstat), and ptrace (PT_VM_ENTRY).
Submitted by: Eric Badger <eric@badgerio.us> (initial revision)
Obtained from: Dell Inc.
PR: 198431
MFC after: 2 weeks
Reviewed by: jhb
Approved by: kib (mentor)
tracing. This matches the behavior of ptrace(PT_ATTACH). Also,
the procfs detach request assumes p_oppid is always set.
Reviewed by: kib
MFC after: 2 weeks
- Threads lifetime cycle, in particular, counting of the threads in
the process, and interlocking with process mutex and thread lock.
The main reason of this is that turnstile locks are after thread
locks, so you e.g. cannot unlock blockable mutex (think process
mutex) while owning thread lock.
- Virtual and profiling itimers, since the timers activation is done
from the clock interrupt context. Replace the p_slock by p_itimmtx
and PROC_ITIMLOCK().
- Profiling code (profil(2)), for similar reason. Replace the p_slock
by p_profmtx and PROC_PROFLOCK().
- Resource usage accounting. Need for the spinlock there is subtle,
my understanding is that spinlock blocks context switching for the
current thread, which prevents td_runtime and similar fields from
changing (updates are done at the mi_switch()). Replace the p_slock
by p_statmtx and PROC_STATLOCK().
The split is done mostly for code clarity, and should not affect
scalability.
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
future further optimizations where the vm_object lock will be held
in read mode most of the time the page cache resident pool of pages
are accessed for reading purposes.
The change is mostly mechanical but few notes are reported:
* The KPI changes as follow:
- VM_OBJECT_LOCK() -> VM_OBJECT_WLOCK()
- VM_OBJECT_TRYLOCK() -> VM_OBJECT_TRYWLOCK()
- VM_OBJECT_UNLOCK() -> VM_OBJECT_WUNLOCK()
- VM_OBJECT_LOCK_ASSERT(MA_OWNED) -> VM_OBJECT_ASSERT_WLOCKED()
(in order to avoid visibility of implementation details)
- The read-mode operations are added:
VM_OBJECT_RLOCK(), VM_OBJECT_TRYRLOCK(), VM_OBJECT_RUNLOCK(),
VM_OBJECT_ASSERT_RLOCKED(), VM_OBJECT_ASSERT_LOCKED()
* The vm/vm_pager.h namespace pollution avoidance (forcing requiring
sys/mutex.h in consumers directly to cater its inlining functions
using VM_OBJECT_LOCK()) imposes that all the vm/vm_pager.h
consumers now must include also sys/rwlock.h.
* zfs requires a quite convoluted fix to include FreeBSD rwlocks into
the compat layer because the name clash between FreeBSD and solaris
versions must be avoided.
At this purpose zfs redefines the vm_object locking functions
directly, isolating the FreeBSD components in specific compat stubs.
The KPI results heavilly broken by this commit. Thirdy part ports must
be updated accordingly (I can think off-hand of VirtualBox, for example).
Sponsored by: EMC / Isilon storage division
Reviewed by: jeff
Reviewed by: pjd (ZFS specific review)
Discussed with: alc
Tested by: pho
In particular, do not lock Giant conditionally when calling into the
filesystem module, remove the VFS_LOCK_GIANT() and related
macros. Stop handling buffers belonging to non-mpsafe filesystems.
The VFS_VERSION is bumped to indicate the interface change which does
not result in the interface signatures changes.
Conducted and reviewed by: attilio
Tested by: pho
to read strings completely to know the actual size.
As a side effect it fixes the issue with kern.proc.args and kern.proc.env
sysctls, which didn't return the size of available data when calling
sysctl(3) with the NULL argument for oldp.
Note, in get_ps_strings(), which does actual work for proc_getargv() and
proc_getenvv(), we still have a safety limit on the size of data read in
case of a corrupted procces stack.
Suggested by: kib
MFC after: 3 days
patch modifies makesyscalls.sh to prefix all of the non-compatibility
calls (e.g. not linux_, freebsd32_) with sys_ and updates the kernel
entry points and all places in the code that use them. It also
fixes an additional name space collision between the kernel function
psignal and the libc function of the same name by renaming the kernel
psignal kern_psignal(). By introducing this change now we will ease future
MFCs that change syscalls.
Reviewed by: rwatson
Approved by: re (bz)
to resolve the path of the text vnode of the process. The behaviour is
very confusing for any consumer of the procfs, in particular, java.
Reported and tested by: bf
MFC after: 2 weeks
Approved by: re (bz)
in "struct vm_object". This is required to make it possible to account
for per-jail swap usage.
Reviewed by: kib@
Tested by: pho@
Sponsored by: FreeBSD Foundation
consequence, fill_regs() and fill_fpregs() access random data, usually
on the thread kernel stack. Most often the td_frame points to the
previous frame saved by last kernel entry sequence, but this is not
guaranteed.
For /proc/<pid>/{regs,fpregs} read access, require the thread to be in
stopped state. Otherwise, return EBUSY as is done for write case.
Reported and tested by: pho
Approved by: des (procfs maintainer)
MFC after: 1 week
via %s
Most of the cases looked harmless, but this is done for the sake of
correctness. In one case it even allowed to drop an intermediate buffer.
Found by: clang
MFC after: 2 week
flag, and for each thread, TDB_SUSPEND debug flag, same as it is done by
exit1() for orphaned debugee.
Approved by: des (procfs maintainer)
MFC after: 1 week
for upcoming 64-bit PowerPC and MIPS support. This renames the COMPAT_IA32
option to COMPAT_FREEBSD32, removes some IA32-specific code from MI parts
of the kernel and enhances the freebsd32 compatibility code to support
big-endian platforms.
Reviewed by: kib, jhb
a device pager (OBJT_DEVICE) object in that it uses fictitious pages to
provide aliases to other memory addresses. The primary difference is that
it uses an sglist(9) to determine the physical addresses for a given offset
into the object instead of invoking the d_mmap() method in a device driver.
Reviewed by: alc
Approved by: re (kensmith)
MFC after: 2 weeks
rlimit RLIMIT_SWAP that limits the amount of swap that may be reserved
for the uid.
The accounting information (charge) is associated with either map entry,
or vm object backing the entry, assuming the object is the first one
in the shadow chain and entry does not require COW. Charge is moved
from entry to object on allocation of the object, e.g. during the mmap,
assuming the object is allocated, or on the first page fault on the
entry. It moves back to the entry on forks due to COW setup.
The per-entry granularity of accounting makes the charge process fair
for processes that change uid during lifetime, and decrements charge
for proper uid when region is unmapped.
The interface of vm_pager_allocate(9) is extended by adding struct ucred *,
that is used to charge appropriate uid when allocation if performed by
kernel, e.g. md(4).
Several syscalls, among them is fork(2), may now return ENOMEM when
global or per-uid limits are enforced.
In collaboration with: pho
Reviewed by: alc
Approved by: re (kensmith)
by creating a child jail, which is visible to that jail and to any
parent jails. Child jails may be restricted more than their parents,
but never less. Jail names reflect this hierarchy, being MIB-style
dot-separated strings.
Every thread now points to a jail, the default being prison0, which
contains information about the physical system. Prison0's root
directory is the same as rootvnode; its hostname is the same as the
global hostname, and its securelevel replaces the global securelevel.
Note that the variable "securelevel" has actually gone away, which
should not cause any problems for code that properly uses
securelevel_gt() and securelevel_ge().
Some jail-related permissions that were kept in global variables and
set via sysctls are now per-jail settings. The sysctls still exist for
backward compatibility, used only by the now-deprecated jail(2) system
call.
Approved by: bz (mentor)
compare map->timestamp with saved timestamp after map read lock is
reacquired, not with saved timestamp + 1. The only consequence of the +1
was unconditional lookup of the next map entry, though.
Tested by: pho
Approved by: des
MFC after: 2 weeks
message for r185765.
Noted by: rdivacky
Requested by: des
Commit message for r185765 should be:
In procfs map handler, and in linprocfs maps handler, do not call
vn_fullpath() while having vm map locked. This is done in anticipation
of the vop_vptocnp commit, that would make vn_fullpath sometime
acquire vnode lock.
Also, in linprocfs, maps handler already acquires vnode lock.
No objections from: des
MFC after: 2 week
sbuf instead of doing uiomove. This allows for reads from non-zero
offsets to work.
Patch is forward-ported des@' one, and was adopted to current code
by dchagin@ and me.
Reviewed by: des (linprocfs part)
PR: kern/101453
MFC after: 1 week
vnode in question does not need to be held. All the data structures used
during the name lookup are protected by the global name cache lock.
Instead, the caller merely needs to ensure a reference is held on the
vnode (such as vhold()) to keep it from being freed.
In the case of procfs' <pid>/file entry, grab the process lock while we
gain a new reference (via vhold()) on p_textvp to fully close races with
execve(2).
For the kern.proc.vmmap sysctl handler, use a shared vnode lock around
the call to VOP_GETATTR() rather than an exclusive lock.
MFC after: 1 month