Commit Graph

311 Commits

Author SHA1 Message Date
kib
f315a59476 Account the writeable shared mappings backed by file in the vnode
v_writecount.  Keep the amount of the virtual address space used by
the mappings in the new vm_object un_pager.vnp.writemappings
counter. The vnode v_writecount is incremented when writemappings gets
non-zero value, and decremented when writemappings is returned to
zero.

Writeable shared vnode-backed mappings are accounted for in vm_mmap(),
and vm_map_insert() is instructed to set MAP_ENTRY_VN_WRITECNT flag on
the created map entry.  During deferred map entry deallocation,
vm_map_process_deferred() checks for MAP_ENTRY_VN_WRITECOUNT and
decrements writemappings for the vm object.

Now, the writeable mount cannot be demoted to read-only while
writeable shared mappings of the vnodes from the mount point
exist. Also, execve(2) fails for such files with ETXTBUSY, as it
should be.

Noted by:	tegge
Reviewed by:	tegge (long time ago, early version), alc
Tested by:	pho
MFC after:	3 weeks
2012-02-23 21:07:16 +00:00
alc
506ef17e7f When vm_mmap() is used to map a vm object into a kernel vm_map, it
makes no sense to check the size of the kernel vm_map against the
user-level resource limits for the calling process.

Reviewed by:	kib
2012-02-16 06:45:51 +00:00
kib
dacbfe950a Close a race due to dropping of the map lock between creating map entry
for a shared mapping and marking the entry for inheritance.
Other thread might execute vmspace_fork() in between (e.g. by fork(2)),
resulting in the mapping becoming private.

Noted and reviewed by:	alc
MFC after:	1 week
2012-02-11 17:29:07 +00:00
kmacy
99851f359e In order to maximize the re-usability of kernel code in user space this
patch modifies makesyscalls.sh to prefix all of the non-compatibility
calls (e.g. not linux_, freebsd32_) with sys_ and updates the kernel
entry points and all places in the code that use them. It also
fixes an additional name space collision between the kernel function
psignal and the libc function of the same name by renaming the kernel
psignal kern_psignal(). By introducing this change now we will ease future
MFCs that change syscalls.

Reviewed by:	rwatson
Approved by:	re (bz)
2011-09-16 13:58:51 +00:00
kib
a9d505a22a Split the vm_page flags PG_WRITEABLE and PG_REFERENCED into atomic
flags field. Updates to the atomic flags are performed using the atomic
ops on the containing word, do not require any vm lock to be held, and
are non-blocking. The vm_page_aflag_set(9) and vm_page_aflag_clear(9)
functions are provided to modify afalgs.

Document the changes to flags field to only require the page lock.

Introduce vm_page_reference(9) function to provide a stable KPI and
KBI for filesystems like tmpfs and zfs which need to mark a page as
referenced.

Reviewed by:    alc, attilio
Tested by:      marius, flo (sparc64); andreast (powerpc, powerpc64)
Approved by:	re (bz)
2011-09-06 10:30:11 +00:00
rwatson
4af919b491 Second-to-last commit implementing Capsicum capabilities in the FreeBSD
kernel for FreeBSD 9.0:

Add a new capability mask argument to fget(9) and friends, allowing system
call code to declare what capabilities are required when an integer file
descriptor is converted into an in-kernel struct file *.  With options
CAPABILITIES compiled into the kernel, this enforces capability
protection; without, this change is effectively a no-op.

Some cases require special handling, such as mmap(2), which must preserve
information about the maximum rights at the time of mapping in the memory
map so that they can later be enforced in mprotect(2) -- this is done by
narrowing the rights in the existing max_protection field used for similar
purposes with file permissions.

In namei(9), we assert that the code is not reached from within capability
mode, as we're not yet ready to enforce namespace capabilities there.
This will follow in a later commit.

Update two capability names: CAP_EVENT and CAP_KEVENT become
CAP_POST_KEVENT and CAP_POLL_KEVENT to more accurately indicate what they
represent.

Approved by:	re (bz)
Submitted by:	jonathan
Sponsored by:	Google Inc
2011-08-11 12:30:23 +00:00
kib
e45048555a Extract the code to translate VM error into errno, into an exported
function vm_mmap_to_errno(). It is useful for the drivers that implement
mmap(2)-like functionality, to be able to return error codes consistent
with mmap(2).

Sponsored by:	The FreeBSD Foundation
No objections from:	alc
MFC after:	1 week
2011-07-10 20:49:13 +00:00
kib
315e379ec2 Style.
MFC after:	3 days
2011-07-10 20:45:13 +00:00
trasz
4a17b24427 All the racct_*() calls need to happen with the proc locked. Fixing this
won't happen before 9.0.  This commit adds "#ifdef RACCT" around all the
"PROC_LOCK(p); racct_whatever(p, ...); PROC_UNLOCK(p)" instances, in order
to avoid useless locking/unlocking in kernels built without "options RACCT".
2011-07-06 20:06:44 +00:00
trasz
92bec9b84c Add accounting for most of the memory-related resources.
Sponsored by:	The FreeBSD Foundation
Reviewed by:	kib (earlier version)
2011-04-05 20:23:59 +00:00
pluknet
3061aea0d2 Remove sysctl vm.max_proc_mmap used to protect from KVA space exhaustion.
As it was pointed out by Alan Cox, that no longer serves its purpose with
the modern UMA allocator compared to the old one used in 4.x days.

The removal of sysctl eliminates max_proc_mmap type overflow leading to
the broken mmap(2) seen with large amount of physical memory on arches
with factually unbound KVA space (such as amd64).  It was found that
slightly less than 256GB of physmem was enough to trigger the overflow.

Reviewed by:	alc, kib
Approved by:	avg (mentor)
MFC after:	2 months
2011-02-24 09:22:56 +00:00
trasz
b2b2bfee2e Fix comment intentation. 2010-12-04 17:41:58 +00:00
kib
f8badff8ef Do not use __FreeBSD_version prefix for the special osrel version.
The ports/Mk/bsd.port.mk uses sys/param.h to fetch osrel, and cannot
grok several constants with the prefix.

Reported and tested by:	    swell.k gmail com
MFC after:   1 week
2010-11-14 21:59:11 +00:00
kib
336fd1996d Use symbolic names instead of hardcoding values for magic p_osrel constants.
MFC after:   1 week
2010-11-14 18:24:12 +00:00
alc
ec0b99e7f0 Allow a POSIX shared memory object that is opened for read but not for
write to nonetheless be mapped PROT_WRITE and MAP_PRIVATE, i.e.,
copy-on-write.

(This is a regression in the new implementation of POSIX shared memory
objects that is used by HEAD and RELENG_8.  This bug does not exist in
RELENG_7's user-level, file-based implementation.)

PR:		150260
MFC after:	3 weeks
2010-09-19 19:42:04 +00:00
rstone
62d7f50b87 Fix a typo in r212281. uintptr -> uintptr_t
Pointy hat to:  rstone

Approved by:    emaste (mentor)
MFC after:      2 weeks
2010-09-07 02:51:11 +00:00
rstone
0dd3ce30eb In munmap() downgrade the vm_map_lock to a read lock before taking a read
lock on the pmc-sx lock.  This prevents a deadlock with
pmc_log_process_mappings, which has an exclusive lock on pmc-sx and tries
to get a read lock on a vm_map.  Downgrading the vm_map_lock in munmap
allows pmc_log_process_mappings to continue, preventing the deadlock.

Without this change I could cause a deadlock on a multicore 8.1-RELEASE
system by having one thread constantly mmap'ing and then munmap'ing a
PROT_EXEC mapping in a loop while I repeatedly invoked and stopped pmcstat
in system-wide sampling mode.

Reviewed by:	fabient
Approved by:	emaste (mentor)
MFC after:	2 weeks
2010-09-07 00:23:45 +00:00
alc
115cb6b29f Add the MAP_PREFAULT_READ option to mmap(2).
Reviewed by:	jhb, kib
2010-08-28 16:57:07 +00:00
kib
ba7ee96f4a Add new make_dev_p(9) flag MAKEDEV_ETERNAL to inform devfs that created
cdev will never be destroyed. Propagate the flag to devfs vnodes as
VV_ETERNVALDEV. Use the flags to avoid acquiring devmtx and taking a
thread reference on such nodes.

In collaboration with:	pho
MFC after:	1 month
2010-08-06 09:42:15 +00:00
trasz
9d4312eb10 Fix commented out resource limit check in mlockall(2). It's still racy,
but at least less misleading.
2010-07-27 19:26:18 +00:00
alc
3f1d4b057c Push down page queues lock acquisition in pmap_enter_object() and
pmap_is_referenced().  Eliminate the corresponding page queues lock
acquisitions from vm_map_pmap_enter() and mincore(), respectively.  In
mincore(), this allows some additional cases to complete without ever
acquiring the page queues lock.

Assert that the page is managed in pmap_is_referenced().

On powerpc/aim, push down the page queues lock acquisition from
moea*_is_modified() and moea*_is_referenced() into moea*_query_bit().
Again, this will allow some additional cases to complete without ever
acquiring the page queues lock.

Reorder a few statements in vm_page_dontneed() so that a race can't lead
to an old reference persisting.  This scenario is described in detail by a
comment.

Correct a spelling error in vm_page_dontneed().

Assert that the object is locked in vm_page_clear_dirty(), and restrict the
page queues lock assertion to just those cases in which the page is
currently writeable.

Add object locking to vnode_pager_generic_putpages().  This was the one
and only place where vm_page_clear_dirty() was being called without the
object being locked.

Eliminate an unnecessary vm_page_lock() around vnode_pager_setsize()'s call
to vm_page_clear_dirty().

Change vnode_pager_generic_putpages() to the modern-style of function
definition.  Also, change the name of one of the parameters to follow
virtual memory system naming conventions.

Reviewed by:	kib
2010-05-26 18:00:44 +00:00
alc
32b13ee957 Roughly half of a typical pmap_mincore() implementation is machine-
independent code.  Move this code into mincore(), and eliminate the
page queues lock from pmap_mincore().

Push down the page queues lock into pmap_clear_modify(),
pmap_clear_reference(), and pmap_is_modified().  Assert that these
functions are never passed an unmanaged page.

Eliminate an inaccurate comment from powerpc/powerpc/mmu_if.m:
Contrary to what the comment says, pmap_mincore() is not simply an
optimization.  Without a complete pmap_mincore() implementation,
mincore() cannot return either MINCORE_MODIFIED or MINCORE_REFERENCED
because only the pmap can provide this information.

Eliminate the page queues lock from vfs_setdirty_locked_object(),
vm_pageout_clean(), vm_object_page_collect_flush(), and
vm_object_page_clean().  Generally speaking, these are all accesses
to the page's dirty field, which are synchronized by the containing
vm object's lock.

Reduce the scope of the page queues lock in vm_object_madvise() and
vm_page_dontneed().

Reviewed by:	kib (an earlier version)
2010-05-24 14:26:57 +00:00
kmacy
1dc1263413 On Alan's advice, rather than do a wholesale conversion on a single
architecture from page queue lock to a hashed array of page locks
(based on a patch by Jeff Roberson), I've implemented page lock
support in the MI code and have only moved vm_page's hold_count
out from under page queue mutex to page lock. This changes
pmap_extract_and_hold on all pmaps.

Supported by: Bitgravity Inc.

Discussed with: alc, jeffr, and kib
2010-04-30 00:46:43 +00:00
alc
0a905b1db9 Resurrect pmap_is_referenced() and use it in mincore(). Essentially,
pmap_ts_referenced() is not always appropriate for checking whether or
not pages have been referenced because it clears any reference bits
that it encounters.  For example, in mincore(), clearing the reference
bits has two negative consequences.  First, it throws off the activity
count calculations performed by the page daemon.  Specifically, a page
on which mincore() has called pmap_ts_referenced() looks less active
to the page daemon than it should.  Consequently, the page could be
deactivated prematurely by the page daemon.  Arguably, this problem
could be fixed by having mincore() duplicate the activity count
calculation on the page.  However, there is a second problem for which
that is not a solution.  In order to clear a reference on a 4KB page,
it may be necessary to demote a 2/4MB page mapping.  Thus, a mincore()
by one process can have the side effect of demoting a superpage
mapping within another process!
2010-04-24 17:32:52 +00:00
jhb
399c01844a Reject attempts to create a MAP_ANON mapping with a non-zero offset.
PR:		kern/71258
Submitted by:	Alexander Best
MFC after:	2 weeks
2010-03-23 21:08:07 +00:00
bz
3b88ee8187 Back out the functional parts from r197537. After r197711, affecting all
user mappings, mmap no longer needs special treatment.
2009-10-02 17:51:46 +00:00
simon
a0b7c793b4 Do not allow mmap with the MAP_FIXED argument to map at address zero.
This is done to make it harder to exploit kernel NULL pointer security
vulnerabilities.  While this of course does not fix vulnerabilities,
it does mitigate their impact.

Note that this may break some applications, most likely emulators or
similar, which for one reason or another require mapping memory at
zero.

This restriction can be disabled with the security.bsd.mmap_zero
sysctl variable.

Discussed with:	rwatson, bz
Tested by:	bz (Wine), simon (VirtualBox)
Submitted by:	jhb
2009-09-27 14:49:51 +00:00
kib
c04bcd3033 Old (a.out) rtld attempts to mmap zero-length region, e.g. when bss
of the linked object is zero-length. More old code assumes that mmap
of zero length returns success.

For a.out and pre-8 ELF binaries, allow the mmap of zero length.

Reported by:	tegge
Reviewed by:	tegge, alc, jhb
MFC after:	3 days
2009-09-20 12:40:56 +00:00
jhb
d81f73fcb5 - Change mmap() to fail requests with EINVAL that pass a length of 0. This
behavior is mandated by POSIX.
- Do not fail requests that pass a length greater than SSIZE_MAX
  (such as > 2GB on 32-bit platforms).  The 'len' parameter is actually
  an unsigned 'size_t' so negative values don't really make sense.

Submitted by:	Alexander Best  alexbestms at math.uni-muenster.de
Reviewed by:	alc
Approved by:	re (kib)
MFC after:	1 week
2009-07-14 19:45:36 +00:00
kib
fa686c638e Implement global and per-uid accounting of the anonymous memory. Add
rlimit RLIMIT_SWAP that limits the amount of swap that may be reserved
for the uid.

The accounting information (charge) is associated with either map entry,
or vm object backing the entry, assuming the object is the first one
in the shadow chain and entry does not require COW. Charge is moved
from entry to object on allocation of the object, e.g. during the mmap,
assuming the object is allocated, or on the first page fault on the
entry. It moves back to the entry on forks due to COW setup.

The per-entry granularity of accounting makes the charge process fair
for processes that change uid during lifetime, and decrements charge
for proper uid when region is unmapped.

The interface of vm_pager_allocate(9) is extended by adding struct ucred *,
that is used to charge appropriate uid when allocation if performed by
kernel, e.g. md(4).

Several syscalls, among them is fork(2), may now return ENOMEM when
global or per-uid limits are enforced.

In collaboration with:	pho
Reviewed by:	alc
Approved by:	re (kensmith)
2009-06-23 20:45:22 +00:00
rwatson
f4934662e5 Move "options MAC" from opt_mac.h to opt_global.h, as it's now in GENERIC
and used in a large number of files, but also because an increasing number
of incorrect uses of MAC calls were sneaking in due to copy-and-paste of
MAC-aware code without the associated opt_mac.h include.

Discussed with:	pjd
2009-06-05 14:55:22 +00:00
jhb
fea04a3fd1 Add an extension to the character device interface that allows character
device drivers to use arbitrary VM objects to satisfy individual mmap()
requests.
- A new d_mmap_single(cdev, &foff, objsize, &object, prot) callback is
  added to cdevsw.  This function is called for each mmap() request.
  If it returns ENODEV, then the mmap() request will fall back to using
  the device's device pager object and d_mmap().  Otherwise, the method
  can return a VM object to satisfy this entire mmap() request via
  *object.  It can also modify the starting offset into this object via
  *foff.  This allows device drivers to use the file offset as a cookie
  to identify specific VM objects.
- vm_mmap_vnode() has been changed to call vm_mmap_cdev() directly when
  mapping V_CHR vnodes.  This avoids duplicating all the cdev mmap
  handling code and simplifies some of vm_mmap_vnode().
- D_VERSION has been bumped to D_VERSION_02.  Older device drivers
  using D_VERSION_01 are still supported.

MFC after:	1 month
2009-06-01 21:32:52 +00:00
alc
85b0c58343 Retire VM_PROT_READ_IS_EXEC. It was intended to be a micro-optimization,
but I see no benefit from it today.

VM_PROT_READ_IS_EXEC was only intended for use on processors that do not
distinguish between read and execute permission.  On an mmap(2) or
mprotect(2), it automatically added execute permission if the caller
specified permissions included read permission.  The hope was that this
would reduce the number of vm map entries needed to implement an address
space because there would be fewer neighboring vm map entries that differed
only in the presence or absence of VM_PROT_EXECUTE.  (See vm/vm_mmap.c
revision 1.56.)

Today, I don't see any real applications that benefit from
VM_PROT_READ_IS_EXEC.  In any case, vm map entries are now organized
as a self-adjusting binary search tree instead of an ordered list.  So,
the need for coalescing vm map entries is not as great as it once was.
2009-04-04 23:12:14 +00:00
kib
66c697aade Revert the addition of the freelist argument for the vm_map_delete()
function, done in r188334. Instead, collect the entries that shall be
freed, in the deferred_freelist member of the map. Automatically purge
the deferred freelist when map is unlocked.

Tested by:	pho
Reviewed by:	alc
2009-02-24 20:57:43 +00:00
kib
ae7c4cfa11 Do not call vm_object_deallocate() from vm_map_delete(), because we
hold the map lock there, and might need the vnode lock for OBJT_VNODE
objects. Postpone object deallocation until caller of vm_map_delete()
drops the map lock. Link the map entries to be freed into the freelist,
that is released by the new helper function vm_map_entry_free_freelist().

Reviewed by:	tegge, alc
Tested by:	pho
2009-02-08 20:39:17 +00:00
jhb
dc43531891 Now that vfs_markatime() no longer requires an exclusive lock due to
the VOP_MARKATIME() changes, use a shared vnode lock for mmap().

Submitted by:	ups
2009-01-21 14:43:35 +00:00
rwatson
3667673537 Update mmap() comment: no more block devices, so no more block device
cache coherency questions.

MFC after:	3 days
2008-10-22 16:50:12 +00:00
kib
2a1cba02c6 Allow the d_mmap driver methods to use cdevpriv KPI during verification
phase of establishing mapping.

Discussed with:	rwatson, jhb, rnoland
Tested by:	rnoland
MFC after:	3 days
2008-09-20 19:56:02 +00:00
attilio
dbf35e279f Decontextualize the couplet VOP_GETATTR / VOP_SETATTR as the passed thread
was always curthread and totally unuseful.

Tested by: Giovanni Trematerra <giovanni dot trematerra at gmail dot com>
2008-08-28 15:23:18 +00:00
trhodes
f37865f7f0 Fill in a few sysctl descriptions.
Reviewed by:	alc, Matt Dillon <dillon@apollo.backplane.com>
Approved by:	alc
2008-08-03 14:26:15 +00:00
alc
933221c70b To date, our implementation of munmap(2) has required that the
entirety of the specified range be mapped.  Specifically, it has
returned EINVAL if the entire range is not mapped.  There is not,
however, any basis for this in either SuSv2 or our own man page.
Moreover, neither Linux nor Solaris impose this requirement.  This
revision removes this requirement.

Submitted by: Tijl Coosemans
PR: 118510
MFC after: 6 weeks
2008-05-24 21:57:16 +00:00
alc
0a1d595c3a In order to map device memory using superpages, mmap(2) must find a
superpage-aligned virtual address for the mapping.  Revision 1.65
implemented an overly simplistic and generally ineffectual method for
finding a superpage-aligned virtual address.  Specifically, it rounds
the virtual address corresponding to the end of the data segment up to
the next superpage-aligned virtual address.  If this virtual address
is unallocated, then the device will be mapped using superpages.
Unfortunately, in modern times, where applications like the X server
dynamically load much of their code, this virtual address is already
allocated.  In such cases, mmap(2) simply uses the first available
virtual address, which is not necessarily superpage aligned.

This revision changes mmap(2) to use a more robust method,
specifically, the VMFS_ALIGNED_SPACE option that is now implemented by
vm_map_find().
2008-05-17 19:32:48 +00:00
alc
e919c282cd vm_map_fixed(), unlike vm_map_find(), does not update "addr", so it can be
passed by value.
2008-04-28 05:30:23 +00:00
kib
de73f6b678 Do not dereference cdev->si_cdevsw, use the dev_refthread() to properly
obtain the reference. In particular, this fixes the panic reported in
the PR. Remove the comments stating that this needs to be done.

PR:	kern/119422
MFC after:	1 week
2008-03-20 16:08:42 +00:00
rwatson
877d7c65ba In keeping with style(9)'s recommendations on macros, use a ';'
after each SYSINIT() macro invocation.  This makes a number of
lightweight C parsers much happier with the FreeBSD kernel
source, including cflow's prcc and lxr.

MFC after:	1 month
Discussed with:	imp, rink
2008-03-16 10:58:09 +00:00
jhb
8cd9437636 Add a new file descriptor type for IPC shared memory objects and use it to
implement shm_open(2) and shm_unlink(2) in the kernel:
- Each shared memory file descriptor is associated with a swap-backed vm
  object which provides the backing store.  Each descriptor starts off with
  a size of zero, but the size can be altered via ftruncate(2).  The shared
  memory file descriptors also support fstat(2).  read(2), write(2),
  ioctl(2), select(2), poll(2), and kevent(2) are not supported on shared
  memory file descriptors.
- shm_open(2) and shm_unlink(2) are now implemented as system calls that
  manage shared memory file descriptors.  The virtual namespace that maps
  pathnames to shared memory file descriptors is implemented as a hash
  table where the hash key is generated via the 32-bit Fowler/Noll/Vo hash
  of the pathname.
- As an extension, the constant 'SHM_ANON' may be specified in place of the
  path argument to shm_open(2).  In this case, an unnamed shared memory
  file descriptor will be created similar to the IPC_PRIVATE key for
  shmget(2).  Note that the shared memory object can still be shared among
  processes by sharing the file descriptor via fork(2) or sendmsg(2), but
  it is unnamed.  This effectively serves to implement the getmemfd() idea
  bandied about the lists several times over the years.
- The backing store for shared memory file descriptors are garbage
  collected when they are not referenced by any open file descriptors or
  the shm_open(2) virtual namespace.

Submitted by:	dillon, peter (previous versions)
Submitted by:	rwatson (I based this on his version)
Reviewed by:	alc (suggested converting getmemfd() to shm_open())
2008-01-08 21:58:16 +00:00
rwatson
60570a92bf Merge first in a series of TrustedBSD MAC Framework KPI changes
from Mac OS X Leopard--rationalize naming for entry points to
the following general forms:

  mac_<object>_<method/action>
  mac_<object>_check_<method/action>

The previous naming scheme was inconsistent and mostly
reversed from the new scheme.  Also, make object types more
consistent and remove spaces from object types that contain
multiple parts ("posix_sem" -> "posixsem") to make mechanical
parsing easier.  Introduce a new "netinet" object type for
certain IPv4/IPv6-related methods.  Also simplify, slightly,
some entry point names.

All MAC policy modules will need to be recompiled, and modules
not updates as part of this commit will need to be modified to
conform to the new KPI.

Sponsored by:	SPARTA (original patches against Mac OS X)
Obtained from:	TrustedBSD Project, Apple Computer
2007-10-24 19:04:04 +00:00
peter
9b1e0dd3a8 Fix cosmetic bug in stale copy of msync_args. 'len' is size_t, not int. 2007-10-18 22:47:39 +00:00
kib
77766ce03f Do not drop vm_map lock between doing vm_map_remove() and vm_map_insert().
For this, introduce vm_map_fixed() that does that for MAP_FIXED case.

Dropping the lock allowed for parallel thread to occupy the freed space.

Reported by:	Tijl Coosemans <tijl ulyssis org>
Reviewed by:	alc
Approved by:	re (kensmith)
MFC after:	2 weeks
2007-08-20 12:05:45 +00:00
peter
fd45cf2a6d Add freebsd6_ wrappers for mmap/lseek/pread/pwrite/truncate/ftruncate
Approved by: re (kensmith)
2007-07-04 22:57:21 +00:00
mjacob
fadc531504 Make sure object is NULL- there is a possible case where you could
fall through to it being used w/o being set. Put a break in the default
case.
2007-06-17 04:17:48 +00:00
attilio
7dd8ed88a9 Revert VMCNT_* operations introduction.
Probabilly, a general approach is not the better solution here, so we should
solve the sched_lock protection problems separately.

Requested by: alc
Approved by: jeff (mentor)
2007-05-31 22:52:15 +00:00
jeff
e1996cb960 - define and use VMCNT_{GET,SET,ADD,SUB,PTR} macros for manipulating
vmcnts.  This can be used to abstract away pcpu details but also changes
   to use atomics for all counters now.  This means sched lock is no longer
   responsible for protecting counts in the switch routines.

Contributed by:		Attilio Rao <attilio@FreeBSD.org>
2007-05-18 07:10:50 +00:00
rwatson
10d0d9cf47 Sweep kernel replacing suser(9) calls with priv(9) calls, assigning
specific privilege names to a broad range of privileges.  These may
require some future tweaking.

Sponsored by:           nCircle Network Security, Inc.
Obtained from:          TrustedBSD Project
Discussed on:           arch@
Reviewed (at least in part) by: mlaier, jmg, pjd, bde, ceri,
                        Alex Lyashkov <umka at sevcity dot net>,
                        Skip Ford <skip dot ford at verizon dot net>,
                        Antoine Brodin <antoine dot brodin at laposte dot net>
2006-11-06 13:42:10 +00:00
rwatson
7beaaf5cd2 Complete break-out of sys/sys/mac.h into sys/security/mac/mac_framework.h
begun with a repo-copy of mac.h to mac_framework.h.  sys/mac.h now
contains the userspace and user<->kernel API and definitions, with all
in-kernel interfaces moved to mac_framework.h, which is now included
across most of the kernel instead.

This change is the first step in a larger cleanup and sweep of MAC
Framework interfaces in the kernel, and will not be MFC'd.

Obtained from:	TrustedBSD Project
Sponsored by:	SPARTA
2006-10-22 11:52:19 +00:00
kib
b90260e703 Make the mincore(2) return ENOMEM when requested range is not fully mapped.
Requested by:	Bruno Haible <bruno at clisp org>
Reviewed by:	alc
Approved by:	pjd (mentor)
MFC after:	1 month
2006-06-21 12:59:05 +00:00
trhodes
2b289f67a8 It seems that POSIX would rather ENODEV returned in place of EINVAL when
trying to mmap() an fd that isn't a normal file.

Reference: http://www.opengroup.org/onlinepubs/009695399/functions/mmap.html
Submitted by:	fanf
2006-04-21 07:17:25 +00:00
jkoshy
48e5e4792d MFP4: Support for profiling dynamically loaded objects.
Kernel changes:

  Inform hwpmc of executable objects brought into the system by
  kldload() and mmap(), and of their removal by kldunload() and
  munmap().  A helper function linker_hwpmc_list_objects() has been
  added to "sys/kern/kern_linker.c" and is used by hwpmc to retrieve
  the list of currently loaded kernel modules.

  The unused `MAPPINGCHANGE' event has been deprecated in favour
  of separate `MAP_IN' and `MAP_OUT' events; this change reduces
  space wastage in the log.

  Bump the hwpmc's ABI version to "2.0.00".  Teach hwpmc(4) to
  handle the map change callbacks.

  Change the default per-cpu sample buffer size to hold
  32 samples (up from 16).

  Increment __FreeBSD_version.

libpmc(3) changes:

  Update libpmc(3) to deal with the new events in the log file; bring
  the pmclog(3) manual page in sync with the code.

pmcstat(8) changes:

  Introduce new options to pmcstat(8): "-r" (root fs path), "-M"
  (mapfile name), "-q"/"-v" (verbosity control).  Option "-k" now
  takes a kernel directory as its argument but will also work with
  the older invocation syntax.

  Rework string handling in pmcstat(8) to use an opaque type for
  interned strings.  Clean up ELF parsing code and add support for
  tracking dynamic object mappings reported by a v2.0.00 hwpmc(4).

  Report statistics at the end of a log conversion run depending
  on the requested verbosity level.

Reviewed by:	jhb, dds (kernel parts of an earlier patch)
Tested by:	gallatin (earlier patch)
2006-03-26 12:20:54 +00:00
dds
0fb2e655fd Move execve's access time update functionality into a new
vfs_mark_atime() function, and use the new function for
performing efficient atime updates in mmap().

Reviewed by:	bde
MFC after:	2 weeks
2005-10-12 06:56:00 +00:00
dds
c066055957 Update the vnode's access time after an mmap operation on it.
Before this change a copy operation with cp(1) would not update the
file access times.

According to the POSIX mmap(2) documentation: the st_atime field
of the mapped file may be marked for update at any time between the
mmap() call and the corresponding munmap() call. The initial read
or write reference to a mapped region shall cause the file's st_atime
field to be marked for update if it has not already been marked for
update.
2005-10-04 14:58:58 +00:00
peter
eb46ffb067 Remove unused (but initialized) variable 'objsize' from vm_mmap() 2005-09-20 22:08:27 +00:00
csjp
e89e83d7fe Move MAC check_vnode_mmap entry point out from being exclusive to
MAP_SHARED so that the entry point gets executed un-conditionally.
This may be useful for security policies which want to perform access
control checks around run-time linking.

-add the mmap(2) flags argument to the check_vnode_mmap entry point
 so that we can make access control decisions based on the type of
 mapped object.
-update any dependent API around this parameter addition such as
 function prototype modifications, entry point parameter additions
 and the inclusion of sys/mman.h header file.
-Change the MLS, BIBA and LOMAC security policies so that subject
 domination routines are not executed unless the type of mapping is
 shared. This is done to maintain compatibility between the old
 vm_mmap_vnode(9) and these policies.

Reviewed by:	rwatson
MFC after:	1 month
2005-04-14 16:03:30 +00:00
jhb
a3c6b782c3 - Change the vm_mmap() function to accept an objtype_t parameter specifying
the type of object represented by the handle argument.
- Allow vm_mmap() to map device memory via cdev objects in addition to
  vnodes and anonymous memory.  Note that mmaping a cdev directly does not
  currently perform any MAC checks like mapping a vnode does.
- Unbreak the DRM getbufs ioctl by having it call vm_mmap() directly on the
  cdev the ioctl is acting on rather than trying to find a suitable vnode
  to map from.

Reviewed by:	alc, arch@
2005-04-01 20:00:11 +00:00
phk
796d435574 Don't use VOP_GETVOBJECT, use vp->v_object directly. 2005-01-25 00:40:01 +00:00
jeff
1dd5432139 - Remove GIANT_REQUIRED where giant is no longer required.
- Use VFS_LOCK_GIANT() rather than directly acquiring giant in places
   where giant is only held because vfs requires it.

Sponsored By:   Isilon Systems, Inc.
2005-01-24 10:48:29 +00:00
imp
f0bf889d0d /* -> /*- for license, minor formatting changes 2005-01-07 02:29:27 +00:00
phk
76b805d6f4 Don't clear flags we just checked were not set. 2004-10-26 05:57:29 +00:00
phk
8d623dca9a XXX mark two places where we do not hold a threadcount on the dev when
frobbing the cdevsw.

In both cases we examine only the cdevsw and it is a good question if we
weren't better off copying those properties into the cdev in the first
place.  This question will be revisited.
2004-09-24 08:32:36 +00:00
alc
f58b09685d Remove dead code. 2004-09-01 19:58:37 +00:00
phk
007ee2473b Remove a product specific workaround for wrong modes when mmap(2)'ing
devices.  They have had plenty of time to adjust now.
2004-08-05 07:04:33 +00:00
alc
afa1e8a9c9 Eliminate the acquisition and release of Giant around the call to
pmap_mincore() in mincore(2).  Either pmap locking exists (alpha, amd64,
i386, ia64) or pmap_mincore() is unimplemented (arm, powerpc, sparc64).
2004-08-02 03:31:05 +00:00
phk
86602fc06c Deorbit COMPAT_SUNOS.
We inherited this from the sparc32 port of BSD4.4-Lite1.  We have neither
a sparc32 port nor a SunOS4.x compatibility desire these days.
2004-06-11 11:16:26 +00:00
tjr
a87b9baadb To handle orphaned character device vnodes properly in mmap(), check that
v_mount is non-null before dereferencing it. If it's null, behave as if
MNT_NOEXEC was not set on the mount that originally containined it.
2004-05-11 10:26:37 +00:00
imp
04c9462c02 Remove advertising clause from University of California Regent's license,
per letter dated July 22, 1999.

Approved by: core
2004-04-06 20:15:37 +00:00
kan
fee73bd419 Delay permission checks for VCHR vnodes until after vnode is locked in
vm_mmap_vnode function, where we can safely check for a special /dev/zero
case. Rev. 1.180 has reordered checks and introduced a regression.

Submitted by:	alc
Was broken by:	kan
2004-04-05 04:54:22 +00:00
guido
365db5dd01 When mmap-ing a file from a noexec mount, be sure not to grant the right
to mmap it PROT_EXEC. This also depends on the architecture, as some
architextures (e.g. i386) do not distinguish between read and exec pages

Inspired by: 	http://linux.bkbits.net:8080/linux-2.4/cset@1.1267.1.85
Reviewed by:	alc
2004-03-18 20:58:51 +00:00
truckman
df17b6c2c8 Make overflow/wraparound checking more robust and unbreak len=0 in
vslock(), mlock(), and munlock().

Reviewed by:	bde
2004-03-15 09:11:23 +00:00
truckman
87ef1d0222 Style(9) changes.
Pointed out by:	bde
2004-03-15 06:43:51 +00:00
truckman
4a8aedf7cf Remove redundant suser() check. 2004-03-15 06:36:55 +00:00
truckman
367b608998 Undo the merger of mlock()/vslock and munlock()/vsunlock() and the
introduction of kern_mlock() and kern_munlock() in
        src/sys/kern/kern_sysctl.c      1.150
        src/sys/vm/vm_extern.h          1.69
        src/sys/vm/vm_glue.c            1.190
        src/sys/vm/vm_mmap.c            1.179
because different resource limits are appropriate for transient and
"permanent" page wiring requests.

Retain the kern_mlock() and kern_munlock() API in the revived
vslock() and vsunlock() functions.

Combine the best parts of each of the original sets of implementations
with further code cleanup.  Make the mclock() and vslock()
implementations as similar as possible.

Retain the RLIMIT_MEMLOCK check in mlock().  Move the most strigent
test, which can return EAGAIN, last so that requests that have no
hope of ever being satisfied will not be retried unnecessarily.

Disable the test that can return EAGAIN in the vslock() implementation
because it will cause the sysctl code to wedge.

Tested by:	Cy Schubert <Cy.Schubert AT komquats.com>
2004-03-05 22:03:11 +00:00
kan
9fe9a30730 Pich up a do {} while(0) cleanup by phk that was discarded accidentally in
previous revision.

Submitted by:	alc
2004-03-01 02:44:33 +00:00
kan
68d7945627 Move the code dealing with vnode out of several functions into a single
helper function vm_mmap_vnode.

Discussed with:	jeffr,alc (a while ago)
2004-02-27 22:02:15 +00:00
truckman
1de257deb3 Split the mlock() kernel code into two parts, mlock(), which unpacks
the syscall arguments and does the suser() permission check, and
kern_mlock(), which does the resource limit checking and calls
vm_map_wire().  Split munlock() in a similar way.

Enable the RLIMIT_MEMLOCK checking code in kern_mlock().

Replace calls to vslock() and vsunlock() in the sysctl code with
calls to kern_mlock() and kern_munlock() so that the sysctl code
will obey the wired memory limits.

Nuke the vslock() and vsunlock() implementations, which are no
longer used.

Add a member to struct sysctl_req to track the amount of memory
that is wired to handle the request.

Modify sysctl_wire_old_buffer() to return an error if its call to
kern_mlock() fails.  Only wire the minimum of the length specified
in the sysctl request and the length specified in its argument list.
It is recommended that sysctl handlers that use sysctl_wire_old_buffer()
should specify reasonable estimates for the amount of data they
want to return so that only the minimum amount of memory is wired
no matter what length has been specified by the request.

Modify the callers of sysctl_wire_old_buffer() to look for the
error return.

Modify sysctl_old_user to obey the wired buffer length and clean up
its implementation.

Reviewed by:	bms
2004-02-26 00:27:04 +00:00
jhb
279b2b8278 Locking for the per-process resource limits structure.
- struct plimit includes a mutex to protect a reference count.  The plimit
  structure is treated similarly to struct ucred in that is is always copy
  on write, so having a reference to a structure is sufficient to read from
  it without needing a further lock.
- The proc lock protects the p_limit pointer and must be held while reading
  limits from a process to keep the limit structure from changing out from
  under you while reading from it.
- Various global limits that are ints are not protected by a lock since
  int writes are atomic on all the archs we support and thus a lock
  wouldn't buy us anything.
- All accesses to individual resource limits from a process are abstracted
  behind a simple lim_rlimit(), lim_max(), and lim_cur() API that return
  either an rlimit, or the current or max individual limit of the specified
  resource from a process.
- dosetrlimit() was renamed to kern_setrlimit() to match existing style of
  other similar syscall helper functions.
- The alpha OSF/1 compat layer no longer calls getrlimit() and setrlimit()
  (it didn't used the stackgap when it should have) but uses lim_rlimit()
  and kern_setrlimit() instead.
- The svr4 compat no longer uses the stackgap for resource limits calls,
  but uses lim_rlimit() and kern_setrlimit() instead.
- The ibcs2 compat no longer uses the stackgap for resource limits.  It
  also no longer uses the stackgap for accessing sysctl's for the
  ibcs2_sysconf() syscall but uses kernel_sysctl() instead.  As a result,
  ibcs2_sysconf() no longer needs Giant.
- The p_rlimit macro no longer exists.

Submitted by:	mtm (mostly, I only did a few cleanups and catchups)
Tested on:	i386
Compiled on:	alpha, amd64
2004-02-04 21:52:57 +00:00
alc
a0a304d068 - Correct an error in mincore(2) that has existed since its introduction:
mincore(2) should check that the page is valid, not just allocated.
   Otherwise, it can return a false positive for a page that is not yet
   resident because it is being read from disk.
2003-12-21 06:03:40 +00:00
kan
ff9eb2d32f Remove trailing whitespace. 2003-12-08 02:45:45 +00:00
alc
672d48f582 Addendum to revision 1.174: In the case where vm_pager_allocate() is called
to create a vnode-backed object, the vnode lock must be held by the caller.

Reported by:	truckman
Discussed with:	kan
2003-12-08 00:47:33 +00:00
alc
3b8e185f65 Fix a deadlock between vm_fault() and vm_mmap(): The expected lock ordering
between vm_map and vnode locks is that vm_map locks are acquired first.  In
revision 1.150 mmap(2) was changed to pass a locked vnode into vm_mmap().
This creates a lock-order reversal when vm_mmap() calls one of the vm_map
routines that acquires a vm_map lock.  The solution implemented herein is
to release the vnode lock in mmap() before calling vm_mmap() and reacquire
this lock if necessary in vm_mmap().

Approved by:	re (scottl)
Reviewed by:	jeff, kan, rwatson
2003-12-06 05:45:32 +00:00
alc
9b4ee6c4dd - Remove long dead code. 2003-11-14 08:22:38 +00:00
alc
58630d7148 Changes to msync(2)
- Return EBUSY if the region was wired by mlock(2) and MS_INVALIDATE
   is specified to msync(2).  This is required by the Open Group Base
   Specifications Issue 6.
 - vm_map_sync() doesn't return KERN_FAILURE.  Thus, msync(2) can't
   possibly return EIO.
 - The second major loop in vm_map_sync() handles sub maps.  Thus,
   failing on sub maps in the first major loop isn't necessary.
2003-11-14 06:55:11 +00:00
alc
fa4ea5d2f2 - The Open Group Base Specifications Issue 6 specifies that an munmap(2)
must return EINVAL if size is zero.  Submitted by: tegge
 - In order to avoid a race condition in multithreaded applications, the
   check and removal operations by munmap(2) must be in the same critical
   section.  To accomodate this, vm_map_check_protection() is modified to
   require its caller to obtain at least a read lock on the map.
2003-11-10 01:37:40 +00:00
alc
b2bc11d840 - Remove Giant from msync(2). Giant is still acquired by the lower layers
if we drop into the pmap or vnode layers.
 - Migrate the handling of zero-length msync(2)s into vm_map_sync() so that
   multithread applications can't change the map between implementing the
   zero-length hack in msync(2) and reacquiring the map lock in
   vm_map_sync().

Reviewed by:	tegge
2003-11-09 22:09:04 +00:00
alc
269cf5aa09 - Rename vm_map_clean() to vm_map_sync(). This better reflects the fact
that msync(2) is its only caller.
 - Migrate the parts of the old vm_map_clean() that examined the internals
   of a vm object to a new function vm_object_sync() that is implemented in
   vm_object.c.  At the same, introduce the necessary vm object locking so
   that vm_map_sync() and vm_object_sync() can be called without Giant.

Reviewed by:	tegge
2003-11-09 05:25:35 +00:00
bms
0ff257ed72 Only the super-user should be able to wire pages via the mlock() family
of system calls at this time.  Remove various #ifdef's to enforce this.
2003-10-06 01:59:04 +00:00
marcel
d75cf98307 Part 2 of implementing rstacks: add the ability to create rstacks and
use the ability on ia64 to map the register stack. The orientation of
the stack (i.e. its grow direction) is passed to vm_map_stack() in the
overloaded cow argument. Since the grow direction is represented by
bits, it is possible and allowed to create bi-directional stacks.
This is not an advertised feature, more of a side-effect.

Fix a bug in vm_map_growstack() that's specific to rstacks and which
we could only find by having the ability to create rstacks: when
the mapped stack ends at the faulting address, we have not actually
mapped the faulting address. we need to include or cover the faulting
address.

Note that at this time mmap(2) has not been extended to allow the
creation of rstacks by processes. If such a need arises, this can
be done.

Tested on: alpha, i386, ia64, sparc64
2003-09-27 22:28:14 +00:00
peter
8ecb3577d8 Add sysentvec->sv_fixlimits() hook so that we can catch cases on 64 bit
systems where the data/stack/etc limits are too big for a 32 bit process.

Move the 5 or so identical instances of ELF_RTLD_ADDR() into imgact_elf.c.

Supply an ia32_fixlimits function.  Export the clip/default values to
sysctl under the compat.ia32 heirarchy.

Have mmap(0, ...) respect the current p->p_limits[RLIMIT_DATA].rlim_max
value rather than the sysctl tweakable variable.  This allows mmap to
place mappings at sensible locations when limits have been reduced.

Have the imgact_elf.c ld-elf.so.1 placement algorithm use the same
method as mmap(0, ...) now does.

Note that we cannot remove all references to the sysctl tweakable
maxdsiz etc variables because /etc/login.conf specifies a datasize
of 'unlimited'.  And that causes exec etc to fail since it can no
longer find space to mmap things.
2003-09-25 01:10:26 +00:00
alc
1cb490a309 Revise the locking in mincore(2). 2003-09-07 18:47:54 +00:00
bms
44aa51e3ae Add the mlockall() and munlockall() system calls.
- All those diffs to syscalls.master for each architecture *are*
   necessary. This needed clarification; the stub code generation for
   mlockall() was disabled, which would prevent applications from
   linking to this API (suggested by mux)
 - Giant has been quoshed. It is no longer held by the code, as
   the required locking has been pushed down within vm_map.c.
 - Callers must specify VM_MAP_WIRE_HOLESOK or VM_MAP_WIRE_NOHOLES
   to express their intention explicitly.
 - Inspected at the vmstat, top and vm pager sysctl stats level.
   Paging-in activity is occurring correctly, using a test harness.
 - The RES size for a process may appear to be greater than its SIZE.
   This is believed to be due to mappings of the same shared library
   page being wired twice. Further exploration is needed.
 - Believed to back out of allocations and locks correctly
   (tested with WITNESS, MUTEX_PROFILING, INVARIANTS and DIAGNOSTIC).

PR:             kern/43426, standards/54223
Reviewed by:    jake, alc
Approved by:    jake (mentor)
MFC after:	2 weeks
2003-08-11 07:14:08 +00:00
phk
3875fcca40 Remove unnecessary cast. 2003-07-04 12:23:43 +00:00
phk
c81c59299b Add a f_vnode field to struct file.
Several of the subtypes have an associated vnode which is used for
stuff like the f*() functions.

By giving the vnode a speparate field, a number of checks for the specific
subtype can be replaced simply with a check for f_vnode != NULL, and
we can later free f_data up to subtype specific use.

At this point in time, f_data still points to the vnode, so any code I
might have overlooked will still work.
2003-06-22 08:41:43 +00:00