Commit Graph

527 Commits

Author SHA1 Message Date
Konstantin Belousov
f5cf758998 Provide storage for the process feature control flags in struct proc.
The flags are cleared on exec, it is up to the image activator to set
them.

Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
2018-11-23 23:07:57 +00:00
Brooks Davis
12e69f96a2 Add const to input-only char * arguments.
These arguments are mostly paths handled by NAMEI*() macros which already
take const char * arguments.

This change improves the match between syscalls.master and the public
declerations of system calls.

Reviewed by:	kib (prior version)
Obtained from:	CheriBSD
Sponsored by:	DARPA, AFRL
Differential Revision:	https://reviews.freebsd.org/D17812
2018-11-02 20:50:22 +00:00
Mariusz Zaborski
2fe6aefff8 capsicum: allow the setproctitle(3) function in capability mode
Capsicum in past allowed to change the process title.
This was broken with r335939.

PR:		230584
Submitted by:	Yuichiro NAITO <naito.yuichiro@gmail.com>
Reported by:	ian@niw.com.au
MFC after:	1 week
2018-08-17 14:35:10 +00:00
Brooks Davis
d92da75941 Round down the location of execpathp to slightly improve copyout speed.
In practice, this moves the padding from below the canary to above
execpathp has no impact on stack consumption.

Submitted by:		Wuyang-Chung (via github pull request #159)
MFC after:	1 week
2018-07-13 11:32:27 +00:00
Andrew Turner
2bf9501287 Create a new macro for static DPCPU data.
On arm64 (and possible other architectures) we are unable to use static
DPCPU data in kernel modules. This is because the compiler will generate
PC-relative accesses, however the runtime-linker expects to be able to
relocate these.

In preparation to fix this create two macros depending on if the data is
global or static.

Reviewed by:	bz, emaste, markj
Sponsored by:	ABT Systems Ltd
Differential Revision:	https://reviews.freebsd.org/D16140
2018-07-05 17:13:37 +00:00
Alan Cox
3e7cb27cdd Use a single, consistent approach to returning success versus failure in
vm_map_madvise().  Previously, vm_map_madvise() used a traditional Unix-
style "return (0);" to indicate success in the common case, but Mach-
style return values in the edge cases.  Since KERN_SUCCESS equals zero,
the only problem with this inconsistency was stylistic.  vm_map_madvise()
has exactly two callers in the entire source tree, and only one of them
cares about the return value.  That caller, kern_madvise(), can be
simplified if vm_map_madvise() consistently uses Unix-style return
values.

Since vm_map_madvise() uses the variable modify_map as a Boolean, make it
one.

Eliminate a redundant error check from kern_madvise().  Add a comment
explaining where the check is performed.

Explicitly note that exec_release_args_kva() doesn't care about
vm_map_madvise()'s return value.  Since MADV_FREE is passed as the
behavior, the return value will always be zero.

Reviewed by:	kib, markj
MFC after:	7 days
2018-06-04 16:28:06 +00:00
Ed Maste
b8d908b71e ANSIfy sys/kern 2018-06-01 13:26:45 +00:00
Brooks Davis
5f77b8a88b Avoid two suword() calls per auxarg entry.
Instead, construct an auxargs array and copy it out all at once.

Use an array of Elf_Auxinfo rather than pairs of Elf_Addr * to represent
the array. This is the correct type where pairs of words just happend
to work. To reduce the size of the diff, AUXARGS_ENTRY is altered to act
on this array rather than introducing a new macro.

Return errors on copyout() and suword() failures and handle them in the
caller.

Incidentally fixes AT_RANDOM and AT_EXECFN in 32-bit linux on amd64
which incorrectly used AUXARG_ENTRY instead of AUXARGS_ENTRY_32
(now removed due to the use of proper types).

Reviewed by:	kib
Comments from:	emaste, jhb
Obtained from:	CheriBSD
Sponsored by:	DARPA, AFRL
Differential Revision:	https://reviews.freebsd.org/D15485
2018-05-24 16:25:18 +00:00
Matt Macy
cbd92ce62e Eliminate the overhead of gratuitous repeated reinitialization of cap_rights
- Add macros to allow preinitialization of cap_rights_t.

- Convert most commonly used code paths to use preinitialized cap_rights_t.
  A 3.6% speedup in fstat was measured with this change.

Reported by:	mjg
Reviewed by:	oshogbo
Approved by:	sbruno
MFC after:	1 month
2018-05-09 18:47:24 +00:00
Konstantin Belousov
1302eea7bb Rename PROC_PDEATHSIG_SET -> PROC_PDEATHSIG_CTL and PROC_PDEATHSIG_GET
-> PROC_PDEATHSIG_STATUS for consistency with other procctl(2)
operations names.

Requested by:	emaste
Sponsored by:	The FreeBSD Foundation
MFC after:	13 days
2018-04-20 15:19:27 +00:00
John Baldwin
73c8686e91 Simplify the code to allocate stack for auxv, argv[], and environment vectors.
Remove auxarg_size as it was only used once right after a confusing
assignment in each of the variants of exec_copyout_strings().

Reviewed by:	emaste
MFC after:	1 month
Differential Revision:	https://reviews.freebsd.org/D15123
2018-04-19 16:00:34 +00:00
Konstantin Belousov
b940886338 Add PROC_PDEATHSIG_SET to procctl interface.
Allow processes to request the delivery of a signal upon death of
their parent process.  Supposed consumer of the feature is PostgreSQL.

Submitted by:	Thomas Munro
Reviewed by:	jilles, mjg
MFC after:	2 weeks
Differential revision:	https://reviews.freebsd.org/D15106
2018-04-18 21:31:13 +00:00
Brooks Davis
6469bdcdb6 Move most of the contents of opt_compat.h to opt_global.h.
opt_compat.h is mentioned in nearly 180 files. In-progress network
driver compabibility improvements may add over 100 more so this is
closer to "just about everywhere" than "only some files" per the
guidance in sys/conf/options.

Keep COMPAT_LINUX32 in opt_compat.h as it is confined to a subset of
sys/compat/linux/*.c.  A fake _COMPAT_LINUX option ensure opt_compat.h
is created on all architectures.

Move COMPAT_LINUXKPI to opt_dontuse.h as it is only used to control the
set of compiled files.

Reviewed by:	kib, cem, jhb, jtl
Sponsored by:	DARPA, AFRL
Differential Revision:	https://reviews.freebsd.org/D14941
2018-04-06 17:35:35 +00:00
Andriy Gapon
b2387aa652 exec_map_first_page: fix an inverse condition introduced in r254138
While the bug itself was serious, as we could either pass a non-busied
page to vm_pager_get_pages() or leak a busy page, it could only be
triggered under a very rare condition where the page is already inserted
into the object, but it is not valid yet.

Reviewed by:	kib
MFC after:	2 weeks
2018-02-07 21:51:59 +00:00
Pedro F. Giffuni
d821d36419 Unsign some values related to allocation.
When allocating memory through malloc(9), we always expect the amount of
memory requested to be unsigned as a negative value would either stand for
an error or an overflow.
Unsign some values, found when considering the use of mallocarray(9), to
avoid unnecessary casting. Also consider that indexes should be of
at least the same size/type as the upper limit they pretend to index.

MFC after:	3 weeks
2018-01-22 02:08:10 +00:00
Pedro F. Giffuni
8a36da99de sys/kern: adoption of SPDX licensing ID tags.
Mainly focus on files that use BSD 2-Clause license, however the tool I
was using misidentified many licenses so this was mostly a manual - error
prone - task.

The Software Package Data Exchange (SPDX) group provides a specification
to make it easier for automated tools to detect and summarize well known
opensource licenses. We are gradually adopting the specification, noting
that the tags are considered only advisory and do not, in any way,
superceed or replace the license texts.
2017-11-27 15:20:12 +00:00
Ed Schouten
814629dd64 Don't let cpu_set_syscall_retval() clobber exec_setregs().
Upon successful completion, the execve() system call invokes
exec_setregs() to initialize the registers of the initial thread of the
newly executed process. What is weird is that when execve() returns, it
still goes through the normal system call return path, clobbering the
registers with the system call's return value (td->td_retval).

Though this doesn't seem to be problematic for x86 most of the times (as
the value of eax/rax doesn't matter upon startup), this can be pretty
frustrating for architectures where function argument and return
registers overlap (e.g., ARM). On these systems, exec_setregs() also
needs to initialize td_retval.

Even worse are architectures where cpu_set_syscall_retval() sets
registers to values not derived from td_retval. On these architectures,
there is no way cpu_set_syscall_retval() can set registers to the way it
wants them to be upon the start of execution.

To get rid of this madness, let sys_execve() return EJUSTRETURN. This
will cause cpu_set_syscall_retval() to leave registers intact. This
makes process execution easier to understand. It also eliminates the
difference between execution of the initial process and successive ones.
The initial call to sys_execve() is not performed through a system call
context.

Reviewed by:	kib, jhibbits
Differential Revision:	https://reviews.freebsd.org/D13180
2017-11-24 07:35:08 +00:00
Matt Joras
2ca45184dc Introduce EVENTHANDLER_LIST and some users.
This introduces a facility to EVENTHANDLER(9) for explicitly defining a
reference to an event handler list. This is useful since previously all
invokers of events had to do a locked traversal of the global list of
event handler lists in order to find the appropriate event handler list.
By keeping a pointer to the appropriate list an invoker can avoid this
traversal completely. The pointer is initialized with SYSINIT(9) during
the eventhandler stage. Users registering interest in events do not need
to know if the event is backed by such a list, since the list is added
to the global list of lists. As with lists that are not pre-defined it
is safe to register for the events before the list has been created.

This converts the process_* and thread_* events to using the new
facility, as these are events whose locked traversals end up showing up
significantly in ports build workflows (and presumably other workflows
with many short lived threads/procs). It may be advantageous to convert
other events to using the new facility.

The el_flags field is now unused, but leave it be so that this revision
can be MFC'd.

Reviewed by:	bdrewery, markj, mjg
Approved by:	rstone (mentor)
In collaboration with:  ian
MFC after:      4 weeks
Sponsored by:	Dell EMC Isilon
Differential Revision:	https://reviews.freebsd.org/D12814
2017-11-09 22:51:48 +00:00
Mateusz Guzik
e6b645ae89 execve: avoid one proc lock/unlock trip unless PTRACE_EXEC is set
MFC after:	1 week
2017-10-19 00:46:15 +00:00
Mateusz Guzik
80a2397a38 Tidy up pmc support at execve.
The proc-specific check is inherently racy, so the code can just unlock
beforehand.

MFC after:	1 week
2017-10-19 00:38:14 +00:00
Konstantin Belousov
467b3975a3 Resolve confusion between different error code spaces.
The vm_map_fixed() and vm_map_stack() VM functions return Mach error
codes.  Convert them into errno values before returning result from
exec_new_vmspace().

While there, modernize the comment and do minor style adjustments.

Reviewed by:	alc
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2017-07-03 20:44:01 +00:00
Alan Cox
8056df6e25 Clear the MAP_WIREFUTURE flag on the vm map in exec_new_vmspace() when it
recycles the current vm space.  Otherwise, an mlockall(MCL_FUTURE) could
still be in effect on the process after an execve(2), which violates the
specification for mlockall(2).

It's pointless for vm_map_stack() to check the MEMLOCK limit.  It will
never be asked to wire the stack.  Moreover, it doesn't even implement
wiring of the stack.

Reviewed by:	kib, markj
MFC after:	1 week
Differential Revision:	https://reviews.freebsd.org/D11421
2017-06-30 15:49:36 +00:00
Ed Maste
3e85b721d6 Remove register keyword from sys/ and ANSIfy prototypes
A long long time ago the register keyword told the compiler to store
the corresponding variable in a CPU register, but it is not relevant
for any compiler used in the FreeBSD world today.

ANSIfy related prototypes while here.

Reviewed by:	cem, jhb
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D10193
2017-05-17 00:34:34 +00:00
Mark Johnston
c6a4ba5a38 Apply MADV_FREE to exec_map entries only after a lowmem event.
This effectively provides the same benefit as applying MADV_FREE inline
upon every execve, since the page daemon invokes lowmem handlers prior to
scanning the inactive queue. It also has less overhead; the cost of
applying MADV_FREE is very noticeable on many-CPU systems since it includes
that of a TLB shootdown of global PTEs. For instance, this change nearly
halves the system CPU usage during a buildkernel on a 128-vCPU EC2
instance (with some other patches applied).

Benchmarked by:	cperciva (earlier version)
Reviewed by:	kib
MFC after:	2 weeks
Differential Revision:	https://reviews.freebsd.org/D9586
2017-02-15 01:50:58 +00:00
Konstantin Belousov
6e89d383c7 Explicitely add "opt_compat.h" to kern_exec.c: fix powerpc LINT builds.
sys/ptrace.h includes sys/signal.h, which includes sys/_sigset.h.
Note that sys/_sigset.h only defines osigset_t if COMPAT_43 was defined.

Two lines later, sys/ptrace.h includes machine/reg.h, which in case of
powerpc, includes opt_compat.h.

After the include headers reordering in r311345, we have sys/ptrace.h
included before sys/sysproto.h.

If COMPAT_43 was requested in the kernel config, the result is that
sys/_sigset.h does not define osigset_t, but sys/sysproto.h sees
COMPAT_43 and uses osigset_t.

Fix this by explicitely including opt_compat.h to cover the whole
kern/kern_exec.c scope.

Sponsored by:	The FreeBSD Foundation
2017-01-06 16:56:24 +00:00
Mark Johnston
ec492b13f1 Add a small allocator for exec_map entries.
Upon each execve, we allocate a KVA range for use in copying data to the
new image. Pages must be faulted into the range, and when the range is
freed, the backing pages are freed and their mappings are destroyed. This
is a lot of needless overhead, and the exec_map management becomes a
bottleneck when many CPUs are executing execve concurrently. Moreover, the
number of available ranges is fixed at 16, which is insufficient on large
systems and potentially excessive on 32-bit systems.

The new allocator reduces overhead by making exec_map allocations
persistent. When a range is freed, pages backing the range are marked clean
and made easy to reclaim. With this change, the exec_map is sized based on
the number of CPUs.

Reviewed by:	kib
MFC after:	1 month
Differential Revision:	https://reviews.freebsd.org/D8921
2017-01-05 01:44:12 +00:00
Mark Johnston
eeeaa7ba22 Sort includes in kern_exec.c.
MFC after:	1 week
2017-01-05 01:28:08 +00:00
Alan Cox
7667839a7e Remove most of the code for implementing PG_CACHED pages. (This change does
not remove user-space visible fields from vm_cnt or all of the references to
cached pages from comments.  Those changes will come later.)

Reviewed by:	kib, markj
Tested by:	pho
Sponsored by:	Dell EMC Isilon
Differential Revision:	https://reviews.freebsd.org/D8497
2016-11-15 18:22:50 +00:00
Mateusz Guzik
53dc58f2dc Mark a bunch of mpsafe sysctls as such.
This gives me a sysctl Giant-free buildworld.
2016-10-19 19:42:01 +00:00
Alan Cox
ce3ee09b53 Eliminate unneeded vm_page_xbusy() and vm_page_xunbusy() operations when
neither vm_pager_has_page() nor vm_pager_get_pages() is called.

Reviewed by:	kib, markj
MFC after:	3 weeks
2016-08-14 22:00:45 +00:00
Alan Cox
fc9bbf2794 Eliminate two calls to vm_page_xunbusy() that are both unnecessary and
incorrect from the error cases in exec_map_first_page().  They are
unnecessary because we automatically unbusy the page in vm_page_free()
when we remove it from the object.  The calls are incorrect because they
happen after the page is freed, so we might actually unbusy the page
after it has been reallocated to a different object.  (This error was
introduced in r292373.)

Reviewed by:	kib
MFC after:	1 week
2016-08-13 18:10:32 +00:00
Konstantin Belousov
77d6809483 The assertion re-added in r302614 was triggered when stopping signal
is delivered to vforked child.  Issue is that we avoid stopping such
children in issignal() to not block parents.  But executed AST, which
ignored stops, leaves the child with the signal pending but no AST
pending.

On first exec after vfork(), call signotify() to handle pending
reenabled signals.  Adjust the assert to not check vfork children
until exec.

Reported and tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
2016-07-18 10:53:47 +00:00
John Baldwin
8d570f64aa Add a mask of optional ptrace() events.
ptrace() now stores a mask of optional events in p_ptevents.  Currently
this mask is a single integer, but it can be expanded into an array of
integers in the future.

Two new ptrace requests can be used to manipulate the event mask:
PT_GET_EVENT_MASK fetches the current event mask and PT_SET_EVENT_MASK
sets the current event mask.

The current set of events include:
- PTRACE_EXEC: trace calls to execve().
- PTRACE_SCE: trace system call entries.
- PTRACE_SCX: trace syscam call exits.
- PTRACE_FORK: trace forks and auto-attach to new child processes.
- PTRACE_LWP: trace LWP events.

The S_PT_SCX and S_PT_SCE events in the procfs p_stops flags have
been replaced by PTRACE_SCE and PTRACE_SCX.  PTRACE_FORK replaces
P_FOLLOW_FORK and PTRACE_LWP replaces P2_LWP_EVENTS.

The PT_FOLLOW_FORK and PT_LWP_EVENTS ptrace requests remain for
compatibility but now simply toggle corresponding flags in the
event mask.

While here, document that PT_SYSCALL, PT_TO_SCE, and PT_TO_SCX both
modify the event mask and continue the traced process.

Reviewed by:	kib
MFC after:	1 month
Differential Revision:	https://reviews.freebsd.org/D7044
2016-07-15 15:32:09 +00:00
Konstantin Belousov
9e590ff04b When filt_proc() removes event from the knlist due to the process
exiting (NOTE_EXIT->knlist_remove_inevent()), two things happen:
- knote kn_knlist pointer is reset
- INFLUX knote is removed from the process knlist.
And, there are two consequences:
- KN_LIST_UNLOCK() on such knote is nop
- there is nothing which would block exit1() from processing past the
  knlist_destroy() (and knlist_destroy() resets knlist lock pointers).
Both consequences result either in leaked process lock, or
dereferencing NULL function pointers for locking.

Handle this by stopping embedding the process knlist into struct proc.
Instead, the knlist is allocated together with struct proc, but marked
as autodestroy on the zombie reap, by knlist_detach() function.  The
knlist is freed when last kevent is removed from the list, in
particular, at the zombie reap time if the list is empty.  As result,
the knlist_remove_inevent() is no longer needed and removed.

Other changes:

In filt_procattach(), clear NOTE_EXEC and NOTE_FORK desired events
from kn_sfflags for knote registered by kernel to only get NOTE_CHILD
notifications.  The flags leak resulted in excessive
NOTE_EXEC/NOTE_FORK reports.

Fix immediate note activation in filt_procattach().  Condition should
be either the immediate CHILD_NOTE activation, or immediate NOTE_EXIT
report for the exiting process.

In knote_fork(), do not perform racy check for KN_INFLUX before kq
lock is taken.  Besides being racy, it did not accounted for notes
just added by scan (KN_SCAN).

Some minor and incomplete style fixes.

Analyzed and tested by:	Eric Badger <eric@badgerio.us>
Reviewed by:	jhb
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
Approved by:	re (gjb)
Differential revision:	https://reviews.freebsd.org/D6859
2016-06-27 21:52:17 +00:00
Konstantin Belousov
3fc292d56b Old process credentials for setuid execve must not be dereferenced
when the process credentials were not changed.  This can happen if an
error occured trying to activate the setuid binary.  And on error, if
new credentials were not yet assigned, they must be freed to not
create the leak.

Use oldcred == NULL as the predicate to detect credential
reassignment.

Reported and tested by:	pho
Sponsored by:	The FreeBSD Foundation
2016-06-08 04:37:03 +00:00
Mateusz Guzik
cda688443a exec: get rid of one vnode lock/unlock pair in do_execve
The lock was temporarily dropped for vrele calls, but they can be
postponed to a point where the lock is not held in the first place.

While here shuffle other code not needing the lock.
2016-05-27 15:03:38 +00:00
Bryan Drewery
1afd78b34d exec: Provide execpath in imgp for the process_exec hook.
This was previously set after the hook and only if auxargs were present.
Now always provide it if possible.

MFC after:	2 weeks
Reviewed by:	kib
Sponsored by:	EMC / Isilon Storage Division
Differential Revision:	https://reviews.freebsd.org/D6546
2016-05-26 23:19:39 +00:00
Bryan Drewery
881010f05d exec: Add credential change information into imgp for process_exec hook.
This allows an EVENTHANDLER(process_exec) hook to see if the new image
will cause credentials to change whether due to setgid/setuid or because
of POSIX saved-id semantics.

This adds 3 new fields into image_params:
  struct ucred *newcred		Non-null if the credentials will change.
  bool credential_setid		True if the new image is setuid or setgid.

This will pre-determine the new credentials before invoking the image
activators, where the process_exec hook is called.  The new credentials
will be installed into the process in the same place as before, after
image activators are done handling the image.

MFC after:	2 weeks
Reviewed by:	kib
Sponsored by:	EMC / Isilon Storage Division
Differential Revision:	https://reviews.freebsd.org/D6544
2016-05-26 23:18:54 +00:00
Pedro F. Giffuni
e3043798aa sys/kern: spelling fixes in comments.
No functional change.
2016-04-29 22:15:33 +00:00
Edward Tomasz Napierala
35030a5dd4 Remove some NULL checks for M_WAITOK allocations.
MFC after:	1 month
Sponsored by:	The FreeBSD Foundation
2016-03-29 13:56:59 +00:00
Bryan Drewery
3b77ac5f9e Correct a comment. 2016-03-01 23:58:53 +00:00
Mark Johnston
3616095801 Fix style issues around existing SDT probes.
- Use SDT_PROBE<N>() instead of SDT_PROBE(). This has no functional effect
  at the moment, but will be needed for some future changes.
- Don't hardcode the module component of the probe identifier. This is
  set automatically by the SDT framework.

MFC after:	1 week
2015-12-16 23:39:27 +00:00
Gleb Smirnoff
b0cd20172d A change to KPI of vm_pager_get_pages() and underlying VOP_GETPAGES().
o With new KPI consumers can request contiguous ranges of pages, and
  unlike before, all pages will be kept busied on return, like it was
  done before with the 'reqpage' only. Now the reqpage goes away. With
  new interface it is easier to implement code protected from race
  conditions.

  Such arrayed requests for now should be preceeded by a call to
  vm_pager_haspage() to make sure that request is possible. This
  could be improved later, making vm_pager_haspage() obsolete.

  Strenghtening the promises on the business of the array of pages
  allows us to remove such hacks as swp_pager_free_nrpage() and
  vm_pager_free_nonreq().

o New KPI accepts two integer pointers that may optionally point at
  values for read ahead and read behind, that a pager may do, if it
  can. These pages are completely owned by pager, and not controlled
  by the caller.

  This shifts the UFS-specific readahead logic from vm_fault.c, which
  should be file system agnostic, into vnode_pager.c. It also removes
  one VOP_BMAP() request per hard fault.

Discussed with:	kib, alc, jeff, scottl
Sponsored by:	Nginx, Inc.
Sponsored by:	Netflix
2015-12-16 21:30:45 +00:00
Conrad Meyer
e6b95927f3 Fix core corruption caused by race in note_procstat_vmmap
This fix is spiritually similar to r287442 and was discovered thanks to
the KASSERT added in that revision.

NT_PROCSTAT_VMMAP output length, when packing kinfo structs, is tied to
the length of filenames corresponding to vnodes in the process' vm map
via vn_fullpath.  As vnodes may move during coredump, this is racy.

We do not remove the race, only prevent it from causing coredump
corruption.

- Add a sysctl, kern.coredump_pack_vmmapinfo, to allow users to disable
  kinfo packing for PROCSTAT_VMMAP notes.  This avoids VMMAP corruption
  and truncation, even if names change, at the cost of up to PATH_MAX
  bytes per mapped object.  The new sysctl is documented in core.5.

- Fix note_procstat_vmmap to self-limit in the second pass.  This
  addresses corruption, at the cost of sometimes producing a truncated
  result.

- Fix PROCSTAT_VMMAP consumers libutil (and libprocstat, via copy-paste)
  to grok the new zero padding.

Reported by:	pho (https://people.freebsd.org/~pho/stress/log/datamove4-2.txt)
Relnotes:	yes
Sponsored by:	EMC / Isilon Storage Division
Differential Revision:	https://reviews.freebsd.org/D3824
2015-10-06 18:07:00 +00:00
Andriy Gapon
2f2f522b5d save some bytes by using more concise SDT_PROBE<n> instead of SDT_PROBE
SDT_PROBE requires 5 parameters whereas SDT_PROBE<n> requires n parameters
where n is typically smaller than 5.

Perhaps SDT_PROBE should be made a private implementation detail.

MFC after:	20 days
2015-09-28 12:14:16 +00:00
Conrad Meyer
bcb60d52e6 Follow-up to r287442: Move sysctl to compiled-once file
Avoid duplicate sysctl nodes.

Found by:	tijl
Approved by:	markj (mentor)
Sponsored by:	EMC / Isilon Storage Division
Differential Revision:	https://reviews.freebsd.org/D3586
2015-09-07 16:44:28 +00:00
Ed Schouten
39f5ebb774 Add sysent flag to switch to capabilities mode on startup.
CloudABI processes should run in capabilities mode automatically. There
is no need to switch manually (e.g., by calling cap_enter()). Add a
flag, SV_CAPSICUM, that can be used to call into cap_enter() during
execve().

Reviewed by:	kib
2015-08-03 13:41:47 +00:00
Konstantin Belousov
b4490c6e93 The si_status field of the siginfo_t, provided by the waitid(2) and
SIGCHLD signal, should keep full 32 bits of the status passed to the
_exit(2).

Split the combined p_xstat of the struct proc into the separate exit
status p_xexit for normal process exit, and signalled termination
information p_xsig.  Kernel-visible macro KW_EXITCODE() reconstructs
old p_xstat from p_xexit and p_xsig.  p_xexit contains complete status
and copied out into si_status.

Requested by:	Joerg Schilling
Reviewed by:	jilles (previous version), pho
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
2015-07-18 09:02:50 +00:00
Ed Schouten
457f7e23b1 Implement CloudABI's exec() call.
Summary:
In a runtime that is purely based on capability-based security, there is
a strong emphasis on how programs start their execution. We need to make
sure that we execute an new program with an exact set of file
descriptors, ensuring that credentials are not leaked into the process
accidentally.

Providing the right file descriptors is just half the problem. There
also needs to be a framework in place that gives meaning to these file
descriptors. How does a CloudABI mail server know which of the file
descriptors corresponds to the socket that receives incoming emails?
Furthermore, how will this mail server acquire its configuration
parameters, as it cannot open a configuration file from a global path on
disk?

CloudABI solves this problem by replacing traditional string command
line arguments by tree-like data structure consisting of scalars,
sequences and mappings (similar to YAML/JSON). In this structure, file
descriptors are treated as a first-class citizen. When calling exec(),
file descriptors are passed on to the new executable if and only if they
are referenced from this tree structure. See the cloudabi-run(1) man
page for more details and examples (sysutils/cloudabi-utils).

Fortunately, the kernel does not need to care about this tree structure
at all. The C library is responsible for serializing and deserializing,
but also for extracting the list of referenced file descriptors. The
system call only receives a copy of the serialized data and a layout of
what the new file descriptor table should look like:

    int proc_exec(int execfd, const void *data, size_t datalen, const int *fds,
              size_t fdslen);

This change introduces a set of fd*_remapped() functions:

- fdcopy_remapped() pulls a copy of a file descriptor table, remapping
  all of the file descriptors according to the provided mapping table.
- fdinstall_remapped() replaces the file descriptor table of the process
  by the copy created by fdcopy_remapped().
- fdescfree_remapped() frees the table in case we aborted before
  fdinstall_remapped().

We then add a function exec_copyin_data_fds() that builds on top these
functions. It copies in the data and constructs a new remapped file
descriptor. This is used by cloudabi_sys_proc_exec().

Test Plan:
cloudabi-run(1) is capable of spawning processes successfully, providing
it data and file descriptors. procstat -f seems to confirm all is good.
Regular FreeBSD processes also work properly.

Reviewers: kib, mjg

Reviewed By: mjg

Subscribers: imp

Differential Revision: https://reviews.freebsd.org/D3079
2015-07-16 07:05:42 +00:00
Mateusz Guzik
6161705823 exec: textvp -> oldtextvp; binvp -> newtextvp
This makes it consistent with the rest of the naming in do_execve.

No functional changes.
2015-07-14 01:13:37 +00:00
Mateusz Guzik
853be5ffef exec plug a redundant vref + vrele of the image vnode 2015-07-14 00:43:08 +00:00
Konstantin Belousov
6ef120027f Do not calculate the stack's bottom address twice.
Submitted by:	Olivц╘r Pintц╘r
Review:	https://reviews.freebsd.org/D2953
MFC after:	1 week
2015-06-30 15:22:47 +00:00
Gleb Smirnoff
093c7f396d Make KPI of vm_pager_get_pages() more strict: if a pager changes a page
in the requested array, then it is responsible for disposition of previous
page and is responsible for updating the entry in the requested array.
Now consumers of KPI do not need to re-lookup the pages after call to
vm_pager_get_pages().

Reviewed by:	kib
Sponsored by:	Netflix
Sponsored by:	Nginx, Inc.
2015-06-12 11:32:20 +00:00
Mateusz Guzik
f6f6d24062 Implement lockless resource limits.
Use the same scheme implemented to manage credentials.

Code needing to look at process's credentials (as opposed to thred's) is
provided with *_proc variants of relevant functions.

Places which possibly had to take the proc lock anyway still use the proc
pointer to access limits.
2015-06-10 10:48:12 +00:00
Konstantin Belousov
7b445033ff On exec, single-threading must be enforced before arguments space is
allocated from exec_map.  If many threads try to perform execve(2) in
parallel, the exec map is exhausted and some threads sleep
uninterruptible waiting for the map space.  Then, the thread which won
the race for the space allocation, cannot single-thread the process,
causing deadlock.

Reported and tested by:	pho (previous version)
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
2015-05-10 09:00:40 +00:00
Konstantin Belousov
fe8a824ca6 Handle incorrect ELF images specifying size for PT_GNU_STACK not being
multiple of page size.

Sponsored by:	The FreeBSD Foundation
MFC after:	3 days
2015-04-23 11:27:21 +00:00
Konstantin Belousov
316b384343 Implement support for binary to requesting specific stack size for the
initial thread.  It is read by the ELF image activator as the virtual
size of the PT_GNU_STACK program header entry, and can be specified by
the linker option -z stack-size in newer binutils.

The soft RLIMIT_STACK is auto-increased if possible, to satisfy the
binary' request.

Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2015-04-15 08:13:53 +00:00
Alan Cox
3d653db063 Introduce vm_object_color() and use it in mmap(2) to set the color of
named objects to zero before the virtual address is selected.  Previously,
the color setting was delayed until after the virtual address was
selected.  In rtld, this delay effectively prevented the mapping of a
shared library's code section using superpages.  Now, for example, we see
the first 1 MB of libc's code on armv6 mapped by a superpage after we've
gotten through the initial cold misses that bring the first 1 MB of code
into memory.  (With the page clustering that we perform on read faults,
this happens quickly.)

Differential Revision:	https://reviews.freebsd.org/D2013
Reviewed by:	jhb, kib
Tested by:	Svatopluk Kraus (armv6)
MFC after:	6 weeks
2015-03-21 17:56:55 +00:00
Mateusz Guzik
daf63fd2f9 cred: add proc_set_cred helper
The goal here is to provide one place altering process credentials.

This eases debugging and opens up posibilities to do additional work when such
an action is performed.
2015-03-16 00:10:03 +00:00
Konstantin Belousov
677258f7e7 Add procctl(2) PROC_TRACE_CTL command to enable or disable debugger
attachment to the process.  Note that the command is not intended to
be a security measure, rather it is an obfuscation feature,
implemented for parity with other operating systems.

Discussed with:	jilles, rwatson
Man page fixes by:	rwatson
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2015-01-18 15:13:11 +00:00
Konstantin Belousov
6ddcc23386 Add facility to stop all userspace processes. The supposed use of the
feature is to quisce the system before suspend.

Stop is implemented by reusing the thread_single(9) with the special
mode SINGLE_ALLPROC.  SINGLE_ALLPROC differs from the existing
single-threading modes by allowing (requiring) caller to operate on
other process.  Interruptible sleeps for !TDF_SBDRY threads are
suspended like SIGSTOP does it, instead of aborting the sleep, like
SINGLE_NO_EXIT, to avoid spurious EINTRs on resume.

Provide debugging sysctl debug.stop_all_proc, which causes total stop
and suspends syncer, while waiting for variable reset for resume.  It
is used for debugging; should be removed after the real use of the
interface is added.

In collaboration with:	pho
Discussed with:	avg
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
2014-12-13 16:18:29 +00:00
Mateusz Guzik
8a5177cca3 filedesc: fix missed comments about fdsetugidsafety
While here just note that both fdsetugidsafety and fdcheckstd take sleepable
locks.
2014-10-31 09:56:00 +00:00
Konstantin Belousov
0a2c94b86e Replace some calls to fuword() by fueword() with proper error checking.
Sponsored by:	The FreeBSD Foundation
Tested by:	pho
MFC after:	3 weeks
2014-10-28 15:28:20 +00:00
Mateusz Guzik
11888da8d9 filedesc: cleanup setugidsafety a little
Rename it to fdsetugidsafety for consistency with other functions.

There is no need to take filedesc lock if not closing any files.

The loop has to verify each file and we are guaranteed fdtable has space
for at least 20 fds. As such there is no need to check fd_lastfile.

While here tidy up is_unsafe.
2014-10-22 00:23:43 +00:00
Mateusz Guzik
5c37b305fd Plug unnecessary binvp NULL initialization and test.
Reported by: Coverity
CID: 1018889
2014-10-20 22:52:15 +00:00
Mateusz Guzik
8e572983d3 Use bzero instead of explicitly zeroing stuff in do_execve.
While strictly speaking this is not correct since some fields are pointers,
it makes no difference on all supported archs and we already rely on it doing
the right thing in other places.

No functional changes.
2014-09-29 23:59:19 +00:00
Konstantin Belousov
70978c93b8 If vm_page_grab() allocates a new page, the page is not inserted into
page queue even when the allocation is not wired.  It is
responsibility of the vm_page_grab() caller to ensure that the page
does not end on the vm_object queue but not on the pagedaemon queue,
which would effectively create unpageable unwired page.

In exec_map_first_page() and vm_imgact_hold_page(), activate the page
immediately after unbusying it, to avoid leak.

In the uiomove_object_page(), deactivate page before the object is
unlocked.  There is no leak, since the page is deactivated after
uiomove_fromphys() finished.  But allowing non-queued non-wired page
in the unlocked object queue makes it impossible to assert that leak
does not happen in other places.

Reviewed by:	alc
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2014-08-13 05:44:08 +00:00
Mateusz Guzik
965d08605f Plug p_pptr null test in do_execve. It is always true. 2014-07-14 22:40:46 +00:00
Mateusz Guzik
5e2554b7f8 Don't call crdup nor uifind under vnode lock.
A locked vnode can get into the way of satisyfing malloc with M_WATOK.

This is a fixup to r268087.

Suggested by:	kib
MFC after:	1 week
2014-07-07 14:03:30 +00:00
Marcel Moolenaar
e7d939bda2 Remove ia64.
This includes:
o   All directories named *ia64*
o   All files named *ia64*
o   All ia64-specific code guarded by __ia64__
o   All ia64-specific makefile logic
o   Mention of ia64 in comments and documentation

This excludes:
o   Everything under contrib/
o   Everything under crypto/
o   sys/xen/interface
o   sys/sys/elf_common.h

Discussed at: BSDcan
2014-07-07 00:27:09 +00:00
Mateusz Guzik
a6bad85e8e Plug gcc warning after r268074 about unitialized newsigacts
Reported by:	Gary Jennejohn <gljennjohn gmail.com>
2014-07-02 05:45:40 +00:00
Mateusz Guzik
350d51816e Don't call crcopysafe or uifind unnecessarily in execve.
MFC after:	1 week
2014-07-01 09:21:32 +00:00
Mateusz Guzik
d00c8ea429 Perform a lockless check in sigacts_shared.
It is used only during execve (i.e. singlethreaded), so there is no fear
of returning 'not shared' which soon becomes 'shared'.

While here reorganize the code a little to avoid proc lock/unlock in
shared case.

MFC after:	1 week
2014-07-01 06:29:15 +00:00
Mateusz Guzik
b0bc0cadbe Call fdcloseexec right after fdunshare.
No functional changes.

MFC after:	1 week
2014-06-28 05:51:45 +00:00
Mateusz Guzik
b9d32c36fa Make fdunshare accept only td parameter.
Proc had to match the thread anyway and 2 parameters were inconsistent
with the rest.

MFC after:	1 week
2014-06-28 05:41:53 +00:00
Hans Petter Selasky
af3b2549c4 Pull in r267961 and r267973 again. Fix for issues reported will follow. 2014-06-28 03:56:17 +00:00
Glen Barber
37a107a407 Revert r267961, r267973:
These changes prevent sysctl(8) from returning proper output,
such as:

 1) no output from sysctl(8)
 2) erroneously returning ENOMEM with tools like truss(1)
    or uname(1)
 truss: can not get etype: Cannot allocate memory
2014-06-27 22:05:21 +00:00
Hans Petter Selasky
3da1cf1e88 Extend the meaning of the CTLFLAG_TUN flag to automatically check if
there is an environment variable which shall initialize the SYSCTL
during early boot. This works for all SYSCTL types both statically and
dynamically created ones, except for the SYSCTL NODE type and SYSCTLs
which belong to VNETs. A new flag, CTLFLAG_NOFETCH, has been added to
be used in the case a tunable sysctl has a custom initialisation
function allowing the sysctl to still be marked as a tunable. The
kernel SYSCTL API is mostly the same, with a few exceptions for some
special operations like iterating childrens of a static/extern SYSCTL
node. This operation should probably be made into a factored out
common macro, hence some device drivers use this. The reason for
changing the SYSCTL API was the need for a SYSCTL parent OID pointer
and not only the SYSCTL parent OID list pointer in order to quickly
generate the sysctl path. The motivation behind this patch is to avoid
parameter loading cludges inside the OFED driver subsystem. Instead of
adding special code to the OFED driver subsystem to post-load tunables
into dynamically created sysctls, we generalize this in the kernel.

Other changes:
- Corrected a possibly incorrect sysctl name from "hw.cbb.intr_mask"
to "hw.pcic.intr_mask".
- Removed redundant TUNABLE statements throughout the kernel.
- Some minor code rewrites in connection to removing not needed
TUNABLE statements.
- Added a missing SYSCTL_DECL().
- Wrapped two very long lines.
- Avoid malloc()/free() inside sysctl string handling, in case it is
called to initialize a sysctl from a tunable, hence malloc()/free() is
not ready when sysctls from the sysctl dataset are registered.
- Bumped FreeBSD version to indicate SYSCTL API change.

MFC after:	2 weeks
Sponsored by:	Mellanox Technologies
2014-06-27 16:33:43 +00:00
Pawel Jakub Dawidek
e16406c7ba Remove duplicated includes.
Submitted by:	Mariusz Zaborski <oshogbo@FreeBSD.org>
2014-06-26 13:57:44 +00:00
Alan Cox
78960940fe Refresh a comment. The VM_STACK option was eliminated in r43209.
Sponsored by:	EMC / Isilon Storage Division
2014-06-09 00:15:16 +00:00
Konstantin Belousov
7032434e98 When exec_new_vmspace() decides that current vmspace cannot be reused
on execve(2), it calls vmspace_exec(), which frees the current
vmspace.  The thread executing an exec syscall gets new vmspace
assigned, and old vmspace is freed if only referenced by the current
process.  The free operation includes pmap_release(), which
de-constructs the paging structures used by hardware.

If the calling process is multithreaded, other threads are suspended
in the thread_suspend_check(), and need to be unsuspended and run to
be able to exit on successfull exec.  Now, since the old vmspace is
destroyed, paging structures are invalid, threads are resumed on the
non-existent pmaps (page tables), which leads to triple fault on x86.

To fix, postpone the free of old vmspace until the threads are resumed
and exited.  To avoid modifications to all image activators all of
which use exec_new_vmspace(), memoize the current (old) vmspace in
kern_execve(), and notify it about the need to call vmspace_free()
with a thread-private flag TDP_EXECVMSPC.

http://bugs.debian.org/743141

Reported by:	Ivo De Decker <ivo.dedecker@ugent.be> through secteam
Sponsored by:	The FreeBSD Foundation
MFC after:	3 days
2014-05-20 09:19:35 +00:00
Konstantin Belousov
88b124cede Make the array pointed to by AT_PAGESIZES auxv properly aligned.
Also, remove the expression which calculated the location of the
strings for a new image and grown over the time to be
non-comprehensible.  Instead, calculate the offsets by steps, which
also makes fixing the alignments much cleaner.

Reported and reviewed by:	alc
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2014-03-19 12:35:04 +00:00
Robert Watson
4a14441044 Update kernel inclusions of capability.h to use capsicum.h instead; some
further refinement is required as some device drivers intended to be
portable over FreeBSD versions rely on __FreeBSD_version to decide whether
to include capability.h.

MFC after:	3 weeks
2014-03-16 10:55:57 +00:00
Andriy Gapon
d9fae5ab88 dtrace sdt: remove the ugly sname parameter of SDT_PROBE_DEFINE
In its stead use the Solaris / illumos approach of emulating '-' (dash)
in probe names with '__' (two consecutive underscores).

Reviewed by:	markj
MFC after:	3 weeks
2013-11-26 08:46:27 +00:00
Attilio Rao
54366c0bd7 - For kernel compiled only with KDTRACE_HOOKS and not any lock debugging
option, unbreak the lock tracing release semantic by embedding
  calls to LOCKSTAT_PROFILE_RELEASE_LOCK() direclty in the inlined
  version of the releasing functions for mutex, rwlock and sxlock.
  Failing to do so skips the lockstat_probe_func invokation for
  unlocking.
- As part of the LOCKSTAT support is inlined in mutex operation, for
  kernel compiled without lock debugging options, potentially every
  consumer must be compiled including opt_kdtrace.h.
  Fix this by moving KDTRACE_HOOKS into opt_global.h and remove the
  dependency by opt_kdtrace.h for all files, as now only KDTRACE_FRAMES
  is linked there and it is only used as a compile-time stub [0].

[0] immediately shows some new bug as DTRACE-derived support for debug
in sfxge is broken and it was never really tested.  As it was not
including correctly opt_kdtrace.h before it was never enabled so it
was kept broken for a while.  Fix this by using a protection stub,
leaving sfxge driver authors the responsibility for fixing it
appropriately [1].

Sponsored by:	EMC / Isilon storage division
Discussed with:	rstone
[0] Reported by:	rstone
[1] Discussed with:	philip
2013-11-25 07:38:45 +00:00
Konstantin Belousov
eda6009c04 Add a sysctl kern.disallow_high_osrel which disables executing the
images compiled on the world with higher major version number than the
high version number of the booted kernel.  Default to disable.

Sponsored by:	The FreeBSD Foundation
Discussed with:	bapt
MFC after:	1 week
2013-10-15 06:38:40 +00:00
Pawel Jakub Dawidek
7008be5bd7 Change the cap_rights_t type from uint64_t to a structure that we can extend
in the future in a backward compatible (API and ABI) way.

The cap_rights_t represents capability rights. We used to use one bit to
represent one right, but we are running out of spare bits. Currently the new
structure provides place for 114 rights (so 50 more than the previous
cap_rights_t), but it is possible to grow the structure to hold at least 285
rights, although we can make it even larger if 285 rights won't be enough.

The structure definition looks like this:

	struct cap_rights {
		uint64_t	cr_rights[CAP_RIGHTS_VERSION + 2];
	};

The initial CAP_RIGHTS_VERSION is 0.

The top two bits in the first element of the cr_rights[] array contain total
number of elements in the array - 2. This means if those two bits are equal to
0, we have 2 array elements.

The top two bits in all remaining array elements should be 0.
The next five bits in all array elements contain array index. Only one bit is
used and bit position in this five-bits range defines array index. This means
there can be at most five array elements in the future.

To define new right the CAPRIGHT() macro must be used. The macro takes two
arguments - an array index and a bit to set, eg.

	#define	CAP_PDKILL	CAPRIGHT(1, 0x0000000000000800ULL)

We still support aliases that combine few rights, but the rights have to belong
to the same array element, eg:

	#define	CAP_LOOKUP	CAPRIGHT(0, 0x0000000000000400ULL)
	#define	CAP_FCHMOD	CAPRIGHT(0, 0x0000000000002000ULL)

	#define	CAP_FCHMODAT	(CAP_FCHMOD | CAP_LOOKUP)

There is new API to manage the new cap_rights_t structure:

	cap_rights_t *cap_rights_init(cap_rights_t *rights, ...);
	void cap_rights_set(cap_rights_t *rights, ...);
	void cap_rights_clear(cap_rights_t *rights, ...);
	bool cap_rights_is_set(const cap_rights_t *rights, ...);

	bool cap_rights_is_valid(const cap_rights_t *rights);
	void cap_rights_merge(cap_rights_t *dst, const cap_rights_t *src);
	void cap_rights_remove(cap_rights_t *dst, const cap_rights_t *src);
	bool cap_rights_contains(const cap_rights_t *big, const cap_rights_t *little);

Capability rights to the cap_rights_init(), cap_rights_set(),
cap_rights_clear() and cap_rights_is_set() functions are provided by
separating them with commas, eg:

	cap_rights_t rights;

	cap_rights_init(&rights, CAP_READ, CAP_WRITE, CAP_FSTAT);

There is no need to terminate the list of rights, as those functions are
actually macros that take care of the termination, eg:

	#define	cap_rights_set(rights, ...)				\
		__cap_rights_set((rights), __VA_ARGS__, 0ULL)
	void __cap_rights_set(cap_rights_t *rights, ...);

Thanks to using one bit as an array index we can assert in those functions that
there are no two rights belonging to different array elements provided
together. For example this is illegal and will be detected, because CAP_LOOKUP
belongs to element 0 and CAP_PDKILL to element 1:

	cap_rights_init(&rights, CAP_LOOKUP | CAP_PDKILL);

Providing several rights that belongs to the same array's element this way is
correct, but is not advised. It should only be used for aliases definition.

This commit also breaks compatibility with some existing Capsicum system calls,
but I see no other way to do that. This should be fine as Capsicum is still
experimental and this change is not going to 9.x.

Sponsored by:	The FreeBSD Foundation
2013-09-05 00:09:56 +00:00
Konstantin Belousov
5944de8ecd Remove the deprecated VM_ALLOC_RETRY flag for the vm_page_grab(9).
The flag was mandatory since r209792, where vm_page_grab(9) was
changed to only support the alloc retry semantic.

Suggested and reviewed by:	alc
Sponsored by:	The FreeBSD Foundation
2013-08-22 07:39:53 +00:00
Mark Johnston
7b77e1fe0f Specify SDT probe argument types in the probe definition itself rather than
using SDT_PROBE_ARGTYPE(). This will make it easy to extend the SDT(9) API
to allow probes with dynamically-translated types.

There is no functional change.

MFC after:	2 weeks
2013-08-15 04:08:55 +00:00
Attilio Rao
c7aebda8a1 The soft and hard busy mechanism rely on the vm object lock to work.
Unify the 2 concept into a real, minimal, sxlock where the shared
acquisition represent the soft busy and the exclusive acquisition
represent the hard busy.
The old VPO_WANTED mechanism becames the hard-path for this new lock
and it becomes per-page rather than per-object.
The vm_object lock becames an interlock for this functionality:
it can be held in both read or write mode.
However, if the vm_object lock is held in read mode while acquiring
or releasing the busy state, the thread owner cannot make any
assumption on the busy state unless it is also busying it.

Also:
- Add a new flag to directly shared busy pages while vm_page_alloc
  and vm_page_grab are being executed.  This will be very helpful
  once these functions happen under a read object lock.
- Move the swapping sleep into its own per-object flag

The KPI is heavilly changed this is why the version is bumped.
It is very likely that some VM ports users will need to change
their own code.

Sponsored by:	EMC / Isilon storage division
Discussed with:	alc
Reviewed by:	jeff, kib
Tested by:	gavin, bapt (older version)
Tested by:	pho, scottl
2013-08-09 11:11:11 +00:00
Jeff Roberson
5df87b21d3 Replace kernel virtual address space allocation with vmem. This provides
transparent layering and better fragmentation.

 - Normalize functions that allocate memory to use kmem_*
 - Those that allocate address space are named kva_*
 - Those that operate on maps are named kmap_*
 - Implement recursive allocation handling for kmem_arena in vmem.

Reviewed by:	alc
Tested by:	pho
Sponsored by:	EMC / Isilon Storage Division
2013-08-07 06:21:20 +00:00
Attilio Rao
be99683637 Revert r253939:
We cannot busy a page before doing pagefaults.
Infact, it can deadlock against vnode lock, as it tries to vget().
Other functions, right now, have an opposite lock ordering, like
vm_object_sync(), which acquires the vnode lock first and then
sleeps on the busy mechanism.

Before this patch is reinserted we need to break this ordering.

Sponsored by:	EMC / Isilon storage division
Reported by:	kib
2013-08-05 08:55:35 +00:00
Attilio Rao
3b6714cacb The page hold mechanism is fast but it has couple of fallouts:
- It does not let pages respect the LRU policy
- It bloats the active/inactive queues of few pages

Try to avoid it as much as possible with the long-term target to
completely remove it.
Use the soft-busy mechanism to protect page content accesses during
short-term operations (like uiomove_fromphys()).

After this change only vm_fault_quick_hold_pages() is still using the
hold mechanism for page content access.
There is an additional complexity there as the quick path cannot
immediately access the page object to busy the page and the slow path
cannot however busy more than one page a time (to avoid deadlocks).

Fixing such primitive can bring to complete removal of the page hold
mechanism.

Sponsored by:	EMC / Isilon storage division
Discussed with:	alc
Reviewed by:	jeff
Tested by:	pho
2013-08-04 21:07:24 +00:00
Alan Cox
27a18d6a23 Don't busy the page unless we are likely to release the object lock.
Reviewed by:	kib
Sponsored by:	EMC / Isilon Storage Division
2013-06-06 06:17:20 +00:00
Konstantin Belousov
1e65d73c74 Do not map the shared page COW. If the process wired its address
space, fork(2) would cause shadowing of the physical object and
copying of the shared page into private copy, effectively preventing
updates for the exported timehands structure and stopping the clock.

Specify the maximum allowed permissions for the page to be read and
execute, preventing write from the user mode.

Reported and tested by:	<huanghwh@yahoo.com>
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
2013-06-03 04:32:53 +00:00
Attilio Rao
89f6b8632c Switch the vm_object mutex to be a rwlock. This will enable in the
future further optimizations where the vm_object lock will be held
in read mode most of the time the page cache resident pool of pages
are accessed for reading purposes.

The change is mostly mechanical but few notes are reported:
* The KPI changes as follow:
  - VM_OBJECT_LOCK() -> VM_OBJECT_WLOCK()
  - VM_OBJECT_TRYLOCK() -> VM_OBJECT_TRYWLOCK()
  - VM_OBJECT_UNLOCK() -> VM_OBJECT_WUNLOCK()
  - VM_OBJECT_LOCK_ASSERT(MA_OWNED) -> VM_OBJECT_ASSERT_WLOCKED()
    (in order to avoid visibility of implementation details)
  - The read-mode operations are added:
    VM_OBJECT_RLOCK(), VM_OBJECT_TRYRLOCK(), VM_OBJECT_RUNLOCK(),
    VM_OBJECT_ASSERT_RLOCKED(), VM_OBJECT_ASSERT_LOCKED()
* The vm/vm_pager.h namespace pollution avoidance (forcing requiring
  sys/mutex.h in consumers directly to cater its inlining functions
  using VM_OBJECT_LOCK()) imposes that all the vm/vm_pager.h
  consumers now must include also sys/rwlock.h.
* zfs requires a quite convoluted fix to include FreeBSD rwlocks into
  the compat layer because the name clash between FreeBSD and solaris
  versions must be avoided.
  At this purpose zfs redefines the vm_object locking functions
  directly, isolating the FreeBSD components in specific compat stubs.

The KPI results heavilly broken by this commit.  Thirdy part ports must
be updated accordingly (I can think off-hand of VirtualBox, for example).

Sponsored by:	EMC / Isilon storage division
Reviewed by:	jeff
Reviewed by:	pjd (ZFS specific review)
Discussed with:	alc
Tested by:	pho
2013-03-09 02:32:23 +00:00
Pawel Jakub Dawidek
2609222ab4 Merge Capsicum overhaul:
- Capability is no longer separate descriptor type. Now every descriptor
  has set of its own capability rights.

- The cap_new(2) system call is left, but it is no longer documented and
  should not be used in new code.

- The new syscall cap_rights_limit(2) should be used instead of
  cap_new(2), which limits capability rights of the given descriptor
  without creating a new one.

- The cap_getrights(2) syscall is renamed to cap_rights_get(2).

- If CAP_IOCTL capability right is present we can further reduce allowed
  ioctls list with the new cap_ioctls_limit(2) syscall. List of allowed
  ioctls can be retrived with cap_ioctls_get(2) syscall.

- If CAP_FCNTL capability right is present we can further reduce fcntls
  that can be used with the new cap_fcntls_limit(2) syscall and retrive
  them with cap_fcntls_get(2).

- To support ioctl and fcntl white-listing the filedesc structure was
  heavly modified.

- The audit subsystem, kdump and procstat tools were updated to
  recognize new syscalls.

- Capability rights were revised and eventhough I tried hard to provide
  backward API and ABI compatibility there are some incompatible changes
  that are described in detail below:

	CAP_CREATE old behaviour:
	- Allow for openat(2)+O_CREAT.
	- Allow for linkat(2).
	- Allow for symlinkat(2).
	CAP_CREATE new behaviour:
	- Allow for openat(2)+O_CREAT.

	Added CAP_LINKAT:
	- Allow for linkat(2). ABI: Reuses CAP_RMDIR bit.
	- Allow to be target for renameat(2).

	Added CAP_SYMLINKAT:
	- Allow for symlinkat(2).

	Removed CAP_DELETE. Old behaviour:
	- Allow for unlinkat(2) when removing non-directory object.
	- Allow to be source for renameat(2).

	Removed CAP_RMDIR. Old behaviour:
	- Allow for unlinkat(2) when removing directory.

	Added CAP_RENAMEAT:
	- Required for source directory for the renameat(2) syscall.

	Added CAP_UNLINKAT (effectively it replaces CAP_DELETE and CAP_RMDIR):
	- Allow for unlinkat(2) on any object.
	- Required if target of renameat(2) exists and will be removed by this
	  call.

	Removed CAP_MAPEXEC.

	CAP_MMAP old behaviour:
	- Allow for mmap(2) with any combination of PROT_NONE, PROT_READ and
	  PROT_WRITE.
	CAP_MMAP new behaviour:
	- Allow for mmap(2)+PROT_NONE.

	Added CAP_MMAP_R:
	- Allow for mmap(PROT_READ).
	Added CAP_MMAP_W:
	- Allow for mmap(PROT_WRITE).
	Added CAP_MMAP_X:
	- Allow for mmap(PROT_EXEC).
	Added CAP_MMAP_RW:
	- Allow for mmap(PROT_READ | PROT_WRITE).
	Added CAP_MMAP_RX:
	- Allow for mmap(PROT_READ | PROT_EXEC).
	Added CAP_MMAP_WX:
	- Allow for mmap(PROT_WRITE | PROT_EXEC).
	Added CAP_MMAP_RWX:
	- Allow for mmap(PROT_READ | PROT_WRITE | PROT_EXEC).

	Renamed CAP_MKDIR to CAP_MKDIRAT.
	Renamed CAP_MKFIFO to CAP_MKFIFOAT.
	Renamed CAP_MKNODE to CAP_MKNODEAT.

	CAP_READ old behaviour:
	- Allow pread(2).
	- Disallow read(2), readv(2) (if there is no CAP_SEEK).
	CAP_READ new behaviour:
	- Allow read(2), readv(2).
	- Disallow pread(2) (CAP_SEEK was also required).

	CAP_WRITE old behaviour:
	- Allow pwrite(2).
	- Disallow write(2), writev(2) (if there is no CAP_SEEK).
	CAP_WRITE new behaviour:
	- Allow write(2), writev(2).
	- Disallow pwrite(2) (CAP_SEEK was also required).

	Added convinient defines:

	#define	CAP_PREAD		(CAP_SEEK | CAP_READ)
	#define	CAP_PWRITE		(CAP_SEEK | CAP_WRITE)
	#define	CAP_MMAP_R		(CAP_MMAP | CAP_SEEK | CAP_READ)
	#define	CAP_MMAP_W		(CAP_MMAP | CAP_SEEK | CAP_WRITE)
	#define	CAP_MMAP_X		(CAP_MMAP | CAP_SEEK | 0x0000000000000008ULL)
	#define	CAP_MMAP_RW		(CAP_MMAP_R | CAP_MMAP_W)
	#define	CAP_MMAP_RX		(CAP_MMAP_R | CAP_MMAP_X)
	#define	CAP_MMAP_WX		(CAP_MMAP_W | CAP_MMAP_X)
	#define	CAP_MMAP_RWX		(CAP_MMAP_R | CAP_MMAP_W | CAP_MMAP_X)
	#define	CAP_RECV		CAP_READ
	#define	CAP_SEND		CAP_WRITE

	#define	CAP_SOCK_CLIENT \
		(CAP_CONNECT | CAP_GETPEERNAME | CAP_GETSOCKNAME | CAP_GETSOCKOPT | \
		 CAP_PEELOFF | CAP_RECV | CAP_SEND | CAP_SETSOCKOPT | CAP_SHUTDOWN)
	#define	CAP_SOCK_SERVER \
		(CAP_ACCEPT | CAP_BIND | CAP_GETPEERNAME | CAP_GETSOCKNAME | \
		 CAP_GETSOCKOPT | CAP_LISTEN | CAP_PEELOFF | CAP_RECV | CAP_SEND | \
		 CAP_SETSOCKOPT | CAP_SHUTDOWN)

	Added defines for backward API compatibility:

	#define	CAP_MAPEXEC		CAP_MMAP_X
	#define	CAP_DELETE		CAP_UNLINKAT
	#define	CAP_MKDIR		CAP_MKDIRAT
	#define	CAP_RMDIR		CAP_UNLINKAT
	#define	CAP_MKFIFO		CAP_MKFIFOAT
	#define	CAP_MKNOD		CAP_MKNODAT
	#define	CAP_SOCK_ALL		(CAP_SOCK_CLIENT | CAP_SOCK_SERVER)

Sponsored by:	The FreeBSD Foundation
Reviewed by:	Christoph Mallon <christoph.mallon@gmx.de>
Many aspects discussed with:	rwatson, benl, jonathan
ABI compatibility discussed with:	kib
2013-03-02 00:53:12 +00:00
Konstantin Belousov
888d4d4f86 When vforked child is traced, the debugging events are not generated
until child performs exec().  The behaviour is reasonable when a
debugger is the real parent, because the parent is stopped until
exec(), and sending a debugging event to the debugger would deadlock
both parent and child.

On the other hand, when debugger is not the parent of the vforked
child, not sending debugging signals makes it impossible to debug
across vfork.

Fix the issue by declining generating debug signals only when vfork()
was done and child called ptrace(PT_TRACEME).  Set a new process flag
P_PPTRACE from the attach code for PT_TRACEME, if P_PPWAIT flag is
set, which indicates that the process was created with vfork() and
still did not execed. Check P_PPTRACE from issignal(), instead of
refusing the trace outright for the P_PPWAIT case.  The scope of
P_PPTRACE is exactly contained in the scope of P_PPWAIT.

Found and tested by:  zont
Reviewed by:	pluknet
MFC after:	2 weeks
2013-02-07 15:34:22 +00:00
Konstantin Belousov
140dedb81c The r241025 fixed the case when a binary, executed from nullfs mount,
was still possible to open for write from the lower filesystem.  There
is a symmetric situation where the binary could already has file
descriptors opened for write, but it can be executed from the nullfs
overlay.

Handle the issue by passing one v_writecount reference to the lower
vnode if nullfs vnode has non-zero v_writecount.  Note that only one
write reference can be donated, since nullfs only keeps one use
reference on the lower vnode.  Always use the lower vnode v_writecount
for the checks.

Introduce the VOP_GET_WRITECOUNT to read v_writecount, which is
currently always bypassed to the lower vnode, and VOP_ADD_WRITECOUNT
to manipulate the v_writecount value, which manages a single bypass
reference to the lower vnode.  Caling the VOPs instead of directly
accessing v_writecount provide the fix described in the previous
paragraph.

Tested by:	pho
MFC after:	3 weeks
2012-11-02 13:56:36 +00:00
Konstantin Belousov
5050aa86cf Remove the support for using non-mpsafe filesystem modules.
In particular, do not lock Giant conditionally when calling into the
filesystem module, remove the VFS_LOCK_GIANT() and related
macros. Stop handling buffers belonging to non-mpsafe filesystems.

The VFS_VERSION is bumped to indicate the interface change which does
not result in the interface signatures changes.

Conducted and reviewed by:	attilio
Tested by:	pho
2012-10-22 17:50:54 +00:00