Commit Graph

455 Commits

Author SHA1 Message Date
kib
d7b44ae42b Explicitely add "opt_compat.h" to kern_exec.c: fix powerpc LINT builds.
sys/ptrace.h includes sys/signal.h, which includes sys/_sigset.h.
Note that sys/_sigset.h only defines osigset_t if COMPAT_43 was defined.

Two lines later, sys/ptrace.h includes machine/reg.h, which in case of
powerpc, includes opt_compat.h.

After the include headers reordering in r311345, we have sys/ptrace.h
included before sys/sysproto.h.

If COMPAT_43 was requested in the kernel config, the result is that
sys/_sigset.h does not define osigset_t, but sys/sysproto.h sees
COMPAT_43 and uses osigset_t.

Fix this by explicitely including opt_compat.h to cover the whole
kern/kern_exec.c scope.

Sponsored by:	The FreeBSD Foundation
2017-01-06 16:56:24 +00:00
markj
d0b3c7cb46 Add a small allocator for exec_map entries.
Upon each execve, we allocate a KVA range for use in copying data to the
new image. Pages must be faulted into the range, and when the range is
freed, the backing pages are freed and their mappings are destroyed. This
is a lot of needless overhead, and the exec_map management becomes a
bottleneck when many CPUs are executing execve concurrently. Moreover, the
number of available ranges is fixed at 16, which is insufficient on large
systems and potentially excessive on 32-bit systems.

The new allocator reduces overhead by making exec_map allocations
persistent. When a range is freed, pages backing the range are marked clean
and made easy to reclaim. With this change, the exec_map is sized based on
the number of CPUs.

Reviewed by:	kib
MFC after:	1 month
Differential Revision:	https://reviews.freebsd.org/D8921
2017-01-05 01:44:12 +00:00
markj
1c83f7347b Sort includes in kern_exec.c.
MFC after:	1 week
2017-01-05 01:28:08 +00:00
alc
2fa3607305 Remove most of the code for implementing PG_CACHED pages. (This change does
not remove user-space visible fields from vm_cnt or all of the references to
cached pages from comments.  Those changes will come later.)

Reviewed by:	kib, markj
Tested by:	pho
Sponsored by:	Dell EMC Isilon
Differential Revision:	https://reviews.freebsd.org/D8497
2016-11-15 18:22:50 +00:00
mjg
8f6db7095f Mark a bunch of mpsafe sysctls as such.
This gives me a sysctl Giant-free buildworld.
2016-10-19 19:42:01 +00:00
alc
c2131431d1 Eliminate unneeded vm_page_xbusy() and vm_page_xunbusy() operations when
neither vm_pager_has_page() nor vm_pager_get_pages() is called.

Reviewed by:	kib, markj
MFC after:	3 weeks
2016-08-14 22:00:45 +00:00
alc
adb85a5730 Eliminate two calls to vm_page_xunbusy() that are both unnecessary and
incorrect from the error cases in exec_map_first_page().  They are
unnecessary because we automatically unbusy the page in vm_page_free()
when we remove it from the object.  The calls are incorrect because they
happen after the page is freed, so we might actually unbusy the page
after it has been reallocated to a different object.  (This error was
introduced in r292373.)

Reviewed by:	kib
MFC after:	1 week
2016-08-13 18:10:32 +00:00
kib
2b0b2600f6 The assertion re-added in r302614 was triggered when stopping signal
is delivered to vforked child.  Issue is that we avoid stopping such
children in issignal() to not block parents.  But executed AST, which
ignored stops, leaves the child with the signal pending but no AST
pending.

On first exec after vfork(), call signotify() to handle pending
reenabled signals.  Adjust the assert to not check vfork children
until exec.

Reported and tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
2016-07-18 10:53:47 +00:00
jhb
91d07047c4 Add a mask of optional ptrace() events.
ptrace() now stores a mask of optional events in p_ptevents.  Currently
this mask is a single integer, but it can be expanded into an array of
integers in the future.

Two new ptrace requests can be used to manipulate the event mask:
PT_GET_EVENT_MASK fetches the current event mask and PT_SET_EVENT_MASK
sets the current event mask.

The current set of events include:
- PTRACE_EXEC: trace calls to execve().
- PTRACE_SCE: trace system call entries.
- PTRACE_SCX: trace syscam call exits.
- PTRACE_FORK: trace forks and auto-attach to new child processes.
- PTRACE_LWP: trace LWP events.

The S_PT_SCX and S_PT_SCE events in the procfs p_stops flags have
been replaced by PTRACE_SCE and PTRACE_SCX.  PTRACE_FORK replaces
P_FOLLOW_FORK and PTRACE_LWP replaces P2_LWP_EVENTS.

The PT_FOLLOW_FORK and PT_LWP_EVENTS ptrace requests remain for
compatibility but now simply toggle corresponding flags in the
event mask.

While here, document that PT_SYSCALL, PT_TO_SCE, and PT_TO_SCX both
modify the event mask and continue the traced process.

Reviewed by:	kib
MFC after:	1 month
Differential Revision:	https://reviews.freebsd.org/D7044
2016-07-15 15:32:09 +00:00
kib
15841a7afe When filt_proc() removes event from the knlist due to the process
exiting (NOTE_EXIT->knlist_remove_inevent()), two things happen:
- knote kn_knlist pointer is reset
- INFLUX knote is removed from the process knlist.
And, there are two consequences:
- KN_LIST_UNLOCK() on such knote is nop
- there is nothing which would block exit1() from processing past the
  knlist_destroy() (and knlist_destroy() resets knlist lock pointers).
Both consequences result either in leaked process lock, or
dereferencing NULL function pointers for locking.

Handle this by stopping embedding the process knlist into struct proc.
Instead, the knlist is allocated together with struct proc, but marked
as autodestroy on the zombie reap, by knlist_detach() function.  The
knlist is freed when last kevent is removed from the list, in
particular, at the zombie reap time if the list is empty.  As result,
the knlist_remove_inevent() is no longer needed and removed.

Other changes:

In filt_procattach(), clear NOTE_EXEC and NOTE_FORK desired events
from kn_sfflags for knote registered by kernel to only get NOTE_CHILD
notifications.  The flags leak resulted in excessive
NOTE_EXEC/NOTE_FORK reports.

Fix immediate note activation in filt_procattach().  Condition should
be either the immediate CHILD_NOTE activation, or immediate NOTE_EXIT
report for the exiting process.

In knote_fork(), do not perform racy check for KN_INFLUX before kq
lock is taken.  Besides being racy, it did not accounted for notes
just added by scan (KN_SCAN).

Some minor and incomplete style fixes.

Analyzed and tested by:	Eric Badger <eric@badgerio.us>
Reviewed by:	jhb
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
Approved by:	re (gjb)
Differential revision:	https://reviews.freebsd.org/D6859
2016-06-27 21:52:17 +00:00
kib
ef4ada7b76 Old process credentials for setuid execve must not be dereferenced
when the process credentials were not changed.  This can happen if an
error occured trying to activate the setuid binary.  And on error, if
new credentials were not yet assigned, they must be freed to not
create the leak.

Use oldcred == NULL as the predicate to detect credential
reassignment.

Reported and tested by:	pho
Sponsored by:	The FreeBSD Foundation
2016-06-08 04:37:03 +00:00
mjg
4a70fa3997 exec: get rid of one vnode lock/unlock pair in do_execve
The lock was temporarily dropped for vrele calls, but they can be
postponed to a point where the lock is not held in the first place.

While here shuffle other code not needing the lock.
2016-05-27 15:03:38 +00:00
bdrewery
01a0751bcd exec: Provide execpath in imgp for the process_exec hook.
This was previously set after the hook and only if auxargs were present.
Now always provide it if possible.

MFC after:	2 weeks
Reviewed by:	kib
Sponsored by:	EMC / Isilon Storage Division
Differential Revision:	https://reviews.freebsd.org/D6546
2016-05-26 23:19:39 +00:00
bdrewery
ecf7a0caee exec: Add credential change information into imgp for process_exec hook.
This allows an EVENTHANDLER(process_exec) hook to see if the new image
will cause credentials to change whether due to setgid/setuid or because
of POSIX saved-id semantics.

This adds 3 new fields into image_params:
  struct ucred *newcred		Non-null if the credentials will change.
  bool credential_setid		True if the new image is setuid or setgid.

This will pre-determine the new credentials before invoking the image
activators, where the process_exec hook is called.  The new credentials
will be installed into the process in the same place as before, after
image activators are done handling the image.

MFC after:	2 weeks
Reviewed by:	kib
Sponsored by:	EMC / Isilon Storage Division
Differential Revision:	https://reviews.freebsd.org/D6544
2016-05-26 23:18:54 +00:00
pfg
28823d0656 sys/kern: spelling fixes in comments.
No functional change.
2016-04-29 22:15:33 +00:00
trasz
ca92bb3067 Remove some NULL checks for M_WAITOK allocations.
MFC after:	1 month
Sponsored by:	The FreeBSD Foundation
2016-03-29 13:56:59 +00:00
bdrewery
199dda9a01 Correct a comment. 2016-03-01 23:58:53 +00:00
markj
fa1b8e9a4f Fix style issues around existing SDT probes.
- Use SDT_PROBE<N>() instead of SDT_PROBE(). This has no functional effect
  at the moment, but will be needed for some future changes.
- Don't hardcode the module component of the probe identifier. This is
  set automatically by the SDT framework.

MFC after:	1 week
2015-12-16 23:39:27 +00:00
glebius
63cd1c131a A change to KPI of vm_pager_get_pages() and underlying VOP_GETPAGES().
o With new KPI consumers can request contiguous ranges of pages, and
  unlike before, all pages will be kept busied on return, like it was
  done before with the 'reqpage' only. Now the reqpage goes away. With
  new interface it is easier to implement code protected from race
  conditions.

  Such arrayed requests for now should be preceeded by a call to
  vm_pager_haspage() to make sure that request is possible. This
  could be improved later, making vm_pager_haspage() obsolete.

  Strenghtening the promises on the business of the array of pages
  allows us to remove such hacks as swp_pager_free_nrpage() and
  vm_pager_free_nonreq().

o New KPI accepts two integer pointers that may optionally point at
  values for read ahead and read behind, that a pager may do, if it
  can. These pages are completely owned by pager, and not controlled
  by the caller.

  This shifts the UFS-specific readahead logic from vm_fault.c, which
  should be file system agnostic, into vnode_pager.c. It also removes
  one VOP_BMAP() request per hard fault.

Discussed with:	kib, alc, jeff, scottl
Sponsored by:	Nginx, Inc.
Sponsored by:	Netflix
2015-12-16 21:30:45 +00:00
cem
9c1e214f79 Fix core corruption caused by race in note_procstat_vmmap
This fix is spiritually similar to r287442 and was discovered thanks to
the KASSERT added in that revision.

NT_PROCSTAT_VMMAP output length, when packing kinfo structs, is tied to
the length of filenames corresponding to vnodes in the process' vm map
via vn_fullpath.  As vnodes may move during coredump, this is racy.

We do not remove the race, only prevent it from causing coredump
corruption.

- Add a sysctl, kern.coredump_pack_vmmapinfo, to allow users to disable
  kinfo packing for PROCSTAT_VMMAP notes.  This avoids VMMAP corruption
  and truncation, even if names change, at the cost of up to PATH_MAX
  bytes per mapped object.  The new sysctl is documented in core.5.

- Fix note_procstat_vmmap to self-limit in the second pass.  This
  addresses corruption, at the cost of sometimes producing a truncated
  result.

- Fix PROCSTAT_VMMAP consumers libutil (and libprocstat, via copy-paste)
  to grok the new zero padding.

Reported by:	pho (https://people.freebsd.org/~pho/stress/log/datamove4-2.txt)
Relnotes:	yes
Sponsored by:	EMC / Isilon Storage Division
Differential Revision:	https://reviews.freebsd.org/D3824
2015-10-06 18:07:00 +00:00
avg
425c0bb088 save some bytes by using more concise SDT_PROBE<n> instead of SDT_PROBE
SDT_PROBE requires 5 parameters whereas SDT_PROBE<n> requires n parameters
where n is typically smaller than 5.

Perhaps SDT_PROBE should be made a private implementation detail.

MFC after:	20 days
2015-09-28 12:14:16 +00:00
cem
a8fae65d69 Follow-up to r287442: Move sysctl to compiled-once file
Avoid duplicate sysctl nodes.

Found by:	tijl
Approved by:	markj (mentor)
Sponsored by:	EMC / Isilon Storage Division
Differential Revision:	https://reviews.freebsd.org/D3586
2015-09-07 16:44:28 +00:00
ed
b2ca400b88 Add sysent flag to switch to capabilities mode on startup.
CloudABI processes should run in capabilities mode automatically. There
is no need to switch manually (e.g., by calling cap_enter()). Add a
flag, SV_CAPSICUM, that can be used to call into cap_enter() during
execve().

Reviewed by:	kib
2015-08-03 13:41:47 +00:00
kib
48ccbdea81 The si_status field of the siginfo_t, provided by the waitid(2) and
SIGCHLD signal, should keep full 32 bits of the status passed to the
_exit(2).

Split the combined p_xstat of the struct proc into the separate exit
status p_xexit for normal process exit, and signalled termination
information p_xsig.  Kernel-visible macro KW_EXITCODE() reconstructs
old p_xstat from p_xexit and p_xsig.  p_xexit contains complete status
and copied out into si_status.

Requested by:	Joerg Schilling
Reviewed by:	jilles (previous version), pho
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
2015-07-18 09:02:50 +00:00
ed
a85d617da6 Implement CloudABI's exec() call.
Summary:
In a runtime that is purely based on capability-based security, there is
a strong emphasis on how programs start their execution. We need to make
sure that we execute an new program with an exact set of file
descriptors, ensuring that credentials are not leaked into the process
accidentally.

Providing the right file descriptors is just half the problem. There
also needs to be a framework in place that gives meaning to these file
descriptors. How does a CloudABI mail server know which of the file
descriptors corresponds to the socket that receives incoming emails?
Furthermore, how will this mail server acquire its configuration
parameters, as it cannot open a configuration file from a global path on
disk?

CloudABI solves this problem by replacing traditional string command
line arguments by tree-like data structure consisting of scalars,
sequences and mappings (similar to YAML/JSON). In this structure, file
descriptors are treated as a first-class citizen. When calling exec(),
file descriptors are passed on to the new executable if and only if they
are referenced from this tree structure. See the cloudabi-run(1) man
page for more details and examples (sysutils/cloudabi-utils).

Fortunately, the kernel does not need to care about this tree structure
at all. The C library is responsible for serializing and deserializing,
but also for extracting the list of referenced file descriptors. The
system call only receives a copy of the serialized data and a layout of
what the new file descriptor table should look like:

    int proc_exec(int execfd, const void *data, size_t datalen, const int *fds,
              size_t fdslen);

This change introduces a set of fd*_remapped() functions:

- fdcopy_remapped() pulls a copy of a file descriptor table, remapping
  all of the file descriptors according to the provided mapping table.
- fdinstall_remapped() replaces the file descriptor table of the process
  by the copy created by fdcopy_remapped().
- fdescfree_remapped() frees the table in case we aborted before
  fdinstall_remapped().

We then add a function exec_copyin_data_fds() that builds on top these
functions. It copies in the data and constructs a new remapped file
descriptor. This is used by cloudabi_sys_proc_exec().

Test Plan:
cloudabi-run(1) is capable of spawning processes successfully, providing
it data and file descriptors. procstat -f seems to confirm all is good.
Regular FreeBSD processes also work properly.

Reviewers: kib, mjg

Reviewed By: mjg

Subscribers: imp

Differential Revision: https://reviews.freebsd.org/D3079
2015-07-16 07:05:42 +00:00
mjg
2ee8352407 exec: textvp -> oldtextvp; binvp -> newtextvp
This makes it consistent with the rest of the naming in do_execve.

No functional changes.
2015-07-14 01:13:37 +00:00
mjg
1fc8c9c24b exec plug a redundant vref + vrele of the image vnode 2015-07-14 00:43:08 +00:00
kib
ad2d1e43ad Do not calculate the stack's bottom address twice.
Submitted by:	Olivц╘r Pintц╘r
Review:	https://reviews.freebsd.org/D2953
MFC after:	1 week
2015-06-30 15:22:47 +00:00
glebius
519f1ccd36 Make KPI of vm_pager_get_pages() more strict: if a pager changes a page
in the requested array, then it is responsible for disposition of previous
page and is responsible for updating the entry in the requested array.
Now consumers of KPI do not need to re-lookup the pages after call to
vm_pager_get_pages().

Reviewed by:	kib
Sponsored by:	Netflix
Sponsored by:	Nginx, Inc.
2015-06-12 11:32:20 +00:00
mjg
d7bc9285a6 Implement lockless resource limits.
Use the same scheme implemented to manage credentials.

Code needing to look at process's credentials (as opposed to thred's) is
provided with *_proc variants of relevant functions.

Places which possibly had to take the proc lock anyway still use the proc
pointer to access limits.
2015-06-10 10:48:12 +00:00
kib
f371322983 On exec, single-threading must be enforced before arguments space is
allocated from exec_map.  If many threads try to perform execve(2) in
parallel, the exec map is exhausted and some threads sleep
uninterruptible waiting for the map space.  Then, the thread which won
the race for the space allocation, cannot single-thread the process,
causing deadlock.

Reported and tested by:	pho (previous version)
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
2015-05-10 09:00:40 +00:00
kib
fe18a7c9f2 Handle incorrect ELF images specifying size for PT_GNU_STACK not being
multiple of page size.

Sponsored by:	The FreeBSD Foundation
MFC after:	3 days
2015-04-23 11:27:21 +00:00
kib
b4924fbf83 Implement support for binary to requesting specific stack size for the
initial thread.  It is read by the ELF image activator as the virtual
size of the PT_GNU_STACK program header entry, and can be specified by
the linker option -z stack-size in newer binutils.

The soft RLIMIT_STACK is auto-increased if possible, to satisfy the
binary' request.

Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2015-04-15 08:13:53 +00:00
alc
b131a2abc8 Introduce vm_object_color() and use it in mmap(2) to set the color of
named objects to zero before the virtual address is selected.  Previously,
the color setting was delayed until after the virtual address was
selected.  In rtld, this delay effectively prevented the mapping of a
shared library's code section using superpages.  Now, for example, we see
the first 1 MB of libc's code on armv6 mapped by a superpage after we've
gotten through the initial cold misses that bring the first 1 MB of code
into memory.  (With the page clustering that we perform on read faults,
this happens quickly.)

Differential Revision:	https://reviews.freebsd.org/D2013
Reviewed by:	jhb, kib
Tested by:	Svatopluk Kraus (armv6)
MFC after:	6 weeks
2015-03-21 17:56:55 +00:00
mjg
054f9cab59 cred: add proc_set_cred helper
The goal here is to provide one place altering process credentials.

This eases debugging and opens up posibilities to do additional work when such
an action is performed.
2015-03-16 00:10:03 +00:00
kib
aa0ac99391 Add procctl(2) PROC_TRACE_CTL command to enable or disable debugger
attachment to the process.  Note that the command is not intended to
be a security measure, rather it is an obfuscation feature,
implemented for parity with other operating systems.

Discussed with:	jilles, rwatson
Man page fixes by:	rwatson
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2015-01-18 15:13:11 +00:00
kib
d41bf48327 Add facility to stop all userspace processes. The supposed use of the
feature is to quisce the system before suspend.

Stop is implemented by reusing the thread_single(9) with the special
mode SINGLE_ALLPROC.  SINGLE_ALLPROC differs from the existing
single-threading modes by allowing (requiring) caller to operate on
other process.  Interruptible sleeps for !TDF_SBDRY threads are
suspended like SIGSTOP does it, instead of aborting the sleep, like
SINGLE_NO_EXIT, to avoid spurious EINTRs on resume.

Provide debugging sysctl debug.stop_all_proc, which causes total stop
and suspends syncer, while waiting for variable reset for resume.  It
is used for debugging; should be removed after the real use of the
interface is added.

In collaboration with:	pho
Discussed with:	avg
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
2014-12-13 16:18:29 +00:00
mjg
6b53d30f11 filedesc: fix missed comments about fdsetugidsafety
While here just note that both fdsetugidsafety and fdcheckstd take sleepable
locks.
2014-10-31 09:56:00 +00:00
kib
ad7bf17db7 Replace some calls to fuword() by fueword() with proper error checking.
Sponsored by:	The FreeBSD Foundation
Tested by:	pho
MFC after:	3 weeks
2014-10-28 15:28:20 +00:00
mjg
2ebe66c290 filedesc: cleanup setugidsafety a little
Rename it to fdsetugidsafety for consistency with other functions.

There is no need to take filedesc lock if not closing any files.

The loop has to verify each file and we are guaranteed fdtable has space
for at least 20 fds. As such there is no need to check fd_lastfile.

While here tidy up is_unsafe.
2014-10-22 00:23:43 +00:00
mjg
a21afe6138 Plug unnecessary binvp NULL initialization and test.
Reported by: Coverity
CID: 1018889
2014-10-20 22:52:15 +00:00
mjg
24150776bd Use bzero instead of explicitly zeroing stuff in do_execve.
While strictly speaking this is not correct since some fields are pointers,
it makes no difference on all supported archs and we already rely on it doing
the right thing in other places.

No functional changes.
2014-09-29 23:59:19 +00:00
kib
065b68902b If vm_page_grab() allocates a new page, the page is not inserted into
page queue even when the allocation is not wired.  It is
responsibility of the vm_page_grab() caller to ensure that the page
does not end on the vm_object queue but not on the pagedaemon queue,
which would effectively create unpageable unwired page.

In exec_map_first_page() and vm_imgact_hold_page(), activate the page
immediately after unbusying it, to avoid leak.

In the uiomove_object_page(), deactivate page before the object is
unlocked.  There is no leak, since the page is deactivated after
uiomove_fromphys() finished.  But allowing non-queued non-wired page
in the unlocked object queue makes it impossible to assert that leak
does not happen in other places.

Reviewed by:	alc
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2014-08-13 05:44:08 +00:00
mjg
b55267950c Plug p_pptr null test in do_execve. It is always true. 2014-07-14 22:40:46 +00:00
mjg
8a494537a4 Don't call crdup nor uifind under vnode lock.
A locked vnode can get into the way of satisyfing malloc with M_WATOK.

This is a fixup to r268087.

Suggested by:	kib
MFC after:	1 week
2014-07-07 14:03:30 +00:00
marcel
9f28abd980 Remove ia64.
This includes:
o   All directories named *ia64*
o   All files named *ia64*
o   All ia64-specific code guarded by __ia64__
o   All ia64-specific makefile logic
o   Mention of ia64 in comments and documentation

This excludes:
o   Everything under contrib/
o   Everything under crypto/
o   sys/xen/interface
o   sys/sys/elf_common.h

Discussed at: BSDcan
2014-07-07 00:27:09 +00:00
mjg
3820ac174b Plug gcc warning after r268074 about unitialized newsigacts
Reported by:	Gary Jennejohn <gljennjohn gmail.com>
2014-07-02 05:45:40 +00:00
mjg
24f1528d0b Don't call crcopysafe or uifind unnecessarily in execve.
MFC after:	1 week
2014-07-01 09:21:32 +00:00
mjg
cd12aa0ab7 Perform a lockless check in sigacts_shared.
It is used only during execve (i.e. singlethreaded), so there is no fear
of returning 'not shared' which soon becomes 'shared'.

While here reorganize the code a little to avoid proc lock/unlock in
shared case.

MFC after:	1 week
2014-07-01 06:29:15 +00:00
mjg
3bf95dde7c Call fdcloseexec right after fdunshare.
No functional changes.

MFC after:	1 week
2014-06-28 05:51:45 +00:00