Commit Graph

451 Commits

Author SHA1 Message Date
rwatson
33fdc14c0c Update kernel inclusions of capability.h to use capsicum.h instead; some
further refinement is required as some device drivers intended to be
portable over FreeBSD versions rely on __FreeBSD_version to decide whether
to include capability.h.

MFC after:	3 weeks
2014-03-16 10:55:57 +00:00
pjd
4ac2e7d8d9 Make process descriptors standard part of the kernel. rwhod(8) already
requires process descriptors to work and having PROCDESC in GENERIC
seems not enough, especially that we hope to have more and more consumers
in the base.

MFC after:	3 days
2013-11-30 15:08:35 +00:00
avg
71889a5eff dtrace sdt: remove the ugly sname parameter of SDT_PROBE_DEFINE
In its stead use the Solaris / illumos approach of emulating '-' (dash)
in probe names with '__' (two consecutive underscores).

Reviewed by:	markj
MFC after:	3 weeks
2013-11-26 08:46:27 +00:00
attilio
7ee4e910ce - For kernel compiled only with KDTRACE_HOOKS and not any lock debugging
option, unbreak the lock tracing release semantic by embedding
  calls to LOCKSTAT_PROFILE_RELEASE_LOCK() direclty in the inlined
  version of the releasing functions for mutex, rwlock and sxlock.
  Failing to do so skips the lockstat_probe_func invokation for
  unlocking.
- As part of the LOCKSTAT support is inlined in mutex operation, for
  kernel compiled without lock debugging options, potentially every
  consumer must be compiled including opt_kdtrace.h.
  Fix this by moving KDTRACE_HOOKS into opt_global.h and remove the
  dependency by opt_kdtrace.h for all files, as now only KDTRACE_FRAMES
  is linked there and it is only used as a compile-time stub [0].

[0] immediately shows some new bug as DTRACE-derived support for debug
in sfxge is broken and it was never really tested.  As it was not
including correctly opt_kdtrace.h before it was never enabled so it
was kept broken for a while.  Fix this by using a protection stub,
leaving sfxge driver authors the responsibility for fixing it
appropriately [1].

Sponsored by:	EMC / Isilon storage division
Discussed with:	rstone
[0] Reported by:	rstone
[1] Discussed with:	philip
2013-11-25 07:38:45 +00:00
jilles
10685f105c Fix siginfo_t.si_status for wait6/waitid/SIGCHLD.
Per POSIX, si_status should contain the value passed to exit() for
si_code==CLD_EXITED and the signal number for other si_code. This was
incorrect for CLD_EXITED and CLD_DUMPED.

This is still not fully POSIX-compliant (Austin group issue #594 says that
the full value passed to exit() shall be returned via si_status, not just
the low 8 bits) but is sufficient for a si_status-related test in libnih
(upstart, Debian/kFreeBSD).

PR:		kern/184002
Reported by:	Dmitrijs Ledkovs
Tested by:	Dmitrijs Ledkovs
2013-11-17 22:31:23 +00:00
pjd
029a6f5d92 Change the cap_rights_t type from uint64_t to a structure that we can extend
in the future in a backward compatible (API and ABI) way.

The cap_rights_t represents capability rights. We used to use one bit to
represent one right, but we are running out of spare bits. Currently the new
structure provides place for 114 rights (so 50 more than the previous
cap_rights_t), but it is possible to grow the structure to hold at least 285
rights, although we can make it even larger if 285 rights won't be enough.

The structure definition looks like this:

	struct cap_rights {
		uint64_t	cr_rights[CAP_RIGHTS_VERSION + 2];
	};

The initial CAP_RIGHTS_VERSION is 0.

The top two bits in the first element of the cr_rights[] array contain total
number of elements in the array - 2. This means if those two bits are equal to
0, we have 2 array elements.

The top two bits in all remaining array elements should be 0.
The next five bits in all array elements contain array index. Only one bit is
used and bit position in this five-bits range defines array index. This means
there can be at most five array elements in the future.

To define new right the CAPRIGHT() macro must be used. The macro takes two
arguments - an array index and a bit to set, eg.

	#define	CAP_PDKILL	CAPRIGHT(1, 0x0000000000000800ULL)

We still support aliases that combine few rights, but the rights have to belong
to the same array element, eg:

	#define	CAP_LOOKUP	CAPRIGHT(0, 0x0000000000000400ULL)
	#define	CAP_FCHMOD	CAPRIGHT(0, 0x0000000000002000ULL)

	#define	CAP_FCHMODAT	(CAP_FCHMOD | CAP_LOOKUP)

There is new API to manage the new cap_rights_t structure:

	cap_rights_t *cap_rights_init(cap_rights_t *rights, ...);
	void cap_rights_set(cap_rights_t *rights, ...);
	void cap_rights_clear(cap_rights_t *rights, ...);
	bool cap_rights_is_set(const cap_rights_t *rights, ...);

	bool cap_rights_is_valid(const cap_rights_t *rights);
	void cap_rights_merge(cap_rights_t *dst, const cap_rights_t *src);
	void cap_rights_remove(cap_rights_t *dst, const cap_rights_t *src);
	bool cap_rights_contains(const cap_rights_t *big, const cap_rights_t *little);

Capability rights to the cap_rights_init(), cap_rights_set(),
cap_rights_clear() and cap_rights_is_set() functions are provided by
separating them with commas, eg:

	cap_rights_t rights;

	cap_rights_init(&rights, CAP_READ, CAP_WRITE, CAP_FSTAT);

There is no need to terminate the list of rights, as those functions are
actually macros that take care of the termination, eg:

	#define	cap_rights_set(rights, ...)				\
		__cap_rights_set((rights), __VA_ARGS__, 0ULL)
	void __cap_rights_set(cap_rights_t *rights, ...);

Thanks to using one bit as an array index we can assert in those functions that
there are no two rights belonging to different array elements provided
together. For example this is illegal and will be detected, because CAP_LOOKUP
belongs to element 0 and CAP_PDKILL to element 1:

	cap_rights_init(&rights, CAP_LOOKUP | CAP_PDKILL);

Providing several rights that belongs to the same array's element this way is
correct, but is not advised. It should only be used for aliases definition.

This commit also breaks compatibility with some existing Capsicum system calls,
but I see no other way to do that. This should be fine as Capsicum is still
experimental and this change is not going to 9.x.

Sponsored by:	The FreeBSD Foundation
2013-09-05 00:09:56 +00:00
markj
6c24c9fb32 Specify SDT probe argument types in the probe definition itself rather than
using SDT_PROBE_ARGTYPE(). This will make it easy to extend the SDT(9) API
to allow probes with dynamically-translated types.

There is no functional change.

MFC after:	2 weeks
2013-08-15 04:08:55 +00:00
mjg
de1e379f48 Remove duplicate assertion from tdsendsignal.
MFC after:	2 weeks
2013-07-22 00:44:37 +00:00
glebius
49b5e1ded6 Fix memory leak in coredump().
Reviewed by:	kib
2013-04-05 20:24:51 +00:00
jhb
8604015a2e Tweak some comments. 2013-03-18 18:04:09 +00:00
jhb
8b099870ed Partially revert r195702. Deferring stops is now implemented via a set of
calls to toggle TDF_SBDRY rather than passing PBDRY to individual sleep
calls.
- Remove the stop_allowed parameters from cursig() and issignal().
  issignal() checks TDF_SBDRY directly.
- Remove the PBDRY and SLEEPQ_STOP_ON_BDRY flags.
2013-03-18 17:23:58 +00:00
jhb
ca1e2e0739 Further refine the handling of stop signals in the NFS client. The
changes in r246417 were incomplete as they did not add explicit calls to
sigdeferstop() around all the places that previously passed SBDRY to
_sleep().  In addition, nfs_getcacheblk() could trigger a write RPC from
getblk() resulting in sigdeferstop() recursing.  Rather than manually
deferring stop signals in specific places, change the VFS_*() and VOP_*()
methods to defer stop signals for filesystems which request this behavior
via a new VFCF_SBDRY flag.  Note that this has to be a VFC flag rather than
a MNTK flag so that it works properly with VFS_MOUNT() when the mount is
not yet fully constructed.  For now, only the NFS clients are set this new
flag in VFS_SET().

A few other related changes:
- Add an assertion to ensure that TDF_SBDRY doesn't leak to userland.
- When a lookup request uses VOP_READLINK() to follow a symlink, mark
  the request as being on behalf of the thread performing the lookup
  (cnp_thread) rather than using a NULL thread pointer.  This causes
  NFS to properly handle signals during this VOP on an interruptible
  mount.

PR:		kern/176179
Reported by:	Russell Cattelan (sigdeferstop() recursion)
Reviewed by:	kib
MFC after:	1 month
2013-02-21 19:02:50 +00:00
pjd
30ca80d1fc Add break to the default case. 2013-02-17 11:47:58 +00:00
kib
de5bb93ec1 When vforked child is traced, the debugging events are not generated
until child performs exec().  The behaviour is reasonable when a
debugger is the real parent, because the parent is stopped until
exec(), and sending a debugging event to the debugger would deadlock
both parent and child.

On the other hand, when debugger is not the parent of the vforked
child, not sending debugging signals makes it impossible to debug
across vfork.

Fix the issue by declining generating debug signals only when vfork()
was done and child called ptrace(PT_TRACEME).  Set a new process flag
P_PPTRACE from the attach code for PT_TRACEME, if P_PPWAIT flag is
set, which indicates that the process was created with vfork() and
still did not execed. Check P_PPTRACE from issignal(), instead of
refusing the trace outright for the P_PPWAIT case.  The scope of
P_PPTRACE is exactly contained in the scope of P_PPWAIT.

Found and tested by:  zont
Reviewed by:	pluknet
MFC after:	2 weeks
2013-02-07 15:34:22 +00:00
jhb
0fee3f66b8 Rework the handling of stop signals in the NFS client. The changes in
195702, 195703, and 195821 prevented a thread from suspending while holding
locks inside of NFS by forcing the thread to fail sleeps with EINTR or
ERESTART but defer the thread suspension to the user boundary.  However,
this had the effect that stopping a process during an NFS request could
abort the request and trigger EINTR errors that were visible to userland
processes (previously the thread would have suspended and completed the
request once it was resumed).

This change instead effectively masks stop signals while in the NFS client.
It uses the existing TDF_SBDRY flag to effect this since SIGSTOP cannot
be masked directly.  Also, instead of setting PBDRY on individual sleeps,
the NFS client now sets the TDF_SBDRY flag around each NFS request and
stop signals are masked for all sleeps during that region (the previous
change missed sleeps in lockmgr locks).  The end result is that stop
signals sent to threads performing an NFS request are completely
ignored until after the NFS request has finished processing and the
thread prepares to return to userland.  This restores the behavior of
stop signals being transparent to userland processes while still
preventing threads from suspending while holding NFS locks.

Reviewed by:	kib
MFC after:	1 month
2013-02-06 17:06:51 +00:00
pjd
f4da2a6634 Replace expand_name() function with corefile_open() function, which not
only returns name, but also vnode of corefile to use.

This simplifies the code and closes few races, especially in %I handling.

Reviewed by:	kib
Obtained from:	WHEEL Systems
2012-12-19 23:59:48 +00:00
pjd
15927a4dbb Use correct file permissions when looking for available core file if
kern.corefile contains %I.

Obtained from:	WHEEL Systems
2012-12-19 23:40:02 +00:00
pjd
ab9dc4d86b The 'flags' argument can be modified in vn_open_cred(), so we need to
set it for every loop interation.

Pointed out by:	kib
2012-12-19 12:14:08 +00:00
pjd
7e5e69b41a Do not audit paths we try when kern.corefile contains %I.
Obtained from:	WHEEL Systems
2012-12-19 12:12:53 +00:00
pjd
a03815156a Style cleanups. 2012-12-19 12:10:14 +00:00
pjd
8c17bcbd37 The expand_name() function isn't called with the process lock held anymore,
so we can safely use malloc(M_WAITOK) now.

Pointed out by:	kib
2012-12-19 12:00:09 +00:00
pjd
dbdae53685 Minor style tweaks.
Obtained from:	WHEEL Systems
2012-12-17 10:51:22 +00:00
pjd
ec8b5e0006 Better variables naming in expand_name() to be more consistent with coredump().
Obtained from:	WHEEL Systems
2012-12-17 10:48:10 +00:00
pjd
3e4021fec3 Move expand_name() after process lock is released.
This fixed panic where we hold mutex (process lock) and try to obtain sleepable
lock (vnode lock in expand_name()). The panic could occur when %I was used
in kern.corefile.

Additionally we avoid expand_name() overhead when coredumps are disabled.

Obtained from:	WHEEL Systems
2012-12-16 14:53:27 +00:00
pjd
a198747e4a Don't add audit record when coredumps are disabled or name cannot be expanded.
Discussed with:	rwatson
Obtained from:	WHEEL Systems
2012-12-16 14:24:59 +00:00
pjd
c6ce471b73 Make the check easier to read.
Obtained from:	WHEEL Systems
2012-12-16 14:14:18 +00:00
pjd
b6b2ae8e26 Use 'cred' variable.
Obtained from:	WHEEL Systems
2012-12-16 13:56:38 +00:00
pjd
e4de9f38a2 Add kern.capmode_coredump sysctl/tunable to allow processes in capability mode
to dump core.

Reviewed by:	rwatson
Obtained from:	WHEEL Systems
MFC after:	2 weeks
2012-11-27 10:38:11 +00:00
pjd
c079174e14 Allow to use kill(2) in capability mode, but process can send a signal only
to himself. For example abort(3) at first tries to do kill(getpid(), SIGABRT)
which was failing in capability mode, so the code was failing back to exit(1).

Reviewed by:	rwatson
Obtained from:	WHEEL Systems
MFC after:	2 weeks
2012-11-27 10:22:40 +00:00
pjd
02c0badfc1 Allow to modify kern.sugid_coredump and kern.corefile from loader.conf.
Obtained from:	WHEEL Systems
2012-11-27 10:16:48 +00:00
pjd
8dbfcd9003 More style fixes. 2012-11-27 10:15:58 +00:00
pjd
0b5aef9e2b Style fixes (mostly whitespaces). 2012-11-27 10:11:54 +00:00
kib
560aa751e0 Remove the support for using non-mpsafe filesystem modules.
In particular, do not lock Giant conditionally when calling into the
filesystem module, remove the VFS_LOCK_GIANT() and related
macros. Stop handling buffers belonging to non-mpsafe filesystems.

The VFS_VERSION is bumped to indicate the interface change which does
not result in the interface signatures changes.

Conducted and reviewed by:	attilio
Tested by:	pho
2012-10-22 17:50:54 +00:00
eadler
24cc54b983 Correct the killpg(2) return values:
Return EPERM if processes were found but they
were unable to be signaled.

Return the first error from p_cansignal if no signal was successful.

Reviewed by:	jilles
Approved by:	cperciva
MFC after:	1 week
2012-10-22 03:43:02 +00:00
eadler
f38062a582 Colin acked the wrong diff originally. fixed version coming soon.
Approved by:	cperciva (implicit)
2012-10-22 03:36:44 +00:00
eadler
f48bd672c6 Correct the killpg(2) return values:
Return EPERM if processes were found but they
were unable to be signaled.

Return the first error from p_cansignal if no signal was successful.

Reviewed by:	jilles
Approved by:	cperciva
MFC after:	1 week
2012-10-22 03:34:43 +00:00
jhb
1e4174de1a Ignore stop and continue signals sent to an exiting process. Stop signals
set p_xstat to the signal that triggered the stop, but p_xstat is also
used to hold the exit status of an exiting process.  Without this change,
a stop signal that arrived after a process was marked P_WEXIT but before
it was marked a zombie would overwrite the exit status with the stop signal
number.

Reviewed by:	kib
MFC after:	1 week
2012-09-13 15:51:18 +00:00
kib
4382478fb0 Deliver SIGSYS to the guilty thread, not to the process.
MFC after:	1 week
2012-08-18 18:17:10 +00:00
davidxu
affa007492 Always clear p_xthread if current thread no longer needs it, in theory, if
debugger exited without calling ptrace(PT_DETACH), there is a time window
that the p_xthread may be pointing to non-existing thread, in practical,
this is not a problem because child process soon will be killed by parent
process.
2012-07-10 05:45:13 +00:00
kib
319ab382ef Add thread-private flag to indicate that error value is already placed
in td_errno. Flag is supposed to be used by syscalls returning
EJUSTRETURN because errno was already placed into the usermode frame
by a call to set_syscall_retval(9). Both ktrace and dtrace get errno
value from td_errno if the flag is set.

Use the flag to fix sigsuspend(2) error return ktrace records.

Requested by:	bde
MFC after:	1 week
2012-04-12 10:48:43 +00:00
jilles
4360dc9ca8 Remove unused and wrong SA_PROC internal signal property.
The SA_PROC signal property indicated whether each signal number is directed
at a specific thread or at the process in general. However, that depends on
how the signal was generated and not on the signal number. SA_PROC was not
used.
2012-04-09 21:58:58 +00:00
ed
0c56cf839d Mark all SYSCTL_NODEs static that have no corresponding SYSCTL_DECLs.
The SYSCTL_NODE macro defines a list that stores all child-elements of
that node. If there's no SYSCTL_DECL macro anywhere else, there's no
reason why it shouldn't be static.
2011-11-07 15:43:11 +00:00
pluknet
f9aa1bdb23 Fix arguments list for proc:::signal-discard DTrace probe.
Reported by:	Anton Yuzhaninov <citrin citrin ru>
MFC after:	1 week
2011-10-28 15:22:51 +00:00
kib
71b9169fe9 The sigwait(3) function shall not return EINTR, according to the
POSIX/SUSvN. The sigwait(2) syscall does return EINTR, and libc.so.7
contains the wrapper sigwait(3) which hides EINTR from callers.  The
EINTR return is used by libthr to handle required cancellation point
in the sigwait(3).

To help the binaries linked against pre-libc.so.7, i.e. RELENG_6 and
earlier, to have right ABI for sigwait(3), transform EINTR return from
sigwait(2) into ERESTART.

Discussed with:	davidxu
MFC after:	1 week
2011-10-01 10:18:55 +00:00
kmacy
99851f359e In order to maximize the re-usability of kernel code in user space this
patch modifies makesyscalls.sh to prefix all of the non-compatibility
calls (e.g. not linux_, freebsd32_) with sys_ and updates the kernel
entry points and all places in the code that use them. It also
fixes an additional name space collision between the kernel function
psignal and the libc function of the same name by renaming the kernel
psignal kern_psignal(). By introducing this change now we will ease future
MFCs that change syscalls.

Reviewed by:	rwatson
Approved by:	re (bz)
2011-09-16 13:58:51 +00:00
jonathan
5ecd1c9d40 Add experimental support for process descriptors
A "process descriptor" file descriptor is used to manage processes
without using the PID namespace. This is required for Capsicum's
Capability Mode, where the PID namespace is unavailable.

New system calls pdfork(2) and pdkill(2) offer the functional equivalents
of fork(2) and kill(2). pdgetpid(2) allows querying the PID of the remote
process for debugging purposes. The currently-unimplemented pdwait(2) will,
in the future, allow querying rusage/exit status. In the interim, poll(2)
may be used to check (and wait for) process termination.

When a process is referenced by a process descriptor, it does not issue
SIGCHLD to the parent, making it suitable for use in libraries---a common
scenario when using library compartmentalisation from within large
applications (such as web browsers). Some observers may note a similarity
to Mach task ports; process descriptors provide a subset of this behaviour,
but in a UNIX style.

This feature is enabled by "options PROCDESC", but as with several other
Capsicum kernel features, is not enabled by default in GENERIC 9.0.

Reviewed by: jhb, kib
Approved by: re (kib), mentor (rwatson)
Sponsored by: Google Inc
2011-08-18 22:51:30 +00:00
trasz
6a13eaa4d1 Fix support for RACCT_CORE by merging forgotten file. 2011-05-26 18:54:07 +00:00
jilles
a860eeb5be ktrace: Log the code for all signals (PSIG events).
The code provides information on how the signal was generated.

Formerly, the code was only logged for traps, much like only signal handlers
for traps received a meaningful si_code before FreeBSD 7.0.

In rare cases, no information is available and 0 is still logged.

MFC after:	1 week
2011-04-17 14:38:11 +00:00
jhb
96b1d8b6d7 Fix several places to ignore processes that are not yet fully constructed.
MFC after:	1 week
2011-04-06 17:47:22 +00:00
jhb
df83f2ffe5 Small style fix. 2011-03-23 13:44:32 +00:00