Commit Graph

905 Commits

Author SHA1 Message Date
David Bright
2d5603fe65 Jail and capability mode for shm_rename; add audit support for shm_rename
Co-mingling two things here:

  * Addressing some feedback from Konstantin and Kyle re: jail,
    capability mode, and a few other things
  * Adding audit support as promised.

The audit support change includes a partial refresh of OpenBSM from
upstream, where the change to add shm_rename has already been
accepted. Matthew doesn't plan to work on refreshing anything else to
support audit for those new event types.

Submitted by:	Matthew Bryan <matthew.bryan@isilon.com>
Reviewed by:	kib
Relnotes:	Yes
Sponsored by:	Dell EMC Isilon
Differential Revision:	https://reviews.freebsd.org/D22083
2019-11-18 13:31:16 +00:00
Doug Moore
7cdcf86360 Define wrapper functions vm_map_entry_{succ,pred} to act as wrappers
around entry->{next,prev} when those are used for ordered list
traversal, and use those wrapper functions everywhere. Where the next
field is used for maintaining a stack of deferred operations, #define
defer_next to make that different usage clearer, and then use the
'right' pointer instead of 'next' for that purpose.

Approved by: markj
Tested by: pho (as part of a larger patch)
Differential Revision: https://reviews.freebsd.org/D22347
2019-11-13 15:56:07 +00:00
Doug Moore
2288078c5e Define macro VM_MAP_ENTRY_FOREACH for enumerating the entries in a vm_map.
In case the implementation ever changes from using a chain of next pointers,
then changing the macro definition will be necessary, but changing all the
files that iterate over vm_map entries will not.

Drop a counter in vm_object.c that would have an effect only if the
vm_map entry count was wrong.

Discussed with: alc
Reviewed by: markj
Tested by: pho (earlier version)
Differential Revision:	https://reviews.freebsd.org/D21882
2019-10-08 07:14:21 +00:00
Doug Moore
83ea714f4f vm_map_simplify_entry considers merging an entry with its two
neighbors, and is used in a way so that if entries a and b cannot be
merged, we consider them twice, first not-merging a with its successor
b, and then not-merging b with its predecessor a. This change replaces
vm_map_simplify_entry with vm_map_try_merge_entries, which compares
two adjacent entries only, and uses it to avoid duplicated
merge-checks.

Tested by: pho
Reviewed by: alc
Approved by: markj (implicit)
Differential Revision: https://reviews.freebsd.org/D20814
2019-08-25 07:06:51 +00:00
Marcin Wojtas
e7c5d9d3f2 Fix mac_veriexec_parser build after r347938
In r347938 the definition of mac_veriexec_metadata_add_file
so adjust the argument list accordingly.

Submitted by: Kornel Duleba <mindal@semihalf.com>
2019-08-08 16:51:49 +00:00
Conrad Meyer
e2e050c8ef Extract eventfilter declarations to sys/_eventfilter.h
This allows replacing "sys/eventfilter.h" includes with "sys/_eventfilter.h"
in other header files (e.g., sys/{bus,conf,cpu}.h) and reduces header
pollution substantially.

EVENTHANDLER_DECLARE and EVENTHANDLER_LIST_DECLAREs were moved out of .c
files into appropriate headers (e.g., sys/proc.h, powernv/opal.h).

As a side effect of reduced header pollution, many .c files and headers no
longer contain needed definitions.  The remainder of the patch addresses
adding appropriate includes to fix those files.

LOCK_DEBUG and LOCK_FILE_LINE_ARG are moved to sys/_lock.h, as required by
sys/mutex.h since r326106 (but silently protected by header pollution prior
to this change).

No functional change (intended).  Of course, any out of tree modules that
relied on header pollution for sys/eventhandler.h, sys/lock.h, or
sys/mutex.h inclusion need to be fixed.  __FreeBSD_version has been bumped.
2019-05-20 00:38:23 +00:00
Stephen J. Kiernan
942886743b Add a new ioctl for the larger params struct that includes the label.
We need to make the find_veriexec_file() function available publicly, so
rename it to mac_veriexec_metadata_find_file_info() and make it non-static.

Bump the version of the veriexec device interface so user space will know
the labelized version of fingerprint loading is available.

Approved by:	sjg
Obtained from:	Juniper Networks, Inc.
Differential Revision:	https://reviews.freebsd.org/D20295
2019-05-17 19:27:07 +00:00
Stephen J. Kiernan
6cbc970317 Obtain a shared lock instead of exclusive in the MAC/veriexec
MAC_VERIEXEC_CHECK_PATH_SYSCALL per-MAC policy system call.

When we are checking the status of the fingerprint on a vnode using the
per-MAC-policy syscall, we do not need an exclusive lock on the vnode.

Even if there is more than one thread requesting the status at the same time,
the worst we can end up doing is processing the file more than once.

This can potentially be improved in the future with offloading the fingerprint
evaluation to a separate thread and blocking until the update completes. But
for now the race is acceptable.

Obtained from:	Juniper Networks, Inc.
MFC after:	1 week
2019-05-17 18:13:43 +00:00
Stephen J. Kiernan
ed377cf415 sysctls which should be restricted when securelevel is raised should also
be restricted when veriexec is enforced.

Add mpo_system_check_sysctl method to mac_veriexec which does this.

Obtained from:	Juniper Networks, Inc.
MFC after:	1 week
2019-05-17 18:09:48 +00:00
Stephen J. Kiernan
3d53cd0fbb Fix format strings for some debug messages that could have arguments that
are different types across architectures by using %ju and typecasting to
uintmax_t, where appropriate.

Obtained from:	Juniper Networks, Inc.
MFC after:	1 week
2019-05-17 18:06:24 +00:00
Stephen J. Kiernan
3da3012ace Ensure we have obtained a lock on the process before calling
mac_veriexec_get_executable_flags(). Only try locking/unlocking if the caller
has not already acquired the process lock.

Obtained from:	Juniper Networks, Inc.
MFC after:	1 week
2019-05-17 17:50:01 +00:00
Robert Watson
5c95417dad When MAC is enabled and a policy module is loaded, don't unconditionally
lock mac_ifnet_mtx, which protects labels on struct ifnet, unless at least
one policy is actively using labels on ifnets.  This avoids a global mutex
acquire in certain fast paths -- most noticeably ifnet transmit.  This was
previously invisible by default, as no MAC policies were loaded by default,
but recently became visible due to mac_ntpd being enabled by default.

gallatin@ reports a reduction in PPS overhead from 300% to 2.2% with this
change.  We will want to explore further MAC Framework optimisation to
reduce overhead further, but this brings things more back into the world
of the sane.

MFC after:	3 days
2019-05-03 20:38:43 +00:00
Marcin Wojtas
b0fefb25c5 Create kernel module to parse Veriexec manifest based on envs
The current approach of injecting manifest into mac_veriexec is to
verify the integrity of it in userspace (veriexec (8)) and pass its
entries into kernel using a char device (/dev/veriexec).
This requires verifying root partition integrity in loader,
for example by using memory disk and checking its hash.
Otherwise if rootfs is compromised an attacker could inject their own data.

This patch introduces an option to parse manifest in kernel based on envs.
The loader sets manifest path and digest.
EVENTHANDLER is used to launch the module right after the rootfs is mounted.
It has to be done this way, since one might want to verify integrity of the init file.
This means that manifest is required to be present on the root partition.
Note that the envs have to be set right before boot to make sure that no one can spoof them.

Submitted by: Kornel Duleba <mindal@semihalf.com>
Reviewed by: sjg
Obtained from: Semihalf
Sponsored by: Stormshield
Differential Revision: https://reviews.freebsd.org/D19281
2019-04-03 03:57:37 +00:00
Kirk McKusick
88640c0e8b Create new EINTEGRITY error with message "Integrity check failed".
An integrity check such as a check-hash or a cross-correlation failed.
The integrity error falls between EINVAL that identifies errors in
parameters to a system call and EIO that identifies errors with the
underlying storage media. EINTEGRITY is typically raised by intermediate
kernel layers such as a filesystem or an in-kernel GEOM subsystem when
they detect inconsistencies. Uses include allowing the mount(8) command
to return a different exit value to automate the running of fsck(8)
during a system boot.

These changes make no use of the new error, they just add it. Later
commits will be made for the use of the new error number and it will
be added to additional manual pages as appropriate.

Reviewed by:    gnn, dim, brueffer, imp
Discussed with: kib, cem, emaste, ed, jilles
Differential Revision: https://reviews.freebsd.org/D18765
2019-01-17 06:35:45 +00:00
Mateusz Guzik
6dcf45feda mac: reduce pessimization of sdt probe handling
Prior to the change the code would branch on return value and then check
if probes are enabled. Since vast majority of the time they are not, this
is clearly wasteful. Check probes first.

Sponsored by:	The FreeBSD Foundation
2018-12-19 22:30:26 +00:00
Mateusz Guzik
cc426dd319 Remove unused argument to priv_check_cred.
Patch mostly generated with cocinnelle:

@@
expression E1,E2;
@@

- priv_check_cred(E1,E2,0)
+ priv_check_cred(E1,E2)

Sponsored by:	The FreeBSD Foundation
2018-12-11 19:32:16 +00:00
Mateusz Guzik
e8451da5e8 audi: replace open-coded TDP_AUDITREC checks with the macro
Sponsored by:	The FreeBSD Foundation
2018-12-11 17:14:12 +00:00
Mateusz Guzik
c0282e1e07 audit: predict AUDITING_TD as false
By default it is compiled in and disabled.

Sponsored by:	The FreeBSD Foundation
2018-11-29 09:19:48 +00:00
Mateusz Guzik
159dcc30a5 audit: change audit_syscalls_enabled type to bool
So that it fits better in __read_frequently.

Sponsored by:	The FreeBSD Foundation
2018-11-29 08:37:33 +00:00
Brooks Davis
12e69f96a2 Add const to input-only char * arguments.
These arguments are mostly paths handled by NAMEI*() macros which already
take const char * arguments.

This change improves the match between syscalls.master and the public
declerations of system calls.

Reviewed by:	kib (prior version)
Obtained from:	CheriBSD
Sponsored by:	DARPA, AFRL
Differential Revision:	https://reviews.freebsd.org/D17812
2018-11-02 20:50:22 +00:00
Robert Watson
2ddefb6d5d Rework the logic around quick checks for auditing that take place at
system-call entry and whenever audit arguments or return values are
captured:

1. Expose a single global, audit_syscalls_enabled, which controls
   whether the audit framework is entered, rather than exposing
   components of the policy -- e.g., if the trail is enabled,
   suspended, etc.

2. Introduce a new function audit_syscalls_enabled_update(), which is
   called to update audit_syscalls_enabled whenever an aspect of the
   policy changes, so that the value can be updated.

3. Remove a check of trail enablement/suspension from audit_new() --
   at the point where this function has been entered, we believe that
   system-call auditing is already in force, or we wouldn't get here,
   so simply proceed to more expensive policy checks.

4. Use an audit-provided global, audit_dtrace_enabled, rather than a
   dtaudit-provided global, to provide policy indicating whether
   dtaudit would like system calls to be audited.

5. Do some minor cosmetic renaming to clarify what various variables
   are for.

These changes collectively arrange it so that traditional audit
(trail, pipes) or the DTrace audit provider can enable system-call
probes without the other configured.  Otherwise, dtaudit cannot
capture system-call data without auditd(8) started.

Reviewed by:		gnn
Sponsored by:		DARPA, AFRL
Approved by:		re (gjb)
Differential Revision:	https://reviews.freebsd.org/D17348
2018-10-02 15:58:17 +00:00
Robert Watson
deea362c80 The kernel DTrace audit provider (dtaudit) relies on auditd(8) to load
/etc/security/audit_event to provide a list of audit event-number <->
name mappings.  However, this occurs too late for anonymous tracing.
With this change, adding 'audit_event_load="YES"' to /boot/loader.conf
will cause the boot loader to preload the file, and then the kernel
audit code will parse it to register an initial set of audit event-number
<-> name mappings.  Those mappings can later be updated by auditd(8) if
the configuration file changes.

Reviewed by:	gnn, asomers, markj, allanjude
Discussed with:	jhb
Approved by:	re (kib)
MFC after:	1 week
Sponsored by:	DARPA, AFRL
Differential Revision:	https://reviews.freebsd.org/D16589
2018-09-03 14:26:43 +00:00
Mark Johnston
6324de037c Require that MAC label buffers be able to store a non-empty string.
The buffer size may be used to initialize an sbuf in
MAC_POLICY_EXTERNALIZE, and without this constraint it's possible to
trigger an assertion failure in the sbuf code.  With INVARIANTS
disabled, the first attempt to write to the sbuf will fail.

Reported by:	pho
Reviewed by:	delphij
MFC after:	1 week
Differential Revision:	https://reviews.freebsd.org/D16527
2018-08-01 03:46:07 +00:00
Andriy Gapon
dc8240f0da fix incorrect operator in the AUDITPIPE_SET_QLIMIT bounds check
PR:		229983
Submitted by:	Aniket Pandey <aniketp@iitk.ac.in>
Reported by:	Aniket Pandey <aniketp@iitk.ac.in>
MFC after:	1 week
2018-07-23 16:56:49 +00:00
Alan Somers
12395dc9f6 Fix audit of chflagsat, lgetfh, and setfib
These syscalls were always supposed to have been auditted, but due to
oversights never were.

PR:		228374
Reported by:	aniketp
Reviewed by:	aniketp
MFC after:	2 weeks
Differential Revision:	https://reviews.freebsd.org/D16388
2018-07-22 14:11:52 +00:00
Ian Lepore
3496c981ac Make it possible to run ntpd as a non-root user, add ntpd uid and gid.
Code analysis and runtime analysis using truss(8) indicate that the only
privileged operations performed by ntpd are adjusting system time, and
(re-)binding to privileged UDP port 123. These changes add a new mac(4)
policy module, mac_ntpd(4), which grants just those privileges to any
process running with uid 123.

This also adds a new user and group, ntpd:ntpd, (uid:gid 123:123), and makes
them the owner of the /var/db/ntp directory, so that it can be used as a
location where the non-privileged daemon can write files such as the
driftfile, and any optional logfile or stats files.

Because there are so many ways to configure ntpd, the question of how to
configure it to run without root privs can be a bit complex, so that will be
addressed in a separate commit. These changes are just what's required to
grant the limited subset of privs to ntpd, and the small change to ntpd to
prevent it from exiting with an error if running as non-root.

Differential Revision:	https://reviews.freebsd.org/D16281
2018-07-19 23:55:29 +00:00
Alan Somers
f7c188aeeb auditon(2): fix A_SETPOLICY with 64-bit values
A_SETPOLICY is supposed to work with either 64 or 32-bit values, but due to a
typo the 64-bit version has never worked correctly.

Submitted by:	aniketp
Reviewed by:	asomers, cem
MFC after:	2 weeks
Sponsored by:	Google, Inc. (GSoC 2018)
Differential Revision:	https://reviews.freebsd.org/D16222
2018-07-15 21:10:19 +00:00
Stephen J. Kiernan
ade9788665 Add mpo_vnode_check_setmode MAC method to MAC/veriexec.
In the method, disallow changing SUID/SGID on verified files.

Obtained from:	Juniper Networks, Inc.
2018-07-14 17:21:16 +00:00
Stephen J. Kiernan
1db017d006 Fix a typo which could cause a build breakage when building with MAC/veriexec
enabled in the kernel config.

Remove unused mac_veriexec_print_db prototype in internal header file.
2018-07-14 17:15:28 +00:00
Stephen J. Kiernan
38d5d2d53b Remove RIPEMD-160 fingerprint modules for veriexec, since it has very
little practical use and would not be recommended for anyone to use in
a production environment.

Reviewed by:	sjg
2018-07-14 16:59:17 +00:00
Stephen J. Kiernan
593d37bb8e Fix build breakage in veriexec for 32-bit architectures.
fsid_t and ino_t are 64-bit entities, use uintmax_t typecast to ensure we
can print it on 32-bit or 64-bit architectures by using the %ju format for
prints.

Obtained from:	Juniper Networks, Inc.
2018-06-20 06:54:38 +00:00
Stephen J. Kiernan
fb47a3769c MAC/veriexec implements a verified execution environment using the MAC
framework.

The code is organized into a few distinct pieces:

* The meta-data store (in veriexec_metadata.c) which maps a file system
  identifier, file identifier, and generation key tuple to veriexec
  meta-data record.

* Fingerprint management (in veriexec_fingerprint.c) which deals with
  calculating the cryptographic hash for a file and verifying it. It also
  manages the loadable fingerprint modules.

* MAC policy implementation (in mac_veriexec.c) which implements the
  following MAC methods:

mpo_init
  Initializes the veriexec state, meta-data store, fingerprint modules,
  and registers mount and unmount EVENTHANDLERs

mpo_syscall
  Implements the following per-policy system calls:
  MAC_VERIEXEC_CHECK_FD_SYSCALL
    Check a file descriptor to see if the referenced file has a valid
    fingerprint.
  MAC_VERIEXEC_CHECK_PATH_SYSCALL
    Check a path to see if the referenced file has a valid fingerprint.

mpo_kld_check_load
  Check if loading a kld is allowed. This checks if the referenced vnode
  has a valid fingerprint.

mpo_mount_destroy_label
  Clears the veriexec slot data in a mount point label.

mpo_mount_init_label
  Initializes the veriexec slot data in a mount point label.
  The file system identifier is saved in the veriexec slot data.

mpo_priv_check
  Check if a process is allowed to write to /dev/kmem and /dev/mem
  devices.
  If a process is flagged as trusted, it is allowed to write.

mpo_proc_check_debug
  Check if a process is allowed to be debugged. If a process is not
  flagged with VERIEXEC_NOTRACE, then debugging is allowed.

mpo_vnode_check_exec
  Check is an exectuable is allowed to run. If veriexec is not enforcing
  or the executable has a valid fingerprint, then it is allowed to run.
  NOTE: veriexec will complain about mismatched fingerprints if it is
  active, regardless of the state of the enforcement.

mpo_vnode_check_open
  Check is a file is allowed to be opened. If verification was not
  requested, veriexec is not enforcing, or the file has a valid
  fingerprint, then veriexec will allow the file to be opened.

mpo_vnode_copy_label
  Copies the veriexec slot data from one label to another.

mpo_vnode_destroy_label
  Clears the veriexec slot data in a vnode label.

mpo_vnode_init_label
  Initializes the veriexec slot data in a vnode label.
  The fingerprint status for the file is stored in the veriexec slot data.

* Some sysctls, under security.mac.veriexec, for setting debug level,
  fetching the current state in a human-readable form, and dumping the
  fingerprint database are implemented.

* The MAC policy implementation source file also contains some utility
  functions.

* A set of fingerprint modules for the following cryptographic hash
  algorithms:
  RIPEMD-160, SHA1, SHA2-256, SHA2-384, SHA2-512

* Loadable module builds for MAC/veriexec and fingerprint modules.

 WARNING: Using veriexec with NFS (or other network-based) file systems is
          not recommended as one cannot guarantee the integrity of the files
          served, nor the uniqueness of file system identifiers which are
          used as key in the meta-data store.

Reviewed by:	ian, jtl
Obtained from:	Juniper Networks, Inc.
Differential Revision:	https://reviews.freebsd.org/D8554
2018-06-20 00:41:30 +00:00
Alan Somers
ebc0d5599d audit(4): fix the definition of ARG_TERMID_ADDR
Due to a copy/paste error in r168688, ARG_TERMID_ADDR has the same
definition as ARG_SADDRUNIX.  Fix it.

The header change, while publicly visible, is guarded by #ifdef KERNEL, and
I can't find any kmod ports that use it.  So I'm not bumping
__FreeBSD_version.

PR:		228820
Submitted by:	aniketp
Sponsored by:	Google, Inc. (GSoC 2018)
Differential Revision:	https://reviews.freebsd.org/D15702
2018-06-13 14:55:31 +00:00
Alan Somers
1c97d64332 #include <bsm/audit.h> in security/audit/audit_ioctl.h
security/audit/audit_ioctl.h uses a type from bsm/audit.h, so needs to
include it.  And it needs to know the type's size, so it can't just
forward-declare.

PR:		228470
Submitted by:	aniketp
MFC after:	2 weeks
Sponsored by:	Google, Inc. (GSoC 2018)
Differential Revision:	https://reviews.freebsd.org/D15561
2018-05-30 21:50:23 +00:00
Alan Somers
cb3d7cd839 Fix "Bad tailq" panic when auditing auditon(A_SETCLASS, ...)
Due to an oversight in r195280, auditon(A_SETCLASS, ...) would cause a tailq
element to get added to the tailq twice, resulting in a circular tailq. This
panics when INVARIANTS are on.

MFC after:	2 weeks
Differential Revision:	https://reviews.freebsd.org/D15381
2018-05-28 20:47:39 +00:00
Brooks Davis
541d96aaaf Use an accessor function to access ifr_data.
This fixes 32-bit compat (no ioctl command defintions are required
as struct ifreq is the same size).  This is believed to be sufficent to
fully support ifconfig on 32-bit systems.

Reviewed by:	kib
Obtained from:	CheriBSD
MFC after:	1 week
Relnotes:	yes
Sponsored by:	DARPA, AFRL
Differential Revision:	https://reviews.freebsd.org/D14900
2018-03-30 18:50:13 +00:00
Alan Somers
b521cf275c audit(4): fix a typo in a comment
no functional change
2018-03-17 17:56:08 +00:00
Eugene Grosbein
df850530e3 mac_portacl(4): stop panicing INVARIANTS-enabled kernel by loading .ko
when kernel already has options MAC_PORTACL.

PR:		183817
Approved by:	avg (mentor)
MFC after:	1 week
2018-02-25 23:10:13 +00:00
Brooks Davis
d88fe103eb Reduce duplication in __mac_*_(file|link)(2) implementation.
Reviewed by:	rwatson
Obtained from:	CheriBSD
Sponsored by:	DARPA, AFRL
Differential Revision:	https://reviews.freebsd.org/D14175
2018-02-15 18:57:22 +00:00
Alexander Kabaev
151ba7933a Do pass removing some write-only variables from the kernel.
This reduces noise when kernel is compiled by newer GCC versions,
such as one used by external toolchain ports.

Reviewed by: kib, andrew(sys/arm and sys/arm64), emaste(partial), erj(partial)
Reviewed by: jhb (sys/dev/pci/* sys/kern/vfs_aio.c and sys/kern/kern_synch.c)
Differential Revision: https://reviews.freebsd.org/D10385
2017-12-25 04:48:39 +00:00
Pedro F. Giffuni
51369649b0 sys: further adoption of SPDX licensing ID tags.
Mainly focus on files that use BSD 3-Clause license.

The Software Package Data Exchange (SPDX) group provides a specification
to make it easier for automated tools to detect and summarize well known
opensource licenses. We are gradually adopting the specification, noting
that the tags are considered only advisory and do not, in any way,
superceed or replace the license texts.

Special thanks to Wind River for providing access to "The Duke of
Highlander" tool: an older (2014) run over FreeBSD tree was useful as a
starting point.
2017-11-20 19:43:44 +00:00
Mateusz Guzik
574adb65c8 Sprinkle __read_frequently on few obvious places.
Note that some of annotated variables should probably change their types
to something smaller, preferably bit-sized.
2017-09-06 20:33:33 +00:00
Ed Maste
993114dd2a Correct bitwise test in mac_bsdextended ugidfw_rule_valid()
PR:		218039
CID:		1008934
Reported by:	Coverity, PVS-Studio
Reviewed by:	kib
MFC after:	1 week
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D10300
2017-06-13 01:17:58 +00:00
Konstantin Belousov
6992112349 Commit the 64-bit inode project.
Extend the ino_t, dev_t, nlink_t types to 64-bit ints.  Modify
struct dirent layout to add d_off, increase the size of d_fileno
to 64-bits, increase the size of d_namlen to 16-bits, and change
the required alignment.  Increase struct statfs f_mntfromname[] and
f_mntonname[] array length MNAMELEN to 1024.

ABI breakage is mitigated by providing compatibility using versioned
symbols, ingenious use of the existing padding in structures, and
by employing other tricks.  Unfortunately, not everything can be
fixed, especially outside the base system.  For instance, third-party
APIs which pass struct stat around are broken in backward and
forward incompatible ways.

Kinfo sysctl MIBs ABI is changed in backward-compatible way, but
there is no general mechanism to handle other sysctl MIBS which
return structures where the layout has changed. It was considered
that the breakage is either in the management interfaces, where we
usually allow ABI slip, or is not important.

Struct xvnode changed layout, no compat shims are provided.

For struct xtty, dev_t tty device member was reduced to uint32_t.
It was decided that keeping ABI compat in this case is more useful
than reporting 64-bit dev_t, for the sake of pstat.

Update note: strictly follow the instructions in UPDATING.  Build
and install the new kernel with COMPAT_FREEBSD11 option enabled,
then reboot, and only then install new world.

Credits: The 64-bit inode project, also known as ino64, started life
many years ago as a project by Gleb Kurtsou (gleb).  Kirk McKusick
(mckusick) then picked up and updated the patch, and acted as a
flag-waver.  Feedback, suggestions, and discussions were carried
by Ed Maste (emaste), John Baldwin (jhb), Jilles Tjoelker (jilles),
and Rick Macklem (rmacklem).  Kris Moore (kris) performed an initial
ports investigation followed by an exp-run by Antoine Brodin (antoine).
Essential and all-embracing testing was done by Peter Holm (pho).
The heavy lifting of coordinating all these efforts and bringing the
project to completion were done by Konstantin Belousov (kib).

Sponsored by:	The FreeBSD Foundation (emaste, kib)
Differential revision:	https://reviews.freebsd.org/D10439
2017-05-23 09:29:05 +00:00
Robert Watson
709557d903 Break audit_bsm_klib.c into two files: one (audit_bsm_klib.c)
retaining various utility functions used during BSM generation,
and a second (audit_bsm_db.c) that contains the various in-kernel
databases supporting various audit activities (the class and
event-name tables).

(No functional change is intended.)

Obtained from:	TrustedBSD Project
MFC after:	3 weeks
Sponsored by:	DARPA, AFRL
2017-04-03 10:15:58 +00:00
Robert Watson
475e1fc01f Correct macro names and signatures for !AUDIT versions of canonical
path auditing.

Obtained from:	TrustedBSD Project
MFC after:	3 weeks
Sponsored by:	DARPA, AFRL
2017-03-31 14:13:13 +00:00
Robert Watson
15bcf785ba Audit arguments to POSIX message queues, semaphores, and shared memory.
This requires minor changes to the audit framework to allow capturing
paths that are not filesystem paths (i.e., will not be canonicalised
relative to the process current working directory and/or filesystem
root).

Obtained from:	TrustedBSD Project
MFC after:	3 weeks
Sponsored by:	DARPA, AFRL
2017-03-31 13:43:00 +00:00
Robert Watson
1c2da02938 Audit arguments to System V IPC system calls implementing sempahores,
message queues, and shared memory.

Obtained from:	TrustedBSD Project
MFC after:	3 weeks
Sponsored by:	DARPA, AFRL
2017-03-30 22:26:15 +00:00
Robert Watson
b65ec5e523 Various BSM generation improvements when auditing AUE_ACCEPT,
AUE_PROCCTL, AUE_SENDFILE, AUE_ACL_*, and AUE_POSIX_FALLOCATE.
Audit AUE_SHMUNLINK path in the path token rather than as a
text string, and AUE_SHMOPEN flags as an integer token rather
than a System V IPC address token.

Obtained from:	TrustedBSD Project
MFC after:	3 weeks
Sponsored by:	DARPA, AFRL
2017-03-30 21:39:03 +00:00
Robert Watson
7cad64f796 Don't ifdef KDTRACE_HOOKS struct, variable, and function prototype
definitions for the DTrace audit provider, so that the dtaudit module
can compile in the absence of kernel DTrace support.  This doesn't
really make run-time sense (since the binary dependencies for the
module won't be present), but it allows the dtaudit module to compile
successfully regardless of the kernel configuration.

MFC after:	3 weeks
Sponsored by:	DARPA, AFRL
Reported by:	kib
2017-03-30 12:35:56 +00:00
Robert Watson
b783025921 When handling msgsys(2), semsys(2), and shmsys(2) multiplex system calls,
map the 'which' argument into a suitable audit event identifier for the
specific operation requested.

Obtained from:	TrustedBSD Project
MFC after:	3 weeks
Sponsored by:	DARPA, AFRL
2017-03-29 23:31:35 +00:00
Robert Watson
1811d6bf7f Add an experimental DTrace audit provider, which allows users of DTrace to
instrument security event auditing rather than relying on conventional BSM
trail files or audit pipes:

- Add a set of per-event 'commit' probes, which provide access to
  particular auditable events at the time of commit in system-call return.
  These probes gain access to audit data via the in-kernel audit_record
  data structure, providing convenient access to system-call arguments and
  return values in a single probe.

- Add a set of per-event 'bsm' probes, which provide access to particular
  auditable events at the time of BSM record generation in the audit
  worker thread. These probes have access to the in-kernel audit_record
  data structure and BSM representation as would be written to a trail
  file or audit pipe -- i.e., asynchronously in the audit worker thread.

DTrace probe arguments consist of the name of the audit event (to support
future mechanisms of instrumenting multiple events via a single probe --
e.g., using classes), a pointer to the in-kernel audit record, and an
optional pointer to the BSM data and its length. For human convenience,
upper-case audit event names (AUE_...) are converted to lower case in
DTrace.

DTrace scripts can now cause additional audit-based data to be collected
on system calls, and inspect internal and BSM representations of the data.
They do not affect data captured in the audit trail or audit pipes
configured in the system. auditd(8) must be configured and running in
order to provide a database of event information, as well as other audit
configuration parameters (e.g., to capture command-line arguments or
environmental variables) for the provider to operate.

Reviewed by:	gnn, jonathan, markj
Sponsored by:	DARPA, AFRL
MFC after:	3 weeks
Differential Revision:	https://reviews.freebsd.org/D10149
2017-03-29 19:58:00 +00:00
Robert Watson
759c8caa5a Introduce an audit event identifier -> audit event name mapping
database in the kernel audit implementation, similar the exist
class mapping database.  This will be used by the DTrace audit
provider to map audit event identifiers originating in the
system-call table back into strings for the purposes of setting
probe names.  The database is initialised and maintained by
auditd(8), which reads values in from the audit_events
configuration file, and then manages them using the A_GETEVENT
and A_SETEVENT auditon(2) operations.

Obtained from:	TrustedBSD Project
Sponsored by:	DARPA, AFRL
MFC after:	3 weeks
2017-03-27 10:38:53 +00:00
Robert Watson
d422682f67 Extend comment describing path canonicalisation in audit.
Sponsored by:	DARPA, AFRL
Obtained from:	TrustedBSD Project
MFC after:	3 days
2017-03-27 08:29:17 +00:00
Robert Watson
1279fdafce Audit 'fd' and 'cmd' arguments to fcntl(2), and when generating BSM,
always audit the file-descriptor number and vnode information for all
fnctl(2) commands, not just locking-related ones.  This was likely an
oversight in the original adaptation of this code from XNU.

MFC after:	3 days
Sponsored by:	DARPA, AFRL
2016-11-22 00:41:24 +00:00
Bryan Drewery
28323add09 Fix improper use of "its".
Sponsored by:	Dell EMC Isilon
2016-11-08 23:59:41 +00:00
John Baldwin
b3db2736b1 Don't check aq64_minfree which is unsigned for negative values.
This fixes a tautological comparison warning.

Reviewed by:	rwatson
Differential Revision:	https://reviews.freebsd.org/D7682
2016-09-08 19:47:57 +00:00
Robert Watson
70a98c110e Audit the accepted (or rejected) username argument to setlogin(2).
(NB: This was likely a mismerge from XNU in audit support, where the
text argument to setlogin(2) is captured -- but as a text token,
whereas this change uses the dedicated login-name field in struct
audit_record.)

MFC after:	2 weeks
Sponsored by:	DARPA, AFRL
2016-08-20 20:28:08 +00:00
Robert Watson
98daa3e5db Add AUE_WAIT6 handling to the BSM conversion switch statement, reusing
the BSM encoding used for AUE_WAIT4.

MFC after:	3 days
Sponsored by:	DARPA, AFRL
2016-07-11 13:06:17 +00:00
Robert Watson
2aa8c03917 Implement AUE_PREAD and AUE_PWRITE BSM conversion support, eliminating
console warnings when pread(2) and pwrite(2) are used with full
system-call auditing enabled.  We audit the same file-descriptor data
for these calls as we do read(2) and write(2).

Approved by:	re (kib)
MFC after:	3 days
Sponsored by:	DARPA, AFRL
2016-06-13 09:22:20 +00:00
Pedro F. Giffuni
bc5ade0d10 sys/security: minor spelling fixes.
No functional change.
2016-05-06 16:59:04 +00:00
Pedro F. Giffuni
323b076e9c sys: use our nitems() macro when param.h is available.
This should cover all the remaining cases in the kernel.

Discussed in:	freebsd-current
2016-04-21 19:40:10 +00:00
Pedro F. Giffuni
8dfea46460 Remove slightly used const values that can be replaced with nitems().
Suggested by:	jhb
2016-04-21 15:38:28 +00:00
Pedro F. Giffuni
2e4eeb703a audit(8): leave unsigned comparison for last.
aq64_minfree is unsigned so comparing to find out if it is less
than zero is a nonsense. Move the comparison to the last position
as we don't want to spend time if any of the others triggers first.

hile it would be tempting to just remove it, it may be important to
keep  it for portability with platforms where may be signed(?) or
in case we may want to change it in the future.
2016-04-08 03:26:21 +00:00
Konstantin Belousov
27725229dc Busy the mount point which is the owner of the audit vnode, around
audit_record_write().  This is important so that VFS_STATFS() is not
done on the NULL or freed mp and the check for free space is
consistent with the vnode used for write.

Add vn_start_write() braces around VOP_FSYNC() calls on the audit vnode.

Move repeated code to fsync vnode and panic to the helper
audit_worker_sync_vp().

Reviewed by:	rwatson
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2016-01-16 10:06:33 +00:00
Konstantin Belousov
c77f6350ee Move the funsetown(9) call from audit_pipe_close() to cdevpriv
destructor.  As result, close method becomes trivial and removed.
Final cdevsw close method might be called without file
context (e.g. in vn_open_vnode() if the vnode is reclaimed meantime),
which leaves ap_sigio registered for notification, despite cdevpriv
destructor frees the memory later.

Call destructor instead of doing a cleanup inline, for
devfs_set_cdevpriv() failure in open.  This adds missed funsetown(9)
call and locks ap to satisfy audit_pipe_free() invariants.

Reported and tested by:	pho (previous version)
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2016-01-13 14:02:07 +00:00
Christian Brueffer
8a0f5c0b6c Merge from contrib/openbsm to bring the kernel audit bits up to date with OpenBSM 1.2 alpha 4:
- remove $P4$
- fix a comment
2015-12-20 23:22:04 +00:00
Mark Johnston
3616095801 Fix style issues around existing SDT probes.
- Use SDT_PROBE<N>() instead of SDT_PROBE(). This has no functional effect
  at the moment, but will be needed for some future changes.
- Don't hardcode the module component of the probe identifier. This is
  set automatically by the SDT framework.

MFC after:	1 week
2015-12-16 23:39:27 +00:00
Mateusz Guzik
f131759f54 fd: make 'rights' a manadatory argument to fget* functions 2015-07-05 19:05:16 +00:00
Mateusz Guzik
4da8456f0a Replace struct filedesc argument in getvnode with struct thread
This is is a step towards removal of spurious arguments.
2015-06-16 13:09:18 +00:00
Mateusz Guzik
9ef8328d52 fd: make rights a mandatory argument to fget_unlocked 2015-06-16 09:52:36 +00:00
Mateusz Guzik
daf63fd2f9 cred: add proc_set_cred helper
The goal here is to provide one place altering process credentials.

This eases debugging and opens up posibilities to do additional work when such
an action is performed.
2015-03-16 00:10:03 +00:00
Mateusz Guzik
eb0b6ba016 audit: fix cred assignment when A_SETPMASK is used
The code used to modify curproc instead of the target process.

Discussed with: rwatson
MFC after:	3 days
2015-03-15 21:43:43 +00:00
Gleb Kurtsou
dde58752db Adjust printf format specifiers for dev_t and ino_t in kernel.
ino_t and dev_t are about to become uint64_t.

Reviewed by:	kib, mckusick
2014-12-17 07:27:19 +00:00
Davide Italiano
b40ea294f9 Replace dev_clone with cdevpriv(9) KPI in audit_pipe code.
This is (yet another) step towards the removal of device
cloning from our kernel.

CR:	https://reviews.freebsd.org/D441
Reviewed by:	kib, rwatson
Tested by:	pho
2014-08-20 16:04:30 +00:00
Hans Petter Selasky
af3b2549c4 Pull in r267961 and r267973 again. Fix for issues reported will follow. 2014-06-28 03:56:17 +00:00
Glen Barber
37a107a407 Revert r267961, r267973:
These changes prevent sysctl(8) from returning proper output,
such as:

 1) no output from sysctl(8)
 2) erroneously returning ENOMEM with tools like truss(1)
    or uname(1)
 truss: can not get etype: Cannot allocate memory
2014-06-27 22:05:21 +00:00
Hans Petter Selasky
3da1cf1e88 Extend the meaning of the CTLFLAG_TUN flag to automatically check if
there is an environment variable which shall initialize the SYSCTL
during early boot. This works for all SYSCTL types both statically and
dynamically created ones, except for the SYSCTL NODE type and SYSCTLs
which belong to VNETs. A new flag, CTLFLAG_NOFETCH, has been added to
be used in the case a tunable sysctl has a custom initialisation
function allowing the sysctl to still be marked as a tunable. The
kernel SYSCTL API is mostly the same, with a few exceptions for some
special operations like iterating childrens of a static/extern SYSCTL
node. This operation should probably be made into a factored out
common macro, hence some device drivers use this. The reason for
changing the SYSCTL API was the need for a SYSCTL parent OID pointer
and not only the SYSCTL parent OID list pointer in order to quickly
generate the sysctl path. The motivation behind this patch is to avoid
parameter loading cludges inside the OFED driver subsystem. Instead of
adding special code to the OFED driver subsystem to post-load tunables
into dynamically created sysctls, we generalize this in the kernel.

Other changes:
- Corrected a possibly incorrect sysctl name from "hw.cbb.intr_mask"
to "hw.pcic.intr_mask".
- Removed redundant TUNABLE statements throughout the kernel.
- Some minor code rewrites in connection to removing not needed
TUNABLE statements.
- Added a missing SYSCTL_DECL().
- Wrapped two very long lines.
- Avoid malloc()/free() inside sysctl string handling, in case it is
called to initialize a sysctl from a tunable, hence malloc()/free() is
not ready when sysctls from the sysctl dataset are registered.
- Bumped FreeBSD version to indicate SYSCTL API change.

MFC after:	2 weeks
Sponsored by:	Mellanox Technologies
2014-06-27 16:33:43 +00:00
Mateusz Guzik
00b85f6218 audit: plug FILEDESC_LOCK leak in audit_canon_path.
MFC after:	3 days
2014-03-21 01:30:33 +00:00
Robert Watson
4a14441044 Update kernel inclusions of capability.h to use capsicum.h instead; some
further refinement is required as some device drivers intended to be
portable over FreeBSD versions rely on __FreeBSD_version to decide whether
to include capability.h.

MFC after:	3 weeks
2014-03-16 10:55:57 +00:00
Gleb Smirnoff
45c203fce2 Remove AppleTalk support.
AppleTalk was a network transport protocol for Apple Macintosh devices
in 80s and then 90s. Starting with Mac OS X in 2000 the AppleTalk was
a legacy protocol and primary networking protocol is TCP/IP. The last
Mac OS X release to support AppleTalk happened in 2009. The same year
routing equipment vendors (namely Cisco) end their support.

Thus, AppleTalk won't be supported in FreeBSD 11.0-RELEASE.
2014-03-14 06:29:43 +00:00
Gleb Smirnoff
2c284d9395 Remove IPX support.
IPX was a network transport protocol in Novell's NetWare network operating
system from late 80s and then 90s. The NetWare itself switched to TCP/IP
as default transport in 1998. Later, in this century the Novell Open
Enterprise Server became successor of Novell NetWare. The last release
that claimed to still support IPX was OES 2 in 2007. Routing equipment
vendors (e.g. Cisco) discontinued support for IPX in 2011.

Thus, IPX won't be supported in FreeBSD 11.0-RELEASE.
2014-03-14 02:58:48 +00:00
Bjoern A. Zeeb
523b2279b8 As constantly reported during kernel compilation, m_buflen is unsigned so
can never be < 0.  Remove the expression, which can never be true.

MFC after:	1 week
2013-12-25 20:10:17 +00:00
John Baldwin
44ddb776c6 There is no sysctl with the MIB { CTL_KERN, KERN_MAXID }.
MFC after:	2 weeks
2013-12-05 21:55:10 +00:00
Andriy Gapon
d9fae5ab88 dtrace sdt: remove the ugly sname parameter of SDT_PROBE_DEFINE
In its stead use the Solaris / illumos approach of emulating '-' (dash)
in probe names with '__' (two consecutive underscores).

Reviewed by:	markj
MFC after:	3 weeks
2013-11-26 08:46:27 +00:00
Attilio Rao
54366c0bd7 - For kernel compiled only with KDTRACE_HOOKS and not any lock debugging
option, unbreak the lock tracing release semantic by embedding
  calls to LOCKSTAT_PROFILE_RELEASE_LOCK() direclty in the inlined
  version of the releasing functions for mutex, rwlock and sxlock.
  Failing to do so skips the lockstat_probe_func invokation for
  unlocking.
- As part of the LOCKSTAT support is inlined in mutex operation, for
  kernel compiled without lock debugging options, potentially every
  consumer must be compiled including opt_kdtrace.h.
  Fix this by moving KDTRACE_HOOKS into opt_global.h and remove the
  dependency by opt_kdtrace.h for all files, as now only KDTRACE_FRAMES
  is linked there and it is only used as a compile-time stub [0].

[0] immediately shows some new bug as DTRACE-derived support for debug
in sfxge is broken and it was never really tested.  As it was not
including correctly opt_kdtrace.h before it was never enabled so it
was kept broken for a while.  Fix this by using a protection stub,
leaving sfxge driver authors the responsibility for fixing it
appropriately [1].

Sponsored by:	EMC / Isilon storage division
Discussed with:	rstone
[0] Reported by:	rstone
[1] Discussed with:	philip
2013-11-25 07:38:45 +00:00
Gleb Smirnoff
76039bc84f The r48589 promised to remove implicit inclusion of if_var.h soon. Prepare
to this event, adding if_var.h to files that do need it. Also, include
all includes that now are included due to implicit pollution via if_var.h

Sponsored by:	Netflix
Sponsored by:	Nginx, Inc.
2013-10-26 17:58:36 +00:00
Mark Johnston
92c6196caa Fix some typos that were causing probe argument types to show up as unknown.
Reviewed by:	rwatson (mac provider)
Approved by:	re (glebius)
MFC after:	1 week
2013-10-01 15:40:27 +00:00
Konstantin Belousov
f7fadf1f22 Make the mac_policy_rm lock recursable, which allows reentrance into
the mac framework.  It is needed when priv_check_cred(9) is called from
the mac callback, e.g. in the mac_portacl(4).

Reported by:	az
Reviewed by:	rwatson
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Approved by:	re (gjb)
2013-09-29 20:21:34 +00:00
Davide Italiano
d56b4cd4ac - Use make_dev_credf(MAKEDEV_REF) instead of the race-prone make_dev()+
dev_ref() in the clone handlers that still use it.
- Don't set SI_CHEAPCLONE flag, it's not used anywhere neither in devfs
(for anything real)

Reviewed by:	kib
2013-09-07 13:45:44 +00:00
Pawel Jakub Dawidek
ab568de789 Handle cases where capability rights are not provided.
Reported by:	kib
2013-09-05 11:58:12 +00:00
Pawel Jakub Dawidek
7008be5bd7 Change the cap_rights_t type from uint64_t to a structure that we can extend
in the future in a backward compatible (API and ABI) way.

The cap_rights_t represents capability rights. We used to use one bit to
represent one right, but we are running out of spare bits. Currently the new
structure provides place for 114 rights (so 50 more than the previous
cap_rights_t), but it is possible to grow the structure to hold at least 285
rights, although we can make it even larger if 285 rights won't be enough.

The structure definition looks like this:

	struct cap_rights {
		uint64_t	cr_rights[CAP_RIGHTS_VERSION + 2];
	};

The initial CAP_RIGHTS_VERSION is 0.

The top two bits in the first element of the cr_rights[] array contain total
number of elements in the array - 2. This means if those two bits are equal to
0, we have 2 array elements.

The top two bits in all remaining array elements should be 0.
The next five bits in all array elements contain array index. Only one bit is
used and bit position in this five-bits range defines array index. This means
there can be at most five array elements in the future.

To define new right the CAPRIGHT() macro must be used. The macro takes two
arguments - an array index and a bit to set, eg.

	#define	CAP_PDKILL	CAPRIGHT(1, 0x0000000000000800ULL)

We still support aliases that combine few rights, but the rights have to belong
to the same array element, eg:

	#define	CAP_LOOKUP	CAPRIGHT(0, 0x0000000000000400ULL)
	#define	CAP_FCHMOD	CAPRIGHT(0, 0x0000000000002000ULL)

	#define	CAP_FCHMODAT	(CAP_FCHMOD | CAP_LOOKUP)

There is new API to manage the new cap_rights_t structure:

	cap_rights_t *cap_rights_init(cap_rights_t *rights, ...);
	void cap_rights_set(cap_rights_t *rights, ...);
	void cap_rights_clear(cap_rights_t *rights, ...);
	bool cap_rights_is_set(const cap_rights_t *rights, ...);

	bool cap_rights_is_valid(const cap_rights_t *rights);
	void cap_rights_merge(cap_rights_t *dst, const cap_rights_t *src);
	void cap_rights_remove(cap_rights_t *dst, const cap_rights_t *src);
	bool cap_rights_contains(const cap_rights_t *big, const cap_rights_t *little);

Capability rights to the cap_rights_init(), cap_rights_set(),
cap_rights_clear() and cap_rights_is_set() functions are provided by
separating them with commas, eg:

	cap_rights_t rights;

	cap_rights_init(&rights, CAP_READ, CAP_WRITE, CAP_FSTAT);

There is no need to terminate the list of rights, as those functions are
actually macros that take care of the termination, eg:

	#define	cap_rights_set(rights, ...)				\
		__cap_rights_set((rights), __VA_ARGS__, 0ULL)
	void __cap_rights_set(cap_rights_t *rights, ...);

Thanks to using one bit as an array index we can assert in those functions that
there are no two rights belonging to different array elements provided
together. For example this is illegal and will be detected, because CAP_LOOKUP
belongs to element 0 and CAP_PDKILL to element 1:

	cap_rights_init(&rights, CAP_LOOKUP | CAP_PDKILL);

Providing several rights that belongs to the same array's element this way is
correct, but is not advised. It should only be used for aliases definition.

This commit also breaks compatibility with some existing Capsicum system calls,
but I see no other way to do that. This should be fine as Capsicum is still
experimental and this change is not going to 9.x.

Sponsored by:	The FreeBSD Foundation
2013-09-05 00:09:56 +00:00
Konstantin Belousov
940cb0e2bb Implement read(2)/write(2) and neccessary lseek(2) for posix shmfd.
Add MAC framework entries for posix shm read and write.

Do not allow implicit extension of the underlying memory segment past
the limit set by ftruncate(2) by either of the syscalls.  Read and
write returns short i/o, lseek(2) fails with EINVAL when resulting
offset does not fit into the limit.

Discussed with:	alc
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
2013-08-21 17:45:00 +00:00
Andriy Gapon
a01669ea96 audit_proc_coredump: check return value of audit_new
audit_new may return NULL if audit is disabled or suspended.

Sponsored by:	HybridCluster
MFC after:	7 days
2013-07-09 09:03:01 +00:00
Alan Cox
a42159f0ee Relax the vm object locking in mac_proc_vm_revoke_recurse(). A read lock
suffices in one place.

Sponsored by:	EMC / Isilon Storage Division
2013-06-04 17:23:09 +00:00
Attilio Rao
89f6b8632c Switch the vm_object mutex to be a rwlock. This will enable in the
future further optimizations where the vm_object lock will be held
in read mode most of the time the page cache resident pool of pages
are accessed for reading purposes.

The change is mostly mechanical but few notes are reported:
* The KPI changes as follow:
  - VM_OBJECT_LOCK() -> VM_OBJECT_WLOCK()
  - VM_OBJECT_TRYLOCK() -> VM_OBJECT_TRYWLOCK()
  - VM_OBJECT_UNLOCK() -> VM_OBJECT_WUNLOCK()
  - VM_OBJECT_LOCK_ASSERT(MA_OWNED) -> VM_OBJECT_ASSERT_WLOCKED()
    (in order to avoid visibility of implementation details)
  - The read-mode operations are added:
    VM_OBJECT_RLOCK(), VM_OBJECT_TRYRLOCK(), VM_OBJECT_RUNLOCK(),
    VM_OBJECT_ASSERT_RLOCKED(), VM_OBJECT_ASSERT_LOCKED()
* The vm/vm_pager.h namespace pollution avoidance (forcing requiring
  sys/mutex.h in consumers directly to cater its inlining functions
  using VM_OBJECT_LOCK()) imposes that all the vm/vm_pager.h
  consumers now must include also sys/rwlock.h.
* zfs requires a quite convoluted fix to include FreeBSD rwlocks into
  the compat layer because the name clash between FreeBSD and solaris
  versions must be avoided.
  At this purpose zfs redefines the vm_object locking functions
  directly, isolating the FreeBSD components in specific compat stubs.

The KPI results heavilly broken by this commit.  Thirdy part ports must
be updated accordingly (I can think off-hand of VirtualBox, for example).

Sponsored by:	EMC / Isilon storage division
Reviewed by:	jeff
Reviewed by:	pjd (ZFS specific review)
Discussed with:	alc
Tested by:	pho
2013-03-09 02:32:23 +00:00
Pawel Jakub Dawidek
7493f24ee6 - Implement two new system calls:
int bindat(int fd, int s, const struct sockaddr *addr, socklen_t addrlen);
	int connectat(int fd, int s, const struct sockaddr *name, socklen_t namelen);

  which allow to bind and connect respectively to a UNIX domain socket with a
  path relative to the directory associated with the given file descriptor 'fd'.

- Add manual pages for the new syscalls.

- Make the new syscalls available for processes in capability mode sandbox.

- Add capability rights CAP_BINDAT and CAP_CONNECTAT that has to be present on
  the directory descriptor for the syscalls to work.

- Update audit(4) to support those two new syscalls and to handle path
  in sockaddr_un structure relative to the given directory descriptor.

- Update procstat(1) to recognize the new capability rights.

- Document the new capability rights in cap_rights_limit(2).

Sponsored by:	The FreeBSD Foundation
Discussed with:	rwatson, jilles, kib, des
2013-03-02 21:11:30 +00:00
Pawel Jakub Dawidek
2609222ab4 Merge Capsicum overhaul:
- Capability is no longer separate descriptor type. Now every descriptor
  has set of its own capability rights.

- The cap_new(2) system call is left, but it is no longer documented and
  should not be used in new code.

- The new syscall cap_rights_limit(2) should be used instead of
  cap_new(2), which limits capability rights of the given descriptor
  without creating a new one.

- The cap_getrights(2) syscall is renamed to cap_rights_get(2).

- If CAP_IOCTL capability right is present we can further reduce allowed
  ioctls list with the new cap_ioctls_limit(2) syscall. List of allowed
  ioctls can be retrived with cap_ioctls_get(2) syscall.

- If CAP_FCNTL capability right is present we can further reduce fcntls
  that can be used with the new cap_fcntls_limit(2) syscall and retrive
  them with cap_fcntls_get(2).

- To support ioctl and fcntl white-listing the filedesc structure was
  heavly modified.

- The audit subsystem, kdump and procstat tools were updated to
  recognize new syscalls.

- Capability rights were revised and eventhough I tried hard to provide
  backward API and ABI compatibility there are some incompatible changes
  that are described in detail below:

	CAP_CREATE old behaviour:
	- Allow for openat(2)+O_CREAT.
	- Allow for linkat(2).
	- Allow for symlinkat(2).
	CAP_CREATE new behaviour:
	- Allow for openat(2)+O_CREAT.

	Added CAP_LINKAT:
	- Allow for linkat(2). ABI: Reuses CAP_RMDIR bit.
	- Allow to be target for renameat(2).

	Added CAP_SYMLINKAT:
	- Allow for symlinkat(2).

	Removed CAP_DELETE. Old behaviour:
	- Allow for unlinkat(2) when removing non-directory object.
	- Allow to be source for renameat(2).

	Removed CAP_RMDIR. Old behaviour:
	- Allow for unlinkat(2) when removing directory.

	Added CAP_RENAMEAT:
	- Required for source directory for the renameat(2) syscall.

	Added CAP_UNLINKAT (effectively it replaces CAP_DELETE and CAP_RMDIR):
	- Allow for unlinkat(2) on any object.
	- Required if target of renameat(2) exists and will be removed by this
	  call.

	Removed CAP_MAPEXEC.

	CAP_MMAP old behaviour:
	- Allow for mmap(2) with any combination of PROT_NONE, PROT_READ and
	  PROT_WRITE.
	CAP_MMAP new behaviour:
	- Allow for mmap(2)+PROT_NONE.

	Added CAP_MMAP_R:
	- Allow for mmap(PROT_READ).
	Added CAP_MMAP_W:
	- Allow for mmap(PROT_WRITE).
	Added CAP_MMAP_X:
	- Allow for mmap(PROT_EXEC).
	Added CAP_MMAP_RW:
	- Allow for mmap(PROT_READ | PROT_WRITE).
	Added CAP_MMAP_RX:
	- Allow for mmap(PROT_READ | PROT_EXEC).
	Added CAP_MMAP_WX:
	- Allow for mmap(PROT_WRITE | PROT_EXEC).
	Added CAP_MMAP_RWX:
	- Allow for mmap(PROT_READ | PROT_WRITE | PROT_EXEC).

	Renamed CAP_MKDIR to CAP_MKDIRAT.
	Renamed CAP_MKFIFO to CAP_MKFIFOAT.
	Renamed CAP_MKNODE to CAP_MKNODEAT.

	CAP_READ old behaviour:
	- Allow pread(2).
	- Disallow read(2), readv(2) (if there is no CAP_SEEK).
	CAP_READ new behaviour:
	- Allow read(2), readv(2).
	- Disallow pread(2) (CAP_SEEK was also required).

	CAP_WRITE old behaviour:
	- Allow pwrite(2).
	- Disallow write(2), writev(2) (if there is no CAP_SEEK).
	CAP_WRITE new behaviour:
	- Allow write(2), writev(2).
	- Disallow pwrite(2) (CAP_SEEK was also required).

	Added convinient defines:

	#define	CAP_PREAD		(CAP_SEEK | CAP_READ)
	#define	CAP_PWRITE		(CAP_SEEK | CAP_WRITE)
	#define	CAP_MMAP_R		(CAP_MMAP | CAP_SEEK | CAP_READ)
	#define	CAP_MMAP_W		(CAP_MMAP | CAP_SEEK | CAP_WRITE)
	#define	CAP_MMAP_X		(CAP_MMAP | CAP_SEEK | 0x0000000000000008ULL)
	#define	CAP_MMAP_RW		(CAP_MMAP_R | CAP_MMAP_W)
	#define	CAP_MMAP_RX		(CAP_MMAP_R | CAP_MMAP_X)
	#define	CAP_MMAP_WX		(CAP_MMAP_W | CAP_MMAP_X)
	#define	CAP_MMAP_RWX		(CAP_MMAP_R | CAP_MMAP_W | CAP_MMAP_X)
	#define	CAP_RECV		CAP_READ
	#define	CAP_SEND		CAP_WRITE

	#define	CAP_SOCK_CLIENT \
		(CAP_CONNECT | CAP_GETPEERNAME | CAP_GETSOCKNAME | CAP_GETSOCKOPT | \
		 CAP_PEELOFF | CAP_RECV | CAP_SEND | CAP_SETSOCKOPT | CAP_SHUTDOWN)
	#define	CAP_SOCK_SERVER \
		(CAP_ACCEPT | CAP_BIND | CAP_GETPEERNAME | CAP_GETSOCKNAME | \
		 CAP_GETSOCKOPT | CAP_LISTEN | CAP_PEELOFF | CAP_RECV | CAP_SEND | \
		 CAP_SETSOCKOPT | CAP_SHUTDOWN)

	Added defines for backward API compatibility:

	#define	CAP_MAPEXEC		CAP_MMAP_X
	#define	CAP_DELETE		CAP_UNLINKAT
	#define	CAP_MKDIR		CAP_MKDIRAT
	#define	CAP_RMDIR		CAP_UNLINKAT
	#define	CAP_MKFIFO		CAP_MKFIFOAT
	#define	CAP_MKNOD		CAP_MKNODAT
	#define	CAP_SOCK_ALL		(CAP_SOCK_CLIENT | CAP_SOCK_SERVER)

Sponsored by:	The FreeBSD Foundation
Reviewed by:	Christoph Mallon <christoph.mallon@gmx.de>
Many aspects discussed with:	rwatson, benl, jonathan
ABI compatibility discussed with:	kib
2013-03-02 00:53:12 +00:00
Pawel Jakub Dawidek
89adaea91f Remove redundant check. 2013-02-17 11:57:47 +00:00
Pawel Jakub Dawidek
9270ed9d38 Style. 2013-02-11 22:54:23 +00:00