0dc332bff2
fspacectl(2) is a system call to provide space management support to userspace applications. VOP_DEALLOCATE(9) is a VOP call to perform the deallocation. vn_deallocate(9) is a public KPI for kmods' use. The purpose of proposing a new system call, a KPI and a VOP call is to allow bhyve or other hypervisor monitors to emulate the behavior of SCSI UNMAP/NVMe DEALLOCATE on a plain file. fspacectl(2) comprises of cmd and flags parameters to specify the space management operation to be performed. Currently cmd has to be SPACECTL_DEALLOC, and flags has to be 0. fo_fspacectl is added to fileops. VOP_DEALLOCATE(9) is added as a new VOP call. A trivial implementation of VOP_DEALLOCATE(9) is provided. Sponsored by: The FreeBSD Foundation Reviewed by: kib Differential Revision: https://reviews.freebsd.org/D28347
784 lines
14 KiB
Plaintext
784 lines
14 KiB
Plaintext
##
|
|
## Copyright (c) 2008-2010 Robert N. M. Watson
|
|
## All rights reserved.
|
|
##
|
|
## This software was developed at the University of Cambridge Computer
|
|
## Laboratory with support from a grant from Google, Inc.
|
|
##
|
|
## Redistribution and use in source and binary forms, with or without
|
|
## modification, are permitted provided that the following conditions
|
|
## are met:
|
|
## 1. Redistributions of source code must retain the above copyright
|
|
## notice, this list of conditions and the following disclaimer.
|
|
## 2. Redistributions in binary form must reproduce the above copyright
|
|
## notice, this list of conditions and the following disclaimer in the
|
|
## documentation and/or other materials provided with the distribution.
|
|
##
|
|
## THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND
|
|
## ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
|
|
## IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
|
|
## ARE DISCLAIMED. IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE
|
|
## FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
|
|
## DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
|
|
## OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
|
|
## HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
|
|
## LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
|
|
## OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
|
|
## SUCH DAMAGE.
|
|
##
|
|
## List of system calls enabled in capability mode, one name per line.
|
|
##
|
|
## System calls listed here operate either fully or partially in the absence
|
|
## of global namespaces or ambient authority. In capability mode system calls
|
|
## that operate only on global namespaces or require ambient authority have no
|
|
## utility and thus are not permitted.
|
|
##
|
|
## Notes:
|
|
## - sys_exit(2), abort2(2) and close(2) are very important.
|
|
## - Sorted alphabetically, please keep it that way.
|
|
##
|
|
## $FreeBSD$
|
|
##
|
|
|
|
##
|
|
## Allow ACL and MAC label operations by file descriptor, subject to
|
|
## capability rights. Allow MAC label operations on the current process but
|
|
## we will need to scope __mac_get_pid(2).
|
|
##
|
|
__acl_aclcheck_fd
|
|
__acl_delete_fd
|
|
__acl_get_fd
|
|
__acl_set_fd
|
|
__mac_get_fd
|
|
#__mac_get_pid
|
|
__mac_get_proc
|
|
__mac_set_fd
|
|
__mac_set_proc
|
|
|
|
##
|
|
## Allow creating special file descriptors like eventfd(2).
|
|
##
|
|
__specialfd
|
|
|
|
##
|
|
## Allow sysctl(2) as we scope internal to the call; this is a global
|
|
## namespace, but there are several critical sysctls required for almost
|
|
## anything to run, such as hw.pagesize. For now that policy lives in the
|
|
## kernel for performance and simplicity, but perhaps it could move to a
|
|
## proxying daemon in userspace.
|
|
##
|
|
__sysctl
|
|
__sysctlbyname
|
|
|
|
##
|
|
## Allow umtx operations as these are scoped by address space.
|
|
##
|
|
## XXRW: Need to check this very carefully.
|
|
##
|
|
_umtx_op
|
|
|
|
##
|
|
## Allow process termination using abort2(2).
|
|
##
|
|
abort2
|
|
|
|
##
|
|
## Allow accept(2) since it doesn't manipulate namespaces directly, rather
|
|
## relies on existing bindings on a socket, subject to capability rights.
|
|
##
|
|
accept
|
|
accept4
|
|
|
|
##
|
|
## Allow AIO operations by file descriptor, subject to capability rights.
|
|
##
|
|
aio_cancel
|
|
aio_error
|
|
aio_fsync
|
|
aio_read
|
|
aio_return
|
|
aio_suspend
|
|
aio_waitcomplete
|
|
aio_write
|
|
aio_writev
|
|
aio_readv
|
|
|
|
##
|
|
## audit(2) is a global operation, submitting to the global trail, but it is
|
|
## controlled by privilege, and it might be useful to be able to submit
|
|
## records from sandboxes. For now, disallow, but we may want to think about
|
|
## providing some sort of proxy service for this.
|
|
##
|
|
#audit
|
|
|
|
##
|
|
## Allow bindat(2).
|
|
##
|
|
bindat
|
|
|
|
##
|
|
## Allow capability mode and capability system calls.
|
|
##
|
|
cap_enter
|
|
cap_fcntls_get
|
|
cap_fcntls_limit
|
|
cap_getmode
|
|
cap_ioctls_get
|
|
cap_ioctls_limit
|
|
__cap_rights_get
|
|
cap_rights_limit
|
|
|
|
##
|
|
## Allow read-only clock operations.
|
|
##
|
|
clock_getres
|
|
clock_gettime
|
|
|
|
##
|
|
## Always allow file descriptor close(2).
|
|
##
|
|
close
|
|
close_range
|
|
closefrom
|
|
|
|
##
|
|
## Allow connectat(2).
|
|
##
|
|
connectat
|
|
|
|
##
|
|
## copy_file_range(2) reads from one descriptor and writes to the other.
|
|
##
|
|
copy_file_range
|
|
|
|
##
|
|
## cpuset(2) and related calls are limited to caller's own process/thread.
|
|
##
|
|
#cpuset
|
|
cpuset_getaffinity
|
|
cpuset_getdomain
|
|
#cpuset_getid
|
|
cpuset_setaffinity
|
|
cpuset_setdomain
|
|
#cpuset_setid
|
|
|
|
##
|
|
## Always allow dup(2) and dup2(2) manipulation of the file descriptor table.
|
|
##
|
|
dup
|
|
dup2
|
|
|
|
##
|
|
## Allow extended attribute operations by file descriptor, subject to
|
|
## capability rights.
|
|
##
|
|
extattr_delete_fd
|
|
extattr_get_fd
|
|
extattr_list_fd
|
|
extattr_set_fd
|
|
|
|
##
|
|
## Allow changing file flags, mode, and owner by file descriptor, subject to
|
|
## capability rights.
|
|
##
|
|
fchflags
|
|
fchmod
|
|
fchown
|
|
|
|
##
|
|
## For now, allow fcntl(2), subject to capability rights, but this probably
|
|
## needs additional scoping.
|
|
##
|
|
fcntl
|
|
|
|
##
|
|
## Allow fexecve(2), subject to capability rights. We perform some scoping,
|
|
## such as disallowing privilege escalation.
|
|
##
|
|
fexecve
|
|
|
|
##
|
|
## Allow flock(2), subject to capability rights.
|
|
##
|
|
flock
|
|
|
|
##
|
|
## Allow fork(2), even though it returns pids -- some applications seem to
|
|
## prefer this interface.
|
|
##
|
|
fork
|
|
|
|
##
|
|
## Allow fpathconf(2), subject to capability rights.
|
|
##
|
|
fpathconf
|
|
|
|
##
|
|
## Allow various file descriptor-based I/O operations, subject to capability
|
|
## rights.
|
|
##
|
|
freebsd11_fstat
|
|
freebsd11_fstatat
|
|
freebsd11_getdirentries
|
|
freebsd11_fstatfs
|
|
freebsd11_mknodat
|
|
freebsd6_ftruncate
|
|
freebsd6_lseek
|
|
freebsd6_mmap
|
|
freebsd6_pread
|
|
freebsd6_pwrite
|
|
|
|
##
|
|
## Allow I/O-related file operations, subject to capability rights.
|
|
##
|
|
fspacectl
|
|
|
|
##
|
|
## Allow querying file and file system state with fstat(2) and fstatfs(2),
|
|
## subject to capability rights.
|
|
##
|
|
fstat
|
|
fstatfs
|
|
|
|
##
|
|
## Allow further file descriptor-based I/O operations, subject to capability
|
|
## rights.
|
|
##
|
|
fdatasync
|
|
fsync
|
|
ftruncate
|
|
|
|
##
|
|
## Allow futimens(2) and futimes(2), subject to capability rights.
|
|
##
|
|
futimens
|
|
futimes
|
|
|
|
##
|
|
## Allow querying process audit state, subject to normal access control.
|
|
##
|
|
getaudit
|
|
getaudit_addr
|
|
getauid
|
|
|
|
##
|
|
## Allow thread context management with getcontext(2).
|
|
##
|
|
getcontext
|
|
|
|
##
|
|
## Allow directory I/O on a file descriptor, subject to capability rights.
|
|
## Originally we had separate capabilities for directory-specific read
|
|
## operations, but on BSD we allow reading the raw directory data, so we just
|
|
## rely on CAP_READ now.
|
|
##
|
|
getdents
|
|
getdirentries
|
|
|
|
##
|
|
## Allow querying certain trivial global state.
|
|
##
|
|
getdomainname
|
|
|
|
##
|
|
## Allow querying certain per-process resource limit state.
|
|
##
|
|
getdtablesize
|
|
|
|
##
|
|
## Allow querying current process credential state.
|
|
##
|
|
getegid
|
|
geteuid
|
|
|
|
##
|
|
## Allow querying certain trivial global state.
|
|
##
|
|
gethostid
|
|
gethostname
|
|
|
|
##
|
|
## Allow querying per-process timer.
|
|
##
|
|
getitimer
|
|
|
|
##
|
|
## Allow querying current process credential state.
|
|
##
|
|
getgid
|
|
getgroups
|
|
getlogin
|
|
getloginclass
|
|
|
|
##
|
|
## Allow querying certain trivial global state.
|
|
##
|
|
getpagesize
|
|
getpeername
|
|
|
|
##
|
|
## Allow querying certain per-process scheduling, resource limit, and
|
|
## credential state.
|
|
##
|
|
## XXXRW: getpgid(2) needs scoping. It's not clear if it's worth scoping
|
|
## getppid(2). getpriority(2) needs scoping. getrusage(2) needs scoping.
|
|
## getsid(2) needs scoping.
|
|
##
|
|
getpgid
|
|
getpgrp
|
|
getpid
|
|
getppid
|
|
getpriority
|
|
getresgid
|
|
getresuid
|
|
getrlimit
|
|
getrusage
|
|
getsid
|
|
|
|
##
|
|
## Allow getrandom
|
|
##
|
|
getrandom
|
|
|
|
##
|
|
## Allow querying socket state, subject to capability rights.
|
|
##
|
|
## XXXRW: getsockopt(2) may need more attention.
|
|
##
|
|
getsockname
|
|
getsockopt
|
|
|
|
##
|
|
## Allow querying the global clock.
|
|
##
|
|
gettimeofday
|
|
|
|
##
|
|
## Allow querying current process credential state.
|
|
##
|
|
getuid
|
|
|
|
##
|
|
## Allow ioctl(2), which hopefully will be limited by applications only to
|
|
## required commands with cap_ioctls_limit(2) syscall.
|
|
##
|
|
ioctl
|
|
|
|
##
|
|
## Allow querying current process credential state.
|
|
##
|
|
issetugid
|
|
|
|
##
|
|
## Allow kevent(2), as we will authorize based on capability rights on the
|
|
## target descriptor.
|
|
##
|
|
kevent
|
|
|
|
##
|
|
## Allow kill(2), as we allow the process to send signals only to himself.
|
|
##
|
|
kill
|
|
|
|
##
|
|
## Allow message queue operations on file descriptors, subject to capability
|
|
## rights.
|
|
## NOTE: Corresponding sysents are initialized in sys/kern/uipc_mqueue.c with
|
|
## SYF_CAPENABLED.
|
|
##
|
|
kmq_notify
|
|
kmq_setattr
|
|
kmq_timedreceive
|
|
kmq_timedsend
|
|
|
|
##
|
|
## Allow kqueue(2), we will control use.
|
|
##
|
|
kqueue
|
|
|
|
##
|
|
## Allow managing per-process timers.
|
|
##
|
|
ktimer_create
|
|
ktimer_delete
|
|
ktimer_getoverrun
|
|
ktimer_gettime
|
|
ktimer_settime
|
|
|
|
##
|
|
## We can't allow ktrace(2) because it relies on a global namespace, but we
|
|
## might want to introduce an fktrace(2) of some sort.
|
|
##
|
|
#ktrace
|
|
|
|
##
|
|
## Allow AIO operations by file descriptor, subject to capability rights.
|
|
##
|
|
lio_listio
|
|
|
|
##
|
|
## Allow listen(2), subject to capability rights.
|
|
##
|
|
## XXXRW: One might argue this manipulates a global namespace.
|
|
##
|
|
listen
|
|
|
|
##
|
|
## Allow I/O-related file descriptors, subject to capability rights.
|
|
##
|
|
lseek
|
|
|
|
##
|
|
## Allow simple VM operations on the current process.
|
|
##
|
|
madvise
|
|
mincore
|
|
minherit
|
|
mlock
|
|
mlockall
|
|
|
|
##
|
|
## Allow memory mapping a file descriptor, and updating protections, subject
|
|
## to capability rights.
|
|
##
|
|
mmap
|
|
mprotect
|
|
|
|
##
|
|
## Allow simple VM operations on the current process.
|
|
##
|
|
msync
|
|
munlock
|
|
munlockall
|
|
munmap
|
|
|
|
##
|
|
## Allow the current process to sleep.
|
|
##
|
|
nanosleep
|
|
|
|
##
|
|
## Allow querying the global clock.
|
|
##
|
|
ntp_gettime
|
|
|
|
##
|
|
## Allow AIO operations by file descriptor, subject to capability rights.
|
|
##
|
|
oaio_read
|
|
oaio_write
|
|
|
|
##
|
|
## Allow simple VM operations on the current process.
|
|
##
|
|
break
|
|
|
|
##
|
|
## Allow AIO operations by file descriptor, subject to capability rights.
|
|
##
|
|
olio_listio
|
|
|
|
##
|
|
## Operations relative to directory capabilities.
|
|
##
|
|
chflagsat
|
|
faccessat
|
|
fchmodat
|
|
fchownat
|
|
fstatat
|
|
futimesat
|
|
linkat
|
|
mkdirat
|
|
mkfifoat
|
|
mknodat
|
|
openat
|
|
readlinkat
|
|
renameat
|
|
symlinkat
|
|
unlinkat
|
|
funlinkat
|
|
utimensat
|
|
|
|
##
|
|
## Process descriptor-related system calls are allowed.
|
|
##
|
|
pdfork
|
|
pdgetpid
|
|
pdkill
|
|
#pdwait4 # not yet implemented
|
|
|
|
##
|
|
## Allow pipe(2).
|
|
##
|
|
pipe
|
|
pipe2
|
|
|
|
##
|
|
## Allow poll(2), which will be scoped by capability rights.
|
|
##
|
|
poll
|
|
ppoll
|
|
|
|
##
|
|
## Allow I/O-related file descriptors, subject to capability rights.
|
|
##
|
|
posix_fallocate
|
|
pread
|
|
preadv
|
|
|
|
##
|
|
## Allow access to profiling state on the current process.
|
|
##
|
|
profil
|
|
|
|
##
|
|
## Disallow ptrace(2) for now, but we do need debugging facilities in
|
|
## capability mode, so we will want to revisit this, possibly by scoping its
|
|
## operation.
|
|
##
|
|
#ptrace
|
|
|
|
##
|
|
## Allow I/O-related file descriptors, subject to capability rights.
|
|
##
|
|
pwrite
|
|
pwritev
|
|
read
|
|
readv
|
|
recv
|
|
recvfrom
|
|
recvmsg
|
|
|
|
##
|
|
## Allow real-time scheduling primitives to be used.
|
|
##
|
|
## XXXRW: These require scoping.
|
|
##
|
|
rtprio
|
|
rtprio_thread
|
|
|
|
##
|
|
## Allow simple VM operations on the current process.
|
|
##
|
|
sbrk
|
|
|
|
##
|
|
## Allow querying trivial global scheduler state.
|
|
##
|
|
sched_get_priority_max
|
|
sched_get_priority_min
|
|
|
|
##
|
|
## Allow various thread/process scheduler operations.
|
|
##
|
|
## XXXRW: Some of these require further scoping.
|
|
##
|
|
sched_getparam
|
|
sched_getscheduler
|
|
sched_rr_get_interval
|
|
sched_setparam
|
|
sched_setscheduler
|
|
sched_yield
|
|
|
|
##
|
|
## Allow I/O-related file descriptors, subject to capability rights.
|
|
## NOTE: Corresponding sysents are initialized in sys/netinet/sctp_syscalls.c
|
|
## with SYF_CAPENABLED.
|
|
##
|
|
sctp_generic_recvmsg
|
|
sctp_generic_sendmsg
|
|
sctp_generic_sendmsg_iov
|
|
sctp_peeloff
|
|
|
|
##
|
|
## Allow pselect(2) and select(2), which will be scoped by capability rights.
|
|
##
|
|
## XXXRW: But is it?
|
|
##
|
|
pselect
|
|
select
|
|
|
|
##
|
|
## Allow I/O-related file descriptors, subject to capability rights. Use of
|
|
## explicit addresses here is restricted by the system calls themselves.
|
|
##
|
|
send
|
|
sendfile
|
|
sendmsg
|
|
sendto
|
|
|
|
##
|
|
## Allow setting per-process audit state, which is controlled separately by
|
|
## privileges.
|
|
##
|
|
setaudit
|
|
setaudit_addr
|
|
setauid
|
|
|
|
##
|
|
## Allow setting thread context.
|
|
##
|
|
setcontext
|
|
|
|
##
|
|
## Allow setting current process credential state, which is controlled
|
|
## separately by privilege.
|
|
##
|
|
setegid
|
|
seteuid
|
|
setgid
|
|
|
|
##
|
|
## Allow use of the process interval timer.
|
|
##
|
|
setitimer
|
|
|
|
##
|
|
## Allow setpriority(2).
|
|
##
|
|
## XXXRW: Requires scoping.
|
|
##
|
|
setpriority
|
|
|
|
##
|
|
## Allow setting current process credential state, which is controlled
|
|
## separately by privilege.
|
|
##
|
|
setregid
|
|
setresgid
|
|
setresuid
|
|
setreuid
|
|
|
|
##
|
|
## Allow setting process resource limits with setrlimit(2).
|
|
##
|
|
setrlimit
|
|
|
|
##
|
|
## Allow creating a new session with setsid(2).
|
|
##
|
|
setsid
|
|
|
|
##
|
|
## Allow setting socket options with setsockopt(2), subject to capability
|
|
## rights.
|
|
##
|
|
## XXXRW: Might require scoping.
|
|
##
|
|
setsockopt
|
|
|
|
##
|
|
## Allow setting current process credential state, which is controlled
|
|
## separately by privilege.
|
|
##
|
|
setuid
|
|
|
|
##
|
|
## shm_open(2) is scoped so as to allow only access to new anonymous objects.
|
|
##
|
|
shm_open
|
|
shm_open2
|
|
|
|
##
|
|
## Allow I/O-related file descriptors, subject to capability rights.
|
|
##
|
|
shutdown
|
|
|
|
##
|
|
## Allow signal control on current process.
|
|
##
|
|
sigaction
|
|
sigaltstack
|
|
sigblock
|
|
sigfastblock
|
|
sigpending
|
|
sigprocmask
|
|
sigqueue
|
|
sigreturn
|
|
sigsetmask
|
|
sigstack
|
|
sigsuspend
|
|
sigtimedwait
|
|
sigvec
|
|
sigwaitinfo
|
|
sigwait
|
|
|
|
##
|
|
## Allow creating new socket pairs with socket(2) and socketpair(2).
|
|
##
|
|
socket
|
|
socketpair
|
|
|
|
##
|
|
## Allow simple VM operations on the current process.
|
|
##
|
|
## XXXRW: Kernel doesn't implement this, so drop?
|
|
##
|
|
sstk
|
|
|
|
##
|
|
## Do allow sync(2) for now, but possibly shouldn't.
|
|
##
|
|
sync
|
|
|
|
##
|
|
## Always allow process termination with sys_exit(2).
|
|
##
|
|
sys_exit
|
|
|
|
##
|
|
## sysarch(2) does rather diverse things, but is required on at least i386
|
|
## in order to configure per-thread data. As such, it's scoped on each
|
|
## architecture.
|
|
##
|
|
sysarch
|
|
|
|
##
|
|
## Allow thread operations operating only on current process.
|
|
##
|
|
thr_create
|
|
thr_exit
|
|
thr_kill
|
|
|
|
##
|
|
## Disallow thr_kill2(2), as it may operate beyond the current process.
|
|
##
|
|
## XXXRW: Requires scoping.
|
|
##
|
|
#thr_kill2
|
|
|
|
##
|
|
## Allow thread operations operating only on current process.
|
|
##
|
|
thr_new
|
|
thr_self
|
|
thr_set_name
|
|
thr_suspend
|
|
thr_wake
|
|
|
|
##
|
|
## Allow manipulation of the current process umask with umask(2).
|
|
##
|
|
umask
|
|
|
|
##
|
|
## Allow submitting of process trace entries with utrace(2).
|
|
##
|
|
utrace
|
|
|
|
##
|
|
## Allow generating UUIDs with uuidgen(2).
|
|
##
|
|
uuidgen
|
|
|
|
##
|
|
## Allow I/O-related file descriptors, subject to capability rights.
|
|
##
|
|
write
|
|
writev
|
|
|
|
##
|
|
## Allow processes to yield(2).
|
|
##
|
|
yield
|