vn_open_cred in default implementation. Valid struct ucred is needed for
audit and MAC, and curthread credentials may be wrong.
This further requires modifying the interface of vn_fullpath(9), but it
is out of scope of this change.
Reviewed by: rwatson
vn_open_cred invocations shall not audit namei path.
In particular, specify VN_OPEN_NOAUDIT for dotdot lookup performed by
default implementation of vop_vptocnp, and for the open done for core
file. vn_fullpath is called from the audit code, and vn_open there need
to disable audit to avoid infinite recursion. Core file is created on
return to user mode, that, in particular, happens during syscall return.
The creation of the core file is audited by direct calls, and we do not
want to overwrite audit information for syscall.
Reported, reviewed and tested by: rwatson
Now that we start to use credentials on character devices more often
(because of MPSAFE TTY), move the prison-checks that are in place in the
TTY code into devfs.
Instead of strictly comparing the prisons, use the more common
prison_check() function to compare credentials. This means that
pseudo-terminals are only visible in devfs by processes within the same
jail and parent jails.
Even though regular users in parent jails can now interact with
pseudo-terminals from child jails, this seems to be the right approach.
These processes are also capable of interacting with the jailed
processes anyway, through signals for example.
Reviewed by: kib, rwatson (older version)
NGROUPS_MAX, eliminate ABI dependencies on them, and raise the to 1024
and 1023 respectively. (Previously they were equal, but under a close
reading of POSIX, NGROUPS_MAX was defined to be too large by 1 since it
is the number of supplemental groups, not total number of groups.)
The bulk of the change consists of converting the struct ucred member
cr_groups from a static array to a pointer. Do the equivalent in
kinfo_proc.
Introduce new interfaces crcopysafe() and crsetgroups() for duplicating
a process credential before modifying it and for setting group lists
respectively. Both interfaces take care for the details of allocating
groups array. crsetgroups() takes care of truncating the group list
to the current maximum (NGROUPS) if necessary. In the future,
crsetgroups() may be responsible for insuring invariants such as sorting
the supplemental groups to allow groupmember() to be implemented as a
binary search.
Because we can not change struct xucred without breaking application
ABIs, we leave it alone and introduce a new XU_NGROUPS value which is
always 16 and is to be used or NGRPS as appropriate for things such as
NFS which need to use no more than 16 groups. When feasible, truncate
the group list rather than generating an error.
Minor changes:
- Reduce the number of hand rolled versions of groupmember().
- Do not assign to both cr_gid and cr_groups[0].
- Modify ipfw to cache ucreds instead of part of their contents since
they are immutable once referenced by more than one entity.
Submitted by: Isilon Systems (initial implementation)
X-MFC after: never
PR: bin/113398 kern/133867
Instead of locking the local unp followed by the remote unp, use the same
locking model as accept() and read lock the global link lock followed by
the remote unp while fetching the remote sockaddr.
Reported by: Mel Flynn mel.flynn of mailing.thruhere.net
Reviewed by: rwatson
MFC after: 1 week
allocator for the jumbo frames zones. This change has two benefits: (1) a
custom back-end deallocator is no longer required. UMA's standard
deallocator suffices. (2) It eliminates a potentially confusing artifact
of using contigmalloc(): The malloc(9) statistics contain bogus information
about the usage of jumbo frames. Specifically, the malloc(9) statistics
report all jumbo frames in use whereas the UMA zone statistics report the
"truth" about the number in use vs. the number free.
in the type field of system call tables. Specifically, one can now use
the 'NO*' types as flags in addition to the 'COMPAT*' types. For example,
to tag 'COMPAT*' system calls as living in a KLD via NOSTD. The COMPAT*
type is required to be listed first in this case.
- Add new functions 'type()' and 'flag()' to the embedded awk script in
makesyscalls.sh that return true if a requested flag is found in the
type field ($3). The flag() function checks all of the flags in the
field, but type() only checks the first flag. type() is meant to be
used in the top-level "switch" statement and flag() should be used
otherwise.
- Retire the CPT_NOA type, it is now replaced with "COMPAT|NOARGS" using
the flags approach.
- Tweak the comment descriptions of COMPAT[46] system calls so that they
say "freebsd[46] foo" rather than "old foo".
- Document the COMPAT6 type.
- Sync comments in compat32 syscall table with the master table.
- Mark nfsclnt as UNIMPL. It should have been NOSTD instead of NOIMPL back
when it lived in nfsclient.ko, but it was removed from that a long time
ago.
Actually, as it did receive few tuning, the support is disabled by
default, but it can opt-in with the option ADAPTIVE_LOCKMGRS.
Due to the nature of lockmgrs, adaptive spinning needs to be
selectively enabled for any interested lockmgr.
The support is bi-directional, or, in other ways, it will work in both
cases if the lock is held in read or write way. In particular, the
read path is passible of further tunning using the sysctls
debug.lockmgr.retries and debug.lockmgr.loops . Ideally, such sysctls
should be axed or compiled out before release.
Addictionally note that adaptive spinning doesn't cope well with
LK_SLEEPFAIL. The reason is that many (and probabilly all) consumers
of LK_SLEEPFAIL are mainly interested in knowing if the interlock was
dropped or not in order to reacquire it and re-test initial conditions.
This directly interacts with adaptive spinning because lockmgr needs
to drop the interlock while spinning in order to avoid a deadlock
(further details in the comments inside the patch).
Final note: finding someone willing to help on tuning this with
relevant workloads would be either very important and appreciated.
Tested by: jeff, pho
Requested by: many
The code that was in place in exit1() was mainly based on code from the
old TTY layer. The main reason behind this, was because at one moment I
ran a system that had two TTY layers in place at the same time. It is
now sufficient to do the following:
- Remove references from the session structure to the TTY vnode and the
session leader.
- If we have a controlling TTY and the session used by the TTY is equal
to our session, send the SIGHUP.
- If we have a vnode to the controlling TTY which has not been revoked,
revoke it.
While there, change sys/kern/tty.c to use s_ttyp in the comparison
instead of s_ttyvp. It should not make any difference, because s_ttyvp
can only become null when the session leader already left, but it's
nicer to compare against the proper value.
any open file descriptors >= 'lowfd'. It is largely identical to the same
function on other operating systems such as Solaris, DFly, NetBSD, and
OpenBSD. One difference from other *BSD is that this closefrom() does not
fail with any errors. In practice, while the manpages for NetBSD and
OpenBSD claim that they return EINTR, they ignore internal errors from
close() and never return EINTR. DFly does return EINTR, but for the common
use case (closing fd's prior to execve()), the caller really wants all
fd's closed and returning EINTR just forces callers to call closefrom() in
a loop until it stops failing.
Note that this implementation of closefrom(2) does not make any effort to
resolve userland races with open(2) in other threads. As such, it is not
multithread safe.
Submitted by: rwatson (initial version)
Reviewed by: rwatson
MFC after: 2 weeks
Right now the only way to make tcsetsid(3)/TIOCSCTTY work, is by
ensuring the session leader is dead. This means that an application that
catches SIGHUPs and performs a sleep prevents us from assigning a new
session leader.
Change the code to make it work on revoked TTYs as well. This allows us
to change init(8) to make the shutdown script run in a more clean
environment.
parameter "vnet" when it is created, a new vnet instance will be created
along with the jail. Networks interfaces can be moved between prisons
with an ioctl similar to the one that moves them between vimages.
For now vnets will co-exist under both jails and vimages, but soon
struct vimage will be going away.
Reviewed by: zec, julian
Approved by: bz (mentor)
There is an external use in the opensolaris code.
I am not sure how this ever worked but I have seen two reports of:
link_elf: symbol hardlink_check_uid undefined
lately.
Reported by: Scott Ullrich (sullrich gmail.com), pfsense
Reported by: Mister Olli (mister.olli googlemail.com)
Because our rc scripts also open the /etc/ttyv* nodes, it revokes the
console, preventing startup messages from being displayed.
I really have to think about this. Maybe we should just give the console
its own TTY and let it build on top of other TTYs. I'm still not sure
what to do with input handling there.
Even though I thought I fixed the staircase issue (and I was no longer
able to reproduce it), I got some reports of the issue still being
there. It turns out the staircase effect still occurred when
/dev/console was kept open while killing the getty on the same TTY
(ttyv0).
For some reason I can't figure out how the old TTY code dealt with that,
so I assume the issue has always been there. I only exposed it more by
merging consolectl with ttyv0, which means that the issue was present,
even on systems without a serial console.
I'm now marking the console device as being closed when closing the
regular TTY device node. This means that when the getty shuts down,
init(8) will open /dev/console, which means the termios attributes will
always be reset in this case.
in symtab_get method symtab parameter is made constant as this reflects
actual intention and usage of the method
Reviewed by: imp, current@
Approved by: jhb (mentor)
vnode interlock to protect the knote fields [1]. The locking assumes
that shared vnode lock is held, thus we get exclusive access to knote
either by exclusive vnode lock protection, or by shared vnode lock +
vnode interlock.
Do not use kl_locked() method to assert either lock ownership or the
fact that curthread does not own the lock. For shared locks, ownership
is not recorded, e.g. VOP_ISLOCKED can return LK_SHARED for the shared
lock not owned by curthread, causing false positives in kqueue subsystem
assertions about knlist lock.
Remove kl_locked method from knlist lock vector, and add two separate
assertion methods kl_assert_locked and kl_assert_unlocked, that are
supposed to use proper asserts. Change knlist_init accordingly.
Add convenience function knlist_init_mtx to reduce number of arguments
for typical knlist initialization.
Submitted by: jhb [1]
Noted by: jhb [2]
Reviewed by: jhb
Tested by: rnoland
So properly hide the already #ifdef SCTP code with
#if defined(INET) || defined(INET6) as well to get us
closer to a non-INET/INET6 kernel.
Discussed with: tuexen [1]
probe. The current device order is unchanged. This commit just adds the
infrastructure and ABI changes so that it is easier to merge later changes
into 8.x.
- Driver attachments now have an associated pass level. Attachments are
not allowed to probe or attach to drivers until the system-wide pass level
is >= the attachment's pass level. By default driver attachments use the
"last" pass level (BUS_PASS_DEFAULT). Driver's that wish to probe during
an earlier pass use EARLY_DRIVER_MODULE() instead of DRIVER_MODULE() which
accepts the pass level as an additional parameter.
- A new method BUS_NEW_PASS has been added to the bus interface. This
method is invoked when the system-wide pass level is changed to kick off
a rescan of the device tree so that drivers that have just been made
"eligible" can probe and attach.
- The bus_generic_new_pass() function provides a default implementation of
BUS_NEW_PASS(). It first allows drivers that were just made eligible for
this pass to identify new child devices. Then it propogates the rescan to
child devices that already have an attached driver by invoking their
BUS_NEW_PASS() method. It also reprobes devices without a driver.
- BUS_PROBE_NOMATCH() is only invoked for devices that do not have
an attached driver after being scanned during the final pass.
- The bus_set_pass() function is used during boot to raise the pass level.
Currently it is only called once during root_bus_configure() to raise
the pass level to BUS_PASS_DEFAULT. This has the effect of probing all
devices in a single pass identical to previous behavior.
Reviewed by: imp
Approved by: re (kib)
the ROUTETABLES kernel option thus there is no need to include opt_route.h
anymore in all consumers of vnet.h and no longer depend on it for module
builds.
Remove the hidden include in flowtable.h as well and leave the two
explicit #includes in ip_input.c and ip_output.c.
Vnet modules and protocol domains may now register destructor
functions to clean up and release per-module state. The destructor
mechanisms can be triggered by invoking "vimage -d", or a future
equivalent command which will be provided via the new jail framework.
While this patch introduces numerous placeholder destructor functions,
many of those are currently incomplete, thus leaking memory or (even
worse) failing to stop all running timers. Many of such issues are
already known and will be incrementaly fixed over the next weeks in
smaller incremental commits.
Apart from introducing new fields in structs ifnet, domain, protosw
and vnet_net, which requires the kernel and modules to be rebuilt, this
change should have no impact on nooptions VIMAGE builds, since vnet
destructors can only be called in VIMAGE kernels. Moreover,
destructor functions should be in general compiled in only in
options VIMAGE builds, except for kernel modules which can be safely
kldunloaded at run time.
Bump __FreeBSD_version to 800097.
Reviewed by: bz, julian
Approved by: rwatson, kib (re), julian (mentor)