list lock, as there has been a report that an alternative lock order
is getting introduced. This should help ferret it out.
Reported by: Ed Maste <emaste at phaedrus dot sandvine dot ca>
IFF_DRV_RUNNING, as well as the move from ifnet.if_flags to
ifnet.if_drv_flags. Device drivers are now responsible for
synchronizing access to these flags, as they are in if_drv_flags. This
helps prevent races between the network stack and device driver in
maintaining the interface flags field.
Many __FreeBSD__ and __FreeBSD_version checks maintained and continued;
some less so.
Reviewed by: pjd, bz
MFC after: 7 days
not holding any non-sleep-able-locks locks when copyin is called.
This gets executed un-conditionally since we have no function
to wire the buffer in this direction.
Pointed out by: truckman
MFC after: 1 week
event handler, dev_clone, which accepts a credential argument.
Implementors of the event can ignore it if they're not interested,
and most do. This avoids having multiple event handler types and
fall-back/precedence logic in devfs.
This changes the kernel API for /dev cloning, and may affect third
party packages containg cloning kernel modules.
Requested by: phk
MFC after: 3 days
the buffer has not been wired and we are holding any non-sleep-able locks,
drop a witness warning. If the buffer has not been wired, it is possible
that the writing of the data can sleep, especially if the page is not in
memory. This can result in a number of different locking issues, including
dead locks.
MFC after: 1 week
Discussed with: rwatson
Reviewed by: jhb
integer to an unsigned long. This lifts variables like the maximum
number of pages available for shared memory from 2^31 to 2^32 on 32
bit architectures, and from 2^31 to 2^64 on 64 bit architectures.
It should be noted that this changes breaks ABI on 64 bit architectures
because the size of the shmmax, shmmin, shmmni, shmseg and shmall members
of the shminfo structure has changed.
Silence on: current@
vnode is inactivated), possibly leading to a NULL dereference when
checking if the mount wants knotes to be activated in the VOP hooks.
So, we add a new vnode flag VV_NOKNOTE that is only set in getnewvnode(),
if necessary, and check it when activating knotes.
Since the flags are not erased when a vnode is being held, we can safely
read them.
Reviewed by: kris@
MFC after: 3 days
lists, as well as accessor macros. For now, this is a recursive mutex
due code sequences where IPv4 multicast calls into IGMP calls into
ip_output(), which then tests for a multicast forwarding case.
For support macros in in_var.h to check multicast address lists, assert
that in_multi_mtx is held.
Acquire in_multi_mtx around iteration over the IPv4 multicast address
lists, such as in ip_input() and ip_output().
Acquire in_multi_mtx when manipulating the IPv4 layer multicast addresses,
as well as over the manipulation of ifnet multicast address lists in order
to keep the two layers in sync.
Lock down accesses to IPv4 multicast addresses in IGMP, or assert the
lock when performing IGMP join/leave events.
Eliminate spl's associated with IPv4 multicast addresses, portions of
IGMP that weren't previously expunged by IGMP locking.
Add in_multi_mtx, igmp_mtx, and if_addr_mtx lock order to hard-coded
lock order in WITNESS, in that order.
Problem reported by: Ed Maste <emaste at phaedrus dot sandvine dot ca>
MFC after: 10 days
caller by saving the stack of the last locker/unlocker in lockmgr. We
also put the stack in KTR at the moment.
Contributed by: Antoine Brodin <antoine.brodin@laposte.net>
kenv environment in kern_environment.c switches to dynamic kenv. The prior
call sets the static variable hintp to the static hints in subr_hints.c
(hintmode==0).
However, changes to the environment are not detected by the resource_xxx
lookups after the change to dynamic kernel environment, so the lookup
routines only report the old stuff of hintmode==0, even after the change to
the dynamic kenv. This causes kenv users to see a different environment than
the kernel routines.
This is a problem in the mixer.c code that looks up initial mixer volume
settings from the hints: If the hints are dynamic and not from the
device.hints file, mixer.c doesn't see them, but kenv does.
The patch from the PR (modified to comply to the style of the function)
solves this.
PR: 83686
Submitted by: Harry Coin <harrycoin@qconline.com>
This has no security implications since only root is allowed to use
kenv(1) (and corrupt the kernel memory after adding too much variables
previous to this commit).
This is based upon the PR [1] mentioned below, but extended to check both
bounds (in case of an overflow of the counting variable) and to comply
to the style of the function. An overflow of the counting variable
shouldn't happen after adding the check for the upper bound, but better
safe than sorry (in case some other function in the kernel overwrites
random memory).
An interested soul may want to add a printf to notify root in case the
bounds are hit.
Also allocate KENV_SIZE+1 entries (the array is NULL-terminated), since
the comment for KENV_SIZE says it's the maximum number of environment
strings. [2]
PR: 83687 [1]
Submitted by: Harry Coin <harrycoin@qconline.com> [1]
Submitted by: Ariff Abdullah <skywizard@MyBSD.org.my> [2]
was not compiled with 'options HWPMC_HOOKS' or if the compiled-in
version numbers of the kernel and module are out of sync.
Reported by: cracauer
MFC after: 3 days
Make sure that there actually is a next packet before setting
nextrecord to that field.
PR: 83885
Submitted by: hirose@comm.yamaha.co.jp
Obtained from: Patch suggested in the PR
MFC after: 1 week
- increase number of allocations count only on successfull malloc(9),
so it doesn't confuse people;
- because we need to check if 'size > 0', hide 'mtsp->mts_memalloced += size;'
under the check as well, as for size=0 it is of course a no-op;
- avoid critical_enter()/critical_exit() in case of failure in
malloc_type_allocated() as there will be nothing to do.
OK'ed by: rwatson
MFC after: 2 days
make the b_iodone callback responsible for setting it if it is needed.
Previously, it was set unconditionally by bufdone() without holding
whichever lock is shared by the b_iodone callback and the corresponding
top-half function. Consequently, in a race, the top-half function could
conclude that operation was done before the b_iodone callback finished.
See, for example, aio_physwakeup() and aio_fphysio().
Note: I don't believe that the other, more widely-used b_iodone callbacks
are affected.
Discussed with: jeff
Reviewed by: phk
MFC after: 2 weeks
vnlru proc is extremely inefficient, potentially iteration over tens of
thousands of vnodes without blocking. Droping Giant allows other threads
to preempt us although we should revisit the algorithm to fix the runtime
problems especially since this may hold up all vnode allocations.
- Remove the LK_NOWAIT from the VOP_LOCK in vlrureclaim. This provides
a natural blocking point to help alleviate the situation described above
although it may not technically be desirable.
- yield after we make a pass on all mount points to prevent us from
blocking other threads which require Giant.
MFC after: 2 weeks
hokie and much more readable and expand the comment to explain why it is
the way that it is.
- Close a race where one CPU could free the process belonging to a thread
on another CPU that hasn't quite finished exiting yet but is beyond the
point of setting the process state as PRS_ZOMBIE.
Reported and tested by: ps (2)
MFC after: 3 days
are string names for their respective UMA zones and malloc types, and
are passed into uma_zcreate() and MALLOC_DEFINE(). Export them
outside of _KERNEL in mbuf.h so that netstat can reference them.
Change the names to improve consistency, with each zone/type
associated with the mbuf allocator being prefixed mbuf_.
MFC after: 1 week
variables rather than void * variables. This makes it easier and simpler
to get asm constraints and volatile keywords correct.
MFC after: 3 days
Tested on: i386, alpha, sparc64
Compiled on: ia64, powerpc, amd64
Kernel toolchain busted on: arm
statistics via a binary structure stream:
- Add structure 'malloc_type_stream_header', which defines a stream
version, definition of MAXCPUS used in the stream, and a number of
malloc_type records in the stream.
- Add structure 'malloc_type_header', which defines the name of the
malloc type being reported on.
- When the sysctl is queried, return a stream header, followed by a
series of type descriptions, each consisting of a type header
followed by a series of MAXCPUS malloc_type_stats structures holding
per-CPU allocation information. Typical values of MAXCPUS will be 1
(UP compiled kernel) and 16 (SMP compiled kernel).
This query mechanism allows user space monitoring tools to extract
memory allocation statistics in a machine-readable form, and to do so
at a per-CPU granularity, allowing monitoring of allocation patterns
across CPUs in order to better understand the distribution of work and
memory flow over multiple CPUs.
While here:
- Bump statistics width to uint64_t, and hard code using fixed-width
type in order to be more sure about structure layout in the stream.
We allocate and free a lot of memory.
- Add kmemcount, a counter of the number of registered malloc types,
in order to avoid excessive manual counting of types. Export via a
new sysctl to allow user-space code to better size buffers.
- De-XXX comment on no longer maintaining the high watermark in old
sysctl monitoring code.
A follow-up commit of libmemstat(3), a library to monitor kernel memory
allocation, will occur in the next few days. Likewise, similar changes
to UMA.
process that caused the clone event to take place for the device driver
creating the device. This allows cloned device drivers to adapt the
device node based on security aspects of the process, such as the uid,
gid, and MAC label.
- Add a cred reference to struct cdev, so that when a device node is
instantiated as a vnode, the cloning credential can be exposed to
MAC.
- Add make_dev_cred(), a version of make_dev() that additionally
accepts the credential to stick in the struct cdev. Implement it and
make_dev() in terms of a back-end make_dev_credv().
- Add a new event handler, dev_clone_cred, which can be registered to
receive the credential instead of dev_clone, if desired.
- Modify the MAC entry point mac_create_devfs_device() to accept an
optional credential pointer (may be NULL), so that MAC policies can
inspect and act on the label or other elements of the credential
when initializing the skeleton device protections.
- Modify tty_pty.c to register clone_dev_cred and invoke make_dev_cred(),
so that the pty clone credential is exposed to the MAC Framework.
While currently primarily focussed on MAC policies, this change is also
a prerequisite for changes to allow ptys to be instantiated with the UID
of the process looking up the pty. This requires further changes to the
pty driver -- in particular, to immediately recycle pty nodes on last
close so that the credential-related state can be recreated on next
lookup.
Submitted by: Andrew Reisse <andrew.reisse@sparta.com>
Obtained from: TrustedBSD Project
Sponsored by: SPAWAR, SPARTA
MFC after: 1 week
MFC note: Merge to 6.x, but not 5.x for ABI reasons
syscalls.master for the master list and the Alpha/OSF1 compat ABI to be
consistent with all the other compat ABIs where 'make sysent' already
works.
MFC after: 3 days
address, writting non-canonical address can cause kernel a panic,
by restricting base values to 0..VM_MAXUSER_ADDRESS, ensuring
only canonical values get written to the registers.
Reviewed by: peter, Josepha Koshy < joseph.koshy at gmail dot com >
Approved by: re (scottl)
and writev() except that they take an additional offset argument and do
not change the current file position. In SAT speak:
preadv:readv::pread:read and pwritev:writev::pwrite:write.
- Try to reduce code duplication some by merging most of the old
kern_foov() and dofilefoo() functions into new dofilefoo() functions
that are called by kern_foov() and kern_pfoov(). The non-v functions
now all generate a simple uio on the stack from the passed in arguments
and then call kern_foov(). For example, read() now just builds a uio and
calls kern_readv() and pwrite() just builds a uio and calls kern_pwritev().
PR: kern/80362
Submitted by: Marc Olzheim marcolz at stack dot nl (1)
Approved by: re (scottl)
MFC after: 1 week
which is invoked from socket() and socketpair(), permitting MAC
policy modules to control the creation of sockets by domain, type, and
protocol.
Obtained from: TrustedBSD Project
Sponsored by: SPARTA, SPAWAR
Approved by: re (scottl)
Requested by: SCC
- Introducing the possibility of using locks different than mutexes
for the knlist locking. In order to do this, we add three arguments to
knlist_init() to specify the functions to use to lock, unlock and
check if the lock is owned. If these arguments are NULL, we assume
mtx_lock, mtx_unlock and mtx_owned, respectively.
- Using the vnode lock for the knlist locking, when doing kqueue operations
on a vnode. This way, we don't have to lock the vnode while holding a
mutex, in filt_vfsread.
Reviewed by: jmg
Approved by: re (scottl), scottl (mentor override)
Pointyhat to: ssouhlal
Will be happy: everyone
- pmcstat(8) gprof output mode fixes:
lib/libpmc/pmclog.{c,h}, sys/sys/pmclog.h:
+ Add a 'is_usermode' field to the PMCLOG_PCSAMPLE event
+ Add an 'entryaddr' field to the PMCLOG_PROCEXEC event,
so that pmcstat(8) can determine where the runtime loader
/libexec/ld-elf.so.1 is getting loaded.
sys/kern/kern_exec.c:
+ Use a local struct to group the entry address of the image being
exec()'ed and the process credential changed flag to the exec
handling hook inside hwpmc(4).
usr.sbin/pmcstat/*:
+ Support "-k kernelpath", "-D sampledir".
+ Implement the ELF bits of 'gmon.out' profile generation in a new
file "pmcstat_log.c". Move all log related functions to this
file.
+ Move local definitions and prototypes to "pmcstat.h"
- Other bug fixes:
+ lib/libpmc/pmclog.c: correctly handle EOF in pmclog_read().
+ sys/dev/hwpmc_mod.c: unconditionally log a PROCEXIT event to all
attached PMCs when a process exits.
+ sys/sys/pmc.h: correct a function prototype.
+ Improve usage checks in pmcstat(8).
Approved by: re (blanket hwpmc)
This is good enough to be able to run a RELENG_4 gdb binary against
a RELENG_4 application, along with various other tools (eg: 4.x gcore).
We use this at work.
ia32_reg.[ch]: handle the 32 bit register file format, used by ptrace,
procfs and core dumps.
procfs_*regs.c: vary the format of proc/XXX/*regs depending on the client
and target application.
procfs_map.c: Don't print a 64 bit value to 32 bit consumers, or their
sscanf fails. They expect an unsigned long.
imgact_elf.c: produce a valid 32 bit coredump for 32 bit apps.
sys_process.c: handle 32 bit consumers debugging 32 bit targets. Note
that 64 bit consumers can still debug 32 bit targets.
IA64 has got stubs for ia32_reg.c.
Known limitations: a 5.x/6.x gdb uses get/setcontext(), which isn't
implemented in the 32/64 wrapper yet. We also make a tiny patch to
gdb pacify it over conflicting formats of ld-elf.so.1.
Approved by: re