Commit Graph

12640 Commits

Author SHA1 Message Date
trasz
b2747e472e Fix use-after-free in kern_jail_set() triggered e.g. by attempts
to clear "persist" flag from empty persistent jail, like this:

jail -c persist=1
jail -n 1 -m persist=0

Submitted by:	Mateusz Guzik <mjguzik at gmail dot com>
MFC after:	2 weeks
2012-05-22 19:43:20 +00:00
trasz
a25d879040 Don't leak locks in prison_racct_modify().
Submitted by:	Mateusz Guzik <mjguzik at gmail dot com>
MFC after:	2 weeks
2012-05-22 17:30:02 +00:00
trasz
3a811deac7 Fix panic with RACCT that could occur in low memory (or out of swap)
situations, due to fork1() calling racct_proc_exit() without calling
racct_proc_fork() first.

Submitted by:	Mateusz Guzik <mjguzik at gmail dot com> (earlier version)
Reviewed by:	Mateusz Guzik <mjguzik at gmail dot com>
2012-05-22 15:58:27 +00:00
harti
c7e30562ca Make dumptid non-static. It is used by libkvm to detect whether
this is a VNET-kernel or not. gcc used to put the static symbol into
the symbol table, clang does not. This fixes the 'netstat: no namelist'
error seen on clang+VNET systems.
2012-05-22 07:23:41 +00:00
melifaro
34ec5c8650 Fix old panic when BPF consumer attaches to destroying interface.
'flags' field is added to the end of bpf_if structure. Currently the only
flag is BPFIF_FLAG_DYING which is set on bpf detach and checked by bpf_attachd()
Problem can be easily triggered on SMP stable/[89] by the following command (sort of):
'while true; do ifconfig vlan222 create vlan 222 vlandev em0 up ; tcpdump -pi vlan222 & ; ifconfig vlan222 destroy ; done'

Fix possible use-after-free when BPF detaches itself from interface, freeing bpf_bif memory,
while interface is still UP and there can be routes via this interface.
Freeing is now delayed till ifnet_departure_event is received via eventhandler(9) api.

Convert bpfd rwlock back to mutex due lack of performance gain (currently checking if packet
matches filter is done without holding bpfd lock and we have to acquire write lock if packet matches)

Approved by:      kib(mentor)
MFC in:            4 weeks
2012-05-21 22:17:29 +00:00
iwasaki
31eddd58e3 Add SMP/i386 suspend/resume support.
Most part is merged from amd64.

- i386/acpica/acpi_wakecode.S
Replaced with amd64 code (from realmode to paging enabling code).

- i386/acpica/acpi_wakeup.c
Replaced with amd64 code (except for wakeup_pagetables stuff).

- i386/include/pcb.h
- i386/i386/genassym.c
Added PCB new members (CR0, CR2, CR4, DS, ED, FS, SS, GDT, IDT, LDT
and TR) needed for suspend/resume, not for context switch.

- i386/i386/swtch.s
Added suspendctx() and resumectx().
Note that savectx() was not changed and used for suspending (while
amd64 code uses it).
BSP and AP execute the same sequence, suspendctx(), acpi_wakecode()
and resumectx() for suspend/resume (in case of UP system also).

- i386/i386/apic_vector.s
Added cpususpend().

- i386/i386/mp_machdep.c
- i386/include/smp.h
Added cpususpend_handler().

- i386/include/apicvar.h
- kern/subr_smp.c
- sys/smp.h
Added IPI_SUSPEND and suspend_cpus().

- i386/i386/initcpu.c
- i386/i386/machdep.c
- i386/include/md_var.h
- pc98/pc98/machdep.c
Moved initializecpu() declarations to md_var.h.

MFC after:	3 days
2012-05-18 18:55:58 +00:00
gleb
3288f283ff Skip directory entries with zero inode number during traversal.
Entries with zero inode number are considered placeholders by libc and
UFS.  Fix remaining uses of VOP_READDIR in kernel: vop_stdvptocnp,
unionfs.

Sponsored by:	Google Summer of Code 2011
2012-05-16 10:44:09 +00:00
pluknet
7aab7d56be Fix typo in function name SDT_PROBE4 and unbreak 4BSD UP. 2012-05-15 10:58:17 +00:00
gber
112a2e964f Do not call bremfree for managed buffers.
Calling bremfree for these buffers results in panic:
"bremfree: buffer %p not on a queue."

Approved by: kib
2012-05-15 09:55:15 +00:00
rstone
a059a0e086 Implement the DTrace sched provider. This implementation aims to be
compatible with the sched provider implemented by Solaris and its open-
source derivatives.  Full documentation of the sched provider can be found
on Oracle's DTrace wiki pages.

Note that for compatibility with scripts originally written for Solaris,
serveral probes are defined that will never fire.  These probes are defined
to fire when Solaris-specific features perform certain actions.  As these
features are not present in FreeBSD, the probes can never fire.

Also, I have added a two probes that are not defined in Solaris, lend-pri
and load-change.  These probes have been added to make it possible to
collect schedgraph data with DTrace.

Finally, a few probes are defined in Solaris to take a cpuinfo_t *
argument.  As it was not immediately clear to me how to translate that to
FreeBSD, currently those probes are passed NULL in place of a cpuinfo_t *.

Sponsored by: Sandvine Incorporated
MFC after:	2 weeks
2012-05-15 01:30:25 +00:00
delphij
53e510d1ef Revert previous revision, misunderstood the code :( 2012-05-11 23:43:32 +00:00
delphij
f7e33a4a67 Release proc lock after setting signal queue.
PR:		kern/167727
Submitted by:	Jinjun Gao <gjinjun gmail com>
MFC after:	2 weeks
2012-05-11 23:41:52 +00:00
kib
c5f120d09b Move the code to call the callout callback into the helper function
softclock_call_cc(). While there, move some common code to callout_cc_del().

Requested by:	avg, jhb
Reviewed by:	jhb
MFC after:    1 week
2012-05-03 20:00:30 +00:00
kib
9e5fca0368 When callout_reset_on() cannot immediately migrate a callout since it
is running on other cpu, the CALLOUT_PENDING flag is temporarily
cleared. Then, callout_stop() on this, in fact active, callout fails
because CALLOUT_PENDING is not set, and callout_stop() returns 0.

Now, in sleepq_check_timeout(), the failed callout_stop() causes the
sleepq code to execute mi_switch() without even setting the wmesg,
since the switch-out is supposed to be transient. In fact, the thread
is put off the CPU for full timeout interval, instead of being put on
runq immediately.  Until timeout fires, the process is unkillable for
obvious reasons.

Fix this by marking the migrating callouts with CALLOUT_DFRMIGRATION
flag. The flag is cleared by callout_stop_safe() when the function
detects a migration, besides returning the success. The softclock()
rechecks the flag for migrating callout and cancels its execution if
the flag was cleared meantime.

PR:	 misc/166340
Reported, debugging traces provided and tested by:
	Christian Esken <christian.esken trivago com>
Reviewed by:	 avg, jhb
MFC after:	 1 week
2012-05-03 10:38:02 +00:00
jhb
c96b8c07a4 - Don't log messages saying that accounting is being disabled and enabled
if the accounting log file is atomically replaced with a new file
  (such as during log rotation).
- Simplify accounting log rotation a bit.  There is no need to re-run
  accton(8) after renaming the new log file to it's real name.

PR:		kern/167321
Tested by:	Jeremy Chadwick
2012-05-02 14:25:39 +00:00
kib
0e86d1558c Allow for the process information sysctls to accept a thread id in addition
to the process id.  It follows the ptrace(2) interface and allows debugging
libraries to use thread ids directly, without slow and verbose conversion
of thread id into pid.

The PGET_NOTID flag is provided to allow a specific sysctl to disallow
this behaviour.  All current callers of pget(9) have useful semantic to
operate on tid and do not need this flag.

Reviewed by:	jhb, trocini
MFC after:	1 week
2012-04-23 20:56:05 +00:00
trasz
023bd7c6bf Remove unused thread argument to vrecycle().
Reviewed by:	kib
2012-04-23 14:10:34 +00:00
trasz
baac623cd9 Remove unused thread argument from vtruncbuf().
Reviewed by:	kib
2012-04-23 13:21:28 +00:00
jhb
aa85973504 Include the associated wait channel message for context switch ktrace
records.  kdump supports both the old and new messages.

Submitted by:	Andrey Zonov  andrey zonov org
MFC after:	1 week
2012-04-20 15:32:36 +00:00
jh
433fc8eeff The value of flags matching VNOVAL can't be supported. Return EOPNOTSUPP
from setfflags() in this case. This fixes the return value of
chflags(path, -1).

Discussed with:	bde
MFC after:	2 weeks
2012-04-20 10:08:30 +00:00
mckusick
d9895ac1fe This update uses the MNT_VNODE_FOREACH_ACTIVE interface that loops
over just the active vnodes associated with a mount point to replace
MNT_VNODE_FOREACH_ALL in the vfs_msync, ffs_sync_lazy, and qsync
routines.

The vfs_msync routine is run every 30 seconds for every writably
mounted filesystem. It ensures that any files mmap'ed from the
filesystem with modified pages have those pages queued to be
written back to the file from which they are mapped.

The ffs_lazy_sync and qsync routines are run every 30 seconds for
every writably mounted UFS/FFS filesystem. The ffs_lazy_sync routine
ensures that any files that have been accessed in the previous
30 seconds have had their access times queued for updating in the
filesystem. The qsync routine ensures that any files with modified
quotas have those quotas queued to be written back to their
associated quota file.

In a system configured with 250,000 vnodes, less than 1000 are
typically active at any point in time. Prior to this change all
250,000 vnodes would be locked and inspected twice every minute
by the syncer. For UFS/FFS filesystems they would be locked and
inspected six times every minute (twice by each of these three
routines since each of these routines does its own pass over the
vnodes associated with a mount point). With this change the syncer
now locks and inspects only the tiny set of vnodes that are active.

Reviewed by: kib
Tested by:   Peter Holm
MFC after:   2 weeks
2012-04-20 07:00:28 +00:00
mckusick
5b7b29e35b This change creates a new list of active vnodes associated with
a mount point. Active vnodes are those with a non-zero use or hold
count, e.g., those vnodes that are not on the free list. Note that
this list is in addition to the list of all the vnodes associated
with a mount point.

To avoid adding another set of linkage pointers to the vnode
structure, the active list uses the existing linkage pointers
used by the free list (previously named v_freelist, now renamed
v_actfreelist).

This update adds the MNT_VNODE_FOREACH_ACTIVE interface that loops
over just the active vnodes associated with a mount point (typically
less than 1% of the vnodes associated with the mount point).

Reviewed by: kib
Tested by:   Peter Holm
MFC after:   2 weeks
2012-04-20 06:50:44 +00:00
mckusick
a9a210460f Delete a no longer useful VNASSERT missed during changes in 234400.
Suggested by: kib
2012-04-18 19:34:20 +00:00
mckusick
be8731298f Fix a memory leak of M_VNODE_MARKER introduced in 234386.
Found by:  Peter Holm
2012-04-18 19:30:22 +00:00
mckusick
841f20af50 Drop export of vdestroy() function from kern/vfs_subr.c as it is
used only as a helper function in that file. Replace sole call to
vbusy() with inline code in vholdl(). Replace sole calls to vfree()
and vdestroy() with inline code in vdropl().

The Clang compiler already inlines these functions, so they do not
show up in a kernel backtrace which is confusing. Also you cannot
set their frame in kgdb which means that it is impossible to view
their local variables. So, while the produced code is unchanged,
the debugging should be easier.

Discussed with: kib
MFC after:      2 weeks
2012-04-17 21:46:59 +00:00
mckusick
ffee40eeff Replace the MNT_VNODE_FOREACH interface with MNT_VNODE_FOREACH_ALL.
The primary changes are that the user of the interface no longer
needs to manage the mount-mutex locking and that the vnode that
is returned has its mutex locked (thus avoiding the need to check
to see if its is DOOMED or other possible end of life senarios).

To minimize compatibility issues for third-party developers, the
old MNT_VNODE_FOREACH interface will remain available so that this
change can be MFC'ed to 9. Following the MFC to 9, MNT_VNODE_FOREACH
will be removed in head.

The reason for this update is to prepare for the addition of the
MNT_VNODE_FOREACH_ACTIVE interface that will loop over just the
active vnodes associated with a mount point (typically less than
1% of the vnodes associated with the mount point).

Reviewed by: kib
Tested by:   Peter Holm
MFC after:   2 weeks
2012-04-17 16:28:22 +00:00
trasz
7f09aee7a1 Fix bug where NFSv4 ACL enforcement code wouldn't unconditionally
allow the owner to read and write ACL and file attributes when there
was no entry with subject matching the owner.  In other words,
'getfacl meh' shouldn't fail for the owner if the ACL looks like this:

# file: meh
# owner: trasz
# group: wheel
         user:root:------a-------:------:allow

Reported by:	kientzle
2012-04-17 14:54:00 +00:00
trasz
29ba0a35f6 Stop treating system processes as special. This fixes panics
like the one triggered by this:

# kldload geom_vinum
# pwait `pgrep -S gv_worker` &
# kldunload geom_vinum

or this:

GEOM_JOURNAL: Shutting down geom gjournal 3464572051.
panic: destroying non-empty racct: 1 allocated for resource 6

which were tracked by jh@ to be caused by checking p->p_flag,
while it wasn't initialised yet.  Basically, during fork, the code
checked p_flag, concluded the process isn't marked as P_SYSTEM,
incremented the counter, and later on, when exiting, checked that
the process was marked as P_SYSTEM, and thus didn't decrement it.

Also, I believe there wasn't any good reason for checking P_SYSTEM
in the first place.

Tested by:	jh
2012-04-17 14:31:02 +00:00
trasz
a41bb18a29 Fix panic, triggered like this: "int main() { thr_exit(); }"
Submitted by:	Mateusz Guzik
2012-04-17 13:44:40 +00:00
trasz
c37ffba90a Enforce upper bound on the input buffer length.
Reported by:	Mateusz Guzik
2012-04-17 13:28:14 +00:00
jkim
e210f689a8 - Implement pipe2 syscall for Linuxulator. This syscall appeared in 2.6.27
but GNU libc used it without checking its kernel version, e. g., Fedora 10.
- Move pipe(2) implementation for Linuxulator from MD files to MI file,
sys/compat/linux/linux_file.c.  There is no MD code for this syscall at all.
- Correct an argument type for pipe() from l_ulong * to l_int *.  Probably
this was the source of MI/MD confusion.

Reviewed by:	emulation
2012-04-16 21:22:02 +00:00
davide
63cc567af5 Fix a typo.
Approved by:	gnn (mentor)
MFC after:	2 days
2012-04-14 23:59:58 +00:00
davide
ff8b0a29f3 Fix some style bugs introduced in a previous commit (r233045)
Reported by:	glebius, jmallet
Reviewed by:	jmallet
Approved by:	gnn (mentor)
MFC after:	2 days
2012-04-14 23:53:31 +00:00
marius
6f1427f0e6 Fix !DDB build after r234190. 2012-04-14 11:21:24 +00:00
adrian
2c73480574 Use strdup() on the name (and free it when it's done) so non-static names
can be used in firmware_register().
2012-04-13 04:22:42 +00:00
jhb
20ac4e4f81 - Extend the KDB interface to add a per-debugger callback to print a
backtrace for an arbitrary thread (rather than the calling thread).
  A kdb_backtrace_thread() wrapper function uses the configured debugger
  if possible, otherwise it falls back to using stack(9) if that is
  available.
- Replace a direct call to db_trace_thread() in propagate_priority()
  with a call to kdb_backtrace_thread() instead.

MFC after:	1 week
2012-04-12 17:43:59 +00:00
jhb
51ec6999bb If a linker file contains at least one module, but all of the modules
fail to load (the MOD_LOAD event fails) during a kldload(2), unload the
linker file and fail the kldload(2) with ENOEXEC.

Reported by:	gcooper
MFC after:	1 week
2012-04-12 14:49:25 +00:00
kib
319ab382ef Add thread-private flag to indicate that error value is already placed
in td_errno. Flag is supposed to be used by syscalls returning
EJUSTRETURN because errno was already placed into the usermode frame
by a call to set_syscall_retval(9). Both ktrace and dtrace get errno
value from td_errno if the flag is set.

Use the flag to fix sigsuspend(2) error return ktrace records.

Requested by:	bde
MFC after:	1 week
2012-04-12 10:48:43 +00:00
mckusick
7901256b30 Export vinactive() from kern/vfs_subr.c (e.g., make it no longer
static and declare its prototype in sys/vnode.h) so that it can be
called from process_deferred_inactive() (in ufs/ffs/ffs_snapshot.c)
instead of the body of vinactive() being cut and pasted into
process_deferred_inactive().

Reviewed by: kib
MFC after:   2 weeks
2012-04-11 23:01:11 +00:00
jhb
294ae9574d Allow device_busy() and device_unbusy() to be invoked while a device is
being attached.  This is implemented by adding a new DS_ATTACHING state
while a device's DEVICE_ATTACH() method is being invoked.  A driver is
required to not fail an attach of a busy device.  The device's state will
be promoted to DS_BUSY rather than DS_ACTIVE() if the device was marked
busy during DEVICE_ATTACH().

Reviewed by:	kib
MFC after:	1 week
2012-04-11 20:57:41 +00:00
eadler
2a42c5c4e9 Return EBADF instead of EMFILE from dup2 when the second argument is
outside the range of valid file descriptors

PR:		kern/164970
Submitted by:	Peter Jeremy <peterjeremy@acm.org>
Reviewed by:	jilles
Approved by:	cperciva
MFC after:	1 week
2012-04-11 14:08:09 +00:00
jilles
4360dc9ca8 Remove unused and wrong SA_PROC internal signal property.
The SA_PROC signal property indicated whether each signal number is directed
at a specific thread or at the process in general. However, that depends on
how the signal was generated and not on the signal number. SA_PROC was not
used.
2012-04-09 21:58:58 +00:00
mav
e1ffe54fb7 Microoptimize cpu_search().
According to profiling, it makes one take 6% of CPU time on hackbench
with its million of context switches per second, instead of 8% before.
2012-04-09 18:24:58 +00:00
gleb
fb452e77b0 Add vfs_getopt_size. Support human readable file system options in tmpfs.
Increase maximum tmpfs file system size to 4GB*PAGE_SIZE on 32 bit archs.

Discussed with:	delphij
MFC after:	2 weeks
2012-04-07 15:27:34 +00:00
melifaro
8b1d10268c - Improve BPF locking model.
Interface locks and descriptor locks are converted from mutex(9) to rwlock(9).
This greately improves performance: in most common case we need to acquire 1
reader lock instead of 2 mutexes.

- Remove filter(descriptor) (reader) lock in bpf_mtap[2]
This was suggested by glebius@. We protect filter by requesting interface
writer lock on filter change.

- Cover struct bpf_if under BPF_INTERNAL define. This permits including bpf.h
without including rwlock stuff. However, this is is temporary solution,
struct bpf_if should be made opaque for any external caller.

Found by:       Dmitrij Tejblum <tejblum@yandex-team.ru>
Sponsored by:   Yandex LLC

Reviewed by:    glebius (previous version)
Reviewed by:    silence on -net@
Approved by:    (mentor)

MFC after:      3 weeks
2012-04-06 06:53:58 +00:00
jhb
5829de48d9 Add new ktrace records for the start and end of VM faults. This gives
a pair of records similar to syscall entry and return that a user can
use to determine how long page faults take.  The new ktrace records are
enabled via the 'p' trace type, and are enabled in the default set of
trace points.

Reviewed by:	kib
MFC after:	2 weeks
2012-04-05 17:13:14 +00:00
davidxu
cc55f4943b In sem_post, the field _has_waiters is no longer used, because some
application destroys semaphore after sem_wait returns. Just enter
kernel to wake up sleeping threads, only update _has_waiters if
it is safe. While here, check if the value exceed SEM_VALUE_MAX and
return EOVERFLOW if this is true.
2012-04-05 03:05:02 +00:00
davidxu
8c31e244f2 umtx operation UMTX_OP_MUTEX_WAKE has a side-effect that it accesses
a mutex after a thread has unlocked it, it event writes data to the mutex
memory to clear contention bit, there is a race that other threads
can lock it and unlock it, then destroy it, so it should not write
data to the mutex memory if there isn't any waiter.
The new operation UMTX_OP_MUTEX_WAKE2 try to fix the problem. It
requires thread library to clear the lock word entirely, then
call the WAKE2 operation to check if there is any waiter in kernel,
and try to wake up a thread, if necessary, the contention bit is set again
by the operation. This also mitgates the chance that other threads find
the contention bit and try to enter kernel to compete with each other
to wake up sleeping thread, this is unnecessary. With this change, the
mutex owner is no longer holding the mutex until it reaches a point
where kernel umtx queue is locked, it releases the mutex as soon as
possible.
Performance is improved when the mutex is contensted heavily.  On Intel
i3-2310M, the runtime of a benchmark program is reduced from 26.87 seconds
to 2.39 seconds, it even is better than UMTX_OP_MUTEX_WAKE which is
deprecated now. http://people.freebsd.org/~davidxu/bench/mutex_perf.c
2012-04-05 02:24:08 +00:00
np
307ef13f94 - Remove redundant call to pr_ctloutput from code that handles SO_SETFIB.
- Add a check for errors during copyin while here.

Reviewed by:	julian, bz
MFC after:	2 weeks
2012-04-03 18:38:00 +00:00
kib
ff6239a557 When process exists, not only the children shall be reparented to
init, but also the orphans shall be removed from the orphan list,
because the list header is destroyed.

Reported and tested by:	pho
MFC after:	3 days
2012-04-02 19:35:36 +00:00