Commit Graph

12466 Commits

Author SHA1 Message Date
Kevin Lo
575cabed9e Fix a style bug
Spotted by:	avg
2012-01-16 14:54:48 +00:00
David Xu
29a06690ca Eliminate branch and insert an explicit reader memory barrier to ensure
that waiter bit is set before reading semaphore count.
2012-01-16 04:39:10 +00:00
Mikolaj Golub
fe7f89b71a Abrogate nchr argument in proc_getargv() and proc_getenvv(): we always want
to read strings completely to know the actual size.

As a side effect it fixes the issue with kern.proc.args and kern.proc.env
sysctls, which didn't return the size of available data when calling
sysctl(3) with the NULL argument for oldp.

Note, in get_ps_strings(), which does actual work for proc_getargv() and
proc_getenvv(), we still have a safety limit on the size of data read in
case of a corrupted procces stack.

Suggested by:	kib
MFC after:	3 days
2012-01-15 18:47:24 +00:00
Martin Matuska
9cbe30e1d5 Fix missing in r230129:
kern_jail.c: initialize fullpath_disabled to zero
vfs_cache.c: add missing dot in comment

Reported by:	kib
MFC after:	1 month
2012-01-15 18:08:15 +00:00
Ulrich Spörlein
9a14aa017b Convert files to UTF-8 2012-01-15 13:23:18 +00:00
Martin Matuska
f6e633a9e1 Introduce vn_path_to_global_path()
This function updates path string to vnode's full global path and checks
the size of the new path string against the pathlen argument.

In vfs_domount(), sys_unmount() and kern_jail_set() this new function
is used to update the supplied path argument to the respective global path.

Unbreaks jailed zfs(8) with enforce_statfs set to 1.

Reviewed by:	kib
MFC after:	1 month
2012-01-15 12:08:20 +00:00
Eitan Adler
886e862866 - Fix undefined behavior when device_get_name is null
- Make error message more informative

PR:		kern/149800
Submitted by:	olgeni
Approved by:	cperciva
MFC after:	1 week
2012-01-15 07:09:18 +00:00
Oleksandr Tymoshenko
4104e83567 Fix kernel modules loading for MIPS64 kernel:
On amd64, link_elf_obj.c must specify KERNBASE rather than
    VM_MIN_KERNEL_ADDRESS to vm_map_find() because kernel loadable
    modules must be mapped for execution in the same upper region
    of the kernel map as the kernel code and data segments.

    For MIPS32 KERNBASE lies below KVA area (it's less than
    VM_MIN_KERNEL_ADDRESS) so basically vm_map_find got whole
    KVA to look through. On MIPS64 it's not the case because
    KERNBASE is set to the very end of XKSEG, well out of KVA
    bounds, so vm_map_find always fails. We should use
    VM_MIN_KERNEL_ADDRESS as a base for vm_map_find.

Details obtained from: alc@
2012-01-14 00:36:07 +00:00
John Baldwin
fbcebf7f71 Convert the per-interface address list lock from a mutex to a reader/writer
lock.

Reviewed by:	bz
2012-01-09 19:34:12 +00:00
Andriy Gapon
90d8265326 enable stop_scheduler_on_panic by default
My plan is to make this behavior unconditional before 10.0 release.

X-MFC after:	r228424 (if ever)
2012-01-09 12:06:09 +00:00
Konstantin Belousov
3ab0160340 Avoid LOR between vfs_busy() lock and covered vnode lock on quotaon().
The vfs_busy() is after covered vnode lock in the global lock order, but
since quotaon() does recursive VFS call to open quota file, we usually
end up locking covered vnode after mp is busied in sys_quotactl().

Change the interface of VFS_QUOTACTL(), requiring that mp was unbusied
by fs code, and do not try to pick up vfs_busy() reference in ufs quotaon,
esp. if vfs_busy cannot succeed due to unmount being performed.

Reported and tested by:	pho
MFC after:	1 week
2012-01-08 23:06:53 +00:00
Alan Cox
2971897d51 Correct an error of omission in the implementation of the truncation
operation on POSIX shared memory objects and tmpfs.  Previously, neither of
these modules correctly handled the case in which the new size of the object
or file was not a multiple of the page size.  Specifically, they did not
handle partial page truncation of data stored on swap.  As a result, stale
data might later be returned to an application.

Interestingly, a data inconsistency was less likely to occur under tmpfs
than POSIX shared memory objects.  The reason being that a different mistake
by the tmpfs truncation operation helped avoid a data inconsistency.  If the
data was still resident in memory in a PG_CACHED page, then the tmpfs
truncation operation would reactivate that page, zero the truncated portion,
and leave the page pinned in memory.  More precisely, the benevolent error
was that the truncation operation didn't add the reactivated page to any of
the paging queues, effectively pinning the page.  This page would remain
pinned until the file was destroyed or the page was read or written.  With
this change, the page is now added to the inactive queue.

Discussed with:	jhb
Reviewed by:	kib (an earlier version)
MFC after:	3 weeks
2012-01-08 20:09:26 +00:00
Hiroki Sato
ca54e1aee3 Fix a typo. (s/nessesary/necessary/) 2012-01-08 18:48:36 +00:00
John Baldwin
71eeeaf256 Add 5 spare VOPs as placeholders to avoid breaking the KBI in the future
when new VOPs are MFC'd to a branch.

Reviewed by:	kib, bz
MFC after:	3 days
2012-01-06 20:06:45 +00:00
John Baldwin
908cac07ce Use proper argument structure types for the extattr post-VOP hooks.
The wrong structure happened to work since the only argument used was
the vnode which is in the same place in both VOP_SETATTR() and the two
extattr VOPs.

MFC after:	3 days
2012-01-06 20:05:48 +00:00
John Baldwin
948c460971 Fix a logic bug in change 228207 in the check for a thread's new user
priority being a realtime priority.

MFC after:	3 days
2012-01-05 19:02:52 +00:00
John Baldwin
137f91e80f Convert all users of IF_ADDR_LOCK to use new locking macros that specify
either a read lock or write lock.

Reviewed by:	bz
MFC after:	2 weeks
2012-01-05 19:00:36 +00:00
John Baldwin
7e3a96ea37 Some small fixes to CPU accounting for threads:
- Only initialize the per-cpu switchticks and switchtime in sched_throw()
  for the very first context switch on APs during boot.  This avoids a
  small gap between the middle of thread_exit() and sched_throw() where
  time is not accounted to any thread.
- In thread_exit(), update the timestamp bookkeeping to track the changes
  to mi_switch() introduced by td_rux so that the code once again matches
  the comment claiming it is mimicing mi_switch().  Specifically, only
  update the per-thread stats directly and depend on ruxagg() to update
  p_rux rather than adjusting p_rux directly.  While here, move the
  timestamp bookkeeping as late in the function as possible.

Reviewed by:	bde, kib
MFC after:	1 week
2012-01-03 21:03:28 +00:00
Ed Schouten
dc15eac046 Use strchr() and strrchr().
It seems strchr() and strrchr() are used more often than index() and
rindex(). Therefore, simply migrate all kernel code to use it.

For the XFS code, remove an empty line to make the code identical to
the code in the Linux kernel.
2012-01-02 12:12:10 +00:00
Konstantin Belousov
cdb7a43117 Avoid double-unlock or double unreference for ndp->ni_dvp when the vnode dp
lock upgrade right after the 'success' label fails.

In collaboration with:	pho
MFC after:	1 week
2012-01-01 18:45:59 +00:00
John Baldwin
0c0d27d5dd Cap the priority calculated from the current thread's running tick count
at SCHED_PRI_RANGE to prevent overflows in the priority value.  This can
happen due to irregularities with clock interrupts under certain
virtualization environments.

Tested by:	Larry Rosenman  ler lerctr org
MFC after:	2 weeks
2011-12-29 16:17:16 +00:00
Lawrence Stewart
6cedd609b7 Introduce the sysclock_getsnapshot() and sysclock_snap2bintime() KPIs. The
sysclock_getsnapshot() function allows the caller to obtain a snapshot of all
the system clock and timecounter state required to create time stamps at a later
point. The sysclock_snap2bintime() function converts a previously obtained
snapshot into a bintime time stamp according to the specified flags e.g. which
system clock, uptime vs absolute time, etc.

These KPIs enable useful functionality, including direct comparison of the
feedback and feed-forward system clocks and generation of multiple time stamps
with different formats from a single timecounter read.

Committed on behalf of Julien Ridoux and Darryl Veitch from the University of
Melbourne, Australia, as part of the FreeBSD Foundation funded "Feed-Forward
Clock Synchronization Algorithms" project.

For more information, see http://www.synclab.org/radclock/

In collaboration with:	Julien Ridoux (jridoux at unimelb edu au)
2011-12-24 01:32:01 +00:00
John Baldwin
f0d6c5caf0 Add post-VOP hooks for VOP_DELETEEXTATTR() and VOP_SETEXTATTR() and use
these to trigger a NOTE_ATTRIB EVFILT_VNODE kevent when the extended
attributes of a vnode are changed.

Note that OS X already implements this behavior.

Reviewed by:	rwatson
MFC after:	2 weeks
2011-12-23 20:11:37 +00:00
John Baldwin
268e76d86e Use TASK_INITIALIZER() for dev_dtr_task rather than a dedicated SYSINIT(). 2011-12-22 16:01:10 +00:00
Andriy Gapon
167057914b ule: ensure that batch timeshare threads are scheduled fairly
With the previous code, if the range of priorities for timeshare batch
threads was greater than RQ_NQS, then the threads with low priorities in
the part of the range above RQ_NQS would be scheduled to the run-queues
as if they had high priorities at the beginning of the range.
In other words, threads with a nice level of +N could be scheduled as
if they had a nice level of -M.

Reported by:	George Mitchell <george@m5p.com>
Reviewed by:	jhb
Tested by:	George Mitchell <george@m5p.com> (earlier version)
MFC after:	1 week
2011-12-19 20:01:21 +00:00
Mikolaj Golub
547b155eb1 Fix style and white spaces.
MFC after:	1 week
2011-12-17 22:18:26 +00:00
Mikolaj Golub
fa3935bcea On start most of sysctl_kern_proc functions use the same pattern:
locate a process calling pfind() and do some additional checks like
p_candebug(). To reduce this code duplication a new function pget() is
introduced and used.

As the function may be useful not only in kern_proc.c it is in the
kernel name space.

Suggested by:	kib
Reviewed by:	kib
MFC after:	2 weeks
2011-12-17 16:59:22 +00:00
Andriy Gapon
f389bc9585 belatedly transfer copyrights from libkern/gets.c to kern_cons.c
MFC after:	2 months
MFC with:	r228642
2011-12-17 15:50:45 +00:00
Andriy Gapon
f6ce353e58 replace uses of libkern gets with cngets
MFC after:	2 months
2011-12-17 15:26:34 +00:00
Andriy Gapon
8e62854265 introduce cngets, a method for kernel to read a string from console
This is intended as a replacement for libkern's gets and mostly borrows
its implementation.  It uses cngrab/cnungrab to delimit kernel's access
to console input.

Note: libkern's gets obviously doesn't share any bits of implementation
iwth libc's gets.  They also have different APIs and the former doesn't
have the overflow problems of the latter.

Inspired by:	bde
MFC after:	2 months
2011-12-17 15:16:54 +00:00
Andriy Gapon
bf8696b408 introduce cngrab/cnungrab stub calls in some places where they make sense
MFC after:	2 months
2011-12-17 15:11:22 +00:00
Andriy Gapon
9976156f12 kern cons: introduce infrastructure for console grabbing by kernel
At the moment grab and ungrab methods of all console drivers are no-ops.

Current intended meaning of the calls is that the kernel takes control of
console input.  In the future the semantics may be extended to mean that
the calling thread takes full ownership of the console (e.g. console
output from other threads could be suspended).

Inspired by:	bde
MFC after:	2 months
2011-12-17 15:08:43 +00:00
John Baldwin
f427c78b19 Fire a kevent if necessary after seeking on a regular file. This fixes a
case where a kevent would not fire on a regular file if an application read
to EOF and then seeked backwards into the file.

Reviewed by:	kib
MFC after:	2 weeks
2011-12-16 20:10:00 +00:00
John Baldwin
338e7cf235 Use vm_mmap_to_errno().
Submitted by:	kib
2011-12-15 15:17:19 +00:00
Jilles Tjoelker
6d1c58f8a2 Fix select/poll/kqueue for write on reverse direction before first write.
The reverse direction of a pipe is lazily allocated on the first write in
that direction (because pipes are usually used in one direction only).  A
special case is needed to ensure the pipe appears writable before the first
write because there are 0 bytes of pending data in 0 bytes of buffer space
at that point, leaving 0 bytes of data that can be written with the normal
code.

Note that the first write returns [ENOMEM] if kern.ipc.maxpipekva is
exceeded and does not block or return [EAGAIN], so selecting true for write
is correct even in that case.

PR:		kern/93685
Submitted by:	gianni
MFC after:	2 weeks
2011-12-14 22:26:39 +00:00
John Baldwin
fb680e16f4 Add a helper API to allow in-kernel code to map portions of shared memory
objects created by shm_open(2) into the kernel's address space.  This
provides a convenient way for creating shared memory buffers between
userland and the kernel without requiring custom character devices.
2011-12-14 22:22:19 +00:00
David E. O'Brien
1c5151f3f8 Match other formatting. 2011-12-14 02:31:32 +00:00
David E. O'Brien
3d7618d8bf Disallow various debug.kdb sysctl's when securelevel is raised.
PR:	161350
2011-12-13 17:59:16 +00:00
Eitan Adler
9910b854c6 - Add a sysctl to allow non-root users the ability to set idle
priorities.

- While here fix up some style nits.

Discussed with: cperciva (breifly)
Reviewed by:	pjd (earlier version)
Reviewed by:	bde
Approved by:	jhb
MFC after:	1 month
2011-12-13 14:00:27 +00:00
Eitan Adler
3eb9ab5255 Document a large number of currently undocumented sysctls. While here
fix some style(9) issues and reduce redundancy.

PR:		kern/155491
PR:		kern/155490
PR:		kern/155489
Submitted by:	Galimov Albert <wtfcrap@mail.ru>
Approved by:	bde
Reviewed by:	jhb
MFC after:	1 week
2011-12-13 00:38:50 +00:00
Andriy Gapon
7a7ce668ef put sys/systm.h at its proper place or add it if missing
Reported by:	lstewart, tinderbox
Pointyhat to:	avg, attilio
MFC after:	1 week
MFC with:	r228430
2011-12-12 10:05:13 +00:00
Andriy Gapon
0e225211a0 kern_racct: move sys/systm.h inclusion to its proper place
This should fix the build failure introduced with r228424.
Also remove duplicate inclusion of sys/param.h.

Pointyhat to:	avg
MFC after:	1 week
2011-12-12 07:46:10 +00:00
Andriy Gapon
353705930f panic: add a switch and infrastructure for stopping other CPUs in SMP case
Historical behavior of letting other CPUs merily go on is a default for
time being.  The new behavior can be switched on via
kern.stop_scheduler_on_panic tunable and sysctl.

Stopping of the CPUs has (at least) the following benefits:
- more of the system state at panic time is preserved intact
- threads and interrupts do not interfere with dumping of the system
  state

Only one thread runs uninterrupted after panic if stop_scheduler_on_panic
is set.  That thread might call code that is also used in normal context
and that code might use locks to prevent concurrent execution of certain
parts.  Those locks might be held by the stopped threads and would never
be released.  To work around this issue, it was decided that instead of
explicit checks for panic context, we would rather put those checks
inside the locking primitives.

This change has substantial portions written and re-written by attilio
and kib at various times.  Other changes are heavily based on the ideas
and patches submitted by jhb and mdf.  bde has provided many insights
into the details and history of the current code.

The new behavior may cause problems for systems that use a USB keyboard
for interfacing with system console.  This is because of some unusual
locking patterns in the ukbd code which have to be used because on one
hand ukbd is below syscons, but on the other hand it has to interface
with other usb code that uses regular mutexes/Giant for its concurrency
protection.  Dumping to USB-connected disks may also be affected.

PR:			amd64/139614 (at least)
In cooperation with:	attilio, jhb, kib, mdf
Discussed with:		arch@, bde
Tested by:		Eugene Grosbein <eugen@grosbein.net>,
			gnn,
			Steven Hartland <killing@multiplay.co.uk>,
			glebius,
			Andrew Boyer <aboyer@averesystems.com>
			(various versions of the patch)
MFC after:		3 months (or never)
2011-12-11 21:02:01 +00:00
Peter Holm
cdea31e305 Move cpu_set_upcall(newtd, td) up before the first call of
thread_free(newtd).  This to avoid a possible page fault in
cpu_thread_clean() as seen on amd64 with syscall fuzzing.

Reviewed by:	kib
MFC after:	1 week
2011-12-09 17:19:41 +00:00
Eitan Adler
5a01b72672 - Fix ktrace leakage if error is set
PR:		kern/163098
Submitted by:	Loganaden Velvindron <loganaden@devio.us>
Approved by:	sbruno@
MFC after:	1 month
2011-12-08 03:20:38 +00:00
Alan Cox
ea3f07d3a0 Eliminate stale numbers from a comment. 2011-12-07 16:27:23 +00:00
Alan Cox
c749c003b8 Eliminate the possibility of 32-bit arithmetic overflow in the calculation
of vm_kmem_size that may occur if the system administrator has specified a
vm.vm_kmem_size tunable value that exceeds the hard cap.

PR:		162741
Submitted by:	Adam McDougall
Reviewed by:	bde@
MFC after:	3 weeks
2011-12-07 07:03:14 +00:00
Konstantin Belousov
93c26de0ad Most users of pipe(2) do not call fstat(2) on the returned pipe descriptors.
Optimize for the case, by lazily allocating the pipe inode number at the
fstat(2) time. If alloc_unr(9) returns failure, do not fail fstat(2), since
uses of inode numbers are even rare then fstat(2), but provide zero inode
forever. Note that alloc_unr() failure is unlikely due to total number
of pipes in the system limited by the number of file descriptors.

Based on the submission by:	gianni
MFC after:	2 weeks
2011-12-06 11:24:03 +00:00
Mikolaj Golub
9e94d5b83f Really protect kern.proc.ps_strings sysctls with p_candebug(). This
was intended to be in r228288.

Spotted by:	many
MFC after:	1 week
2011-12-06 06:40:14 +00:00
Mikolaj Golub
c65932be9d Protect kern.proc.auxv and kern.proc.ps_strings sysctls with p_candebug().
Citing jilles:

If we are ever going to do ASLR, the AUXV information tells an attacker
where the stack, executable and RTLD are located, which defeats much of
the point of randomizing the addresses in the first place.

Given that the AUXV information seems to be used by debuggers only anyway,
I think it would be good to move it to p_candebug() now.

The full virtual memory maps (KERN_PROC_VMMAP, procstat -v) are already
under p_candebug().

Suggested by:	jilles
Discussed with:	rwatson
MFC after:	1 week
2011-12-05 19:34:02 +00:00