Commit Graph

11380 Commits

Author SHA1 Message Date
Robert Watson
72c35fc69c Reserve system call numbers for Capsicum security framework capabilities,
capability mode, and process descriptors: cap_new, cap_getrights, cap_enter,
cap_getmode, pdfork, pdkill, pdgetpid, and pdwait.

Obtained from:	TrustedBSD Project
Sponsored by:	Google
MFC after:	3 weeks
2009-09-30 08:46:01 +00:00
Jamie Gritton
d446857747 Set the prison in NFS anon and GSS SVC creds.
Reviewed by:	marcel
MFC after:	3 days
2009-09-28 18:07:16 +00:00
Xin LI
82aebf697c Add two new fcntls to enable/disable read-ahead:
- F_READAHEAD: specify the amount for sequential access.  The amount is
   specified in bytes and is rounded up to nearest block size.
 - F_RDAHEAD: Darwin compatible version that use 128KB as the sequential
   access size.

A third argument of zero disables the read-ahead behavior.

Please note that the read-ahead amount is also constrainted by sysctl
variable, vfs.read_max, which may need to be raised in order to better
utilize this feature.

Thanks Igor Sysoev for proposing the feature and submitting the original
version, and kib@ for his valuable comments.

Submitted by:	Igor Sysoev <is rambler-co ru>
Reviewed by:	kib@
MFC after:	1 month
2009-09-28 16:59:47 +00:00
Xin LI
13dcbd75c1 Use correct sizeof() object for klist 'list'. Currently, struct klist
contained only SLIST_HEAD as its member, thus sizeof(struct klist) would
equal to sizeof(struct klist *), so this change makes the code more
correct in terms of semantics, but should be a no-op to compiler at this
time.

Reported by:	MQ <antinvidia at gmail com>
2009-09-28 10:22:46 +00:00
David Xu
b101b127f3 In function do_rw_wrlock, when a writer got an error and before returning,
check if there are readers blocked by us via URWLOCK_WRITE_WAITERS flag,
and resume the readers. The error must be EAGAIN, otherwise there must
have memory problem, and nobody can rescue the buggy application.

The revision 197445 might be reverted.
2009-09-25 00:03:13 +00:00
Alexander Motin
6090db7d38 Do not call BUS_DRIVER_ADDED() for detached buses (attach failed) on
driver load. This fixes crash on atapicam module load on systems,
where some ata channels (usually ata1) was probed, but failed to attach.

Reviewed by:	jhb, imp
Tested by:	many
2009-09-24 17:03:32 +00:00
Roman Divacky
1c2825bd80 Change unsigned foo to u_foo as required by style(9).
Requested by:	bde
Approved by:	ed (mentor)
2009-09-22 16:16:02 +00:00
Edward Tomasz Napierala
a9315dded6 Add pieces of infrastructure required for NFSv4 ACL support in UFS.
Reviewed by:	rwatson
2009-09-22 15:15:03 +00:00
Konstantin Belousov
51a6ef34fb Remove forward_roundrobin(), it is unused for quite some time.
Reviewed by:	jhb
MFC after:	1 week
2009-09-21 13:09:56 +00:00
Michael Tuexen
8518270e20 Get SCTP working in combination with VIMAGE.
Contains code from bz.
Approved by: rrs (mentor)
MFC after: 1 month.
2009-09-19 14:02:16 +00:00
Alan Cox
fe105d45a2 Add a new sysctl for reporting all of the supported page sizes.
Reviewed by:	jhb
MFC after:	3 weeks
2009-09-18 17:04:57 +00:00
Attilio Rao
39df6da8cc Don't allocate new unnecessary pages when devstat_alloc() looses the
run for re-acuiring the lock, but recheck if new pages are allocatable
from the pool and free the previously allocated ones.

Tested by:	pho, Giovanni Trematerra
		<giovanni dot trematerra at gmail dot com>
2009-09-18 13:48:38 +00:00
Roman Divacky
6413e27b3a Fix the style of the previous commit.
Approved by:	ed (mentor, implicit)
2009-09-17 17:48:13 +00:00
Roman Divacky
abc8594d70 Make these argument/variable unsigned as the defines for them don't fit
into signed 32bit integer.

Approved by:	ed (mentor, implicit)
Approved by:	sson
2009-09-17 17:41:28 +00:00
Stacey Son
fdc1a1131e Add EV_RECEIPT to kevents.
EV_RECEIPT is useful to disambiguating error conditions when multiple
events structures are passed to kevent(2).  The error code is returned
in the data field and EV_ERROR is set.

Approved by:	rwatson (co-mentor)
2009-09-16 03:49:54 +00:00
Stacey Son
1a921c410a Add the EV_DISPATCH flag to kevents.
When the EV_DISPATCH flag is used the event source will be disabled
immediately after the delivery of an event.   This is similar to the
EV_ONESHOT flag but it doesn't delete the event.

Approved by:	rwatson (co-mentor)
2009-09-16 03:37:39 +00:00
Stacey Son
2c2e449905 Add EVFILT_USER to kevents.
Add user events support to kernel events which are not associated with any
kernel mechanism but are triggered by user level code.  This is useful for
adding user level events to an event handler that may also be monitoring
kernel events.

Approved by:	rwatson (co-mentor)
2009-09-16 03:30:12 +00:00
Stacey Son
95128e983f Add optional touch event filter hooks to kevents.
The touch event filter is called when a kernel event data is possibly
updated.  There are two hook points.  First, during a kevent() system
call.  Second, when an event has been triggered.

Approved by:	rwatson (co-mentor)
2009-09-16 03:15:57 +00:00
Andre Oppermann
11c99a6d7b -Put the optimized soreceive_stream() under a compile time option called
TCP_SORECEIVE_STREAM for the time being.

Requested by:	brooks

Once compiled in make it easily switchable for testers by using a tuneable
 net.inet.tcp.soreceive_stream
and a corresponding read-only sysctl to report the current state.

Suggested by:	rwatson

MFC after:	2 days
2009-09-15 22:23:45 +00:00
Attilio Rao
435068aab7 Fix sched_switch_migrate():
- In 8.x and above the run-queue locks are nomore shared even in the
  HTT case, so remove the special case.
- The deadlock explained in the removed comment here is still possible
  even with different locks, with the contribution of tdq_lock_pair().
  An explanation is here:
  (hypotesis: a thread needs to migrate on another CPU, thread1 is doing
  sched_switch_migrate() and thread2 is the one handling the sched_switch()
  request or in other words, thread1 is the thread that needs to migrate
  and thread2 is a thread that is going to be preempted, most likely an
  idle thread. Also, 'old' is referred to the context (in terms of
  run-queue and CPU) thread1 is leaving and 'new' is referred to the
  context thread1 is going into.  Finally, thread3 is doing tdq_idletd()
  or sched_balance() and definitively doing tdq_lock_pair())

  * thread1 blocks its td_lock. Now td_lock is 'blocked'
  * thread1 drops its old runqueue lock
  * thread1 acquires the new runqueue lock
  * thread1 adds itself to the new runqueue and sends an IPI_PREEMPT
    through tdq_notify() to the new CPU
  * thread1 drops the new lock
  * thread3, scanning the runqueues, locks the old lock
  * thread2 received the IPI_PREEMPT and does thread_lock() with td_lock
    pointing to the new runqueue
  * thread3 wants to acquire the new runqueue lock, but it can't because
    it is held by thread2 so it spins
  * thread1 wants to acquire old lock, but as long as it is held by
    thread3 it can't
  * thread2 going further, at some point wants to switchin in thread1,
    but it will wait forever because thread1->td_lock is in blocked state

This deadlock has been manifested mostly on 7.x and reported several time
on mailing lists under the voice 'spinlock held too long'.
Many thanks to des@ for having worked hard on producing suitable textdumps
and Jeff for help on the comment wording.

Reviewed by:	jeff
Reported by:	des, others
Tested by:	des, Giovanni Trematerra
		<giovanni dot trematerra at gmail dot com>
		(STABLE_7 based version)
2009-09-15 16:56:17 +00:00
Attilio Rao
4c68dee0fb Revert r196779 in order to implement a different scheme for newbus locking
methodology.

Requested by:	imp
2009-09-13 15:08:19 +00:00
Luigi Rizzo
446e861708 Make sure callouts are not processed one tick late.
The problem was introduced in SVN 180608/ rev 1.114 and affects
all users of callout_reset() (including select, usleep, setitimer).
A better fix probably involves replicating 'ticks' in the
struct callout_cpu; this commit is just a temporary thing so that
we can MFC it after a suitable test time and RE approval.

MFC after:	3 days
2009-09-12 21:44:34 +00:00
Robert Watson
e76d823b81 Use C99 initialization for struct filterops.
Obtained from:	Mac OS X
Sponsored by:	Apple Inc.
MFC after:	3 weeks
2009-09-12 20:03:45 +00:00
Nick Hibma
b22692bd0a Add a comment on the consequences of reducing the poweroff delay 2009-09-10 18:24:59 +00:00
Konstantin Belousov
b55ef216fe kern_select(9) copies fd_set in and out of userspace in quantities of
longs. Since 32bit processes longs are 4 bytes, 64bit kernel may copy in
or out 4 bytes more then the process expected.

Calculate the amount of bytes to copy taking into account size of fd_set
for the current process ABI.

Diagnosed and tested by:	Peter Jeremy <peterjeremy acm org>
Reviewed by:	jhb
MFC after:	1 week
2009-09-09 20:59:01 +00:00
Konstantin Belousov
1ef6ea9b60 Unlock the image vnode around the call of pmc PMC_FN_PROCESS_EXEC hook.
The hook calls vn_fullpath(9), that should not be executed with a vnode
lock held.

Reported by:	Bruce Cran <bruce cran org uk>
Tested by:	pho
MFC after:	3 days
2009-09-09 10:52:36 +00:00
Konstantin Belousov
427992ecdb In vfs_mark_atime(9), be resistent against reclaimed vnodes.
Assert that neccessary locks are taken, since vop might not be called.

Tested by:	pho
MFC after:	3 days
2009-09-09 10:51:50 +00:00
Poul-Henning Kamp
6778431478 Revert previous commit and add myself to the list of people who should
know better than to commit with a cat in the area.
2009-09-08 13:19:05 +00:00
Poul-Henning Kamp
b34421bf9c Add necessary include. 2009-09-08 13:16:55 +00:00
Antoine Brodin
bf3f1fe043 Change w_notrunning and w_stillcold from pointer to array so that sizeof
returns what is expected.

PR:		kern/138557
Discussed with:	brucec@
MFC after:	1 month
2009-09-06 13:31:05 +00:00
Konstantin Belousov
db17314ea4 In fhopen, vfs_ref() the mount point while vnode is unlocked, to prevent
vn_start_write(NULL, &mp) from operating on potentially freed or reused
struct mount *.

Remove unmatched vfs_rel() in cleanup.

Noted and reviewed by:	tegge
Tested by:	pho
MFC after:	3 days
2009-09-06 11:44:46 +00:00
Ed Schouten
4d3b1aacfc Move ptmx into pty(4).
Now that pty(4) is a loadable kernel module, I'd better move /dev/ptmx
in there as well. This means that pty(4) now provides almost all
pseudo-terminal compatibility code. This means it's very easy to test
whether applications use the proper library interfaces when allocating
pseudo-terminals (namely posix_openpt and openpty).
2009-09-06 10:27:45 +00:00
Jamie Gritton
babbbb9c57 Allow a jail's name to be the same as its jid (which is the default if no
name is specified), but still disallow other numeric names.

Reviewed by:	zec
Approved by:	bz (mentor)
MFC after:	3 days
2009-09-04 19:00:48 +00:00
Attilio Rao
80002a63db Add intermediate states for attaching and detaching that will be
reused by the enhached newbus locking once it is checked in.
This change can be easilly MFCed to STABLE_8 at the appropriate moment.

Reviewed by:	jhb, scottl
Tested by:	Giovanni Trematerra <giovanni dot trematerra at gmail dot com>
2009-09-03 13:40:41 +00:00
Attilio Rao
8d3635c4db Fix some bugs related to adaptive spinning:
In the lockmgr support:
- GIANT_RESTORE() is just called when the sleep finishes, so the current
  code can ends up into a giant unlock problem.  Fix it by appropriately
  call GIANT_RESTORE() when needed.  Note that this is not exactly ideal
  because for any interation of the adaptive spinning we drop and restore
  Giant, but the overhead should be not a factor.
- In the lock held in exclusive mode case, after the adaptive spinning is
  brought to completition, we should just retry to acquire the lock
  instead to fallthrough. Fix that.
- Fix a style nit

In the sx support:
- Call GIANT_SAVE() before than looping. This saves some overhead because
  in the current code GIANT_SAVE() is called several times.

Tested by:	Giovanni Trematerra <giovanni dot trematerra at gmail dot com>
2009-09-02 17:33:51 +00:00
Konstantin Belousov
579b976090 Fix mount reference leak when V_XSLEEP is specified to vn_start_write().
Submitted by:	tegge
2009-09-01 12:05:39 +00:00
Konstantin Belousov
8a945d109c Reintroduce the r196640, after fixing the problem with my testing.
Remove the altkstacks, instead instantiate threads with kernel stack
allocated with the right size from the start. For the thread that has
kernel stack cached, verify that requested stack size is equial to the
actual, and reallocate the stack if sizes differ [1].

This fixes the bug introduced by r173361 that was committed several days
after r173004 and consisted of kthread_add(9) ignoring the non-default
kernel stack size.

Also, r173361 removed the caching of the kernel stacks for a non-first
thread in the process. Introduce separate kernel stack cache that keeps
some limited amount of preallocated kernel stacks to lower the latency
of thread allocation. Add vm_lowmem handler to prune the cache on
low memory condition. This way, system with reasonable amount of the
threads get lower latency of thread creation, while still not exhausting
significant portion of KVA for unused kstacks.

Submitted by:	peter [1]
Discussed with:	jhb, julian, peter
Reviewed by:	jhb
Tested by:	pho (and retested according to new test scenarious)
MFC after:	1 week
2009-09-01 11:41:51 +00:00
Konstantin Belousov
a505c2c72c Make the mnt_writeopcount and mnt_secondary_writes counters,
used by the suspension code, not greater then mnt_ref reference
counter value. Increment mnt_ref together with write counter
in vn_start_write()/ vn_start_secondary_write(), releasing in
vn_finished_write/vn_finished_secondary_write().

Since r186197, unmount code requires that no writers occured after all
references are expired. We still could get write counter incremented
for freed or reused struct mount, but it seems to be innocent, since
corresponding vnode should be referenced and reclaimed then.

Reported by:	pho (last half a year), erwin
Reviewed by:	attilio
Tested by:	pho, erwin
MFC after:	1 week
2009-08-31 10:20:52 +00:00
Bjoern A. Zeeb
ecc2fda872 Make sure FreeBSD binaries without .note.ABI-tag section work
correctly and do not match a colliding Debian GNU/kFreeBSD
brandinfo statements.
For this mark the Debian GNU/kFreeBSD brandinfo that it must have
an .note.ABI-tag section and ignore the old EI_OSABI brandinfo
when comparing a possibly colliding set of options.

Due to SYSINIT we add the brandinfo in a non-deterministic order,
so native FreeBSD is not always first. We may want to consider
to force native FreeBSD to come first as well.

The only way a problem could currently be noticed is when running an
i386 binary without the .note.ABI-tag on amd64 and the Debian GNU/kFreeBSD
brandinfo  was matched first,  as the fallback to ld-elf32.so.1 does
not exist in that case.

Reported and tested by:	ticso
In collaboration with:	kib
MFC after:		3 days
2009-08-30 14:38:17 +00:00
Konstantin Belousov
f25fa6abb2 Reverse r196640 and r196644 for now. 2009-08-29 21:53:08 +00:00
Konstantin Belousov
b6b2d1bf88 Dispose the kernel stack of the proper thread.
Submitted by:	alc
MFC after:	1 week
2009-08-29 18:01:02 +00:00
Konstantin Belousov
c3cf0b476f Remove the altkstacks, instead instantiate threads with kernel stack
allocated with the right size from the start. For the thread that has
kernel stack cached, verify that requested stack size is equial to the
actual, and reallocate the stack if sizes differ [1].

This fixes the bug introduced by r173361 that was committed several days
after r173004 and consisted of kthread_add(9) ignoring the non-default
kernel stack size.

Also, r173361 removed the caching of the kernel stacks for a non-first
thread in the process. Introduce separate kernel stack cache that keeps
some limited amount of preallocated kernel stacks to lower the latency
of thread allocation. Add vm_lowmem handler to prune the cache on
low memory condition. This way, system with reasonable amount of the
threads get lower latency of thread creation, while still not exhausting
significant portion of KVA for unused kstacks.

Submitted by:	peter [1]
Discussed with:	jhb, julian, peter
Reviewed by:	jhb
Tested by:	pho
MFC after:	1 week
2009-08-29 13:28:02 +00:00
John Baldwin
2fa8c8d21e Extend the device pager to support different memory attributes on different
pages in an object.
- Add a new variant of d_mmap() currently called d_mmap2() which accepts
  an additional in/out parameter that is the memory attribute to use for
  the requested page.
- A driver either uses d_mmap() or d_mmap2() for all requests but not both.
  The current implementation uses a flag in the cdevsw (D_MMAP2) to indicate
  that the driver provides a d_mmap2() handler instead of d_mmap().  This
  is done to make the change ABI compatible with existing drivers and
  MFC'able to 7 and 8.

Submitted by:	alc
MFC after:	1 month
2009-08-28 14:06:55 +00:00
Jamie Gritton
c4884ffa6f Fix a LOR between allprison_lock and vnode locks by releasing
allprison_lock before releasing a prison's root vnode.

PR:		kern/138004
Reviewed by:	kib
Approved by:	bz (mentor)
MFC after:	3 days
2009-08-27 16:15:51 +00:00
Marius Strobl
5486ffc898 Add a temporary workaround which just lets init die instead of
causing a panic if it is killed due to a unsolved stack overflow
seen very late during shutdown on sparc64 when the gmirror worker
process exists, which is a regression introduced in 8.0.

Reviewed by:	kib
MFC after:	3 days
2009-08-26 21:10:47 +00:00
Konstantin Belousov
4f4946d337 Honor the vfs.timestamp_precision sysctl settings for utimes(path, NULL)
and similar calls.

Obtained from:	Petr Salinger, Debian GNU/kFreeBSD, Debian bug #489894
MFC after:	3 days
2009-08-26 14:32:37 +00:00
Jilles Tjoelker
74d1c4927a Fix poll() on half-closed sockets, while retaining POLLHUP for fifos.
This reverts part of r196460, so that sockets only return POLLHUP if both
directions are closed/error. Fifos get POLLHUP by closing the unused
direction immediately after creating the sockets.

The tools/regression/poll/*poll.c tests now pass except for two other things:
- if POLLHUP is returned, POLLIN is always returned as well instead of only
  when there is data left in the buffer to be read
- fifo old/new reader distinction does not work the way POSIX specs it

Reviewed by:	kib, bde
2009-08-25 21:44:14 +00:00
Warner Losh
58a745889f Rather than havnig enabled/disabled, implement a max queue depth.
While usually not an issue, this firewalls bugs in the code that may
run us out of memory.

Fix a memory exhaustion in the case where devctl was disabled, but the
link was bouncing.  The check to queue was in the wrong place.

Implement a new sysctl hw.bus.devctl_queue to control the depth.  Make
compatibility hacks for hw.bus.devctl_disable to ease transition.

Reviewed by:	emaste@
Approved by:	re@ (kib)
MFC after:	asap
2009-08-25 06:25:59 +00:00
Bjoern A. Zeeb
89ffc202d6 Fix handling of .note.ABI-tag section for GNU systems [1].
Handle GNU/Linux according to LSB Core Specification 4.0,
Chapter 11. Object Format, 11.8. ABI note tag.

Also check the first word of desc, not only name, according to
glibc abi-tags specification to distinguish between Linux and
kFreeBSD.

Add explicit handling for Debian GNU/kFreeBSD, which runs
on our kernels as well [2].

In {amd64,i386}/trap.c, when checking osrel of the current process,
also check the ABI to not change the signal behaviour for Linux
binary processes, now that we save an osrel version for all three
from the lists above in struct proc [2].

These changes make it possible to run FreeBSD, Debian GNU/kFreeBSD
and Linux binaries on the same machine again for at least i386 and
amd64, and no longer break kFreeBSD which was detected as GNU(/Linux).

PR:		kern/135468
Submitted by:	dchagin [1] (initial patch)
Suggested by:	kib [2]
Tested by:	Petr Salinger (Petr.Salinger seznam.cz) for kFreeBSD
Reviewed by:	kib
MFC after:	3 days
2009-08-24 16:19:47 +00:00
Ed Schouten
2992abe047 Allow multiple console devices per driver without insane code duplication.
Say, a driver wants to have multiple console devices to pick from, you
would normally write down something like this:

	CONSOLE_DRIVER(dev1);
	CONSOLE_DRIVER(dev2);

Unfortunately, this means that you have to declare 10 cn routines,
instead of 5. It also isn't possible to initialize cn_arg on beforehand.

I noticed this restriction when I was implementing some of the console
bits for my vt(4) driver in my newcons branch. I have a single set of cn
routines (termcn_*) which are shared by all vt(4) console devices.

In order to solve this, I'm adding a separate consdev_ops structure,
which contains all the function pointers. This structure is referenced
through consdev's cn_ops field.

While there, I'm removing CONS_DRIVER() and cn_checkc, which have been
deprecated for years. They weren't used throughout the source, until the
Xen console driver showed up. CONSOLE_DRIVER() has been changed to do
the right thing. It now declares both the consdev and consdev_ops
structure and ties them together. In other words: this change doesn't
change the KPI for drivers that used the regular way of declaring
console devices.

If drivers want to use multiple console devices, they can do this as
follows:

	static const struct consdev_ops mydriver_cnops = {
		.cn_probe	= mydriver_cnprobe,
		...
	};
	static struct mydriver_softc cons0_softc = {
		...
	};
	CONSOLE_DEVICE(cons0, mydriver_cnops, &cons0_softc);
	static struct mydriver_softc cons1_softc = {
		...
	};
	CONSOLE_DEVICE(cons1, mydriver_cnops, &cons1_softc);

Obtained from:	//depot/user/ed/newcons/...
2009-08-24 10:53:30 +00:00