Commit Graph

11475 Commits

Author SHA1 Message Date
alc
b4400267f9 Update a comment to reflect the previous change. 2009-10-25 02:48:29 +00:00
ru
0a59296a7e - Rename tunable kern.ipc.shmmaxpgs to kern.ipc.shmall.
- Explain the fuss when initializing shmmax.

PR:	75542 (mistakenly closed instead of PR 75541)
2009-10-24 19:00:58 +00:00
jhb
a661f652ad - Fix several off-by-one errors when using MAXCOMLEN. The p_comm[] and
td_name[] arrays are actually MAXCOMLEN + 1 in size and a few places that
  created shadow copies of these arrays were just using MAXCOMLEN.
- Prefer using sizeof() of an array type to explicit constants for the
  array length in a few places.
- Ensure that all of p_comm[] and td_name[] is always zero'd during
  execve() to guard against any possible information leaks.  Previously
  trailing garbage in p_comm[] could be leaked to userland in ktrace
  record headers via td_name[].

Reviewed by:	bde
2009-10-23 15:14:54 +00:00
jhb
deef9eece9 Don't bother copying the name of a kproc or kthread out into a temporary
array just to pass that array to printf().  kproc and kthread names are
NUL-terminated and can be printed using printf() directly.

Reviewed by:	bde
2009-10-23 15:09:51 +00:00
jhb
e3088a4b83 Set the devclass_t pointer specified in the DRIVER_MODULE() macro
sooner so it is always valid when a driver's identify routine is
called.  Previously, new-bus would attempt to create the devclass for
a newly loaded driver in two separate places, once in
devclass_add_driver(), and again after devclass_add_driver() returned
in driver_module_handler().  Only the second lookup attempted to set a
device class' parent and set the devclass_t pointer specified in the
DRIVER_MODULE() macro.  However, by the time it was executed, the
driver was already added to existing instances of the parent driver at
which point in time the new driver's identify routine would have been
invoked.  The fix is to merge the two attempts and only create the
devclass once in devclass_add_driver() including setting the
devclass_t pointer passed to DRIVER_MODULE() before the driver is
added to any existing bus devices.

Reported by:	avg
Reviewed by:	imp
MFC after:	2 weeks
2009-10-22 14:53:44 +00:00
marcel
51bb720939 o Introduce vm_sync_icache() for making the I-cache coherent with
the memory or D-cache, depending on the semantics of the platform.
    vm_sync_icache() is basically a wrapper around pmap_sync_icache(),
    that translates the vm_map_t argumument to pmap_t.
o   Introduce pmap_sync_icache() to all PMAP implementation. For powerpc
    it replaces the pmap_page_executable() function, added to solve
    the I-cache problem in uiomove_fromphys().
o   In proc_rwmem() call vm_sync_icache() when writing to a page that
    has execute permissions. This assures that when breakpoints are
    written, the I-cache will be coherent and the process will actually
    hit the breakpoint.
o   This also fixes the Book-E PMAP implementation that was missing
    necessary locking while trying to deal with the I-cache coherency
    in pmap_enter() (read: mmu_booke_enter_locked).

The key property of this change is that the I-cache is made coherent
*after* writes have been done. Doing it in the PMAP layer when adding
or changing a mapping means that the I-cache is made coherent *before*
any writes happen. The difference is key when the I-cache prefetches.
2009-10-21 18:38:02 +00:00
ru
8bb7f5309b Random number generator initialization cleanup:
- Introduce new SI_SUB_RANDOM point in boot sequence to make it
clear from where one may start using random(9).  It should be as
early as possible, so place it just after SI_SUB_CPU where we
have some randomness on most platforms via get_cyclecount().

- Move stack protector initialization to be after SI_SUB_RANDOM
as before this point we have no randomness at all.  This fixes
stack protector to actually protect stack with some random guard
value instead of a well-known one.

Note that this patch doesn't try to address arc4random(9) issues.
With current code, it will be implicitly seeded by stack protector
and hence will get the same entropy as random(9).  It will be
securely reseeded once /dev/random is feeded by some entropy from
userland.

Submitted by:	Maxim Dounin <mdounin@mdounin.ru>
MFC after:	3 days
2009-10-20 16:36:51 +00:00
ed
d92da0f1c0 Properly set the low watermarks when reducing the baud rate.
Now that buffers are deallocated lazily, we should not use
tty*q_getsize() to obtain the buffer size to calculate the low
watermarks. Doing this may cause the watermark to be placed outside the
typical buffer size.

This caused some regressions after my previous commit to the TTY code,
which allows pseudo-devices to resize the buffers as well.

Reported by:	yongari, dougb
MFC after:	1 week
2009-10-19 07:17:37 +00:00
ed
9be1c96abb Allow the buffer size to be configured for pseudo-like TTY devices.
Devices that don't implement param() (which means they don't support
hardware parameters such as flow control, baud rate) hardcode the baud
rate to TTYDEF_SPEED. This means the buffer size cannot be configured,
which is a little inconvenient when using canonical mode with big lines
of input, etc.

Make it adjustable, but do clamp it between B50 and B115200 to prevent
awkward buffer sizes. Remove the baud rate assignment from
/etc/gettytab. Trust the kernel to fill in a proper value.

Reported by:	Mikolaj Golub <to my trociny gmail com>
MFC after:	1 month
2009-10-18 19:48:53 +00:00
ed
48d5bb8030 Make lock devices work properly.
It turned out I did add the code to use the init state devices to set
the termios structure when opening the device, but it seems I totally
forgot to add the bits required to force the actual locking of flags
through the lock state devices.

Reported by:	ru
MFC after:	1 week (to be discussed)
2009-10-18 19:45:44 +00:00
kib
2eb5677d22 If ET_DYN binary has non-zero base address for some reason, honour it
and do not relocate the binary to ET_DYN_LOAD_ADDR. This allows for the
binary author to influence address map of the process. In particular,
when the binary is actually an interpeter, this allows to have almost
usual process address map.

Communicate the relocation bias of the mapping for interpeter-less
ET_DYN binary, that is interperter itself, in AT_BASE aux entry. This
way, rtld is able to find its dynamic structure and relocate itself.
Note that mapbase in the rtld is still wrong and requires further
fixing.

Reported and tested by:	rwatson
Discussed with:	kan
MFC after:	3 days
2009-10-18 12:57:48 +00:00
ed
04a930203d Print backspaces after echoing an EOF.
Applications like shells expect EOF to give no graphical output, while
our implementation prints ^D by default (tunable with stty echoctl).
Make the new implementation behave like the old TTY code. Print two
backspaces afterwards.

Reported by:	koitsu
MFC after:	1 month
2009-10-17 08:59:41 +00:00
jhb
f410b0c1a3 Use language more closely resembling English in a panic message.
Pointy hat to:	jhb
Submitted by:	pluknet
2009-10-15 18:51:19 +00:00
jhb
45688ed39d Add a facility for associating optional descriptions with active interrupt
handlers.  This is primarily intended as a way to allow devices that use
multiple interrupts (e.g. MSI) to meaningfully distinguish the various
interrupt handlers.
- Add a new BUS_DESCRIBE_INTR() method to the bus interface to associate
  a description with an active interrupt handler setup by BUS_SETUP_INTR.
  It has a default method (bus_generic_describe_intr()) which simply passes
  the request up to the parent device.
- Add a bus_describe_intr() wrapper around BUS_DESCRIBE_INTR() that supports
  printf(9) style formatting using var args.
- Reserve MAXCOMLEN bytes in the intr_handler structure to hold the name of
  an interrupt handler and copy the name passed to intr_event_add_handler()
  into that buffer instead of just saving the pointer to the name.
- Add a new intr_event_describe_handler() which appends a description string
  to an interrupt handler's name.
- Implement support for interrupt descriptions on amd64 and i386 by having
  the nexus(4) driver supply a custom bus_describe_intr method that invokes
  a new intr_describe() MD routine which in turn looks up the associated
  interrupt event and invokes intr_event_describe_handler().

Requested by:	many
Reviewed by:	scottl
MFC after:	2 weeks
2009-10-15 14:54:35 +00:00
jhb
f88b32f139 Fix a sign bug in the handling of nice priorities when computing the
interactive score for a thread.

Submitted by:	Taku YAMAMOTO  taku of tackymt.homeip.net
Reviewed by:	jeff
MFC after:	3 days
2009-10-15 11:41:12 +00:00
jkoshy
879694e8a1 Improve the description of sysctl "kern.sugid_coredump".
Submitted by:	Mel Flynn <mel.flynn+fbsd.hackers at mailing.thruhere.net>
		on -hackers
2009-10-12 15:49:48 +00:00
kib
65925abb9e Fix typo.
Submitted by:	rdivacky
MFC after:	1 month
2009-10-12 10:09:48 +00:00
kib
9ed435d37a Currently, when signal is delivered to the process and there is a thread
not blocking the signal, signal is placed on the thread sigqueue. If
the selected thread is in kernel executing thr_exit() or sigprocmask()
syscalls, then signal might be not delivered to usermode for arbitrary
amount of time, and for exiting thread it is lost.

Put process-directed signals to the process queue unconditionally,
selecting the thread to deliver the signal only by the thread returning
to usermode, since only then the thread can handle delivery of signal
reliably. For exiting thread or thread that has blocked some signals,
check whether the newly blocked signal is queued for the process, and
try to find a thread to wakeup for delivery, in reschedule_signal(). For
exiting thread, assume that all signals are blocked.

Change cursig() and postsig() to look both into the thread and process
signal queues. When there is a signal that thread returning to usermode
could consume, TDF_NEEDSIGCHK flag is not neccessary set now. Do
unlocked read of p_siglist and p_pendingcnt to check for queued signals.

Note that thread that has a signal unblocked might get spurious wakeup
and EINTR from the interruptible system call now, due to the possibility
of being selected by reschedule_signals(), while other thread returned
to usermode earlier and removed the signal from process queue. This
should not cause compliance issues, since the thread has not blocked a
signal and thus should be ready to receive it anyway.

Reported by:	Justin Teller <justin.teller gmail com>
Reviewed by:	davidxu, jilles
MFC after:	1 month
2009-10-11 16:49:30 +00:00
kib
31d083f28a Refine r195509, instead of checking that vnode type is VBAD, that is
set quite late in the revocation path, properly verify that vnode is
not doomed before calling VOP.

Reported and tested by:	Harald Schmalzbauer <h.schmalzbauer omnilan de>
MFC after:	3 days
2009-10-10 21:17:30 +00:00
kib
edf781a815 Map PIE binaries at non-zero base address.
Discussed with:	bz
Reviewed by:	kan
Tested by:	bz (i386, amd64), bsam (linux)
MFC after:	some time
2009-10-10 15:33:01 +00:00
kib
16ff64e8c6 Do not map segments of zero length.
Discussed with:	bz
Reviewed by:	kan
Tested by:	bz (i386, amd64), bsam (linux)
MFC after:	some time
2009-10-10 15:28:52 +00:00
kib
87a472715d Postpone dropping fp till both kq_global and kqueue mutexes are
unlocked. fdrop() closes file descriptor when reference count goes to
zero. Close method for vnodes locks the vnode, resulting in "sleepable
after non-sleepable". For pipes, pipe mutex is before kqueue lock,
causing LOR.

Reported and tested by:	pho
MFC after:	2 weeks
2009-10-10 14:56:34 +00:00
rwatson
60dba30ff0 Fix build on amd64, where sysctl arg1 is a pointer.
Reported by:	Mr Tinderbox
MFC after:	3 months
2009-10-05 22:23:12 +00:00
trasz
ef5e63d04a Fix NFSv4 ACLs on sparc64. Turns out that fuword(9) fetches 64 bits
instead of sizeof(int), and on sparc64 that resulted in fetching wrong
value for acl_maxcnt, which in turn caused __acl_get_link(2) to fail
with EINVAL.

PR:		sparc64/139304
Submitted by:	Dmitry Afanasiev <KOT at MATPOCKuH.Ru>
2009-10-05 19:56:56 +00:00
rwatson
8579778ae2 First cut at implementing SOCK_SEQPACKET support for UNIX (local) domain
sockets.  This allows for reliable bi-directional datagram communication
over UNIX domain sockets, in contrast to SOCK_DGRAM (M:N, unreliable) or
SOCK_STERAM (bi-directional bytestream).  Largely, this reuses existing
UNIX domain socket code.  This allows applications requiring record-
oriented semantics to do so reliably via local IPC.

Some implementation notes (also present in XXX comments):

- Currently we lack an sbappend variant able to do datagrams and control
  data without doing addresses, so we mark SOCK_SEQPACKET as PR_ADDR.
  Adding a new variant will solve this problem.

- UNIX domain sockets on FreeBSD provide back-pressure/flow control
  notification for stream sockets by manipulating the send socket
  buffer's size during pru_send and pru_rcvd.  This trick works less well
  for SOCK_SEQPACKET as sosend_generic() uses sb_hiwat not just to
  manage blocking, but also to determine maximum datagram size.  Fixing
  this requires rethinking how back-pressure is done for SOCK_SEQPACKET;
  in the mean time, it's possible to get EMSGSIZE when buffers fill,
  instead of blocking.

Discussed with:	benl
Reviewed by:	bz, rpaulo
MFC after:	3 months
Sponsored by:	Google
2009-10-05 14:49:16 +00:00
attilio
df8d61d246 When releasing a lockmgr held in shared way we need to use a write memory
barrier in order to avoid, on architectures which doesn't have strong
ordered writes, CPU instructions reordering.

Diagnosed by:	fabio
2009-10-03 15:02:55 +00:00
bz
a0d8f55f8a Print a warning in case we cannot add more brandinfo because
we would overflow the MAX_BRANDS sized array.

Reviewed by:	kib
MFC After:	1 month
2009-10-03 10:50:00 +00:00
rwatson
86b3bcad7d Don't comment on stream socket handling in sosend_dgram, since that's
not handled.

MFC after:	3 weeks
2009-10-02 21:31:15 +00:00
bz
bc660fe08f Add a mitigation feature that will prevent user mappings at
virtual address 0, limiting the ability to convert a kernel
NULL pointer dereference into a privilege escalation attack.

If the sysctl is set to 0 a newly started process will not be able
to map anything in the address range of the first page (0 to PAGE_SIZE).
This is the default. Already running processes are not affected by this.

You can either change the sysctl or the tunable from loader in case
you need to map at a virtual address of 0, for example when running
any of the extinct species of a set of a.out binaries, vm86 emulation, ..
In that case set security.bsd.map_at_zero="1".

Superseeds:		r197537
In collaboration with:	jhb, kib, alc
2009-10-02 17:48:51 +00:00
emaste
93e81ca098 In fill_kinfo_thread, copy the thread's name into struct kinfo_proc even
if it is empty.  Otherwise the previous thread's name would remain in the
struct and then be reported for this thread.

Submitted by:	Ryan Stone
MFC after:	1 week
2009-10-01 21:44:30 +00:00
trasz
d5661d631d Provide default implementation for VOP_ACCESS(9), so that filesystems which
want to provide VOP_ACCESSX(9) don't have to implement both.  Note that
this commit makes implementation of either of these two mandatory.

Reviewed by:	kib
2009-10-01 17:22:03 +00:00
kib
6f65ac4277 Do not dereference vp->v_mount without holding vnode lock and checking
that the vnode is not reclaimed.

Noted by:	Igor Sysoev <is rambler-co ru>
MFC after:	1 week
2009-10-01 12:50:26 +00:00
kib
605a0e085a Fix typo.
MFC after:	3 days
2009-10-01 12:46:58 +00:00
avg
78596c163e print machine in kernel boot version string
Discussed with:	gavin, kib, jhb
PR:		kern/126926
MFC after:	2 weeks
2009-10-01 10:53:12 +00:00
attilio
e9f2530ebf When releasing a read/shared lock we need to use a write memory barrier
in order to avoid, on architectures which doesn't have strong ordered
writes, CPU instructions reordering.

Diagnosed by:	fabio
Reviewed by:	jhb
Tested by:	Giovanni Trematerra
		<giovanni dot trematerra at gmail dot com>
2009-09-30 13:26:31 +00:00
avg
49524d648a print_caddr_t: drop incorrect __unused attribute from parameter
seems like a purely cosmetic change

Reviewed by:	jhb, kib
MFC after:	1 week
2009-09-30 11:14:13 +00:00
rwatson
47ee86367a Regenerate system call files following r197636. 2009-09-30 08:48:59 +00:00
rwatson
3d5e3df28c Reserve system call numbers for Capsicum security framework capabilities,
capability mode, and process descriptors: cap_new, cap_getrights, cap_enter,
cap_getmode, pdfork, pdkill, pdgetpid, and pdwait.

Obtained from:	TrustedBSD Project
Sponsored by:	Google
MFC after:	3 weeks
2009-09-30 08:46:01 +00:00
jamie
59de460f51 Set the prison in NFS anon and GSS SVC creds.
Reviewed by:	marcel
MFC after:	3 days
2009-09-28 18:07:16 +00:00
delphij
79f2f8c774 Add two new fcntls to enable/disable read-ahead:
- F_READAHEAD: specify the amount for sequential access.  The amount is
   specified in bytes and is rounded up to nearest block size.
 - F_RDAHEAD: Darwin compatible version that use 128KB as the sequential
   access size.

A third argument of zero disables the read-ahead behavior.

Please note that the read-ahead amount is also constrainted by sysctl
variable, vfs.read_max, which may need to be raised in order to better
utilize this feature.

Thanks Igor Sysoev for proposing the feature and submitting the original
version, and kib@ for his valuable comments.

Submitted by:	Igor Sysoev <is rambler-co ru>
Reviewed by:	kib@
MFC after:	1 month
2009-09-28 16:59:47 +00:00
delphij
fd5f08e3e8 Use correct sizeof() object for klist 'list'. Currently, struct klist
contained only SLIST_HEAD as its member, thus sizeof(struct klist) would
equal to sizeof(struct klist *), so this change makes the code more
correct in terms of semantics, but should be a no-op to compiler at this
time.

Reported by:	MQ <antinvidia at gmail com>
2009-09-28 10:22:46 +00:00
davidxu
14cbd7c194 In function do_rw_wrlock, when a writer got an error and before returning,
check if there are readers blocked by us via URWLOCK_WRITE_WAITERS flag,
and resume the readers. The error must be EAGAIN, otherwise there must
have memory problem, and nobody can rescue the buggy application.

The revision 197445 might be reverted.
2009-09-25 00:03:13 +00:00
mav
19afb02afb Do not call BUS_DRIVER_ADDED() for detached buses (attach failed) on
driver load. This fixes crash on atapicam module load on systems,
where some ata channels (usually ata1) was probed, but failed to attach.

Reviewed by:	jhb, imp
Tested by:	many
2009-09-24 17:03:32 +00:00
rdivacky
69bdbc1b60 Change unsigned foo to u_foo as required by style(9).
Requested by:	bde
Approved by:	ed (mentor)
2009-09-22 16:16:02 +00:00
trasz
7af86ad4d9 Add pieces of infrastructure required for NFSv4 ACL support in UFS.
Reviewed by:	rwatson
2009-09-22 15:15:03 +00:00
kib
9357da94fe Remove forward_roundrobin(), it is unused for quite some time.
Reviewed by:	jhb
MFC after:	1 week
2009-09-21 13:09:56 +00:00
tuexen
de3c71bd61 Get SCTP working in combination with VIMAGE.
Contains code from bz.
Approved by: rrs (mentor)
MFC after: 1 month.
2009-09-19 14:02:16 +00:00
alc
309c5ab06f Add a new sysctl for reporting all of the supported page sizes.
Reviewed by:	jhb
MFC after:	3 weeks
2009-09-18 17:04:57 +00:00
attilio
6aa8dd54a5 Don't allocate new unnecessary pages when devstat_alloc() looses the
run for re-acuiring the lock, but recheck if new pages are allocatable
from the pool and free the previously allocated ones.

Tested by:	pho, Giovanni Trematerra
		<giovanni dot trematerra at gmail dot com>
2009-09-18 13:48:38 +00:00
rdivacky
eefaeaf649 Fix the style of the previous commit.
Approved by:	ed (mentor, implicit)
2009-09-17 17:48:13 +00:00
rdivacky
1f00412058 Make these argument/variable unsigned as the defines for them don't fit
into signed 32bit integer.

Approved by:	ed (mentor, implicit)
Approved by:	sson
2009-09-17 17:41:28 +00:00
sson
95d4e7d075 Add EV_RECEIPT to kevents.
EV_RECEIPT is useful to disambiguating error conditions when multiple
events structures are passed to kevent(2).  The error code is returned
in the data field and EV_ERROR is set.

Approved by:	rwatson (co-mentor)
2009-09-16 03:49:54 +00:00
sson
a386443e51 Add the EV_DISPATCH flag to kevents.
When the EV_DISPATCH flag is used the event source will be disabled
immediately after the delivery of an event.   This is similar to the
EV_ONESHOT flag but it doesn't delete the event.

Approved by:	rwatson (co-mentor)
2009-09-16 03:37:39 +00:00
sson
7cb0718a03 Add EVFILT_USER to kevents.
Add user events support to kernel events which are not associated with any
kernel mechanism but are triggered by user level code.  This is useful for
adding user level events to an event handler that may also be monitoring
kernel events.

Approved by:	rwatson (co-mentor)
2009-09-16 03:30:12 +00:00
sson
297d4ab14b Add optional touch event filter hooks to kevents.
The touch event filter is called when a kernel event data is possibly
updated.  There are two hook points.  First, during a kevent() system
call.  Second, when an event has been triggered.

Approved by:	rwatson (co-mentor)
2009-09-16 03:15:57 +00:00
andre
3e3fe69f6b -Put the optimized soreceive_stream() under a compile time option called
TCP_SORECEIVE_STREAM for the time being.

Requested by:	brooks

Once compiled in make it easily switchable for testers by using a tuneable
 net.inet.tcp.soreceive_stream
and a corresponding read-only sysctl to report the current state.

Suggested by:	rwatson

MFC after:	2 days
-This line, and those below, will be ignored--
> Description of fields to fill in above:                     76 columns --|
> PR:            If a GNATS PR is affected by the change.
> Submitted by:  If someone else sent in the change.
> Reviewed by:   If someone else reviewed your modification.
> Approved by:   If you needed approval for this commit.
> Obtained from: If the change is from a third party.
> MFC after:     N [day[s]|week[s]|month[s]].  Request a reminder email.
> Security:      Vulnerability reference (one per line) or description.
> Empty fields above will be automatically removed.

M    sys/conf/options
M    sys/kern/uipc_socket.c
M    sys/netinet/tcp_subr.c
M    sys/netinet/tcp_usrreq.c
2009-09-15 22:23:45 +00:00
attilio
867bd3bd7f Fix sched_switch_migrate():
- In 8.x and above the run-queue locks are nomore shared even in the
  HTT case, so remove the special case.
- The deadlock explained in the removed comment here is still possible
  even with different locks, with the contribution of tdq_lock_pair().
  An explanation is here:
  (hypotesis: a thread needs to migrate on another CPU, thread1 is doing
  sched_switch_migrate() and thread2 is the one handling the sched_switch()
  request or in other words, thread1 is the thread that needs to migrate
  and thread2 is a thread that is going to be preempted, most likely an
  idle thread. Also, 'old' is referred to the context (in terms of
  run-queue and CPU) thread1 is leaving and 'new' is referred to the
  context thread1 is going into.  Finally, thread3 is doing tdq_idletd()
  or sched_balance() and definitively doing tdq_lock_pair())

  * thread1 blocks its td_lock. Now td_lock is 'blocked'
  * thread1 drops its old runqueue lock
  * thread1 acquires the new runqueue lock
  * thread1 adds itself to the new runqueue and sends an IPI_PREEMPT
    through tdq_notify() to the new CPU
  * thread1 drops the new lock
  * thread3, scanning the runqueues, locks the old lock
  * thread2 received the IPI_PREEMPT and does thread_lock() with td_lock
    pointing to the new runqueue
  * thread3 wants to acquire the new runqueue lock, but it can't because
    it is held by thread2 so it spins
  * thread1 wants to acquire old lock, but as long as it is held by
    thread3 it can't
  * thread2 going further, at some point wants to switchin in thread1,
    but it will wait forever because thread1->td_lock is in blocked state

This deadlock has been manifested mostly on 7.x and reported several time
on mailing lists under the voice 'spinlock held too long'.
Many thanks to des@ for having worked hard on producing suitable textdumps
and Jeff for help on the comment wording.

Reviewed by:	jeff
Reported by:	des, others
Tested by:	des, Giovanni Trematerra
		<giovanni dot trematerra at gmail dot com>
		(STABLE_7 based version)
2009-09-15 16:56:17 +00:00
attilio
47c9d3ff5e Revert r196779 in order to implement a different scheme for newbus locking
methodology.

Requested by:	imp
2009-09-13 15:08:19 +00:00
luigi
02e6d1c7f9 Make sure callouts are not processed one tick late.
The problem was introduced in SVN 180608/ rev 1.114 and affects
all users of callout_reset() (including select, usleep, setitimer).
A better fix probably involves replicating 'ticks' in the
struct callout_cpu; this commit is just a temporary thing so that
we can MFC it after a suitable test time and RE approval.

MFC after:	3 days
2009-09-12 21:44:34 +00:00
rwatson
53eaed07bc Use C99 initialization for struct filterops.
Obtained from:	Mac OS X
Sponsored by:	Apple Inc.
MFC after:	3 weeks
2009-09-12 20:03:45 +00:00
n_hibma
ae42cb4782 Add a comment on the consequences of reducing the poweroff delay 2009-09-10 18:24:59 +00:00
kib
91e6a5b3cc kern_select(9) copies fd_set in and out of userspace in quantities of
longs. Since 32bit processes longs are 4 bytes, 64bit kernel may copy in
or out 4 bytes more then the process expected.

Calculate the amount of bytes to copy taking into account size of fd_set
for the current process ABI.

Diagnosed and tested by:	Peter Jeremy <peterjeremy acm org>
Reviewed by:	jhb
MFC after:	1 week
2009-09-09 20:59:01 +00:00
kib
7a991968f2 Unlock the image vnode around the call of pmc PMC_FN_PROCESS_EXEC hook.
The hook calls vn_fullpath(9), that should not be executed with a vnode
lock held.

Reported by:	Bruce Cran <bruce cran org uk>
Tested by:	pho
MFC after:	3 days
2009-09-09 10:52:36 +00:00
kib
3119d26c95 In vfs_mark_atime(9), be resistent against reclaimed vnodes.
Assert that neccessary locks are taken, since vop might not be called.

Tested by:	pho
MFC after:	3 days
2009-09-09 10:51:50 +00:00
phk
5297261651 Revert previous commit and add myself to the list of people who should
know better than to commit with a cat in the area.
2009-09-08 13:19:05 +00:00
phk
2314521104 Add necessary include. 2009-09-08 13:16:55 +00:00
antoine
3aaee1a8e7 Change w_notrunning and w_stillcold from pointer to array so that sizeof
returns what is expected.

PR:		kern/138557
Discussed with:	brucec@
MFC after:	1 month
2009-09-06 13:31:05 +00:00
kib
14578a3276 In fhopen, vfs_ref() the mount point while vnode is unlocked, to prevent
vn_start_write(NULL, &mp) from operating on potentially freed or reused
struct mount *.

Remove unmatched vfs_rel() in cleanup.

Noted and reviewed by:	tegge
Tested by:	pho
MFC after:	3 days
2009-09-06 11:44:46 +00:00
ed
73faea2301 Move ptmx into pty(4).
Now that pty(4) is a loadable kernel module, I'd better move /dev/ptmx
in there as well. This means that pty(4) now provides almost all
pseudo-terminal compatibility code. This means it's very easy to test
whether applications use the proper library interfaces when allocating
pseudo-terminals (namely posix_openpt and openpty).
2009-09-06 10:27:45 +00:00
jamie
319f7fbbc6 Allow a jail's name to be the same as its jid (which is the default if no
name is specified), but still disallow other numeric names.

Reviewed by:	zec
Approved by:	bz (mentor)
MFC after:	3 days
2009-09-04 19:00:48 +00:00
attilio
ef39f869d5 Add intermediate states for attaching and detaching that will be
reused by the enhached newbus locking once it is checked in.
This change can be easilly MFCed to STABLE_8 at the appropriate moment.

Reviewed by:	jhb, scottl
Tested by:	Giovanni Trematerra <giovanni dot trematerra at gmail dot com>
2009-09-03 13:40:41 +00:00
attilio
ce3b4baf3c Fix some bugs related to adaptive spinning:
In the lockmgr support:
- GIANT_RESTORE() is just called when the sleep finishes, so the current
  code can ends up into a giant unlock problem.  Fix it by appropriately
  call GIANT_RESTORE() when needed.  Note that this is not exactly ideal
  because for any interation of the adaptive spinning we drop and restore
  Giant, but the overhead should be not a factor.
- In the lock held in exclusive mode case, after the adaptive spinning is
  brought to completition, we should just retry to acquire the lock
  instead to fallthrough. Fix that.
- Fix a style nit

In the sx support:
- Call GIANT_SAVE() before than looping. This saves some overhead because
  in the current code GIANT_SAVE() is called several times.

Tested by:	Giovanni Trematerra <giovanni dot trematerra at gmail dot com>
2009-09-02 17:33:51 +00:00
kib
32ffadd900 Fix mount reference leak when V_XSLEEP is specified to vn_start_write().
Submitted by:	tegge
2009-09-01 12:05:39 +00:00
kib
bae5df8cbb Reintroduce the r196640, after fixing the problem with my testing.
Remove the altkstacks, instead instantiate threads with kernel stack
allocated with the right size from the start. For the thread that has
kernel stack cached, verify that requested stack size is equial to the
actual, and reallocate the stack if sizes differ [1].

This fixes the bug introduced by r173361 that was committed several days
after r173004 and consisted of kthread_add(9) ignoring the non-default
kernel stack size.

Also, r173361 removed the caching of the kernel stacks for a non-first
thread in the process. Introduce separate kernel stack cache that keeps
some limited amount of preallocated kernel stacks to lower the latency
of thread allocation. Add vm_lowmem handler to prune the cache on
low memory condition. This way, system with reasonable amount of the
threads get lower latency of thread creation, while still not exhausting
significant portion of KVA for unused kstacks.

Submitted by:	peter [1]
Discussed with:	jhb, julian, peter
Reviewed by:	jhb
Tested by:	pho (and retested according to new test scenarious)
MFC after:	1 week
2009-09-01 11:41:51 +00:00
kib
eab33b2665 Make the mnt_writeopcount and mnt_secondary_writes counters,
used by the suspension code, not greater then mnt_ref reference
counter value. Increment mnt_ref together with write counter
in vn_start_write()/ vn_start_secondary_write(), releasing in
vn_finished_write/vn_finished_secondary_write().

Since r186197, unmount code requires that no writers occured after all
references are expired. We still could get write counter incremented
for freed or reused struct mount, but it seems to be innocent, since
corresponding vnode should be referenced and reclaimed then.

Reported by:	pho (last half a year), erwin
Reviewed by:	attilio
Tested by:	pho, erwin
MFC after:	1 week
2009-08-31 10:20:52 +00:00
bz
840afe36da Make sure FreeBSD binaries without .note.ABI-tag section work
correctly and do not match a colliding Debian GNU/kFreeBSD
brandinfo statements.
For this mark the Debian GNU/kFreeBSD brandinfo that it must have
an .note.ABI-tag section and ignore the old EI_OSABI brandinfo
when comparing a possibly colliding set of options.

Due to SYSINIT we add the brandinfo in a non-deterministic order,
so native FreeBSD is not always first. We may want to consider
to force native FreeBSD to come first as well.

The only way a problem could currently be noticed is when running an
i386 binary without the .note.ABI-tag on amd64 and the Debian GNU/kFreeBSD
brandinfo  was matched first,  as the fallback to ld-elf32.so.1 does
not exist in that case.

Reported and tested by:	ticso
In collaboration with:	kib
MFC after:		3 days
2009-08-30 14:38:17 +00:00
kib
d105721a22 Reverse r196640 and r196644 for now. 2009-08-29 21:53:08 +00:00
kib
7e88789784 Dispose the kernel stack of the proper thread.
Submitted by:	alc
MFC after:	1 week
2009-08-29 18:01:02 +00:00
kib
9e8ade6852 Remove the altkstacks, instead instantiate threads with kernel stack
allocated with the right size from the start. For the thread that has
kernel stack cached, verify that requested stack size is equial to the
actual, and reallocate the stack if sizes differ [1].

This fixes the bug introduced by r173361 that was committed several days
after r173004 and consisted of kthread_add(9) ignoring the non-default
kernel stack size.

Also, r173361 removed the caching of the kernel stacks for a non-first
thread in the process. Introduce separate kernel stack cache that keeps
some limited amount of preallocated kernel stacks to lower the latency
of thread allocation. Add vm_lowmem handler to prune the cache on
low memory condition. This way, system with reasonable amount of the
threads get lower latency of thread creation, while still not exhausting
significant portion of KVA for unused kstacks.

Submitted by:	peter [1]
Discussed with:	jhb, julian, peter
Reviewed by:	jhb
Tested by:	pho
MFC after:	1 week
2009-08-29 13:28:02 +00:00
jhb
7b069e86c6 Extend the device pager to support different memory attributes on different
pages in an object.
- Add a new variant of d_mmap() currently called d_mmap2() which accepts
  an additional in/out parameter that is the memory attribute to use for
  the requested page.
- A driver either uses d_mmap() or d_mmap2() for all requests but not both.
  The current implementation uses a flag in the cdevsw (D_MMAP2) to indicate
  that the driver provides a d_mmap2() handler instead of d_mmap().  This
  is done to make the change ABI compatible with existing drivers and
  MFC'able to 7 and 8.

Submitted by:	alc
MFC after:	1 month
2009-08-28 14:06:55 +00:00
jamie
c78efa488e Fix a LOR between allprison_lock and vnode locks by releasing
allprison_lock before releasing a prison's root vnode.

PR:		kern/138004
Reviewed by:	kib
Approved by:	bz (mentor)
MFC after:	3 days
2009-08-27 16:15:51 +00:00
marius
931a88ff68 Add a temporary workaround which just lets init die instead of
causing a panic if it is killed due to a unsolved stack overflow
seen very late during shutdown on sparc64 when the gmirror worker
process exists, which is a regression introduced in 8.0.

Reviewed by:	kib
MFC after:	3 days
2009-08-26 21:10:47 +00:00
kib
6e8f2df92e Honor the vfs.timestamp_precision sysctl settings for utimes(path, NULL)
and similar calls.

Obtained from:	Petr Salinger, Debian GNU/kFreeBSD, Debian bug #489894
MFC after:	3 days
2009-08-26 14:32:37 +00:00
jilles
0b9c3c2f3d Fix poll() on half-closed sockets, while retaining POLLHUP for fifos.
This reverts part of r196460, so that sockets only return POLLHUP if both
directions are closed/error. Fifos get POLLHUP by closing the unused
direction immediately after creating the sockets.

The tools/regression/poll/*poll.c tests now pass except for two other things:
- if POLLHUP is returned, POLLIN is always returned as well instead of only
  when there is data left in the buffer to be read
- fifo old/new reader distinction does not work the way POSIX specs it

Reviewed by:	kib, bde
2009-08-25 21:44:14 +00:00
imp
649ad6d51f Rather than havnig enabled/disabled, implement a max queue depth.
While usually not an issue, this firewalls bugs in the code that may
run us out of memory.

Fix a memory exhaustion in the case where devctl was disabled, but the
link was bouncing.  The check to queue was in the wrong place.

Implement a new sysctl hw.bus.devctl_queue to control the depth.  Make
compatibility hacks for hw.bus.devctl_disable to ease transition.

Reviewed by:	emaste@
Approved by:	re@ (kib)
MFC after:	asap
2009-08-25 06:25:59 +00:00
bz
ba7b3afabc Fix handling of .note.ABI-tag section for GNU systems [1].
Handle GNU/Linux according to LSB Core Specification 4.0,
Chapter 11. Object Format, 11.8. ABI note tag.

Also check the first word of desc, not only name, according to
glibc abi-tags specification to distinguish between Linux and
kFreeBSD.

Add explicit handling for Debian GNU/kFreeBSD, which runs
on our kernels as well [2].

In {amd64,i386}/trap.c, when checking osrel of the current process,
also check the ABI to not change the signal behaviour for Linux
binary processes, now that we save an osrel version for all three
from the lists above in struct proc [2].

These changes make it possible to run FreeBSD, Debian GNU/kFreeBSD
and Linux binaries on the same machine again for at least i386 and
amd64, and no longer break kFreeBSD which was detected as GNU(/Linux).

PR:		kern/135468
Submitted by:	dchagin [1] (initial patch)
Suggested by:	kib [2]
Tested by:	Petr Salinger (Petr.Salinger seznam.cz) for kFreeBSD
Reviewed by:	kib
MFC after:	3 days
2009-08-24 16:19:47 +00:00
ed
7669d1f2c1 Allow multiple console devices per driver without insane code duplication.
Say, a driver wants to have multiple console devices to pick from, you
would normally write down something like this:

	CONSOLE_DRIVER(dev1);
	CONSOLE_DRIVER(dev2);

Unfortunately, this means that you have to declare 10 cn routines,
instead of 5. It also isn't possible to initialize cn_arg on beforehand.

I noticed this restriction when I was implementing some of the console
bits for my vt(4) driver in my newcons branch. I have a single set of cn
routines (termcn_*) which are shared by all vt(4) console devices.

In order to solve this, I'm adding a separate consdev_ops structure,
which contains all the function pointers. This structure is referenced
through consdev's cn_ops field.

While there, I'm removing CONS_DRIVER() and cn_checkc, which have been
deprecated for years. They weren't used throughout the source, until the
Xen console driver showed up. CONSOLE_DRIVER() has been changed to do
the right thing. It now declares both the consdev and consdev_ops
structure and ties them together. In other words: this change doesn't
change the KPI for drivers that used the regular way of declaring
console devices.

If drivers want to use multiple console devices, they can do this as
follows:

	static const struct consdev_ops mydriver_cnops = {
		.cn_probe	= mydriver_cnprobe,
		...
	};
	static struct mydriver_softc cons0_softc = {
		...
	};
	CONSOLE_DEVICE(cons0, mydriver_cnops, &cons0_softc);
	static struct mydriver_softc cons1_softc = {
		...
	};
	CONSOLE_DEVICE(cons1, mydriver_cnops, &cons1_softc);

Obtained from:	//depot/user/ed/newcons/...
2009-08-24 10:53:30 +00:00
zec
927d43d574 When "jail -c vnet" request fails, the current code actually creates and
leaves behind an orphaned vnet.  This change ensures that such vnets get
released.

This change affects only options VIMAGE builds.

Submitted by:	jamie
Discussed with:	bz
Approved by:	re (rwatson), julian (mentor)
MFC after:	3 days
2009-08-24 10:16:19 +00:00
zec
7b5545dea0 When registering a protocol to an existing protocol domain via
pf_proto_register(), iterate over all existing vnets to call protosw_init()
and thus the appropriate .pr_init() handler in the context of each vnet.
NB in the future we probably want to separate pr_init() handlers into
two, i.e. per-vnet and global, functions.

This change has no impact on nooptions VIMAGE builds.

Approved by:	re (rwatson), julian (mentor)
MFC after:	3 days
2009-08-24 10:03:41 +00:00
rwatson
ef8d755d4d Rework global locks for interface list and index management, correcting
several critical bugs, including race conditions and lock order issues:

Replace the single rwlock, ifnet_lock, with two locks, an rwlock and an
sxlock.  Either can be held to stablize the lists and indexes, but both
are required to write.  This allows the list to be held stable in both
network interrupt contexts and sleepable user threads across sleeping
memory allocations or device driver interactions.  As before, writes to
the interface list must occur from sleepable contexts.

Reviewed by:	bz, julian
MFC after:	3 days
2009-08-23 20:40:19 +00:00
ed
3d8e6186e2 Allow pty(4) to be loaded as a kld.
Unfortunately, the wrappers that are present in pts(4) don't have the
mechanics to allow pty(4) to be unloaded safely, so I'm forcing this kld
to return EBUSY. This also means we have to enable some extra code in
pts(4) unconditionally.

Proposed by:	rwatson
2009-08-23 20:26:09 +00:00
kib
8b1803af93 Fix the conformance of poll(2) for sockets after r195423 by
returning POLLHUP instead of POLLIN for several cases. Now, the
tools/regression/poll results for FreeBSD are closer to that of the
Solaris and Linux.

Also, improve the POSIX conformance by explicitely clearing POLLOUT
when POLLHUP is reported in pollscan(), making the fix global.

Submitted by:	bde
Reviewed by:	rwatson
MFC after:	1 week
2009-08-23 12:44:15 +00:00
rpaulo
993a28caeb Constify prime numbers. 2009-08-23 09:55:06 +00:00
ed
503bd5d6c5 Add ttydisc_rint_simple().
I noticed several drivers in our tree don't actually care about parity
and framing, such as pts(4), snp(4) (and my partially finished console
driver). Instead of duplicating a lot of code, I think we'd better add a
utility function for those drivers to quickly process a buffer of input.

Also change pts(4) and snp(4) to use this function.
2009-08-23 08:04:40 +00:00
jhb
9137c5d8b4 This patch fixes two bugs in sglist(9) and improves robustness of the API via
better semantics if a request to append an address range to an existing list
fails.
- When cloning an sglist, properly set the length in the new sglist instead of
  leaving the new list empty.
- Properly compute the amount of data added to an sglist via
  _sglist_append_buf().  This allows sglist_consume_uio() to properly update
  uio_resid.
- When a request to append an address range to a scatter/gather list fails,
  restore the sglist to the state it had at the start of the function call
  instead of resetting it to an empty list.

Requested by:	np (3)
Approved by:	re (kib)
2009-08-21 02:59:07 +00:00
jhb
bb1c942f11 Change the 'resid' parameter to sglist_consume_uio() from an int to a
size_t to match the recent type change of the uio_resid member of struct
uio.

Approved by:	re (kib)
2009-08-20 19:23:58 +00:00
jhb
9b0755de9f Temporarily revert the new-bus locking for 8.0 release. It will be
reintroduced after HEAD is reopened for commits by re@.

Approved by:	re (kib), attilio
2009-08-20 19:17:53 +00:00
ed
c06ca7508d Small changes to the warning message generated by pty(4):
- Only print the warning once, instead of filling up the screen.
- Use the word "legacy" for the pty_warningcnt description, to prevent
  confusion.
- Use log() instead of printf().

Discussed with:	rwatson, jhb
Approved by:	re (kib)
2009-08-19 14:30:46 +00:00
pjd
099429aa32 Remove unused taskqueue_find() function.
Reviewed by:	dfr
Approved by:	re (kib)
2009-08-18 13:55:48 +00:00
attilio
e75d30c87f * Change the scope of the ASSERT_ATOMIC_LOAD() from a generic check to
a pointer-fetching specific operation check. Consequently, rename the
  operation ASSERT_ATOMIC_LOAD_PTR().
* Fix the implementation of ASSERT_ATOMIC_LOAD_PTR() by checking
  directly alignment on the word boundry, for all the given specific
  architectures. That's a bit too strict for some common case, but it
  assures safety.
* Add a comment explaining the scope of the macro
* Add a new stub in the lockmgr specific implementation

Tested by: marcel (initial version), marius
Reviewed by: rwatson, jhb (comment specific review)
Approved by: re (kib)
2009-08-17 16:17:21 +00:00