Commit Graph

11294 Commits

Author SHA1 Message Date
Bjoern A. Zeeb
be31e5e7b5 Make the in-kernel logic for the SIOCSIFVNET, SIOCSIFRVNET ioctls
(ifconfig ifN (-)vnet <jname|jid>) work correctly.

Move vi_if_move to if.c and split it up into two functions(*),
one for each ioctl.

In the reclaim case, correctly set the vnet before calling if_vmove.

Instead of silently allowing a move of an interface from the current
vnet to the current vnet, return an error. (*)

There is some duplicate interface name checking before actually moving
the interface between network stacks without locking and thus race
prone. Ideally if_vmove will correctly and automagically handle these
in the future.

Suggested by:	rwatson (*)
Approved by:	re (kib)
2009-07-26 11:29:26 +00:00
Jamie Gritton
7cbf72137f Some jail parameters (in particular, "ip4" and "ip6" for IP address
restrictions) were found to be inadequately described by a boolean.
Define a new parameter type with three values (disable, new, inherit)
to handle these and future cases.

Approved by:	re (kib), bz (mentor)
Discussed with:	rwatson
2009-07-25 14:48:57 +00:00
Brooks Davis
254d03c508 Introduce a new sysctl process mib, kern.proc.groups which adds the
ability to retrieve the group list of each process.

Modify procstat's -s option to query this mib when the kinfo_proc
reports that the field has been truncated.  If the mib does not exist,
fall back to the truncated list.

Reviewed by:	rwatson
Approved by:	re (kib)
MFC after:	2 weeks
2009-07-24 19:12:19 +00:00
Brooks Davis
1b5768be71 Revert the changes to struct kinfo_proc in r194498. Instead, fill
in up to 16 (KI_NGROUPS) values and steal a bit from ki_cr_flags
(all bits currently unused) to indicate overflow with the new flag
KI_CRF_GRP_OVERFLOW.

This fixes procstat -s.

Approved by: re (kib)
2009-07-24 15:03:10 +00:00
John Baldwin
013818111a Add a new type of VM object: OBJT_SG. An OBJT_SG object is very similar to
a device pager (OBJT_DEVICE) object in that it uses fictitious pages to
provide aliases to other memory addresses.  The primary difference is that
it uses an sglist(9) to determine the physical addresses for a given offset
into the object instead of invoking the d_mmap() method in a device driver.

Reviewed by:	alc
Approved by:	re (kensmith)
MFC after:	2 weeks
2009-07-24 13:50:29 +00:00
Robert Watson
d0728d7174 Introduce and use a sysinit-based initialization scheme for virtual
network stacks, VNET_SYSINIT:

- Add VNET_SYSINIT and VNET_SYSUNINIT macros to declare events that will
  occur each time a network stack is instantiated and destroyed.  In the
  !VIMAGE case, these are simply mapped into regular SYSINIT/SYSUNINIT.
  For the VIMAGE case, we instead use SYSINIT's to track their order and
  properties on registration, using them for each vnet when created/
  destroyed, or immediately on module load for already-started vnets.
- Remove vnet_modinfo mechanism that existed to serve this purpose
  previously, as well as its dependency scheme: we now just use the
  SYSINIT ordering scheme.
- Implement VNET_DOMAIN_SET() to allow protocol domains to declare that
  they want init functions to be called for each virtual network stack
  rather than just once at boot, compiling down to DOMAIN_SET() in the
  non-VIMAGE case.
- Walk all virtualized kernel subsystems and make use of these instead
  of modinfo or DOMAIN_SET() for init/uninit events.  In some cases,
  convert modular components from using modevent to using sysinit (where
  appropriate).  In some cases, do minor rejuggling of SYSINIT ordering
  to make room for or better manage events.

Portions submitted by:	jhb (VNET_SYSINIT), bz (cleanup)
Discussed with:		jhb, bz, julian, zec
Reviewed by:		bz
Approved by:		re (VIMAGE blanket)
2009-07-23 20:46:49 +00:00
Bjoern A. Zeeb
a08362ce46 sysctl_msec_to_ticks is used with both virtualized and
non-vrtiualized sysctls so we cannot used one common function.

Add a macro to convert the arg1 in the virtualized case to
vnet.h to not expose the maths to all over the code.

Add a wrapper for the single virtualized call, properly handling
arg1 and call the default implementation from there.

Convert the two over places to use the new macro.

Reviewed by:	rwatson
Approved by:	re (kib)
2009-07-21 21:58:55 +00:00
Rui Paulo
5b4d5f9ee2 Improve the printf message when a module failed to load. This gives the
user some clue about the possibility of a __FreeBSD_version mismatch.

Discussed with:	rwatson, jhb
Approved by:	re (kib)
2009-07-21 14:18:25 +00:00
Robert Watson
17ef1feb8a Add macros VNET_SETNAME and VNET_SYMPREFIX, and expose to userspace if
_WANT_VNET is defined.  This way we don't need separate definitions in
libkvm.

Reviewed by:	bz
Approved by:	re (vimage blanket)
2009-07-20 07:50:50 +00:00
Konstantin Belousov
7ac5806bfb When buffer write is failed, it is wrong for brelse() to invalidate
portion of the page that was written. Among other problems, this
page might be picked up by pagedaemon, with failed assertion in
vm_pageout_flush() about validity of the page.

Reported and tested by:	pho
Approved by:	re (kensmith)
MFC after:	3 weeks
2009-07-19 20:25:59 +00:00
Robert Watson
006e9db452 Normalize field naming for struct vnet, fix two debugging printfs that
print them.

Reviewed by:	bz
Approved by:	re (kensmith, kib)
2009-07-19 17:40:45 +00:00
Robert Watson
5ee847d3ac Reimplement and/or implement vnet list locking by replacing a mostly
unused custom mutex/condvar-based sleep locks with two locks: an
rwlock (for non-sleeping use) and sxlock (for sleeping use).  Either
acquired for read is sufficient to stabilize the vnet list, but both
must be acquired for write to modify the list.

Replace previous no-op read locking macros, used in various places
in the stack, with actual locking to prevent race conditions.  Callers
must declare when they may perform unbounded sleeps or not when
selecting how to lock.

Refactor vnet sysinits so that the vnet list and locks are initialized
before kernel modules are linked, as the kernel linker will use them
for modules loaded by the boot loader.

Update various consumers of these KPIs based on whether they may sleep
or not.

Reviewed by:	bz
Approved by:	re (kib)
2009-07-19 14:20:53 +00:00
Jamie Gritton
7afcbc18b3 Remove the interim vimage containers, struct vimage and struct procg,
and the ioctl-based interface that supported them.

Approved by:	re (kib), bz (mentor)
2009-07-17 14:48:21 +00:00
Robert Watson
cd2b95a237 r195699 introduced an assertion regarding when progbits data in kernel
modules was present, which turns out to be false in some situations.
Back out the assertion.

Reported by:	Luiz Otavio O Souza <lists.br at gmail.com>,
		Florian Smeets <flo at kasimir.com>
Approved by:	re (kensmith) (implicit)
2009-07-15 09:19:01 +00:00
Konstantin Belousov
f33a947b56 Add new msleep(9) flag PBDY that shall be specified together with
PCATCH, to indicate that thread shall not be stopped upon receipt of
SIGSTOP until it reaches the kernel->usermode boundary.

Also change thread_single(SINGLE_NO_EXIT) to only stop threads at
the user boundary unconditionally.

Tested by:	pho
Reviewed by:	jhb
Approved by:	re (kensmith)
2009-07-14 22:52:46 +00:00
Konstantin Belousov
79799053a7 Move the repeated code to calculate the number of the threads in the
process that still need to be suspended or exited from thread_single
into the new function calc_remaining().

Tested by:	pho
Reviewed by:	jhb
Approved by:	re (kensmith)
2009-07-14 22:51:31 +00:00
Konstantin Belousov
c8167830f9 When wakeup(9) is going to notify swapper, assert that wait channel is not
equal to &proc0. It shall be not, since proc0 stack is not swappable, and
kick_proc0() is wakeup(&proc0).

Reviewed by:	jhb
Approved by:	re (kensmith)
2009-07-14 22:50:41 +00:00
Robert Watson
eddfbb763d Build on Jeff Roberson's linker-set based dynamic per-CPU allocator
(DPCPU), as suggested by Peter Wemm, and implement a new per-virtual
network stack memory allocator.  Modify vnet to use the allocator
instead of monolithic global container structures (vinet, ...).  This
change solves many binary compatibility problems associated with
VIMAGE, and restores ELF symbols for virtualized global variables.

Each virtualized global variable exists as a "reference copy", and also
once per virtual network stack.  Virtualized global variables are
tagged at compile-time, placing the in a special linker set, which is
loaded into a contiguous region of kernel memory.  Virtualized global
variables in the base kernel are linked as normal, but those in modules
are copied and relocated to a reserved portion of the kernel's vnet
region with the help of a the kernel linker.

Virtualized global variables exist in per-vnet memory set up when the
network stack instance is created, and are initialized statically from
the reference copy.  Run-time access occurs via an accessor macro, which
converts from the current vnet and requested symbol to a per-vnet
address.  When "options VIMAGE" is not compiled into the kernel, normal
global ELF symbols will be used instead and indirection is avoided.

This change restores static initialization for network stack global
variables, restores support for non-global symbols and types, eliminates
the need for many subsystem constructors, eliminates large per-subsystem
structures that caused many binary compatibility issues both for
monitoring applications (netstat) and kernel modules, removes the
per-function INIT_VNET_*() macros throughout the stack, eliminates the
need for vnet_symmap ksym(2) munging, and eliminates duplicate
definitions of virtualized globals under VIMAGE_GLOBALS.

Bump __FreeBSD_version and update UPDATING.

Portions submitted by:  bz
Reviewed by:            bz, zec
Discussed with:         gnn, jamie, jeff, jhb, julian, sam
Suggested by:           peter
Approved by:            re (kensmith)
2009-07-14 22:48:30 +00:00
Alan Cox
3153e878dd Add support to the virtual memory system for configuring machine-
dependent memory attributes:

Rename vm_cache_mode_t to vm_memattr_t.  The new name reflects the
fact that there are machine-dependent memory attributes that have
nothing to do with controlling the cache's behavior.

Introduce vm_object_set_memattr() for setting the default memory
attributes that will be given to an object's pages.

Introduce and use pmap_page_{get,set}_memattr() for getting and
setting a page's machine-dependent memory attributes.  Add full
support for these functions on amd64 and i386 and stubs for them on
the other architectures.  The function pmap_page_set_memattr() is also
responsible for any other machine-dependent aspects of changing a
page's memory attributes, such as flushing the cache or updating the
direct map.  The uses include kmem_alloc_contig(), vm_page_alloc(),
and the device pager:

  kmem_alloc_contig() can now be used to allocate kernel memory with
  non-default memory attributes on amd64 and i386.

  vm_page_alloc() and the device pager will set the memory attributes
  for the real or fictitious page according to the object's default
  memory attributes.

Update the various pmap functions on amd64 and i386 that map pages to
incorporate each page's memory attributes in the mapping.

Notes: (1) Inherent to this design are safety features that prevent
the specification of inconsistent memory attributes by different
mappings on amd64 and i386.  In addition, the device pager provides a
warning when a device driver creates a fictitious page with memory
attributes that are inconsistent with the real page that the
fictitious page is an alias for. (2) Storing the machine-dependent
memory attributes for amd64 and i386 as a dedicated "int" in "struct
md_page" represents a compromise between space efficiency and the ease
of MFCing these changes to RELENG_7.

In collaboration with: jhb

Approved by:	re (kib)
2009-07-12 23:31:20 +00:00
Konstantin Belousov
7c6d401c75 The control terminal revocation at the session leader exit does not
correctly checks for reclaimed vnode, possibly calling VOP_REVOKE for
such vnode. If the terminal is already revoked, or devfs mount was
forcibly unmounted, the revocation of doomed ctty vnode causes panic.

Reported and tested by:	lstewart
Approved by:	re (kensmith)
MFC after:	2 weeks
2009-07-09 18:54:38 +00:00
Jamie Gritton
9dce97d788 Remove crcopy call from seteuid now that it calls crcopysafe.
Reviewed by:	brooks
Approved by:	re (kib), bz (mentor)
2009-07-08 21:45:48 +00:00
Edward Tomasz Napierala
e2b881bf03 Regenerate after lpathconf(2) addition.
Approved by:	re (kib)
2009-07-08 15:25:27 +00:00
Edward Tomasz Napierala
c38898116a There is an optimization in chmod(1), that makes it not to call chmod(2)
if the new file mode is the same as it was before; however, this
optimization must be disabled for filesystems that support NFSv4 ACLs.
Chmod uses pathconf(2) to determine whether this is the case - however,
pathconf(2) always follows symbolic links, while the 'chmod -h' doesn't.

This change adds lpathconf(3) to make it possible to solve that problem
in a clean way.

Reviewed by:	rwatson (earlier version)
Approved by:	re (kib)
2009-07-08 15:23:18 +00:00
Ed Schouten
6b53d5c0e7 Fix regressions in return events of poll() on TTYs.
As pointed out, POLLHUP should be generated, even if it hasn't been
specified on input. It is also not allowed to return both POLLOUT and
POLLHUP at the same time.

Reported by:	jilles
Approved by:	re (kib)
2009-07-08 10:21:52 +00:00
Mike Silbersack
347e22a68b Increase HZ_VM from 10 to 100. While 10 hz saves cpu time
under VM environments, it's too slow for FreeBSD to work
properly.  For example, ping at 10hz pings about every 600ms
instead of about every second.

Approved by:	re (kib)
2009-07-08 01:09:12 +00:00
Konstantin Belousov
7f5dff5064 Fix poll(2) and select(2) for named pipes to return "ready for read"
when all writers, observed by reader, exited. Use writer generation
counter for fifo, and store the snapshot of the fifo generation in the
f_seqcount field of struct file, that is otherwise unused for fifos.
Set FreeBSD-undocumented POLLINIGNEOF flag only when file f_seqcount is
equal to fifo' fi_wgen, and revert r89376.

Fix POLLINIGNEOF for sockets and pipes, and return POLLHUP for them.
Note that the patch does not fix not returning POLLHUP for fifos.

PR:	kern/94772
Submitted by:	bde (original version)
Reviewed by:	rwatson, jilles
Approved by:	re (kensmith)
MFC after:	6 weeks (might be)
2009-07-07 09:43:44 +00:00
Konstantin Belousov
f1eccd05ec In vn_vget_ino() and their inline equivalents, mnt_ref() the mount point
around the sequence that drop vnode lock and then busies the mount point.
Not having vlocked node or direct reference to the mp allows for the
forced unmount to proceed, making mp unmounted or reused.

Tested by:	pho
Reviewed by:	jeff
Approved by:	re (kensmith)
MFC after:	2 weeks
2009-07-02 18:02:55 +00:00
Jamie Gritton
f0899a3460 Call prison_check from vfs_suser rather than re-implementing it.
Approved by:	re (kib), bz (mentor)
2009-07-02 14:19:33 +00:00
Robert Watson
64c9a4d9ab Audit file descriptor and command arguments to ioctl(2).
Approved by:	re (audit argument blanket)
MFC after:	1 week
2009-07-02 09:16:25 +00:00
Robert Watson
2a5658382a Clean up a number of aspects of token generation from audit arguments to
system calls:

- Centralize generation of argument tokens for VM addresses in a macro,
  ADDR_TOKEN(), and properly encode 64-bit addresses in 64-bit arguments.
- Fix up argument numbers across a large number of syscalls so that they
  match the numeric argument into the system call.
- Don't audit the address argument to ioctl(2) or ptrace(2), but do keep
  generating tokens for mmap(2), minherit(2), since they relate to passing
  object access across execve(2).

Approved by:	re (audit argument blanket)
Obtained from:	TrustedBSD Project
MFC after:	1 week
2009-07-02 09:15:30 +00:00
Robert Watson
03f7b00438 For access(2) and eaccess(2), audit the requested access mode.
Approved by:	re (audit argument blanket)
MFC after:	3 days
2009-07-01 22:47:45 +00:00
Jeff Roberson
2141453eb4 - Use fd_lastfile + 1 as the upper bound on nd. This is more correct than
using the size of the descriptor array.
 - A lock is not needed to fetch fd_lastfile.  The results are stale the
   instant it is dropped.
 - Use a private mutex pool for select since the pool mutex is not used
   as a leaf.
 - Fetch the si_mtx pointer first before resorting to hashing to compute
   the mutex address.

Reviewed by:	McKusick
Approved by:	re (kib)
2009-07-01 20:43:46 +00:00
Robert Watson
15ca46f69d Audit file descriptor numbers for various socket-related system calls.
Approved by:	re (audit argument blanket)
MFC after:	3 days
2009-07-01 19:55:11 +00:00
Robert Watson
9e4c1521d5 Define missing audit argument macro AUDIT_ARG_SOCKET(), and
capture the domain, type, and protocol arguments to socket(2)
and socketpair(2).

Approved by:	re (audit argument blanket)
MFC after:	3 days
2009-07-01 18:54:49 +00:00
John Baldwin
cebc7fb16c Improve the handling of cpuset with interrupts.
- For x86, change the interrupt source method to assign an interrupt source
  to a specific CPU to return an error value instead of void, thus allowing
  it to fail.
- If moving an interrupt to a CPU fails due to a lack of IDT vectors in the
  destination CPU, fail the request with ENOSPC rather than panicing.
- For MSI interrupts on x86 (but not MSI-X), only allow cpuset to be used
  on the first interrupt in a group.  Moving the first interrupt in a group
  moves the entire group.
- Use the icu_lock to protect intr_next_cpu() on x86 instead of the
  intr_table_lock to fix a LOR introduced in the last set of MSI changes.
- Add a new privilege PRIV_SCHED_CPUSET_INTR for using cpuset with
  interrupts.  Previously, binding an interrupt to a CPU only performed a
  privilege check if the interrupt had an interrupt thread.  Interrupts
  without a thread could be bound by non-root users as a result.
- If an interrupt event's assign_cpu method fails, then restore the original
  cpuset mask for the associated interrupt thread.

Approved by:	re (kib)
2009-07-01 17:20:07 +00:00
Robert Watson
6d5a61563a When auditing unmount(2), capture FSID arguments as regular text strings
rather than as paths, which would lead to them being treated as relative
pathnames and hence confusingly converted into absolute pathnames.

Capture flags to unmount(2) via an argument token.

Approved by:	re (audit argument blanket)
MFC after:	3 days
2009-07-01 16:56:56 +00:00
Robert Watson
422d786676 Audit the file descriptor number passed to lseek(2).
Approved by:	re (kib)
MFC after:	3 days
2009-07-01 15:37:23 +00:00
Robert Watson
c5957d6bba Fix link(2) auditing: use the second audit record path for the new object
name.

Approved by:	re (kib)
MFC after:	3 days
2009-07-01 13:22:08 +00:00
Robert Watson
2ef24dde7c udit the 'options' argument to wait4(2).
Approved by:	re (kib)
MFC after:	3 days
2009-07-01 12:36:10 +00:00
Alan Cox
9785747f87 Remove a stale comment. The very same revision (r85511) that introduced
this comment also implemented the proposed change to the code.

Approved by:	re (kib)
2009-06-30 19:39:17 +00:00
Ed Maste
2dafac3976 Add FIONSPACE from NetBSD. FIONSPACE is provided so that programs may
easily determine how much space is left in the send queue; they do not
need to know the send queue size.

NetBSD revisions:
  sys_socket.c r1.41, 1.42
  filio.h r1.9

Obtained from:	NetBSD
Approved by:	re (kensmith)
2009-06-30 13:38:49 +00:00
Konstantin Belousov
a73034ef7f Free struct ucreds allocated in vfs_hang_addrlist() when deleting
the export element.
While there, remove register storage-class specifiers.

Reported and tested by:	pho
Reviewed by:	kan
Approved by:	re (kensmith)
2009-06-29 18:09:07 +00:00
Attilio Rao
5257ff7ee1 Don't assume a default (currently 15) value for preloaded klds when
loading hwpmc, but calculate at runtime and allocate the necessary space.
Also the current logic is wrong as it can lead to an endless loop.

Sponsored by:   Sandvine Incorporated
Reported by:    Ryan Stone <rstone at sandvine dot com>
Tested by:      Giovanni Trematerra
                <giovanni dot trematerra at gmail dot com>
Approved by:	re (kib)
2009-06-29 16:03:18 +00:00
Stanislav Sedov
fe1d3f15f6 - Turn the third (islocked) argument of the knote call into flags parameter.
Introduce the new flag KNF_NOKQLOCK to allow event callers to be called
  without KQ_LOCK mtx held.
- Modify VFS knote calls to always use KNF_NOKQLOCK flag.  This is required
  for ZFS as its getattr implementation may sleep.

Approved by:	re (rwatson)
Reviewed by:	kib
MFC after:	2 weeks
2009-06-28 21:49:43 +00:00
Ed Schouten
f9bb1cf010 Add FIONWRITE support to TTYs.
TTYs already supported TIOCOUTQ, but FIONWRITE seems to be a more
generic name for this.

Approved by:	re (kib)
2009-06-28 12:02:15 +00:00
Poul-Henning Kamp
bb520069ca There are a number of ways an application can check if there are
inbound data waiting on a filedescriptor, such as a pipe or a socket,
for instance by using select(2), poll(2), kqueue(2), ioctl(FIONREAD)
etc.

But we have no way of finding out if written data have yet to be
disposed of, for instance, transmitted (and ack'ed!) to some remote
host, or read by the applicantion at the far end of the pipe.

The closest we get, is calling shutdown(2) on a TCP socket in
non-blocking mode, but this has the undesirable sideeffect of
preventing future communication.

Add a complement to FIONREAD, called FIONWRITE, which returns the
number of bytes not yet properly disposed of.  Implement it for
all sockets.

Background:

A HTTP server will want to time out connections, if no new request
arrives within a certain period after the last transmitted response
has actually been sent (and ack'ed).

For a busy HTTP server, this timeout can be subsecond duration.

In order to signal to a load-balancer that the connection is truly
dead, TCP_RST will be the preferred method, as this avoids the need
for a RTT delay for FIN handshaking, with a client which, surprisingly
often, no longer at the remote IP number.

If a slow, distant client is being served a response which is big
enough to fill the window, but small enough to fit in the socket
buffer, the write(2) call will return immediately.

If the session timeout is armed at that time, all bytes in the
response may not have been transmitted by the time it fires.

FIONWRITE allows the timeout to check that no data is outstanding
on the connection, before it TCP_RST's it.

Input & Idea from: rwatson
Approved by:	re (kib)
2009-06-28 11:28:14 +00:00
Alan Cox
de0c3e0895 Correct a long-standing performance bug in cluster_rbuild(). Specifically,
in the case of a file system with a block size that is less than the page
size, cluster_rbuild() looks at too many of the page's valid bits.
Consequently, it may terminate prematurely, resulting in poor performance.

Reported by:	bde
Reviewed by:	tegge
Approved by:	re (kib)
2009-06-27 21:37:36 +00:00
Robert Watson
14961ba789 Replace AUDIT_ARG() with variable argument macros with a set more more
specific macros for each audit argument type.  This makes it easier to
follow call-graphs, especially for automated analysis tools (such as
fxr).

In MFC, we should leave the existing AUDIT_ARG() macros as they may be
used by third-party kernel modules.

Suggested by:	brooks
Approved by:	re (kib)
Obtained from:	TrustedBSD Project
MFC after:	1 week
2009-06-27 13:58:44 +00:00
Alan Cox
e999111ae7 This change is the next step in implementing the cache control functionality
required by video card drivers.  Specifically, this change introduces
vm_cache_mode_t with an appropriate VM_CACHE_DEFAULT definition on all
architectures.  In addition, this changes adds a vm_cache_mode_t parameter
to kmem_alloc_contig() and vm_phys_alloc_contig().  These will be the
interfaces for allocating mapped kernel memory and physical memory,
respectively, with non-default cache modes.

In collaboration with:	jhb
2009-06-26 04:47:43 +00:00
Konstantin Belousov
28fe6a3f3e In lf_iteratelocks_vnode, increment state->ls_threads around iterating
of the vnode advisory lock list. This prevents deallocation of state
while inside the loop.

Reported and tested by:	pho
MFC after:	2 weeks
2009-06-25 18:54:56 +00:00