Commit Graph

12291 Commits

Author SHA1 Message Date
attilio
bc4d32e80b MFC 2011-05-31 21:22:44 +00:00
attilio
a924571ff7 Fix KTR_CPUMASK in order to accept a string representing a cpuset_t.
This introduce all the underlying support for making this possible (via
the function cpusetobj_strscan() and keeps ktr_cpumask exported.  sparc64
implements its own assembly primitives for tracing events and needs to
properly check it.  Anyway the sparc64 logic is not implemented yet due
to lack of knowledge (by me) and time (by marius), but it is just a
matter of using ktr_cpumask when possible.

Tested and fixed by:	pluknet
Reviewed by:		marius
2011-05-31 20:48:58 +00:00
attilio
066c7ac96c Revert a change that crept in during MFC. 2011-05-31 20:23:33 +00:00
ken
0febb6df5e Fix apparent garbage in the message buffer.
While we have had a fix in place (options PRINTF_BUFR_SIZE=128) to fix
scrambled console output, the message buffer and syslog were still getting
log messages one character at a time.  While all of the characters still
made it into the log (courtesy of atomic operations), they were often
interleaved when there were multiple threads writing to the buffer at the
same time.

This fixes message buffer accesses to use buffering logic as well, so that
strings that are less than PRINTF_BUFR_SIZE will be put into the message
buffer atomically.  So now dmesg output should look the same as console
output.

subr_msgbuf.c:		Convert most message buffer calls to use a new spin
			lock instead of atomic variables in some places.

			Add a new routine, msgbuf_addstr(), that adds a
			NUL-terminated string to a message buffer.  This
			takes a priority argument, which allows us to
			eliminate some races (at least in the the string
			at a time case) that are present in the
			implementation of msglogchar().  (dangling and
			lastpri are static variables, and are subject to
			races when multiple callers are present.)

			msgbuf_addstr() also allows the caller to request
			that carriage returns be stripped out of the
			string.  This matches the behavior of msglogchar(),
			but in testing so far it doesn't appear that any
			newlines are being stripped out.  So the carriage
			return removal functionality may be a candidate
			for removal later on if further analysis shows
			that it isn't necessary.

subr_prf.c:		Add a new msglogstr() routine that calls
			msgbuf_logstr().

			Rename putcons() to putbuf().  This now handles
			buffered output to the message log as well as
			the console.  Also, remove the logic in putcons()
			(now putbuf()) that added a carriage return before
			a newline.  The console path was the only path that
			needed it, and cnputc() (called by cnputs())
			already adds a carriage return.  So this
			duplication resulted in kernel-generated console
			output lines ending in '\r''\r''\n'.

			Refactor putchar() to handle the new buffering
			scheme.

			Add buffering to log().

			Change log_console() to use msglogstr() instead of
			msglogchar().  Don't add extra newlines by default
			in log_console().  Hide that behavior behind a
			tunable/sysctl (kern.log_console_add_linefeed) for
			those who would like the old behavior.  The old
			behavior led to the insertion of extra newlines
			for log output for programs that print out a
			string, and then a trailing newline on a separate
			write.  (This is visible with dmesg -a.)

msgbuf.h:		Add a prototype for msgbuf_addstr().

			Add three new fields to struct msgbuf, msg_needsnl,
			msg_lastpri and msg_lock.  The first two are needed
			for log message functionality previously handled
			by msglogchar().  (Which is still active if
			buffering isn't enabled.)

			Include sys/lock.h and sys/mutex.h for the new
			mutex.

Reviewed by:	gibbs
2011-05-31 17:29:58 +00:00
nwhitehorn
a69e106b2f On multi-core, multi-threaded PPC systems, it is important that the threads
be brought up in the order they are enumerated in the device tree (in
particular, that thread 0 on each core be brought up first). The SLIST
through which we loop to start the CPUs has all of its entries added with
SLIST_INSERT_HEAD(), which means it is in reverse order of enumeration
and so AP startup would always fail in such situations (causing a machine
check or RTAS failure). Fix this by changing the SLIST into an STAILQ,
and inserting new CPUs at the end.

Reviewed by:	jhb
2011-05-31 15:11:43 +00:00
attilio
b1bf71d3c5 MFC 2011-05-31 14:18:10 +00:00
attilio
8dd6262cd3 MFC 2011-05-29 18:33:13 +00:00
trociny
1dfa9ab873 In soreceive_generic(), if MSG_WAITALL is set but the request is
larger than the receive buffer, we have to receive in sections.
When notifying the protocol that some data has been drained the
lock is released for a moment. Returning we block waiting for the
rest of data. There is a race, when data could arrive while the
lock was released and then the connection stalls in sbwait.

Fix this by checking for data before blocking and skip blocking
if there are some.

PR:		kern/154504
Reported by:	Andrey Simonenko <simon@comsys.ntu-kpi.kiev.ua>
Tested by:	Andrey Simonenko <simon@comsys.ntu-kpi.kiev.ua>
Reviewed by:	rwatson
Approved by:	kib (co-mentor)
MFC after:	2 weeks
2011-05-29 18:00:50 +00:00
attilio
55a3bf38a5 MFC 2011-05-29 00:59:38 +00:00
trasz
5499b0b9d5 Remove definitions for RACCT_FSIZE and RACCT_SBSIZE - these two are rather
performance-sensitive and not that useful, so I won't be merging them
before 9.0.
2011-05-27 19:57:58 +00:00
attilio
eefddaeed6 MFC 2011-05-27 16:09:10 +00:00
trasz
6a13eaa4d1 Fix support for RACCT_CORE by merging forgotten file. 2011-05-26 18:54:07 +00:00
attilio
867c6223e7 MFC 2011-05-26 17:38:00 +00:00
jhb
3c1a24d701 Silly spelling typos.
Submitted by:	"b. f."
2011-05-24 19:55:57 +00:00
jhb
7028e129fd Fix an issue with critical sections and SMP rendezvous handlers.
Specifically, a critical_exit() call that drops the nesting level to zero
has a brief window where the pending preemption flag is set and the
nesting level is set to zero.  This is done purposefully to avoid races
where a preemption scheduled by an interrupt could be lost otherwise (see
revision 144777).  However, this does mean that if an interrupt fires
during this window and enters and exits a critical section, it may preempt
from the interrupt context.  This is generally fine as the interrupt code
is careful to arrange critical sections so that they are not exited until
it is safe to preempt (e.g. interrupts EOI'd and masked if necessary).

However, the SMP rendezvous IPI handler does not quite follow this rule,
and in general a rendezvous can never be preempted.  Rendezvous handlers
are also not permitted to schedule threads to execute, so they will not
typically trigger preemptions.  SMP rendezvous handlers may use
spinlocks (carefully) such as the rm_cleanIPI() handler used in rmlocks,
but using a spinlock also enters and exits a critical section.  If the
interrupted top-half code is in the brief window of critical_exit() where
the nesting level is zero but a preemption is pending, then releasing the
spinlock can trigger a preemption.  Because we know that SMP rendezvous
handlers can never schedule a thread, we know that a critical_exit() in
an SMP rendezvous handler will only preempt in this edge case.  We also
know that the top-half thread will happily handle the deferred preemption
once the SMP rendezvous has completed, so the preemption will not be lost.

This makes it safe to employ a workaround where we use a nested critical
section in the SMP rendezvous code itself around rendezvous action
routines to prevent any preemptions during an SMP rendezvous.  The
workaround intentionally avoids checking for a deferred preemption
when leaving the critical section on the assumption that if there is a
pending preemption it will be handled by the interrupted top-half code.

Submitted by:	mlaier (variation specific to rm_cleanIPI())
Obtained from:	Isilon
MFC after:	1 week
2011-05-24 13:36:41 +00:00
jhb
4d0fe668f7 Update comments for DEVICE_PROBE() to reflect that BUS_PROBE_DEFAULT is
now the preferred typical return value from a probe routine.  Discourage
the use of 0 (BUS_PROBE_SPECIFIC) as it should be used very rarely.
Point the reader to the DEVICE_PROBE(9) manpage for more detailed notes
on possible probe return values.

Submitted by:	Philip Soeberg  philip-dev of soeberg net
2011-05-24 13:22:40 +00:00
jhb
d73862793b Simplify a stale assertion. We have not called mi_switch() from a nested
critical section during a preemption for several years.

MFC after:	1 week
2011-05-24 13:17:08 +00:00
attilio
9879530ca1 MFC 2011-05-23 23:58:02 +00:00
attilio
66305282ac Revert a patch that unvolountary sneaked in while I was MFCing. 2011-05-23 23:50:21 +00:00
ru
5a5a985b61 BKVASIZE was bumped to 16k more than a decade ago. 2011-05-23 19:59:01 +00:00
jh
fbe30c6e5c In init_dynamic_kenv(), ignore environment strings exceeding the
KENV_MNAMELEN + 1 + KENV_MVALLEN + 1 length limit to avoid buffer
overflow in getenv(). Currenly loader(8) doesn't limit the length of
environment strings.

PR:		kern/132104
MFC after:	1 month
2011-05-23 16:40:44 +00:00
attilio
6d7371f950 MFC 2011-05-23 01:17:30 +00:00
attilio
a8b367d89d Merge r221912 from largeSMP project branch:
Fix a long-standing bug in cpuset_thread0() where only the first part
of cs_mask is set full.

Submitted by:	anonymous
MFC after:	1 week
2011-05-22 21:35:03 +00:00
attilio
627bd73cdb MFC 2011-05-22 20:41:10 +00:00
attilio
08bcb681d2 Make cpusetobj_strprint() prepare the string in order to print the
least significant cpuset_t word at the outmost right part of the string
(more far from the beginning of it).  This follows the natural build of
bits rappresentation in the words.
2011-05-22 20:29:47 +00:00
rmacklem
fbb8a5e8ec Add a lock flags argument to the VFS_FHTOVP() file system
method, so that callers can indicate the minimum vnode
locking requirement. This will allow some file systems to choose
to return a LK_SHARED locked vnode when LK_SHARED is specified
for the flags argument. This patch only adds the flag. It
does not change any file system to use it and all callers
specify LK_EXCLUSIVE, so file system semantics are not changed.

Reviewed by:	kib
2011-05-22 01:07:54 +00:00
attilio
0372174d48 MFC 2011-05-19 22:55:37 +00:00
kib
8407c0a698 The CDP_ACTIVE flag is cleared at the beginning of destroy_devl(),
and destroy_devl() drops dev_mtx. The protection against the race
with dev_rel(), introduced in r163328, should be extended to cover
destroy_devl() calls for the children of the destroyed dev.

Reported and tested by:	joerg
MFC after:	1 week
2011-05-18 22:36:58 +00:00
attilio
0828d417d4 Fix mismerge.
Reported by:	pluknet
2011-05-18 15:50:12 +00:00
attilio
01e90e3193 Merge r221285 from largeSMP project:
- Remove the following sysctl:
  kern.sched.ipiwakeup.onecpu
  kern.sched.ipiwakeup.htt2

  Because they are absolutely obsolete.  Probabilly the whole wakeup
  forward mechanism should be revisited for a better fitting in modern
  hw, in the future.
- As map2 variable is no longer used rename map3 to map2
- Fix a string by making more informative the msg and removing the
  arguments passing.

Reviewed by:	julian
Tested by:	several
2011-05-17 22:14:00 +00:00
attilio
2cdf500faf MFC 2011-05-17 22:03:01 +00:00
jhb
8d84cd707e Fix a race in the SMP rendezvous code. Specifically, the write by the
last CPU to to finish the rendezvous action may become visible to
different CPUs at different times.  As a result, the CPU that initiated
the rendezvous may exit the rendezvous and drop the lock allowing another
rendezvous to be initiated on the same CPU or a different CPU.  In that
case the exit sentinel may be cleared before all CPUs have noticed causing
those CPUs to hang forever.

Workaround this by using a generation count to notice when this race
occurs and to exit the rendezvous in that case.

The problem was independently diagnosted by mlaier@ and avg@ as well.

Submitted by:	neel
Reviewed by:	avg, mlaier
Obtained from:	NetApp
MFC after:	1 week
2011-05-17 16:39:08 +00:00
phk
c0026c6642 Use memset() instead of bzero() and memcpy() instead of bcopy(), there
is no relevant difference for sbufs, and it increases portability of
the source code.

Split the actual initialization of the sbuf into a separate local
function, so that certain static code checkers can understand
what sbuf_new() does, thus eliminating on silly annoyance of
MISRA compliance testing.

Contributed by:		An anonymous company in the last business I
			expected sbufs to invade.
2011-05-17 11:04:50 +00:00
phk
99b5f98226 Don't expect PAGE_SIZE to exist on all platforms (It is a pretty arbitrary
choice of default size in the first place)

Reverse the order of arguments to the internal static sbuf_put_byte()
function to match everything else in this file.

Move sbuf_putc_func() inside the kernel version of sbuf_vprintf
where it belongs.

sbuf_putc() incorrectly used sbuf_putc_func() which supress NUL
characters, it should use sbuf_put_byte().

Make sbuf_finish() return -1 on error.

Minor stylistic nits fixed.
2011-05-17 06:36:32 +00:00
attilio
fd96a5afd1 Merge r221278 from largeSMP project:
idle_cpus_mask is just used in sched_4bsd, thus make it private for it.

Tested by:	several
2011-05-16 23:20:12 +00:00
attilio
d57a3c7c06 MFC 2011-05-16 16:34:03 +00:00
phk
9ed2621ed9 Change the length quantities of sbufs to be ssize_t rather than int.
Constify a couple of arguments.
2011-05-16 16:18:40 +00:00
avg
576b51ab8f better integrate cyclic module with clocksource/eventtimer subsystem
Now in the case when one-shot timers are used cyclic events should fire
closer to theier scheduled times.  As the cyclic is currently used only
to drive DTrace profile provider, this is the area where the change
makes a difference.

Reviewed by:	mav (earlier version, a while ago)
X-MFC after:	clocksource/eventtimer subsystem
2011-05-16 15:29:59 +00:00
attilio
c5a5c48e70 Fix a longstanding bug where only the first part of the cpumask was
correctly set full.

Submitted by:	anonymous
2011-05-14 19:36:12 +00:00
attilio
9309cc63ed Simplify the code here.
Submitted by:	jhb
2011-05-14 18:22:08 +00:00
attilio
d62a193525 MFC 2011-05-13 15:20:57 +00:00
mdf
5ffb572962 Correctly use INOUT for the offset/len parameters to vop_allocate. As
far as I can tell this is for documentation only at the moment.
2011-05-13 14:29:28 +00:00
mav
1881f29e6e Refactor Xen PV code to use new event timers subsystem. That uses one-shot
Xen timer and time counter to provide one-shot and periodic time events.

On my tests this reduces idle interruts rate down to about 30Hz, and accor-
ding to Xen VM Manager reduces host CPU load by three times comparing to
the previous periodic 100Hz clock. Also now, when needed, it is possible to
increase HZ rate without useless CPU burning during idle periods.

Now only ia64 and some ARMs left not migrated to the new event timers.
2011-05-13 12:39:37 +00:00
mdf
bbbc4c5455 Use a name instead of a magic number for kern_yield(9) when the priority
should not change.  Fetch the td_user_pri under the thread lock.  This
is probably not necessary but a magic number also seems preferable to
knowing the implementation details here.

Requested by:	Jason Behmer < jason DOT behmer AT isilon DOT com >
2011-05-13 05:27:58 +00:00
attilio
99e65551b9 MFC 2011-05-12 14:01:40 +00:00
stas
fc10099a3d - Do no try to drop a NULL filedesc pointer. 2011-05-12 10:56:33 +00:00
stas
5f9f795476 - Commit work from libprocstat project. These patches add support for runtime
file and processes information retrieval from the running kernel via sysctl
  in the form of new library, libprocstat.  The library also supports KVM backend
  for analyzing memory crash dumps.  Both procstat(1) and fstat(1) utilities have
  been modified to take advantage of the library (as the bonus point the fstat(1)
  utility no longer need superuser privileges to operate), and the procstat(1)
  utility is now able to display information from memory dumps as well.

  The newly introduced fuser(1) utility also uses this library and able to operate
  via sysctl and kvm backends.

  The library is by no means complete (e.g. KVM backend is missing vnode name
  resolution routines, and there're no manpages for the library itself) so I
  plan to improve it further.  I'm commiting it so it will get wider exposure
  and review.

  We won't be able to MFC this work as it relies on changes in HEAD, which
  was introduced some time ago, that break kernel ABI.  OTOH we may be able
  to merge the library with KVM backend if we really need it there.

Discussed with:	rwatson
2011-05-12 10:11:39 +00:00
attilio
cae315a375 MFC 2011-05-07 23:34:14 +00:00
jh
264f4e742a To avoid duplicated warning, move WITNESS_WARN() added in r221597 to the
branch which doesn't call malloc(9).

Suggested by:	kib
2011-05-07 17:59:07 +00:00
jh
f428a5f2b3 Add WITNESS_WARN() to getenv() to explicitly note that the function may
sleep. This helps to expose bugs when the requested environment variable
doesn't exist.
2011-05-07 11:10:58 +00:00
attilio
fe4de567b5 Commit the support for removing cpumask_t and replacing it directly with
cpuset_t objects.
That is going to offer the underlying support for a simple bump of
MAXCPU and then support for number of cpus > 32 (as it is today).

Right now, cpumask_t is an int, 32 bits on all our supported architecture.
cpumask_t on the other side is implemented as an array of longs, and
easilly extendible by definition.

The architectures touched by this commit are the following:
- amd64
- i386
- pc98
- arm
- ia64
- XEN

while the others are still missing.
Userland is believed to be fully converted with the changes contained
here.

Some technical notes:
- This commit may be considered an ABI nop for all the architectures
  different from amd64 and ia64 (and sparc64 in the future)
- per-cpu members, which are now converted to cpuset_t, needs to be
  accessed avoiding migration, because the size of cpuset_t should be
  considered unknown
- size of cpuset_t objects is different from kernel and userland (this is
  primirally done in order to leave some more space in userland to cope
  with KBI extensions). If you need to access kernel cpuset_t from the
  userland please refer to example in this patch on how to do that
  correctly (kgdb may be a good source, for example).
- Support for other architectures is going to be added soon
- Only MAXCPU for amd64 is bumped now

The patch has been tested by sbruno and Nicholas Esborn on opteron
4 x 12 pack CPUs. More testing on big SMP is expected to came soon.
pluknet tested the patch with his 8-ways on both amd64 and i386.

Tested by:	pluknet, sbruno, gianni, Nicholas Esborn
Reviewed by:	jeff, jhb, sbruno
2011-05-05 14:39:14 +00:00
attilio
0987be4d6b MFC 2011-05-04 15:45:23 +00:00
attilio
b29cc3952a MFC 2011-05-03 18:57:46 +00:00
ae
c39b1f9995 Add make_dev_alias_p() function. It is similar to make_dev_alias(),
but it may return an error like make_dev_p() does.

Reviewed by:	kib (previous version)
MFC after:	2 weeks
2011-05-03 18:54:18 +00:00
trasz
752ffacc69 Change the way rctl interfaces with jails by introducing prison_racct
structure, which acts as a proxy between them.  This makes jail rules
persistent, i.e. they can be added before jail gets created, and they
don't disappear when the jail gets destroyed.
2011-05-03 07:32:58 +00:00
attilio
7ac8b4739c - Remove the following sysctl:
kern.sched.ipiwakeup.onecpu
  kern.sched.ipiwakeup.htt2

  Because they are absolutely obsolete. Probabilly the whole wakeup
  forward mechanism should be revisited for a better fitting in modern
  hw.
- As map2 variable is no longer used rename map3 to map2
- Fix a string by making more informative the msg and removing the
  arguments passing

Approved by:	julian
2011-04-30 23:28:07 +00:00
attilio
d0e06d02bc idle_cpus_mask is just used in the SMP case and within sched_4BSD.
Declare appropriately.
2011-04-30 22:30:18 +00:00
jhb
deafe4e593 Add a new bus method, BUS_ADJUST_RESOURCE() that is intended to be a
wrapper around rman_adjust_resource().  Include a generic implementation,
bus_generic_adjust_resource() which passes the request up to the parent
bus.  There is currently no default implementation.  A
bus_adjust_resource() wrapper is provided for use in drivers.
2011-04-29 21:36:45 +00:00
jhb
22b381b721 Extend the rman(9) API to support altering an existing resource.
Specifically, these changes allow a resource to back a relocatable and
resizable resource such as the I/O window decoders in PCI-PCI bridges.
- rman_adjust_resource() can adjust the start and end address of an
  existing resource.  It only succeeds if the newly requested address
  space is already free.  It also supports shrinking a resource in
  which case the freed space will be marked unallocated in the rman.
- rman_first_free_region() and rman_last_free_region() return the
  start and end addresses for the first or last unallocated region in
  an rman, respectively.  This can be used to determine by how much
  the resource backing an rman must be adjusted to accomodate an
  allocation request that does not fit into the existing rman.

While here, document the rm_start and rm_end fields in struct rman,
rman_is_region_manager(), the bound argument to
rman_reserve_resource_bound(), and rman_init_from_resource().
2011-04-29 20:05:19 +00:00
jhb
08955ceac0 Change rman_manage_region() to actually honor the rm_start and rm_end
constraints on the rman and reject attempts to manage a region that is out
of range.
- Fix various places that set rm_end incorrectly (to ~0 or ~0u instead of
  ~0ul).
- To preserve existing behavior, change rman_init() to set rm_start and
  rm_end to allow managing the full range (0 to ~0ul) if they are not set by
  the caller when rman_init() is called.
2011-04-29 18:41:21 +00:00
attilio
d685681d59 Add the watchdogs patting during the (shutdown time) disk syncing and
disk dumping.
With the option SW_WATCHDOG on, these operations are doomed to let
watchdog fire, fi they take too long.

I implemented the stubs this way because I really want wdog_kern_*
KPI to not be dependant by SW_WATCHDOG being on (and really, the option
only enables watchdog activation in hardclock) and also avoid to
call them when not necessary (avoiding not-volountary watchdog
activations).

Sponsored by:	Sandvine Incorporated
Discussed with:	emaste, des
MFC after:	2 weeks
2011-04-28 16:02:05 +00:00
rstone
8a5b424a2c If the 4BSD scheduler tries to schedule a thread that has been pinned or
bound to an AP before SMP has started, the system will panic when we try
to touch per-CPU state for that AP because that state has not been
initialized yet.  Fix this in the same way as ULE: place all threads in
the global run queue before SMP has started.

Reviewed by:	jhb
MFC after:	1 month
2011-04-26 20:34:30 +00:00
kib
cff99f5d10 Implement the delayed task execution extension to the taskqueue
mechanism. The caller may specify a timeout in ticks after which the
task will be scheduled.

Sponsored by:	The FreeBSD Foundation
Reviewed by:	jeff, jhb
MFC after:	1 month
2011-04-26 11:39:56 +00:00
jeff
14f1aafe26 - Catch up to falloc() changes.
- PHOLD() before using a task structure on the stack.
 - Fix a LOR between the sleepq lock and thread lock in _intr_drain().
2011-04-26 07:30:52 +00:00
rmacklem
872195caf9 Fix a LOR in vfs_busy() where, after msleeping, it would lock
the mutexes in the wrong order for the case where the
MBF_MNTLSTLOCK is set. I believe this did have the
potential for deadlock. For example, if multiple nfsd threads
called vfs_busyfs(), which calls vfs_busy() with MBF_MNTLSTLOCK.
Thanks go to pho for catching this during his testing.

Tested by:	pho
Submitted by:	kib
MFC after:	2 weeks
2011-04-23 11:22:48 +00:00
jh
91fe746b18 Utilize vfs_sanitizeopts() in vfs_mergeopts() to merge options. Because
vfs_sanitizeopts() can handle "ro" and "rw" options properly, there is
no more need to add "noro" in vfs_donmount() to cancel "ro".

This also fixes a problem of canceling options beginning with "no".
For example, "noatime" didn't cancel "nonoatime". Thus it was possible
that both "noatime" and "nonoatime" were active at the same time.

Reviewed by:	bde
2011-04-22 07:26:09 +00:00
mdf
597ae9f19b Allow VOP_ALLOCATE to be iterative, and have kern_posix_fallocate(9)
drive looping and potentially yielding.

Requested by:	kib
2011-04-19 16:36:24 +00:00
mdf
45c5f27863 Fix a copy/paste whitespace error. 2011-04-18 16:40:47 +00:00
mdf
b0f8474766 Regen. 2011-04-18 16:32:47 +00:00
mdf
9c9a32d97b Add the posix_fallocate(2) syscall. The default implementation in
vop_stdallocate() is filesystem agnostic and will run as slow as a
read/write loop in userspace; however, it serves to correctly
implement the functionality for filesystems that do not implement a
VOP_ALLOCATE.

Note that __FreeBSD_version was already bumped today to 900036 for any
ports which would like to use this function.

Also reserve space in the syscall table for posix_fadvise(2).

Reviewed by:	-arch (previous version)
2011-04-18 16:32:22 +00:00
jilles
a860eeb5be ktrace: Log the code for all signals (PSIG events).
The code provides information on how the signal was generated.

Formerly, the code was only logged for traps, much like only signal handlers
for traps received a meaningful si_code before FreeBSD 7.0.

In rare cases, no information is available and 0 is still logged.

MFC after:	1 week
2011-04-17 14:38:11 +00:00
dchagin
95da67cdb8 Remove malloc(9) return value checks when M_WAITOK is used.
MFC after:	2 Week
2011-04-16 16:20:51 +00:00
glebius
3a11690b71 Revert r194662, since it breaks ng_ksocket(4) and may break
other socket consumers with alternate sb_upcall.

PR:		kern/154676
Submitted by:	Arnaud Lacombe <lacombar gmail.com>
MFC after:	7 days
2011-04-14 14:54:22 +00:00
pluknet
b1187e01aa Remove stale M_ZOMBIE malloc type.
This type is unused since embedding p_ru into struct proc.

MFC after:	1 week
2011-04-14 14:25:47 +00:00
gavin
0b7551a5c9 Add a new DDB command, "show rmans", which will show the address and brief
details of each rman header, but not the contents of all rman structures
in the system.  This is especially useful on platforms where some rmans
have many thousands of entries in rmans, making scrolling through the
output of "show all rman" impractical.  Individual rmans can then be viewed
including their contents with "show rman 0xaddr" as usual.

Reviewed by:	jhb
2011-04-13 19:10:56 +00:00
pluknet
e48732e3d4 Staticize malloc types.
Approved by:	lstewart
MFC after:	1 week
2011-04-13 11:28:46 +00:00
lstewart
545f7c0ca7 Use the full and proper company name for Swinburne University of Technology
throughout the source tree.

Requested by:	Grenville Armitage, Director of CAIA at Swinburne University of
			Technology
MFC after:	3 days
2011-04-12 08:13:18 +00:00
trasz
e80aa2a5fe Rename a misnamed structure field (hr_loginclass), and reorder priv(9)
constants to match the order and naming of syscalls.  No functional changes.
2011-04-10 18:35:43 +00:00
kib
1b5526fe98 Some callers of proc_reparent() already have the parent process locked.
Detect the situation and avoid process lock recursion.

Reported by:	Fabian Keil <freebsd-listen fabiankeil de>
2011-04-10 17:07:02 +00:00
attilio
bacffe590f Reintroduce the fix already discussed in r216805 (please check its history
for a detailed explanation of the problems).

The only difference with the previous fix is in Solution2:
CPUBLOCK is no longer set when exiting from callout_reset_*() functions,
which avoid the deadlock (leading to r217161).
There is no need to CPUBLOCK there because the running-and-migrating
assumption is strong enough to avoid problems there.
Furthermore add a better !SMP compliancy (leading to shrinked code and
structures) and facility macros/functions.

Tested by:	gianni, pho, dim
MFC after:	3 weeks
2011-04-08 18:48:57 +00:00
trasz
a0192d37e6 Add RACCT_NOFILE accounting.
Sponsored by:	The FreeBSD Foundation
Reviewed by:	kib (earlier version)
2011-04-06 19:13:04 +00:00
trasz
97c31dedde Style fix.
Submitted by:	jhb@
2011-04-06 19:08:50 +00:00
trasz
d3c78eed8e Add accounting for SysV-related resources.
Sponsored by:	The FreeBSD Foundation
Reviewed by:	kib (earlier version)
2011-04-06 18:11:24 +00:00
jhb
96b1d8b6d7 Fix several places to ignore processes that are not yet fully constructed.
MFC after:	1 week
2011-04-06 17:47:22 +00:00
trasz
d6f4192036 Add ucred pointer to the SysV-related memory structures. This is required
for racct.

Note that after this commit, ipcs(1) needs to be rebuilt.  Otherwise, it will
fail with "ipcs: sysctlbyname: kern.ipc.msqids: Cannot allocate memory".

Sponsored by:	The FreeBSD Foundation
Reviewed by:	kib (earlier version)
2011-04-06 16:59:54 +00:00
trasz
92bec9b84c Add accounting for most of the memory-related resources.
Sponsored by:	The FreeBSD Foundation
Reviewed by:	kib (earlier version)
2011-04-05 20:23:59 +00:00
trasz
fffd1b22a5 Add missing stubs. 2011-04-05 19:50:34 +00:00
pluknet
6d33997006 Remove malloc type M_NETADDR unused since splitting into vfs_subr.c
and vfs_export.c.

MFC after:	1 week
2011-04-04 16:23:01 +00:00
trasz
400f21cacb Add accounting for RACCT_NPTS. 2011-04-02 15:02:42 +00:00
kib
eb730d92e4 After the r219999 is merged to stable/8, rename fallocf(9) to falloc(9)
and remove the falloc() version that lacks flag argument. This is done
to reduce the KPI bloat.

Requested by:	jhb
X-MFC-note:	do not
2011-04-01 13:28:34 +00:00
kib
7c2eaa21fe Add support for executing the FreeBSD 1/i386 a.out binaries on amd64.
In particular:
- implement compat shims for old stat(2) variants and ogetdirentries(2);
- implement delivery of signals with ancient stack frame layout and
  corresponding sigreturn(2);
- implement old getpagesize(2);
- provide a user-mode trampoline and LDT call gate for lcall $7,$0;
- port a.out image activator and connect it to the build as a module
  on amd64.

The changes are hidden under COMPAT_43.

MFC after:   1 month
2011-04-01 11:16:29 +00:00
trasz
4c83b1bba4 Enable accounting for RACCT_NPROC and RACCT_NTHR.
Sponsored by:	The FreeBSD Foundation
Reviewed by:	kib (earlier version)
2011-03-31 19:22:11 +00:00
trasz
596c078ed8 Notify racct when process credentials change.
Sponsored by:	The FreeBSD Foundation
Reviewed by:	kib (earlier version)
2011-03-31 18:12:04 +00:00
fabient
e0588db8d2 Clearing the flag when preempting will let the preempted thread run
too much time. This can finish in a scheduler deadlock with ping-pong
between two threads.

One sample of this is:
- device lapic (to have a preemption point on critical_exit())
- options DEVICE_POLLING with HZ>1499 (to have lapic freq = hardclock freq)
- running a cpu intensive task (that does not enter the kernel)
- only one CPU on SMP or no SMP.

As requested by jhb@ 4BSD have received the same type of fix instead of
propagating the flag to the new thread.

Reviewed by:	jhb, jeff
MFC after:	1 month
2011-03-31 13:59:47 +00:00
trasz
3adbd8337d Regenerate. 2011-03-30 17:59:54 +00:00
trasz
2f99052d80 Add rctl. It's used by racct to take user-configurable actions based
on the set of rules it maintains and the current resource usage.  It also
privides userland API to manage that ruleset.

Sponsored by:	The FreeBSD Foundation
Reviewed by:	kib (earlier version)
2011-03-30 17:48:15 +00:00
kib
5c02dd1e9a Provide compat32 shims for kldstat(2).
Requested and tested by:	jpaetzel
MFC after:	1 week
2011-03-30 14:46:12 +00:00
trasz
8d3dbe6760 Remove pointless (always true) KASSERTs.
Submitted by:	pjd
2011-03-29 19:19:10 +00:00
trasz
b8d3e8755d Add racct. It's an API to keep per-process, per-jail, per-loginclass
and per-loginclass resource accounting information, to be used by the new
resource limits code.  It's connected to the build, but the code that
actually calls the new functions will come later.

Sponsored by:	The FreeBSD Foundation
Reviewed by:	kib (earlier version)
2011-03-29 17:47:25 +00:00
kib
7b2d837e2c Fix the check for vm_map_remove() error.
Pointed out by:	alc
MFC after:	2 weeks
2011-03-28 19:44:54 +00:00
kib
0529b2424a Trim white spaces, adjust style.
MFC after:	2 weeks
2011-03-28 13:28:23 +00:00
kib
52182b9c4a Handle zero length in copyout_unmap().
Submitted by:	John Wehle <john feith com>
MFC after:	2 weeks
2011-03-28 13:21:26 +00:00
kib
7b58e8da9b Promote ksyms_map() and ksyms_unmap() to general facility
copyout_map() and copyout_unmap() interfaces.

Submitted by:	John Wehle <john feith com>, nox
MFC after:	2 weeks
2011-03-28 12:48:33 +00:00
jh
b6e26fa641 Fix some style issues in r219925.
Reported by:	bde
MFC after:	1 month
2011-03-26 17:17:24 +00:00
kib
fc2bd01611 Add O_CLOEXEC flag to open(2) and fhopen(2).
The new function fallocf(9), that is renamed falloc(9) with added
flag argument, is provided to facilitate the merge to stable branch.

Reviewed by:	jhb
MFC after:	1 week
2011-03-25 14:00:36 +00:00
jhb
c7ac62aecd Fix some locking nits with the p_state field of struct proc:
- Hold the proc lock while changing the state from PRS_NEW to PRS_NORMAL
  in fork to honor the locking requirements.  While here, expand the scope
  of the PROC_LOCK() on the new process (p2) to avoid some LORs.  Previously
  the code was locking the new child process (p2) after it had locked the
  parent process (p1).  However, when locking two processes, the safe order
  is to lock the child first, then the parent.
- Fix various places that were checking p_state against PRS_NEW without
  having the process locked to use PROC_LOCK().  Every place was already
  locking the process, just after the PRS_NEW check.
- Remove or reduce the use of PROC_SLOCK() for places that were checking
  p_state against PRS_NEW.  The PROC_LOCK() alone is sufficient for reading
  the current state.
- Reorder fill_kinfo_proc() slightly so it only acquires PROC_SLOCK() once.

MFC after:	1 week
2011-03-24 18:40:11 +00:00
jh
fc3bd08464 Recognize "ro", "rdonly", "norw", "rw" and "noro" as equal options in
vfs_equalopts(). This allows vfs_sanitizeopts() to filter redundant
occurrences of these options. It was possible that for example both "ro"
and "rw" options became active concurrently.

PR:		kern/133614
Discussed on:	freebsd-hackers
MFC after:	1 month
2011-03-23 17:56:38 +00:00
alc
c84b8f6e0c Modestly increase the maximum allowed size of the kmem map on i386.
Also, express this new maximum as a fraction of the kernel's address
space size rather than a constant so that increasing KVA_PAGES will
automatically increase this maximum.  As a side-effect of this change,
kern.maxvnodes will automatically increase by a proportional amount.

While I'm here ensure that this change doesn't result in an unintended
increase in maxpipekva on i386.  Calculate maxpipekva based upon the
size of the kernel address space and the amount of physical memory
instead of the size of the kmem map.  The memory backing pipes is not
allocated from the kmem map.  It is allocated from its own submap of
the kernel map.  In short, it has no real connection to the kmem map.
(In fact, the commit messages for the maxpipekva auto-sizing talk
about using the kernel map size, cf. r117325 and r117391, even though
the implementation actually used the kmem map size.)  Although the
calculation is now done differently, the resulting value for
maxpipekva should remain almost the same on i386.  However, on amd64,
the value will be reduced by 2/3.  This is intentional.  The recent
change to VM_KMEM_SIZE_SCALE on amd64 for the benefit of ZFS also had
the unnecessary side-effect of increasing maxpipekva.  This change is
effectively restoring maxpipekva on amd64 to its prior value.

Eliminate init_param3() since it is no longer used.
2011-03-23 16:38:29 +00:00
jhb
df83f2ffe5 Small style fix. 2011-03-23 13:44:32 +00:00
trasz
359437c29d Make UFS use PSARC/2010/029 NFSv4 ACL semantics by default, bringing
it in line with ZFSv28.

X-MFC-After:	ZFSv28.
2011-03-22 19:52:29 +00:00
trasz
eb401e64c1 Move the code around so that libc behaviour does not depend on a variable
that was supposed to be kernel-only.  There should be no functional changes.
2011-03-22 17:44:07 +00:00
jeff
2d7d8c05e7 - Merge changes to the base system to support OFED. These include
a wider arg2 for sysctl, updates to vlan code, IFT_INFINIBAND,
   and other miscellaneous small features.
2011-03-21 09:40:01 +00:00
alc
e498ba0188 Update a comment. The sending process has not mapped the buffer pages
since before r127501.  Strictly speaking, the buffer pages are not
"wired".  They remain in the paging queues.  However, they are pinned in
memory using vm_page_hold().
2011-03-20 15:04:43 +00:00
ivoras
dc9c77949c The hardware has caught up; improvements are now observed even at 128,
but stay conservative and bump read_max to "only" 64 (it will probably be
a good idea to increase this to 128 after the next major release).
2011-03-16 16:22:59 +00:00
avg
666906fcd7 add DTrace systrace support for linux32 and freebsd32 on amd64 syscalls
This commits makes necessary changes in syscall/sysent generation
infrastructure.

PR:		kern/152822
Submitted by:	Artem Belevich <fbsdlist@src.cx>
Reviewed by:	jhb (ealier version)
MFC after:	3 weeks
2011-03-12 08:51:43 +00:00
dchagin
69b8756d3d Extend struct sysvec with new method sv_schedtail, which is used for an
explicit process at fork trampoline path instead of eventhadler(schedtail)
invocation for each child process.

Remove eventhandler(schedtail) code and change linux ABI to use newly added
sysvec method.

While here replace explicit comparing of module sysentvec structure with the
newly created process sysentvec to detect the linux ABI.

Discussed with:	kib

MFC after:	2 Week
2011-03-08 19:01:45 +00:00
jhb
269e1daa8f When constructing a new cpuset, apply the parent cpuset's mask to the new
set's mask rather than the root mask.  This was causing the root mask to
be modified incorrectly.

Reviewed by:	jeff
MFC after:	1 week
2011-03-08 14:18:21 +00:00
kib
4d0733e0f8 Do not assert buffer lock in VFS_STRATEGY() when kernel already paniced.
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2011-03-08 11:50:59 +00:00
kib
f3c6442b93 The execution of the shebang script requires putting interpreter path,
possible option and script path in the place of argv[0] supplied to
execve(2).  It is possible and valid for the substitution to be shorter
then the argv[0].

Avoid signed underflow in this case.

Submitted by:	Devon H. O'Dell <devon.odell gmail com>
PR:	kern/155321
MFC after:	1 week
2011-03-06 22:59:30 +00:00
trasz
0d90d0ceb0 Temporarily revert r219272; it breaks acl_is_trivial_np(3). 2011-03-06 20:12:09 +00:00
dchagin
de259f37cb Style(9) fix.
Fix indentation in comment, double ';' in variable declaration.

MFC after:	1 Week
2011-03-05 20:54:17 +00:00
dchagin
99b120c8f2 Partially reworked r219042.
The reason for this is a bug at ktrops() where process dereferenced
without having a lock. This might cause a panic if ktrace was runned
with -p flag and the specified process exited between the dropping
a lock and writing sv_flags.

Since it is impossible to acquire sx lock while holding mtx switch
to use asynchronous enqueuerequest() instead of writerequest().

Rename ktr_getrequest_ne() to more understandable name [1].

Requested by:	jhb [1]

MFC after:	1 Week
2011-03-05 20:36:42 +00:00
trasz
1618438630 Export login class information via kinfo and make it possible to view
it using "ps -o class".
2011-03-05 14:41:49 +00:00
trasz
0525662d59 Regenerate. 2011-03-05 12:46:24 +00:00
trasz
62f6a13e39 Add two new system calls, setloginclass(2) and getloginclass(2). This makes
it possible for the kernel to track login class the process is assigned to,
which is required for RCTL.  This change also make setusercontext(3) call
setloginclass(2) and makes it possible to retrieve current login class using
id(1).

Reviewed by:	kib (as part of a larger patch)
2011-03-05 12:40:35 +00:00
trasz
e78a2a669a Make UFS use PSARC/2010/029 NFSv4 ACL semantics by default, just like
ZFSv28 does.

MFC after:	2 months
2011-03-04 19:53:07 +00:00
netchild
b66d64d436 - Add a FEATURE for capsicum (security_capabilities).
- Rename mac FEATURE to security_mac.

Discussed with:	rwatson
2011-03-04 09:03:54 +00:00
trasz
758a56db95 Make "struct pts_softc" point to ucred instead of uidinfo. This is no-op,
required for resource containers.

Reviewed by:	kib (as part of a larger patch), ed
2011-03-03 17:33:22 +00:00
jhb
d64a5d112c Similar to 189574, properly handle subclasses of bus drivers when deleting
a driver during kldunload.  Specifically, recursively walk the tree of
subclasses of a given driver attachment's bus device class detaching all
instances of that driver for each class and its subclasses.

Reported by:	bschmidt
Reviewed by:	imp
MFC after:	1 week
2011-03-01 14:43:37 +00:00
rwatson
f1981d366a Continue introducing Capsicum capability mode support:
If a system call wasn't listed in capabilities.conf, return ECAPMODE at
syscall entry.

Reviewed by:	anderson
Discussed with:	benl, kris, pjd
Sponsored by:	Google, Inc.
Obtained from:	Capsicum Project
MFC after:	3 months
2011-03-01 13:32:07 +00:00
rwatson
9c02915234 Regenerate system call files following addition of cap_enter(2),
cap_getmode(2), and capabilities.conf.

Reviewed by:	anderson
Discussed with:	benl, kris, pjd
Obtained from:	Capsicum Project
Sponsored by:	Google, Inc.
MFC after:	3 months
2011-03-01 13:30:23 +00:00
rwatson
fa27828ce8 Continue to introduce Capsicum Capability Mode support:
Add a new system call flag, SYF_CAPENABLED, which indicates that a
particular system call is available in capability mode.

Add a new configuration file, kern/capabilities.conf (similar files
may be introduced for other ABIs in the future), which enumerates
system calls that are available in capability mode.  When a new
system call is added to syscalls.master, it will also need to be
added here (if needed).  Teach sysent parts to use this file to set
values for SYF_CAPENABLED for the native ABI.

Reviewed by:	anderson
Discussed with:	benl, kris, pjd
Obtained from:	Capsicum Project
MFC after:	3 months
2011-03-01 13:28:27 +00:00
rwatson
6894aabcb5 Add initial support for Capsicum's Capability Mode to the FreeBSD kernel,
compiled conditionally on options CAPABILITIES:

Add a new credential flag, CRED_FLAG_CAPMODE, which indicates that a
subject (typically a process) is in capability mode.

Add two new system calls, cap_enter(2) and cap_getmode(2), which allow
setting and querying (but never clearing) the flag.

Export the capability mode flag via process information sysctls.

Sponsored by:	Google, Inc.
Reviewed by:	anderson
Discussed with:	benl, kris, pjd
Obtained from:	Capsicum Project
MFC after:	3 months
2011-03-01 13:23:37 +00:00
dchagin
44c082b543 Introduce preliminary support of the show description of the ABI of
traced process by adding two new events which records value of process
sv_flags to the trace file at process creation/execing/exiting time.

MFC after:	1 Month.
2011-02-25 22:05:33 +00:00
dchagin
049c9dd26f ktrace_resize_pool() locking slightly reworked:
1) do not take a lock around the single atomic operation.
2) do not lose the invariant of lock by dropping/acquiring
   ktrace_mtx around free() or malloc().

MFC after:	1 Month.
2011-02-25 22:03:28 +00:00
netchild
b8896acc71 Make the description of the feature consistent with another similar
description for another feature.

Noticed by:	trasz
2011-02-25 12:46:43 +00:00
netchild
cc4128c6b1 Add some FEATURE macros for various features (AUDIT/CAM/IPC/KTR/MAC/NFS/NTP/
PMC/SYSV/...).

No FreeBSD version bump, the userland application to query the features will
be committed last and can serve as an indication of the availablility if
needed.

Sponsored by:   Google Summer of Code 2010
Submitted by:   kibab
Reviewed by:    arch@ (parts by rwatson, trasz, jhb)
X-MFC after:    to be determined in last commit with code from this project
2011-02-25 10:11:01 +00:00
pluknet
b7b945fd78 Clean up the now unused #include statement.
Approved by:	kib (mentor)
MFC after:	1 week
X-MFC with:	r218972
2011-02-23 18:22:40 +00:00
kib
b083c8ec96 Move the max_threads_per_proc and max_threads_hits variables to the
file where they are used. Declare the kern.threads sysctl node at the
same location. Since no external use for the variables exists, make them
static.

Discussed with:	dchagin
MFC after:	1 week
2011-02-23 13:50:24 +00:00
jhb
5d87a422b4 Revert previous change, the existing check was correct.
Pointy hat to:	jhb
2011-02-23 13:25:42 +00:00
jhb
3eb951ea57 Expose the umtx_key structure and API to the rest of the kernel.
MFC after:	3 days
2011-02-23 13:19:14 +00:00
jhb
902a44bea0 Fix off-by-one error in check against max_threads_per_proc.
Submitted by:	arundel
MFC after:	1 week
2011-02-23 12:56:25 +00:00
brucec
6d9b42b486 Fix typos - remove duplicate "the".
PR:	bin/154928
Submitted by:	Eitan Adler <lists at eitanadler.com>
MFC after: 	3 days
2011-02-21 09:01:34 +00:00
jh
d4c0c6675e Don't restore old mount options and flags if VFS_MOUNT(9) succeeds but
vfs_export() fails. Restoring old options and flags after successful
VFS_MOUNT(9) call may cause the file system internal state to become
inconsistent with mount options and flags. Specifically the FFS super
block fs_ronly field and the MNT_RDONLY flag may get out of sync.

PR:		kern/133614
Discussed on:	freebsd-hackers
2011-02-19 14:27:14 +00:00
mdf
baf9dec697 Modify kdb_trap() so that it re-calls the dbbe_trap function as long as
the debugger back-end has changed.  This means that switching from ddb
to gdb no longer requires a "step" which can be dangerous on an
already-crashed kernel.

Also add a capability to get from the gdb back-end back to ddb, by
typing ^C in the console window.

While here, simplify kdb_sysctl_available() by using
sbuf_new_for_sysctl(), and use strlcpy() instead of strncpy() since the
strlcpy semantic is desired.

MFC after:	1 month
2011-02-18 22:25:11 +00:00
bz
b9b7d3e93a Mfp4 CH=177274,177280,177284-177285,177297,177324-177325
VNET socket push back:
  try to minimize the number of places where we have to switch vnets
  and narrow down the time we stay switched.  Add assertions to the
  socket code to catch possibly unset vnets as seen in r204147.

  While this reduces the number of vnet recursion in some places like
  NFS, POSIX local sockets and some netgraph, .. recursions are
  impossible to fix.

  The current expectations are documented at the beginning of
  uipc_socket.c along with the other information there.

  Sponsored by: The FreeBSD Foundation
  Sponsored by: CK Software GmbH
  Reviewed by:  jhb
  Tested by:    zec

Tested by:	Mikolaj Golub (to.my.trociny gmail.com)
MFC after:	2 weeks
2011-02-16 21:29:13 +00:00
bz
51beb8d124 Mfp4 CH=177256:
Catch a set vnet upon return to user space. This usually
  means return paths with CURVNET_RESTORE() missing.

  If VNET_DEBUG is turned on we can even tell the function
  that did the CURVNET_SET() which is really helpful; else
  we print "N/A".

  Sponsored by: The FreeBSD Foundation
  Sponsored by: CK Software GmbH
  Reviewed by:  jhb

MFC after:	11 days
2011-02-14 20:49:37 +00:00
deischen
cc0a83ec21 Allow the SO_SETFIB socket option to select the default (0)
routing table.

Reviewed by:	julian
2011-02-13 00:14:13 +00:00
alc
060dcf42aa Retire VFS_BIO_DEBUG. Convert those checks that were still valid into
KASSERT()s and eliminate the rest.

Replace excessive printf()s and a panic() in bufdone_finish() with a
KASSERT() in vm_page_io_finish().

Reviewed by:	kib
2011-02-12 01:00:00 +00:00
jmallett
96da23f472 With smp_topo_none, set cg_mask to all_cpus rather than setting the mp_ncpus
low bits.

Submitted by:	Bhanu Prakash
Reviewed by:	jeffr
2011-02-11 22:43:10 +00:00