Commit Graph

12836 Commits

Author SHA1 Message Date
Garrett Wollman
48b5c7410f Fix spelling of the function name in two assertion messages. 2012-10-02 18:38:05 +00:00
Eitan Adler
8dbce2a343 Provide a generic way to disable devices at boot time
PR:		kern/119202
Requested by:	peterj
Reviewed by:	sbruno, jhb
Approved by:	cperciva
MFC after:	1 week
2012-10-02 03:33:41 +00:00
Pawel Jakub Dawidek
55711729f3 - Enforce CAP_MKFIFO on mkfifoat(2), not on mknodat(2). Without this change
mkfifoat(2) was not restricted.
- Introduce CAP_MKNOD and enforce it on mknodat(2).

Sponsored by:	FreeBSD Foundation
MFC after:	2 weeks
2012-10-01 05:43:24 +00:00
Konstantin Belousov
877d24ac8a Fix the mis-handling of the VV_TEXT on the nullfs vnodes.
If you have a binary on a filesystem which is also mounted over by
nullfs, you could execute the binary from the lower filesystem, or
from the nullfs mount. When executed from lower filesystem, the lower
vnode gets VV_TEXT flag set, and the file cannot be modified while the
binary is active. But, if executed as the nullfs alias, only the
nullfs vnode gets VV_TEXT set, and you still can open the lower vnode
for write.

Add a set of VOPs for the VV_TEXT query, set and clear operations,
which are correctly bypassed to lower vnode.

Tested by:	pho (previous version)
MFC after:	2 weeks
2012-09-28 11:25:02 +00:00
Matthew D Fleming
fc8fdae0df Fix up kernel sources to be ready for a 64-bit ino_t.
Original code by:	Gleb Kurtsou
2012-09-27 23:30:49 +00:00
Pawel Jakub Dawidek
c8e781f6e0 Revert r240931, as the previous comment was actually in sync with POSIX.
I have to note that POSIX is simply stupid in how it describes O_EXEC/fexecve
and friends. Yes, not only inconsistent, but stupid.

In the open(2) description, O_RDONLY flag is described as:

	O_RDONLY	Open for reading only.

Taken from:

	http://pubs.opengroup.org/onlinepubs/9699919799/functions/open.html

Note "for reading only". Not "for reading or executing"!

In the fexecve(2) description you can find:

	The fexecve() function shall fail if:

	[EBADF]
		The fd argument is not a valid file descriptor open for executing.

Taken from:

	http://pubs.opengroup.org/onlinepubs/9699919799/functions/exec.html

As you can see the function shall fail if the file was not open with O_EXEC!

And yet, if you look closer you can find this mess in the exec.html:

	Since execute permission is checked by fexecve(), the file description
	fd need not have been opened with the O_EXEC flag.

Yes, O_EXEC flag doesn't have to be specified after all. You can open a file
with O_RDONLY and you still be able to fexecve(2) it.
2012-09-27 16:43:23 +00:00
Mikolaj Golub
47813f5d94 Kernel and modules have "set_vnet" linker set, where virtualized
global variables are placed. When a module is loaded by link_elf
linker its variables from "set_vnet" linker set are copied to the
kernel "set_vnet" ("modspace") and all references to these variables
inside the module are relocated accordingly.

The issue is when a module is loaded that has references to global
variables from another, previously loaded module: these references are
not relocated so an invalid address is used when the module tries to
access the variable. The example is V_layer3_chain, defined in ipfw
module and accessed from ipfw_nat.

The same issue is with DPCPU variables, which use "set_pcpu" linker
set.

Fix this making the link_elf linker on a module load recognize
"external" DPCPU/VNET variables defined in the previously loaded
modules and relocate them accordingly. For this set_pcpu_list and
set_vnet_list are used, where the addresses of modules' "set_pcpu" and
"set_vnet" linker sets are stored.

Note, archs that use link_elf_obj (amd64) were not affected by this
issue.

Reviewed by:	jhb, julian, zec (initial version)
MFC after:	1 month
2012-09-27 14:55:15 +00:00
Konstantin Belousov
94cb35459d Make the updates of the tid ring buffer' head and tail pointers
explicit by moving them into separate statements from the buffer
element accesses.

Requested by:	jhb
MFC after:	3 days
2012-09-26 09:25:11 +00:00
Pawel Jakub Dawidek
28f865b0b1 Fix freebsd32_kmq_timedreceive() and freebsd32_kmq_timedsend() to use
getmq_read() and getmq_write() respectively, just like sys_kmq_timedreceive()
and sys_kmq_timedsend().

Sponsored by:	FreeBSD Foundation
MFC after:	2 weeks
2012-09-25 22:15:59 +00:00
Pawel Jakub Dawidek
8c706ce0d0 vn_write() always expects FOF_OFFSET flag, which is asserted at the begining,
so there is no need to check for it.

Sponsored by:	FreeBSD Foundation
MFC after:	2 weeks
2012-09-25 21:31:17 +00:00
Pawel Jakub Dawidek
3a038c4d68 We cannot open file for reading and executing (O_RDONLY | O_EXEC).
Well, in theory we can pass those two flags, because O_RDONLY is 0,
but we won't be able to read from a descriptor opened with O_EXEC.

Update the comment.

Sponsored by:	FreeBSD Foundation
MFC after:	2 weeks
2012-09-25 21:11:40 +00:00
Pawel Jakub Dawidek
5c3e5c7f03 Require CAP_DELETE on directory descriptor for unlinkat(2).
Sponsored by:	FreeBSD Foundation
MFC after:	2 weeks
2012-09-25 21:00:36 +00:00
Pawel Jakub Dawidek
cffcbad2bf Require CAP_CREATE on directory descriptor for symlinkat(2).
Sponsored by:	FreeBSD Foundation
MFC after:	2 weeks
2012-09-25 20:59:12 +00:00
Pawel Jakub Dawidek
d2e166e654 Require CAP_CREATE on directory descriptor for linkat(2).
Sponsored by:	FreeBSD Foundation
MFC after:	2 weeks
2012-09-25 20:58:15 +00:00
Pawel Jakub Dawidek
1159429db8 O_EXEC flag is not part of the O_ACCMODE mask, check it separately.
If O_EXEC is provided don't require CAP_READ/CAP_WRITE, as O_EXEC
is mutually exclusive to O_RDONLY/O_WRONLY/O_RDWR.

Without this change CAP_FEXECVE capability right is not enforced.

Sponsored by:	FreeBSD Foundation
MFC after:	3 days
2012-09-25 20:48:49 +00:00
George V. Neville-Neil
0bf9cb917c Change the module name for the I/O provider to "kernel" from
"genunix"  This will requires us to modify externally created
DTrace scripts but makes logical sense for FreeBSD.

Requested by: rpaulo
MFC after:	2 weeks
2012-09-25 19:16:28 +00:00
John Baldwin
d95dca1d08 Add optional entropy harvesting for software interrupts in swi_sched()
as controlled by kern.random.sys.harvest.swi.  SWI harvesting feeds into
the interrupt FIFO and each event is estimated as providing a single bit of
entropy.

Reviewed by:	markm, obrien
MFC after:	2 weeks
2012-09-25 14:55:46 +00:00
Konstantin Belousov
787a64ddd2 Do not skip two elements of the tid_buffer when reusing the buffer
slot. This eventually results in exhaustion of the tid space, causing
new threads get tid -1 as identifier.

The bad effect of having the thread id equal to -1 is that
UMTX_OP_UMUTEX_WAIT returns EFAULT for a lock owned by such thread,
because casuword cannot distinguish between literal value -1 read from
the address and -1 returned as an indication of faulted
access. _thr_umutex_lock() helper from libthr does not check for
errors from _umtx_op_err(2), causing an infinite loop in
mutex_lock_sleep().

We observed the JVM processes hanging and consuming enormous amount of
system time on machines with approximately 100 days uptime.

Reported by:	Mykola Dzham <freebsd levsha org ua>
MFC after:	1 week
2012-09-22 12:17:09 +00:00
Eitan Adler
96240c89f0 Correct double "the the"
Approved by:	cperciva
MFC after:	3 days
2012-09-14 21:28:56 +00:00
Andriy Gapon
e87fc7cf7b sched_ule: fix inverted condition in reporting of priority lending via ktr
Reviewed by:	kan
MFC after:	1 week
2012-09-14 19:55:28 +00:00
Attilio Rao
0a15e5d30d Remove all the checks on curthread != NULL with the exception of some MD
trap checks (eg. printtrap()).

Generally this check is not needed anymore, as there is not a legitimate
case where curthread != NULL, after pcpu 0 area has been properly
initialized.

Reviewed by:	bde, jhb
MFC after:	1 week
2012-09-13 22:26:22 +00:00
John Baldwin
0f14f15b62 Ignore stop and continue signals sent to an exiting process. Stop signals
set p_xstat to the signal that triggered the stop, but p_xstat is also
used to hold the exit status of an exiting process.  Without this change,
a stop signal that arrived after a process was marked P_WEXIT but before
it was marked a zombie would overwrite the exit status with the stop signal
number.

Reviewed by:	kib
MFC after:	1 week
2012-09-13 15:51:18 +00:00
Attilio Rao
e3ae0dfe69 Improve check coverage about idle threads.
Idle threads are not allowed to acquire any lock but spinlocks.
Deny any attempt to do so by panicing at the locking operation
when INVARIANTS is on. Then, remove the check on blocking on a
turnstile.
The check in sleepqueues is left because they are not allowed to use
tsleep() either which could happen still.

Reviewed by:	bde, jhb, kib
MFC after:	1 week
2012-09-12 22:10:53 +00:00
Attilio Rao
faa1082aa2 Tweak the commit message in case of panic for sleeping from threads
with TDP_NOSLEEPING on.

The current message has no informations on the thread and wchan
involed, which may be useful in case where dumps have mangled dwarf
informations.

Reported by:    kib
Reviewed by:	bde, jhb, kib
MFC after:	1 week
2012-09-12 22:05:54 +00:00
Konstantin Belousov
bcd5bb8e57 Add a facility for vgone() to inform the set of subscribed mounts
about vnode reclamation. Typical use is for the bypass mounts like
nullfs to get a notification about lower vnode going away.

Now, vgone() calls new VFS op vfs_reclaim_lowervp() with an argument
lowervp which is reclaimed. It is possible to register several
reclamation event listeners, to correctly handle the case of several
nullfs mounts over the same directory.

For the filesystem not having nullfs mounts over it, the overhead
added is a single mount interlock lock/unlock in the vnode reclamation
path.

In collaboration with:	pho
MFC after:	3 weeks
2012-09-09 19:17:15 +00:00
Konstantin Belousov
84c3cd4f19 Add MNTK_LOOKUP_EXCL_DOTDOT struct mount flag, which specifies to the
lookup code that dotdot lookups shall override any shared lock
requests with the exclusive one. The flag is useful for filesystems
which sometimes need to upgrade shared lock to exclusive inside the
VOP_LOOKUP or later, which cannot be done safely for dotdot, due to
dvp also locked and causing LOR.

In collaboration with:	    pho
MFC after:	3 weeks
2012-09-09 19:11:52 +00:00
Attilio Rao
16cbf13b53 Move the checks for td_pinned, td_critnest, TDP_NOFAULTING and
TDP_NOSLEEPING leaking from syscallret() to userret() so that also
trap handling is covered. Also, the check on td_locks is not duplicated
between the two functions.

Reported by:	avg
Reviewed by:	kib
MFC after:	1 week
2012-09-08 18:35:15 +00:00
Attilio Rao
fbe18392a1 Move PT_UPDATED_FLUSH() before td_locks check in order to have more
coverage also in the XEN case.

Reviewed by:	kib
MFC after:	1 week
2012-09-08 18:29:53 +00:00
Attilio Rao
324e57150d userret() already checks for td_locks when INVARIANTS is enabled, so
there is no need to check if Giant is acquired after it.

Reviewed by:	kib
MFC after:	1 week
2012-09-08 18:27:11 +00:00
Gleb Smirnoff
aaf6343576 Supply the pr_ctloutput method for local datagram sockets,
so that setsockopt() and getsockopt() work on them.

This makes 'tools/regression/sockets/unix_cmsg -t dgram'
more successful.
2012-09-07 21:06:54 +00:00
John Baldwin
773e3b7dda A few whitespace and comment fixes. 2012-09-07 15:10:46 +00:00
Aleksandr Rybalko
1bccd8638e Style fixes.
Suggested by:   mdf
Approved by:	adrian (menthor)
2012-09-04 23:16:55 +00:00
Aleksandr Rybalko
6a8dada257 Add missing braces.
Approved by:	bschmidt (while mentor offline)
Pointed by:     gcooper
Pointy hat to:  ray
2012-09-03 09:46:46 +00:00
Andrey Zonov
ceb0f71506 - Mark some sysctls with CTLFLAG_TUN flag instead of CTLFLAG_RDTUN.
Pointed out by:	avg
Approved by:	kib (mentor)
MFC after:	1 week
2012-09-03 09:26:56 +00:00
Aleksandr Rybalko
70da14c4bb Add kern.hintmode sysctl variable to show current state of hints:
0 - loader hints in environment only;
1 - static hints only
2 - fallback mode (Dynamic KENV with fallback to kernel environment)
Add kern.hintmode write handler, accept only value 2. That will switch
static KENV to dynamic. So it will be possible to change device hints.

Approved by:	adrian (mentor)
2012-09-03 08:52:05 +00:00
Andrey Zonov
c3927cd956 - Make kern.maxtsiz, kern.dfldsiz, kern.maxdsiz, kern.dflssiz, kern.maxssiz
and kern.sgrowsiz sysctls writable.

Approved by:	kib (mentor)
2012-09-02 17:39:02 +00:00
Mikolaj Golub
bb9f214f64 In soreceive_generic() remove the optimization for the case when
MSG_WAITALL is set, and it is possible to do the entire receive
operation at once if we block (resid <= hiwat). Actually it might make
the recv(2) with MSG_WAITALL flag get stuck when there is enough space
in the receiver buffer to satisfy the request but not enough to open
the window closed previously due to the buffer being full.

The issue can be reproduced using the following scenario:

On the sender side do 2 send(2) requests:

1) data of size much smaller than SOBUF_SIZE (e.g. SOBUF_SIZE / 10);
2) data of size equal to SOBUF_SIZE.

On the receiver side do 2 recv(2) requests with MSG_WAITALL flag set:

1) recv() data of SOBUF_SIZE / 10 size;
2) recv() data of SOBUF_SIZE size;

We totally fill the receiver buffer with one SOBUF_SIZE/10 size request
and partial SOBUF_SIZE request. When the first request is processed we
get SOBUF_SIZE/10 free space. It is just enough to receive the rest of
bytes for the second request, and soreceive_generic() blocks in the
part that is a subject of this change waiting for the rest. But the
window was closed when the buffer was filled and to avoid silly window
syndrome it opens only when available space is larger than sb_hiwat/4
or maxseg. So it is stuck and pending data is only sent via TCP window
probes.

Discussed with:	kib (long ago)
MFC after:	2 weeks
2012-09-02 07:33:52 +00:00
Mikolaj Golub
2ad099fcb1 In soreceive_generic() when checking if the type of mbuf has changed
check it for MT_CONTROL type too, otherwise the assertion
"m->m_type == MT_DATA" below may be triggered by the following scenario:

- the sender sends some data (MT_DATA) and then a file descriptor
  (MT_CONTROL);
- the receiver calls recv(2) with a MSG_WAITALL asking for data larger
  than the receive buffer (uio_resid > hiwat).

MFC after:	2 week
2012-09-02 07:29:37 +00:00
Pawel Jakub Dawidek
707641ec28 Fix panic in procdesc that can be triggered in the following scenario:
1. Process A pdfork(2)s process B.
2. Process A passes process descriptor of B to unrelated process C.
3. Hit CTRL+C to terminate process A. Process B is also terminated
   with SIGINT.
4. init(8) collects status of process B.
5. Process C closes process descriptor associated with process B.

When we have such order of events, init(8), by collecting status of
process B, will call procdesc_reap(). This function sets pd_proc to NULL.

Now when process C calls close on this process descriptor,
procdesc_close() is called. Unfortunately procdesc_close() assumes that
pd_proc points at a valid proc structure, but it was set to NULL earlier,
so the kernel panics.

The patch also adds setting 'p->p_procdesc' to NULL in procdesc_reap(),
which I think should be done.

MFC after:	1 week
2012-09-01 11:21:56 +00:00
Attilio Rao
d4a2ab8c07 Post r222812 KTR_CPUMASK started being initialized only as a tunable
handler and not more statically.

Unfortunately, it seems that this is not ideal for new platform bringup
and boot low level development (which needs ktr_cpumask to be effective
before tunables can be setup).

Because of this, add a way to statically initialize cpusets, by passing
an list of initializers, divided by commas. Also, provide a way to enforce
an all-set mask, for above mentioned initializers.

This imposes some differences on how KTR_CPUMASK is setup now as a
kernel option, and in particular this makes the words specifications
backward wrt. what is currently in -CURRENT. In order to avoid mismatches
between KTR_CPUMASK definition and other way to setup the mask
(tunable, sysctl) and to print it, change the ordering how
cpusetobj_print() and cpusetobj_scan() acquire the words belonging
to the set.
Please give a look to sys/conf/NOTES in order to understand how the
new format is supposed to work.

Also, ktr manpages will be updated shortly by gjb which volountereed
for this.

This patch won't be merged because it changes a POLA (at least
from the theoretical standpoint) and this is however a patch that
proves to be effective only in development environments.

Requested by:	rpaulo
Reviewed by:	jeff, rpaulo
2012-08-30 21:22:47 +00:00
Marius Strobl
bf38cf8ab3 - Unlike cache invalidation and TLB demapping IPIs, reading registers from
other CPUs doesn't require locking so get rid of it. As the latter is used
  for the timecounter on certain machine models, using a spin lock in this
  case can lead to a deadlock with the upcoming callout(9) rework.
- Merge r134227/r167250 from x86:
  Avoid cross-IPI SMP deadlock by using the smp_ipi_mtx spin lock not only
  for smp_rendezvous_cpus() but also for the MD cache invalidation and TLB
  demapping IPIs.
- Mark some unused function arguments as such.

MFC after:	1 week
2012-08-29 16:56:50 +00:00
Ed Schouten
fa4dd27847 Remove unused SI_* flags.
The SI_DEVOPEN, SI_CONSOPEN and SI_CANDELETE flags are not used by any
piece of code in the tree.
2012-08-28 19:30:29 +00:00
John Baldwin
10f0ab3933 Shorten the name of the fast SWI taskqueue to "fast taskq" so that
it fits.

Reported by:	lev
MFC after:	1 week
2012-08-28 13:35:37 +00:00
Navdeep Parhar
812302c3eb Allow nmbjumbop, nmbjumbo9, and nmbjumbo16 to be set directly via loader
tunables.

MFC after:	1 month
2012-08-23 21:32:02 +00:00
Konstantin Belousov
258f94423b Provide some compat32 shims for sysctl vfs.conflist. It is required
for getvfsbyname(3) operation when called from 32bit process, and
getvfsbyname(3) is used by recent bsdtar import.

Reported by:	many
Tested by:	David Naylor <naylor.b.david@gmail.com>
MFC after:	5 days
2012-08-22 20:05:34 +00:00
John Baldwin
7e690c1f79 Assert that system calls do not leak a pinned thread (via sched_pin()) to
userland.
2012-08-22 20:02:42 +00:00
John Baldwin
dda66918e6 Fix a typo. 2012-08-22 20:01:57 +00:00
John Baldwin
ba96d2d816 Mark the idle threads as non-sleepable and also assert that an idle
thread never blocks on a turnstile.
2012-08-22 20:01:38 +00:00
John Baldwin
e6bdd477fc Fix the 'show witness' DDB command to honor db_pager_quit. 2012-08-22 20:00:41 +00:00
John Baldwin
6f7d0018b0 Add a BUS_CHILD_DELETED() method that a bus can hook to allow it to cleanup
any bus-specific state (such as ivars) when a child device is deleted.

Requested by:	kan
MFC after:	1 month
2012-08-21 18:13:09 +00:00