13597 Commits

Author SHA1 Message Date
pjd
667d7255be Fix panic in ktrcapfail() when no capability rights are passed.
While here, correct all consumers to pass NULL instead of 0 as we pass
capability rights as pointers now, not uint64_t.

Reported by:	Daniel Peyrolon
Tested by:	Daniel Peyrolon
Approved by:	re (marius)
2013-09-18 19:26:08 +00:00
rdivacky
fae003a069 Revert r255672, it has some serious flaws, leaking file references etc.
Approved by:	re (delphij)
2013-09-18 18:48:33 +00:00
rdivacky
d57db3eead Implement epoll support in Linuxulator. This is a tiny wrapper around kqueue
to implement epoll subset of functionality. The kqueue user data are 32bit
on i386 which is not enough for epoll user data so this patch overrides
kqueue fileops to maintain enough space in struct file.

Initial patch developed by me in 2007 and then extended and finished
by Yuri Victorovich.

Approved by:    re (delphij)
Sponsored by:   Google Summer of Code
Submitted by:   Yuri Victorovich <yuri at rawbw dot com>
Tested by:      Yuri Victorovich <yuri at rawbw dot com>
2013-09-18 17:56:04 +00:00
glebius
ab92040a7b Fix assertion in sendfile_readpage() to assert only the validity
of requested amount of data in a page. Move assertion down below
object unlock.

Approved by:	re (kib)
Sponsored by:	Nginx, Inc.
Sponsored by:	Netflix
2013-09-17 06:37:21 +00:00
kib
6796656333 Remove zero-copy sockets code. It only worked for anonymous memory,
and the equivalent functionality is now provided by sendfile(2) over
posix shared memory filedescriptor.

Remove the cow member of struct vm_page, and rearrange the remaining
members.  While there, make hold_count unsigned.

Requested and reviewed by:	alc
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
Approved by:	re (delphij)
2013-09-16 06:25:54 +00:00
kib
36c84de38f Use TAILQ instead of STAILQ for kqeueue filedescriptors to ensure constant
time removal on kqueue close.

Reported and tested by:	pho
Reviewed by:	jmg
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Approved by:	re (delphij)
2013-09-13 19:50:50 +00:00
kib
4c87f20f3a When opening or closing fifo, ensure that the vnode is locked
exclusively.  Filesystems are assumed to disable shared locking for
the fifo vnode locks, but some do not.

Reported and tested by:	olgeni
Discussed with:	avg
Sponsored by:   The FreeBSD Foundation
MFC after:	1 week
Approved by:	re (glebius)
2013-09-13 06:52:23 +00:00
kib
b3080da236 Reduce the scope of the proctree_lock. If several processes cause
continuous calls to the uprintf(9), the proctree_lock could be
shared-locked for indefinite amount of time, starving exclusive
requests. Since proctree_lock is needed for fork() and exit(), this
effectively stops the machine.

While there, do the similar reduction for tprintf(9).

Reported and tested by: pho
Reviewed by:    ed
Sponsored by:   The FreeBSD Foundation
MFC after:	1 week
Approved by:	re (glebius)
2013-09-13 06:39:10 +00:00
jhb
5dbaab99c0 Regen.
Approved by:	re (kib)
2013-09-12 18:03:51 +00:00
jhb
e0689d5d63 Fix the type of the idtype argument to wait6() in syscalls.master.
(Accidentally missed this in the previous commit)

Approved by:	re (kib)
MFC after:	1 week
2013-09-12 18:01:13 +00:00
jhb
d31f6d6b64 Fix the type of the idtype argument to wait6() in syscalls.master.
Approved by:	re (kib)
MFC after:	1 week
2013-09-12 17:52:18 +00:00
glebius
3d45194f2c Provide pr_ctloutput method for AF_LOCAL/SOCK_SEQPACKET sockets.
This makes setsockopt() on them working.

Reported by:	Yuri <yuri rawbw.com>
Approved by:	re (kib)
2013-09-11 18:22:30 +00:00
kib
4d92de31b2 Fix build with gcc.
Build-tested by:	gjb
Approved by:	re (glebius)
2013-09-11 17:31:22 +00:00
kib
55cc853a4d Implement sendfile(2) for the posix shared memory segment file descriptor,
in addition to the regular files.

Requested by:	alc
Discussed with:	emaste
Tested by:	pho (previous version)
Sponsored by:	The FreeBSD Foundation
Approved by:	re (hrs)
2013-09-11 06:41:15 +00:00
des
21e6fd796b Fix the length calculation for the final block of a sendfile(2)
transmission which could be tricked into rounding up to the nearest
page size, leaking up to a page of kernel memory.  [13:11]

In IPv6 and NetATM, stop SIOCSIFADDR, SIOCSIFBRDADDR, SIOCSIFDSTADDR
and SIOCSIFNETMASK at the socket layer rather than pass them on to the
link layer without validation or credential checks.  [SA-13:12]

Prevent cross-mount hardlinks between different nullfs mounts of the
same underlying filesystem.  [SA-13:13]

Security:	CVE-2013-5666
Security:	FreeBSD-SA-13:11.sendfile
Security:	CVE-2013-5691
Security:	FreeBSD-SA-13:12.ifioctl
Security:	CVE-2013-5710
Security:	FreeBSD-SA-13:13.nullfs
Approved by:	re
2013-09-10 10:05:59 +00:00
jhb
04bb6e10cd Add a mmap flag (MAP_32BIT) on 64-bit platforms to request that a mapping use
an address in the first 2GB of the process's address space.  This flag should
have the same semantics as the same flag on Linux.

To facilitate this, add a new parameter to vm_map_find() that specifies an
optional maximum virtual address.  While here, fix several callers of
vm_map_find() to use a VMFS_* constant for the findspace argument instead of
TRUE and FALSE.

Reviewed by:	alc
Approved by:	re (kib)
2013-09-09 18:11:59 +00:00
delphij
7917506a10 In r243868, the error message buffer errmsg have been changed from
an on-stack array to a pointer and therefore sizeof(errmsg) would
become 4 or 8 bytes depending on the architecture.

Fix this by using ERRMSGL in place of sizeof().

Submitted by:	J David <j.david.lists@gmail.com>
MFC after:	3 days
Approved by:	re (kib)
2013-09-09 05:01:18 +00:00
kib
56cc686058 Drain for the xbusy state for two places which potentially do
pmap_remove_all(). Not doing the drain allows the pmap_enter() to
proceed in parallel, making the pmap_remove_all() effects void.

The race results in an invalidated page mapped wired by usermode.

Reported and tested by:	pho
Reviewed by:	alc
Sponsored by:	The FreeBSD Foundation
Approved by:	re (glebius)
2013-09-08 17:51:22 +00:00
pjd
f16777c9dc Sort properly. 2013-09-07 19:16:02 +00:00
pjd
e781fc782f Fix panic in cap_rights_is_valid() when invalid rights are provided -
the right_to_index() function should assert correctness in this case.

Improve other assertions.

Reported by:	pho
Tested by:	pho
2013-09-07 19:03:16 +00:00
mav
e372a1314d Micro-optimize cpu_search(), allowing compiler to use more efficient inline
ffsl() implementation, when it is available, instead of homegrown iteration.

On dual-E5645 amd64 system (2x6x2 cores) under heavy I/O load that reduces
time spent inside cpu_search() from 19% to 13%, while IOPS increased by 5%.
2013-09-07 15:16:30 +00:00
markm
b41e1125b0 MFC 2013-09-07 07:58:29 +00:00
np
24e0bcea20 Add a vtprintf. It is to tprintf what vprintf is to printf.
Reviewed by:	kib
2013-09-07 07:53:21 +00:00
markm
9d67aa8bff MFC 2013-09-06 17:42:12 +00:00
jamie
d13d69ef17 Keep PRIV_KMEM_READ permitted inside jails as it is on the outside. 2013-09-06 17:32:29 +00:00
hiren
0493244548 Fixing a small typo.
Reviewed by:	gjb
Approved by:	sbruno (mentor)
2013-09-05 18:18:23 +00:00
kib
a08e5f97bc The vm_pageout_flush() functions sbusies pages in the passed pages
run.  After that, the pager put method is called, usually translated
to VOP_WRITE().  For the filesystems which use buffer cache,
bufwrite() sbusies the buffer pages again, waiting for the xbusy state
to drain.  The later is done in vfs_drain_busy_pages(), which is
called with the buffer pages already sbusied (by vm_pageout_flush()).

Since vfs_drain_busy_pages() can only wait for one page at the time,
and during the wait, the object lock is dropped, previous pages in the
buffer must be protected from other threads busying them.  Up to the
moment, it was done by xbusying the pages, that is incompatible with
the sbusy state in the new implementation of busy.  Switch to sbusy.

Reported and tested by:	pho
Sponsored by:	The FreeBSD Foundation
2013-09-05 12:56:08 +00:00
pjd
e6563595d7 The fget() function now takes pointer to cap_rights_t, so change 0 to NULL. 2013-09-05 11:59:23 +00:00
pjd
1c7defb76e Handle cases where capability rights are not provided.
Reported by:	kib
2013-09-05 11:58:12 +00:00
glebius
12cfeea19a Fix !CAPABILITIES build. 2013-09-05 10:24:09 +00:00
pjd
ad93dafc6a Correct the logic broken in my last commit.
Reported by:	tijl
2013-09-05 09:36:19 +00:00
sbruno
bf3232d8bb Restore builds on architectures that don't support CAPABILITIES (mips). 2013-09-05 03:46:44 +00:00
sbruno
c1754372ea This looks like a typo that breaks the build. Yell at me if this isn't the
intended declaration.
2013-09-05 03:36:57 +00:00
pjd
add7315b85 Style fixes. 2013-09-05 00:19:30 +00:00
pjd
36e3a50584 Style fixes. Most fixes are about not treating integers and pointers as
booleans.
2013-09-05 00:17:38 +00:00
pjd
d1a65cb7ef Regenerate after r255219.
Sponsored by:	The FreeBSD Foundation
2013-09-05 00:11:59 +00:00
pjd
029a6f5d92 Change the cap_rights_t type from uint64_t to a structure that we can extend
in the future in a backward compatible (API and ABI) way.

The cap_rights_t represents capability rights. We used to use one bit to
represent one right, but we are running out of spare bits. Currently the new
structure provides place for 114 rights (so 50 more than the previous
cap_rights_t), but it is possible to grow the structure to hold at least 285
rights, although we can make it even larger if 285 rights won't be enough.

The structure definition looks like this:

	struct cap_rights {
		uint64_t	cr_rights[CAP_RIGHTS_VERSION + 2];
	};

The initial CAP_RIGHTS_VERSION is 0.

The top two bits in the first element of the cr_rights[] array contain total
number of elements in the array - 2. This means if those two bits are equal to
0, we have 2 array elements.

The top two bits in all remaining array elements should be 0.
The next five bits in all array elements contain array index. Only one bit is
used and bit position in this five-bits range defines array index. This means
there can be at most five array elements in the future.

To define new right the CAPRIGHT() macro must be used. The macro takes two
arguments - an array index and a bit to set, eg.

	#define	CAP_PDKILL	CAPRIGHT(1, 0x0000000000000800ULL)

We still support aliases that combine few rights, but the rights have to belong
to the same array element, eg:

	#define	CAP_LOOKUP	CAPRIGHT(0, 0x0000000000000400ULL)
	#define	CAP_FCHMOD	CAPRIGHT(0, 0x0000000000002000ULL)

	#define	CAP_FCHMODAT	(CAP_FCHMOD | CAP_LOOKUP)

There is new API to manage the new cap_rights_t structure:

	cap_rights_t *cap_rights_init(cap_rights_t *rights, ...);
	void cap_rights_set(cap_rights_t *rights, ...);
	void cap_rights_clear(cap_rights_t *rights, ...);
	bool cap_rights_is_set(const cap_rights_t *rights, ...);

	bool cap_rights_is_valid(const cap_rights_t *rights);
	void cap_rights_merge(cap_rights_t *dst, const cap_rights_t *src);
	void cap_rights_remove(cap_rights_t *dst, const cap_rights_t *src);
	bool cap_rights_contains(const cap_rights_t *big, const cap_rights_t *little);

Capability rights to the cap_rights_init(), cap_rights_set(),
cap_rights_clear() and cap_rights_is_set() functions are provided by
separating them with commas, eg:

	cap_rights_t rights;

	cap_rights_init(&rights, CAP_READ, CAP_WRITE, CAP_FSTAT);

There is no need to terminate the list of rights, as those functions are
actually macros that take care of the termination, eg:

	#define	cap_rights_set(rights, ...)				\
		__cap_rights_set((rights), __VA_ARGS__, 0ULL)
	void __cap_rights_set(cap_rights_t *rights, ...);

Thanks to using one bit as an array index we can assert in those functions that
there are no two rights belonging to different array elements provided
together. For example this is illegal and will be detected, because CAP_LOOKUP
belongs to element 0 and CAP_PDKILL to element 1:

	cap_rights_init(&rights, CAP_LOOKUP | CAP_PDKILL);

Providing several rights that belongs to the same array's element this way is
correct, but is not advised. It should only be used for aliases definition.

This commit also breaks compatibility with some existing Capsicum system calls,
but I see no other way to do that. This should be fine as Capsicum is still
experimental and this change is not going to 9.x.

Sponsored by:	The FreeBSD Foundation
2013-09-05 00:09:56 +00:00
jhb
00a02e74da Trim a couple of panic messages. 2013-09-04 11:52:28 +00:00
davide
3e4464bbe3 Fix socket buffer timeouts precision using the new sbintime_t KPI instead
of relying on the tvtohz() workaround. The latter has been introduced
lately by jhb@ (r254699) in order to have a fix that can be backported
to STABLE.

Reported by:	Vitja Makarov <vitja.makarov at gmail dot com>
Reviewed by:	jhb (earlier version)
2013-09-01 23:34:53 +00:00
ray
c555be1512 MFC @r255128. 2013-09-01 23:06:28 +00:00
rmacklem
8d06f831a7 Forced dismounts of NFS mounts can fail when thread(s) are stuck
waiting for an RPC reply from the server while holding the mount
point busy (mnt_lockref incremented). This happens because dounmount()
msleep()s waiting for mnt_lockref to become 0, before calling
VFS_UNMOUNT(). This patch adds a new VFS operation called VFS_PURGE(),
which the NFS client implements as purging RPCs in progress. Making
this call before checking mnt_lockref fixes the problem, by ensuring
that the VOP_xxx() calls will fail and unbusy the mount point.

Reported by:	sbruno
Reviewed by:	kib
MFC after:	2 weeks
2013-09-01 23:02:59 +00:00
markm
ff7909302f MFC 2013-08-30 11:38:34 +00:00
hselasky
5a68ee8757 Simplify pause_sbt() logic. Don't call DELAY() if remainder is less
than or equal to zero.
2013-08-30 10:39:56 +00:00
kib
dc9c173247 Move the definition of the struct unrhdr into a separate header file,
to allow embedding the struct.  Add init_unrhdr(9) initializer, which
sets up preallocated unrhdr.

Reviewed by:	alc
Tested by:	pho, bf
2013-08-30 07:37:45 +00:00
ken
6165f30980 Fix some issues in change 254760 pointed out by Bruce Evans:
- Remove excessive parenthesis
 - Use KNF continuation indentation
 - Cut down on excessive continuation lines
 - More consistent style in messages
 - Use uprintf() instead of printf()

Submitted by:	bde
2013-08-29 16:41:40 +00:00
jhb
5875a10cf6 Don't return an error for socket timeouts that are too large. Just
cap them to INT_MAX ticks instead.

PR:		kern/181416 (r254699 really)
Requested by:	bde
MFC after:	2 weeks
2013-08-29 15:59:05 +00:00
andre
73f239a63e Pad m_hdr on 32bit architectures to to prevent alignment and padding
problems with the way MLEN, MHLEN, and struct mbuf are set up.

CTASSERT's are provided to detect such issues at compile time in the
future.

The #define MLEN and MHLEN calculation do not take actual compiler-
induced alignment and padding inside the complete struct mbuf into
account.  Accordingly appropriate attention is required when changing
members of struct mbuf.

Ideally one would calculate MLEN as (MSIZE - sizeof(((struct mbuf *)0)->m_hdr)
but that doesn't work as the compiler refuses to operate on an as of
yet incomplete structure.

In particular ARM 32bit has more strict alignment requirements which
caused 4 bytes of padding between m_hdr and pkthdr in struct mbuf
because of the 64bit members in pkthdr.  This wasn't picked up by MLEN
and MHLEN causing an overflow of the mbuf provided data storage by
overestimating its size.

I386 didn't show this problem because it handles unaligned access just
fine, albeit at a small performance penalty.

On 64bit architectures the struct mbuf layout is 64bit aligned in all
places.

Reported by:	Thomas Skibo <ThomasSkibo-at-sbcglobal-dot-net>
Tested by:	tuexen, ian, Thomas Skibo (extended patch)
Sponsored by:	The FreeBSD Foundation
2013-08-27 20:52:02 +00:00
kib
9426e190e7 When allocating a pbuf for the cluster write, do not sleep waiting
for the available pbuf when passed vnode is backing md(4). Other i/o
directed to the same md device might already hold pbufs, and then we
could deadlock since only our progress can free a pbuf needed for
wakeup.

Obtained from:	projects/vm6
Reminded and tested by:	pho
MFC after:	1 week
2013-08-27 01:31:12 +00:00
will
a52b9ca1d3 Add the ability to display the default FIB number for a process to the
ps(1) utility, e.g. "ps -O fib".

bin/ps/keyword.c:
	Add the "fib" keyword and default its column name to "FIB".

bin/ps/ps.1:
	Add "fib" as a supported keyword.

sys/compat/freebsd32/freebsd32.h:
sys/kern/kern_proc.c:
sys/sys/user.h:
	Add the default fib number for a process (p->p_fibnum)
	to the user land accessible process data of struct kinfo_proc.

Submitted by:	Oliver Fromme <olli@fromme.com>, gibbs
2013-08-26 23:48:21 +00:00
jmg
f4ae1f5e6d fix up some comments and a white space issue...
MFC after:	3 days
2013-08-26 18:53:19 +00:00