Commit Graph

13212 Commits

Author SHA1 Message Date
Marius Strobl
db9066f798 - Use strdup(9) instead of reimplementing it.
- Use __DECONST instead of strange casts.
- Reduce code duplication and simplify name2oid().

PR:		176373
Submitted by:	Christoph Mallon
MFC after:	1 week
2013-03-01 18:49:14 +00:00
Konstantin Belousov
58248e57ab Make the default implementation of the VOP_VPTOCNP() fail if the
directory entry, matched by the inode number, is ".".

NFSv4 client might instantiate the distinct vnodes which have the same
inode number, since single v4 export can be combined from several
filesystems on the server.  For instance, a case when the nested
server mount point is exactly one directory below the top of the
export, causes directory and its parent to have the same inode number
2.  The vop_stdvptocnp() algorithm then returns "." as the name of the
lower directory.

Filtering out the "." entry with ENOENT works around this behaviour,
the error forces getcwd(3) to fall back to usermode implementation,
which compares both st_dev and st_ino.

Based on the submission by:	rmacklem
Tested by:	rmacklem
MFC after:	1 week
2013-03-01 18:40:14 +00:00
Davide Italiano
e234a588cb MFcalloutng:
Style fixes.
2013-02-28 16:22:49 +00:00
Alexander Motin
fdc5dd2d2f MFcalloutng:
Switch eventtimers(9) from using struct bintime to sbintime_t.
Even before this not a single driver really supported full dynamic range of
struct bintime even in theory, not speaking about practical inexpediency.
This change legitimates the status quo and cleans up the code.
2013-02-28 13:46:03 +00:00
Davide Italiano
acccf7d8b4 MFcalloutng:
When CPU becomes idle, cpu_idleclock() calculates time to the next timer
event in order to reprogram hw timer. Return that time in sbintime_t to
the caller and pass it to acpi_cpu_idle(), where it can be used as one
more factor (quite precise) to extimate furter sleep time and choose
optimal sleep state. This is a preparatory change for further callout
improvements will be committed in the next days.

The commmit is not targeted for MFC.
2013-02-28 10:46:54 +00:00
Konstantin Belousov
20f4e3e158 Make recursive getblk() slightly more useful. Keep the buffer state
intact if getblk() is done on the already owned buffer.  Exit from
brelse() early when the lock recursion is detected, otherwise brelse()
might prematurely destroy the buffer under some circumstances.

Sponsored by:	The FreeBSD Foundation
Noted by:	mckusick
Tested by:	pho
MFC after:	2 weeks
2013-02-27 07:34:09 +00:00
Alexander Motin
1af19ee4a2 Add support for good old 8192Hz profiling clock to software PMC.
Reviewed by:	fabient
2013-02-26 18:13:42 +00:00
Attilio Rao
590f9303e5 Merge from vmobj-rwlock branch:
Remove unused inclusion of vm/vm_pager.h and vm/vnode_pager.h.

Sponsored by:	EMC / Isilon storage division
Tested by:	pho
Reviewed by:	alc
2013-02-26 01:00:11 +00:00
Pawel Jakub Dawidek
1d59211b2e Style.
Suggested by:	kib
2013-02-25 20:51:29 +00:00
Pawel Jakub Dawidek
893365e42d After r237012, the fdgrowtable() doesn't drop the filedesc lock anymore,
so update a stale comment.

Reviewed by:	kib, keramida
2013-02-25 20:50:08 +00:00
John Baldwin
593efaf9f7 Further refine the handling of stop signals in the NFS client. The
changes in r246417 were incomplete as they did not add explicit calls to
sigdeferstop() around all the places that previously passed SBDRY to
_sleep().  In addition, nfs_getcacheblk() could trigger a write RPC from
getblk() resulting in sigdeferstop() recursing.  Rather than manually
deferring stop signals in specific places, change the VFS_*() and VOP_*()
methods to defer stop signals for filesystems which request this behavior
via a new VFCF_SBDRY flag.  Note that this has to be a VFC flag rather than
a MNTK flag so that it works properly with VFS_MOUNT() when the mount is
not yet fully constructed.  For now, only the NFS clients are set this new
flag in VFS_SET().

A few other related changes:
- Add an assertion to ensure that TDF_SBDRY doesn't leak to userland.
- When a lookup request uses VOP_READLINK() to follow a symlink, mark
  the request as being on behalf of the thread performing the lookup
  (cnp_thread) rather than using a NULL thread pointer.  This causes
  NFS to properly handle signals during this VOP on an interruptible
  mount.

PR:		kern/176179
Reported by:	Russell Cattelan (sigdeferstop() recursion)
Reviewed by:	kib
MFC after:	1 month
2013-02-21 19:02:50 +00:00
Jamie Gritton
ffc72591b1 Don't worry if a module is already loaded when looking for a fstype to mount
(possible in a race condition).

Reviewed by:	kib
MFC after:	1 week
2013-02-21 02:41:37 +00:00
John Baldwin
353374b525 Fix a few typos. 2013-02-19 16:35:27 +00:00
Pawel Jakub Dawidek
b2e054b0d4 Update the comment: we do show the backtrace of misbehaving thread. 2013-02-17 21:37:32 +00:00
Pawel Jakub Dawidek
f0ad2ecb9c Style. 2013-02-17 11:56:36 +00:00
Pawel Jakub Dawidek
8e1d51ab40 - Require CAP_FSYNC capability right when opening a file with O_SYNC or O_FSYNC
flags.
- While here simplify check for locking flags.

Sponsored by:	The FreeBSD Foundation
2013-02-17 11:53:51 +00:00
Pawel Jakub Dawidek
11b0cfe3cd Remove redundant parenthesis. 2013-02-17 11:49:21 +00:00
Pawel Jakub Dawidek
49549b1894 Remove redundant space. 2013-02-17 11:48:16 +00:00
Pawel Jakub Dawidek
6c08be2b88 Add break to the default case. 2013-02-17 11:47:58 +00:00
Pawel Jakub Dawidek
4881a5950e Don't treat pointers as booleans. 2013-02-17 11:47:30 +00:00
Pawel Jakub Dawidek
de26549841 Remove redundant parenthesis. 2013-02-17 11:47:01 +00:00
Kirk McKusick
2bc1a1fe5c Add barrier write capability to the VFS buffer interface. A barrier
write is a disk write request that tells the disk that the buffer
being written must be committed to the media along with any writes
that preceeded it before any future blocks may be written to the drive.

Barrier writes are provided by adding the functions bbarrierwrite
(bwrite with barrier) and babarrierwrite (bawrite with barrier).

Following a bbarrierwrite the client knows that the requested buffer
is on the media. It does not ensure that buffers written before that
buffer are on the media. It only ensure that buffers written before
that buffer will get to the media before any buffers written after
that buffer. A flush command must be sent to the disk to ensure that
all earlier written buffers are on the media.

Reviewed by: kib
Tested by:   Peter Holm
2013-02-16 14:51:30 +00:00
Ian Lepore
a1137de941 Add PPS_CANWAIT support for time_pps_fetch(). This adds support for all three
blocking modes described in section 3.4.3 of RFC 2783, allowing the caller
to retrieve the most recent values without blocking, to block for a specified
time, or to block forever.

Reviewed by:	discussion on hackers@
2013-02-15 18:30:32 +00:00
Sergey Kandaurov
d7ffa24831 vn_io_faults_cnt:
- use u_long consistently
- use SYSCTL_ULONG to match the type of variable

Reviewed by:	kib
MFC after:	1 week
2013-02-15 14:22:05 +00:00
Sergey Kandaurov
ab15d8039e Add support of passing SCM_BINTIME ancillary data object for PF_LOCAL
sockets.

PR:		kern/175883
Submitted by:	Andrey Simonenko <simon@comsys.ntu-kpi.kiev.ua>
Discussed with:	glebius, phk
MFC after:	2 weeks
2013-02-15 13:00:20 +00:00
Ian Lepore
74938cbb7f Make the F_READAHEAD option to fcntl(2) work as documented: a value of zero
now disables read-ahead.  It used to effectively restore the system default
readahead hueristic if it had been changed; a negative value now restores
the default.

Reviewed by:	kib
2013-02-13 15:09:16 +00:00
Konstantin Belousov
dd0b4fb6d5 Reform the busdma API so that new types may be added without modifying
every architecture's busdma_machdep.c.  It is done by unifying the
bus_dmamap_load_buffer() routines so that they may be called from MI
code.  The MD busdma is then given a chance to do any final processing
in the complete() callback.

The cam changes unify the bus_dmamap_load* handling in cam drivers.

The arm and mips implementations are updated to track virtual
addresses for sync().  Previously this was done in a type specific
way.  Now it is done in a generic way by recording the list of
virtuals in the map.

Submitted by:	jeff (sponsored by EMC/Isilon)
Reviewed by:	kan (previous version), scottl,
	mjacob (isp(4), no objections for target mode changes)
Discussed with:	     ian (arm changes)
Tested by:	marius (sparc64), mips (jmallet), isci(4) on x86 (jharris),
	amd64 (Fabian Keil <freebsd-listen@fabiankeil.de>)
2013-02-12 16:57:20 +00:00
Marius Strobl
18716f9f4b Update comments to reflect r246689. 2013-02-11 23:05:10 +00:00
Marius Strobl
bdc5f0172e Make SYSCTL_{LONG,QUAD,ULONG,UQUAD}(9) work as advertised and also handle
constant values.

Reviewed by:	kib
MFC after:	3 days
2013-02-11 21:50:00 +00:00
Konstantin Belousov
2871baa49a Remove the ia64-specific code fragment, which effect is more cleanly
done by the call to trans_prot() function a line before.

Discussed with:	Oliver Pinter <oliver.pntr@gmail.com>
MFC after:	1 week
2013-02-10 20:08:33 +00:00
Andriy Gapon
c43b08dc6c ktr: correctly handle possible wrap-around in the boot buffer
Older entries should be 'before' newer entries in the new buffer too
and there should be no zero-filled gap between them.

Pointed out by:	jhb
MFC after:	3 days
X-MFC with:	r246282
2013-02-08 07:29:07 +00:00
Konstantin Belousov
888d4d4f86 When vforked child is traced, the debugging events are not generated
until child performs exec().  The behaviour is reasonable when a
debugger is the real parent, because the parent is stopped until
exec(), and sending a debugging event to the debugger would deadlock
both parent and child.

On the other hand, when debugger is not the parent of the vforked
child, not sending debugging signals makes it impossible to debug
across vfork.

Fix the issue by declining generating debug signals only when vfork()
was done and child called ptrace(PT_TRACEME).  Set a new process flag
P_PPTRACE from the attach code for PT_TRACEME, if P_PPWAIT flag is
set, which indicates that the process was created with vfork() and
still did not execed. Check P_PPTRACE from issignal(), instead of
refusing the trace outright for the P_PPWAIT case.  The scope of
P_PPTRACE is exactly contained in the scope of P_PPWAIT.

Found and tested by:  zont
Reviewed by:	pluknet
MFC after:	2 weeks
2013-02-07 15:34:22 +00:00
Konstantin Belousov
2ca4998342 Stop translating the ERESTART error from the open(2) into EINTR.
Posix requires that open(2) is restartable for SA_RESTART.

For non-posix objects, in particular, devfs nodes, still disable
automatic restart of the opens. The open call to a driver could have
significant side effects for the hardware.

Noted and reviewed by:	jilles
Discussed with:	bde
MFC after:	2 weeks
2013-02-07 14:53:33 +00:00
Neel Natu
dae3dc73f6 If an interrupt event's assign_cpu method fails, then restore the original
cpuset mask for the associated interrupt thread.

The text used above is verbatim from r195249 and the code should now be
in line with the intent of that commit.
2013-02-07 06:48:47 +00:00
Pawel Jakub Dawidek
fbda3d5dae Audit sockaddr argument for bind(2), connect(2), accept(2), sendto(2) and
recvfrom(2) syscalls.

Sponsored by:	The FreeBSD Foundation
2013-02-07 00:36:00 +00:00
Pawel Jakub Dawidek
82b316b377 Minor style tweaks. 2013-02-07 00:27:11 +00:00
John Baldwin
a120a7a3cd Rework the handling of stop signals in the NFS client. The changes in
195702, 195703, and 195821 prevented a thread from suspending while holding
locks inside of NFS by forcing the thread to fail sleeps with EINTR or
ERESTART but defer the thread suspension to the user boundary.  However,
this had the effect that stopping a process during an NFS request could
abort the request and trigger EINTR errors that were visible to userland
processes (previously the thread would have suspended and completed the
request once it was resumed).

This change instead effectively masks stop signals while in the NFS client.
It uses the existing TDF_SBDRY flag to effect this since SIGSTOP cannot
be masked directly.  Also, instead of setting PBDRY on individual sleeps,
the NFS client now sets the TDF_SBDRY flag around each NFS request and
stop signals are masked for all sleeps during that region (the previous
change missed sleeps in lockmgr locks).  The end result is that stop
signals sent to threads performing an NFS request are completely
ignored until after the NFS request has finished processing and the
thread prepares to return to userland.  This restores the behavior of
stop signals being transparent to userland processes while still
preventing threads from suspending while holding NFS locks.

Reviewed by:	kib
MFC after:	1 month
2013-02-06 17:06:51 +00:00
Sergey Kandaurov
23c053d6a2 Prezero the acl structure which is to be copied to usermode, to avoid
leakage of the previous content of padding and unitialized fields.

Reported by:	Ilia Noskov <noskov@nic.ru>
Reviewed by:	kib
MFC after:	1 week
2013-02-06 15:18:46 +00:00
Sergey Kandaurov
51dc4fea4c Remove reference to the rlist code from comments, and fix a typo visible
in the resulted change.

Reviewed by:	kib
MFC after:	1 week
2013-02-05 20:08:33 +00:00
Andriy Gapon
c8199bc955 ktr: prevent possible footshooting with KTR_ENTRIES and KTR_BOOT_ENTRIES
Suggested by:	adrian
MFC after:	14 days
X-MFC with:	r246282
2013-02-04 21:58:57 +00:00
Andriy Gapon
f85ed12497 ktr: copy content from the early static buffer if KTR_ENTRIES !=
KTR_BOOT_ENTRIES

Reported by:	glebius, jhb
Pointyhat to:	avg
MFC after:	14 days
X-MFC with:	r246282
2013-02-04 21:50:55 +00:00
Marius Strobl
94bfd5b1a0 Try to improve r242655 take III: move these SYSCTLs describing the kernel
map, which is defined and initialized in vm/vm_kern.c, to the latter.

Submitted by:	alc
2013-02-04 09:35:48 +00:00
Marius Strobl
e8cbe54bc4 Further improve r242655 and supply VM_{MIN,MAX}_KERNEL_ADDRESS as constant
values to SYSCTL_ULONG(9) where possible.

Submitted by:	bde
2013-02-03 21:43:55 +00:00
Andriy Gapon
36b7dde416 allow for large KTR_ENTRIES values by allocating ktr_buf using malloc(9)
Only during very early boot, before malloc(9) is functional (SI_SUB_KMEM),
the static ktr_buf_init is used.  Size of the static buffer is determined
by a new kernel option KTR_BOOT_ENTRIES.  Its default value is 1024.

This commit builds on top of r243046.

Reviewed by:	alc
MFC after:	17 days
2013-02-03 09:57:39 +00:00
Andriy Gapon
8eede5c4d9 fix some fat-fingering in r246246
Submitted by:	mjg
Pointyhat to:	avg
MFC after:	5 days
X-MFC with:	r246246
2013-02-02 14:19:50 +00:00
Andriy Gapon
bfdcb3bcba print compiler version in the kernel banner
And provide kernel compiler version as a sysctl as well.
This is useful while we have gcc and clang cohabitation.
This could be even more useful when we have support
for external toolchains.

In cooperation with:	mjg
MFC after:		13 days
2013-02-02 11:58:35 +00:00
Grzegorz Bernacki
2d7d16429c Get time of next event from other cores only if SMP is already started.
Reviewed by: mav
Obtained from: Semihalf
2013-02-01 11:39:03 +00:00
Pawel Jakub Dawidek
4bbe7b0c20 Now that MPSAFE flag is gone, we can arrange code a bit better. 2013-01-31 22:20:05 +00:00
Pawel Jakub Dawidek
b108953c6f Remove leftover label after Giant removal from VFS. 2013-01-31 22:15:41 +00:00
Pawel Jakub Dawidek
a2c496ebb9 Remove label that was accidentally moved during Giant removal from VFS. 2013-01-31 22:14:16 +00:00
Pawel Jakub Dawidek
9e2677fd6d Simplify code a bit. This is leftover after Giant removal from VFS. 2013-01-31 22:12:48 +00:00
Konstantin Belousov
538375d42e The case of pid == WAIT_MYPGRP for the kern_wait() is already handled
in kern_wait6(), which is called by kern_wait().  Remove the redundand
check, introduced in r243136, and add a comment noting this, to make
the code less confusing.

The blank lines are added to properly delineate the scope of the
preceeding comments.

Noted by:	"Jukka A. Ukkonen" <jau@iki.fi>
MFC after:	1 week
2013-01-30 13:14:15 +00:00
John Baldwin
a8df530ddc Mark 'ticks', 'time_second', and 'time_uptime' as volatile to prevent the
compiler from caching their values in tight loops.

Reviewed by:	bde
MFC after:	1 week
2013-01-28 19:38:13 +00:00
Gleb Smirnoff
29110f87a6 - Move large functions m_getjcl() and m_get2() to kern/uipc_mbuf.c
- style(9) fixes to mbuf.h

Reviewed by:	bde
2013-01-24 09:29:41 +00:00
John Baldwin
75d774e36a Fix a typo. 2013-01-23 14:37:05 +00:00
Andre Oppermann
371407162b Move the mbuf memory limit calculations from init_param2() to
tunable_mbinit() where it is next to where it is used later.

Change the sysinit level of tunable_mbinit() from SI_SUB_TUNABLES
to SI_SUB_KMEM after the VM is running.  This allows to use better
methods to determine the effectively available physical and virtual
memory available to the kernel.

Update comments.

In a second step it can be merged into mbuf_init().
2013-01-17 21:28:31 +00:00
Alfred Perlstein
17ebe960a6 Do not autotune ncallout to be greater than 18508.
When maxusers was unrestricted and maxfiles was allowed to autotune
much higher the result was that ncallout which was based on maxfiles
and maxproc grew much higher than was needed.

To fix this clip autotuning to the same number we would get with
the old maxusers algorithm which would stop scaling at 384
maxusers.

Growing ncalout higher is not likely to be needed since most consumers
of timeout(9) are gone and any higher value for ncallout causes the
callwheel hashes to be much larger than will even be needed for
most applications.

MFC after:      1 month

Reviewed by:	mav
2013-01-15 19:26:17 +00:00
Andrey Zonov
0165f660c1 - Detect when we are in KVM.
Silence on:	emulation
Approved by:	kib (mentor)
MFC after:	1 week
2013-01-15 14:05:59 +00:00
Konstantin Belousov
10b4bb0b33 Add a trivial comment to record the proper commit log for r245407:
Set the v_hash for a new vnode in the getnewvnode() to the value
calculated based on the vnode structure address.  Filesystems using
vfs_hash_insert() override the v_hash using the standard formula of
(inode_number + mnt_hashseed).  For other filesystems, the
initialization allows the vfs_hash_index() to provide useful hash too.

Suggested, reviewed and tested by:	peter
Sponsored by:	The FreeBSD Foundation
MFC after:	5 days
2013-01-14 05:52:23 +00:00
Konstantin Belousov
a41df84820 diff --git a/sys/kern/vfs_subr.c b/sys/kern/vfs_subr.c
index 7c243b6..0bdaf36 100644
--- a/sys/kern/vfs_subr.c
+++ b/sys/kern/vfs_subr.c
@@ -279,6 +279,7 @@ SYSCTL_INT(_debug, OID_AUTO, vnlru_nowhere, CTLFLAG_RW,
 #define VSHOULDFREE(vp) (!((vp)->v_iflag & VI_FREE) && !(vp)->v_holdcnt)
 #define VSHOULDBUSY(vp) (((vp)->v_iflag & VI_FREE) && (vp)->v_holdcnt)

+static int vnsz2log;

 /*
  * Initialize the vnode management data structures.
@@ -293,6 +294,7 @@ SYSCTL_INT(_debug, OID_AUTO, vnlru_nowhere, CTLFLAG_RW,
 static void
 vntblinit(void *dummy __unused)
 {
+	u_int i;
 	int physvnodes, virtvnodes;

 	/*
@@ -332,6 +334,9 @@ vntblinit(void *dummy __unused)
 	syncer_maxdelay = syncer_mask + 1;
 	mtx_init(&sync_mtx, "Syncer mtx", NULL, MTX_DEF);
 	cv_init(&sync_wakeup, "syncer");
+	for (i = 1; i <= sizeof(struct vnode); i <<= 1)
+		vnsz2log++;
+	vnsz2log--;
 }
 SYSINIT(vfs, SI_SUB_VFS, SI_ORDER_FIRST, vntblinit, NULL);

@@ -1067,6 +1072,14 @@ alloc:
 	}
 	rangelock_init(&vp->v_rl);

+	/*
+	 * For the filesystems which do not use vfs_hash_insert(),
+	 * still initialize v_hash to have vfs_hash_index() useful.
+	 * E.g., nullfs uses vfs_hash_index() on the lower vnode for
+	 * its own hashing.
+	 */
+	vp->v_hash = (uintptr_t)vp >> vnsz2log;
+
 	*vpp = vp;
 	return (0);
 }
2013-01-14 05:42:54 +00:00
Konstantin Belousov
f6af8e375c Add exported vfs_hash_index() function, which calculates the canonical
pre-masked hash for the given vnode.  The function assumes that
vp->v_hash is initialized by the filesystem vnode instantiation
function.  At the moment, it is only done if filesystem uses
vfs_hash_insert().

Reviewed by:	peter
Tested by:	peter, pho (previous version)
Sponsored by:	The FreeBSD Foundation
MFC after:	5 days
2013-01-14 05:41:40 +00:00
Konstantin Belousov
7b982bc831 Rename vfs_hash_index() to vfs_hash_bucket().
Reviewed by:	peter
Tested by:	peter, pho
Sponsored by:	The FreeBSD Foundation
MFC after:	5 days
2013-01-14 05:40:21 +00:00
Konstantin Belousov
ddd6b3fc33 Add flags argument to vfs_write_resume() and remove
vfs_write_resume_flags().

Sponsored by:	The FreeBSD Foundation
2013-01-11 06:08:32 +00:00
Mateusz Guzik
43287e2753 lockmgr: unlock interlock (if requested) when dealing with upgrade/downgrade
requests for LK_NOSHARE locks, just like for shared locks.

PR:		kern/174969
Reviewed by:	attilio
MFC after:	1 week
2013-01-06 21:47:59 +00:00
Konstantin Belousov
a545089ed5 Protect the p->p_pgrp dereference with the process lock.
MFC after:	3 days
2013-01-06 15:10:10 +00:00
Neel Natu
a09580e7a3 Teach the kernel to recognize that it is executing inside a bhyve virtual
machine.

Obtained from:	NetApp
2013-01-05 19:18:50 +00:00
Benjamin Kaduk
5e9723e271 Fix some minor inaccuracies introduced in r243251.
Also correct the comment in kern_synch.c which was the source of the
problematic text.

Reviewed by:	kib (previous version)
Approved by:	hrs (mentor)
2013-01-05 00:23:26 +00:00
David Xu
eea8d86d4d Revert revision 244760 because strncpy pads trailing space with zero,
this prevents kernel data from being leaked.

Noticed by: Joerg Sonnenberger &lt; joerg at britannica dot bec dot de &gt;
2013-01-04 11:11:12 +00:00
Konstantin Belousov
d1c5e3f8b0 Remove the deprecated MNT_VNODE_FOREACH interface. Use the
MNT_VNODE_FOREACH_ALL instead.
2013-01-03 19:02:52 +00:00
Konstantin Belousov
f99cb34c4f The process_deferred_inactive() function locks the vnodes of the ufs
mount, which means that is must not be called while the snaplock is
owned.  The vfs_write_resume(9) does call the function as the
VFS_SUSP_CLEAN() method, which is too early and falls into the region
still protected by snaplock.

Add yet another flag for the vfs_write_resume_flags() to avoid calling
suspension cleanup handler after the suspend is lifted, and use it in
the ffs_snapshot() call to vfs_write_resume.

Reported and tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
2013-01-01 16:14:48 +00:00
Konstantin Belousov
91e9474552 Make it possible to atomically resume writes on the mount and account
the write start, by adding a variation of the vfs_write_resume(9)
which accepts flags.

Use the new function to prevent a deadlock between parallel suspension
and snapshotting a UFS mount.  The ffs_snapshot() code performed
vfs_write_resume() followed by vn_start_write() while owning the
snaplock.  If the suspension intervene between resume and
vn_start_write(), the deadlock occured after the suspending thread
tried to lock the snaplock, most typically during the write in the
ffs_copyonwrite().

Reported and tested by:	Andreas Longwitz <longwitz@incore.de>
Reviewed by:	mckusick
MFC after:	2 weeks
X-MFC-note:	make the vfs_write_resume(9) function a macro after the MFC,
	in HEAD
2012-12-28 23:08:30 +00:00
Oleksandr Tymoshenko
7fc3ae51f3 Fix build on ARM (and probably other platforms) 2012-12-28 06:52:53 +00:00
David Xu
9d4bf0db7c Use strlcpy to NULL-terminate error message even if user provided a short
buffer.
2012-12-28 02:43:33 +00:00
Attilio Rao
c92c859b7b Fixup r244240: mp_ncpus will be 1 also in the !SMP and smp_disabled=1
case. There is no point in optimizing further the code and use a TRUE
litteral for a path that does heavyweight stuff anyway (like lock acq),
at the price of obfuscated code.

Use the appropriate check where necessary and remove a macro.

Sponsored by:	EMC / Isilon storage division
MFC after:	3 days
2012-12-26 15:20:32 +00:00
Konstantin Belousov
ad9789f6db Do not force a writer to the devfs file to drain the buffer writes.
Requested and tested by:	Ian Lepore <freebsd@damnhippie.dyndns.org>
MFC after:	2 weeks
2012-12-23 22:43:27 +00:00
Jaakko Heinonen
b1e1f725e7 Reject spaces and double quotation marks in device names. devctl(4)
and devd(8) can't handle names with such characters properly.

PR:		bin/144736, kern/161912
Discussed with:	imp, kib, pjd
2012-12-22 13:33:28 +00:00
Attilio Rao
cd2fe4e632 Fixup r240424: On entering KDB backends, the hijacked thread to run
interrupt context can still be idlethread. At that point, without the
panic condition, it can still happen that idlethread then will try to
acquire some locks to carry on some operations.

Skip the idlethread check on block/sleep lock operations when KDB is
active.

Reported by:	jh
Tested by:	jh
MFC after:	1 week
2012-12-22 09:37:34 +00:00
Attilio Rao
b1308d72c2 Fixup r218424: uio_yield() was scaling directly to userland priority.
When kern_yield() was introduced with the possibility to specify
a new priority, the behaviour changed by not lowering priority at all
in the consumers, making the yielding mechanism highly ineffective for
high priority kthreads like bufdaemon, syncer, vlrudaemon, etc.
There are no evidences that consumers could bear with such change in
semantic and this situation could finally lead to bugs similar to the
ones fixed in r244240.
Re-specify userland pri for kthreads involved.

Tested by:	pho
Reviewed by:	kib, mdf
MFC after:	1 week
2012-12-21 13:14:12 +00:00
Dag-Erling Smørgrav
b5471c918f Rewrite fdgrowtable() so common mortals can actually understand what
it does and how, and add comments describing the data structures and
explaining how they are managed.
2012-12-20 20:18:27 +00:00
Olivier Houchard
05d9035003 Create an architecture-agnostic buffer pool manager that uses uma(9) to
manage a set of power-of-2 sized buffers for bus_dmamem_alloc().

This allows the caller to provide the back-end allocator uma allocator,
allowing full control of the memory pages backing the pool.  For
convenience, it provides an optional builtin allocator that provides pages
allocated with the VM_MEMATTR_UNCACHEABLE attribute, for managing pools of
DMA buffers for BUS_DMA_COHERENT or BUS_DMA_NOCACHE.

This also allows the caller to specify a minimum alignment, and it ensures
that all buffers start on a boundary and have a length that's a multiple of
that value, to avoid using buffers that trigger partial cache line flushes.

Submitted by:	Ian Lepore <freebsd@damnhippie.dyndns.org>
2012-12-20 00:34:54 +00:00
Pawel Jakub Dawidek
c345faea5a Replace expand_name() function with corefile_open() function, which not
only returns name, but also vnode of corefile to use.

This simplifies the code and closes few races, especially in %I handling.

Reviewed by:	kib
Obtained from:	WHEEL Systems
2012-12-19 23:59:48 +00:00
Pawel Jakub Dawidek
22a5d85aa9 Use correct file permissions when looking for available core file if
kern.corefile contains %I.

Obtained from:	WHEEL Systems
2012-12-19 23:40:02 +00:00
Jeff Roberson
4c44811c9d - Add new machine parsable KTR macros for timing events.
- Use this new format to automatically handle syscalls and VOPs.  This
   changes the earlier format but is still human readable.

Sponsored by:	EMC / Isilon Storage Division
2012-12-19 20:10:00 +00:00
Jeff Roberson
5b39d5c739 - Correctly handle EWOULDBLOCK in quiesce_cpus
Discussed with:	mav
2012-12-19 20:08:06 +00:00
Pawel Jakub Dawidek
07a8e07896 The 'flags' argument can be modified in vn_open_cred(), so we need to
set it for every loop interation.

Pointed out by:	kib
2012-12-19 12:14:08 +00:00
Pawel Jakub Dawidek
cc58032c44 Do not audit paths we try when kern.corefile contains %I.
Obtained from:	WHEEL Systems
2012-12-19 12:12:53 +00:00
Pawel Jakub Dawidek
29146f1a7a Style cleanups. 2012-12-19 12:10:14 +00:00
Pawel Jakub Dawidek
086053a370 The expand_name() function isn't called with the process lock held anymore,
so we can safely use malloc(M_WAITOK) now.

Pointed out by:	kib
2012-12-19 12:00:09 +00:00
Mateusz Guzik
af3c786c47 prison_racct_detach can be called for not fully initialized jail, so make it check that the jail has racct before doing anything
PR:		kern/174436
Reviewed by:	trasz
MFC after:	3 days
2012-12-18 18:34:36 +00:00
Andrey Zonov
5eb0d2838c - Add sysctl to allow unprivileged users to call mlock(2)-family system
calls and turn it on.
- Do not allow to call them inside jail. [1]

Pointed out by:	trasz [1]
Reviewed by:	avg
Approved by:	kib (mentor)
MFC after:	1 week
2012-12-18 07:36:45 +00:00
Pawel Jakub Dawidek
f06f465db7 Minor style tweaks.
Obtained from:	WHEEL Systems
2012-12-17 10:51:22 +00:00
Pawel Jakub Dawidek
c52ff61196 Better variables naming in expand_name() to be more consistent with coredump().
Obtained from:	WHEEL Systems
2012-12-17 10:48:10 +00:00
Pawel Jakub Dawidek
dd57ce87eb Move expand_name() after process lock is released.
This fixed panic where we hold mutex (process lock) and try to obtain sleepable
lock (vnode lock in expand_name()). The panic could occur when %I was used
in kern.corefile.

Additionally we avoid expand_name() overhead when coredumps are disabled.

Obtained from:	WHEEL Systems
2012-12-16 14:53:27 +00:00
Pawel Jakub Dawidek
2ce1b32df2 Don't add audit record when coredumps are disabled or name cannot be expanded.
Discussed with:	rwatson
Obtained from:	WHEEL Systems
2012-12-16 14:24:59 +00:00
Pawel Jakub Dawidek
7e73ee85ab Make the check easier to read.
Obtained from:	WHEEL Systems
2012-12-16 14:14:18 +00:00
Pawel Jakub Dawidek
b039f8c2aa Use 'cred' variable.
Obtained from:	WHEEL Systems
2012-12-16 13:56:38 +00:00
Konstantin Belousov
14df601e47 When mnt_vnode_next_active iterator cannot lock the next vnode and
yields, specify the user priority for the yield.  Otherwise, a
higher-priority (kernel) thread could fall into the priority-inversion
with the thread owning the mutex lock.

On single-processor machines or UP kernels, do not loop adaptively
when the next vnode cannot be locked, instead yield unconditionally.

Restructure the iteration initializer and the iterator to remove code
duplication.  Put the code to fetch and lock a vnode next to the
current marker, into the mnt_vnode_next_active() function, and use it
instead of repeating the loop.

Reported by:	hrs, rmacklem
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	3 days
2012-12-15 02:04:46 +00:00
Konstantin Belousov
d4015944e7 Remove a special case for XEN, which is erronous and makes vfork(2)
behaviour to differ from the documented, only on XEN.  If there are
any issues with XEN pmap left, they should be fixed in pmap.

MFC after:	2 weeks
2012-12-15 02:02:11 +00:00
Rick Macklem
f1c4014cd5 The group list for a non-default export entry (a host/subnet one)
was being copied from the wrong place. This patch fixes that.
This could cause access failures for mapped users, when the group
permissions were needed.

PR:		147998
Submitted by:	Christopher Key (cjk32 at cam.ac.uk)
MFC after:	2 weeks
2012-12-14 21:49:06 +00:00
Alfred Perlstein
15d32bd543 Cleanup more of the kassert_panic.
fix compile warnings on !amd64 and NULL derefs that would happen
if kassert_panic() would return.
2012-12-11 07:08:14 +00:00
Alfred Perlstein
c2c5ede903 Fix WITNESS when INVARIANT_SUPPORT is defined.
This fixes tinderbox breakage from r244105.

Pointed out by: adrian
2012-12-11 05:59:16 +00:00
Alfred Perlstein
6b6bd3b704 Switch the hardwired WITNESS panics to kassert_panic.
This is an ongoing effort to provide runtime debug information
useful in the field that does not panic existing installations.

This gives us the flexibility needed when shipping images to a
potentially large audience with WITNESS enabled without worrying
about formerly non-fatal LORs hurting a release.

Sponsored by: iXsystems
2012-12-11 01:23:50 +00:00
Alfred Perlstein
d3bfafb4f6 back out half of 244098.
kern.bootfile needs to be rw for installkernel.

Pointed out by: kib, flo
2012-12-11 00:10:20 +00:00
Alfred Perlstein
a94053ba39 allow KASSERT to enter KDB. 2012-12-10 23:11:26 +00:00
Alfred Perlstein
d06cadae1e make sysctls kern.{bootfile,conftxt} read-only
MFC after:	1 month
2012-12-10 23:09:55 +00:00
Konstantin Belousov
686ffcaceb Do not yield while owning a mutex. The Giant reacquire in the
kern_yield() is problematic than.

The owned mutex is the mount interlock, and it is in fact not needed
to guarantee the stability of the mount list of active vnodes, so fix
the the issue by only taking the mount interlock for MNT_REF and
MNT_REL operations.

While there, augment the unconditional yield by some amount of
spinning [1].

Reported and tested by:	pho
Reviewed by:	attilio
Submitted by:	attilio [1]
MFC after:	3 days
2012-12-10 20:44:09 +00:00
Andre Oppermann
0060bab556 Prevent long type overflow of realmem calculation on ILP32 by forcing
calculation to be in quad_t space.  Fix style issue with second parameter
to qmin().

Reported by:	alc
Reviewed by:	bde, alc
2012-12-10 12:19:03 +00:00
Konstantin Belousov
5d439a2957 Do not ignore zero address, possibly returned by the vm_map_find()
call.  The function indicates a failure by the TRUE return value.  To
be extra safe, assert that the return value from the following
vm_map_insert() indicates success.

Fix style issues in the nearby lines, reformulate the comment.

Reviewed by:	alc (previous version)
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2012-12-10 05:14:04 +00:00
Konstantin Belousov
17cb8cfc31 Remove useless comment.
MFC after:	3 days
2012-12-09 20:34:11 +00:00
Konstantin Belousov
796fa4fb86 Fix typo.
MFC after:	3 days
2012-12-09 20:26:51 +00:00
Attilio Rao
e68ccbe85e Add a comment on why inlining critical_enter() may not be a good idea
for the general case.

Reviewed by:	bde
MFC after:	1 week
2012-12-09 04:54:22 +00:00
Pawel Jakub Dawidek
6e0b674628 Configure UMA warnings for the following zones:
- unp_zone: kern.ipc.maxsockets limit reached
- socket_zone: kern.ipc.maxsockets limit reached
- zone_mbuf: kern.ipc.nmbufs limit reached
- zone_clust: kern.ipc.nmbclusters limit reached
- zone_jumbop: kern.ipc.nmbjumbop limit reached
- zone_jumbo9: kern.ipc.nmbjumbo9 limit reached
- zone_jumbo16: kern.ipc.nmbjumbo16 limit reached

Note that those warnings are printed not often than every five minutes and can
be globally turned off by setting sysctl/tunable vm.zone_warnings to 0.

Discussed on:	arch
Obtained from:	WHEEL Systems
MFC after:	2 weeks
2012-12-07 22:30:30 +00:00
Pawel Jakub Dawidek
45fe0bf7e4 Make use of the fact that uma_zone_set_max(9) already returns actual limit set. 2012-12-07 22:23:53 +00:00
Pawel Jakub Dawidek
4007b61cde More style cleanups. 2012-12-07 22:22:04 +00:00
Pawel Jakub Dawidek
b0b1402537 Style cleanups. 2012-12-07 22:19:41 +00:00
Pawel Jakub Dawidek
94b0ae5d62 - Make socket_zone static - it is used only in this file.
- Update maxsockets on uma_zone_set_max().

Obtained from:	WHEEL Systems
2012-12-07 22:15:51 +00:00
Pawel Jakub Dawidek
68412f4179 Style cleanups. 2012-12-07 22:13:33 +00:00
Pawel Jakub Dawidek
0b746181a2 There is no need anymore to include vm/uma.h after r241726.
Obtained from:	WHEEL Systems
2012-12-07 22:05:42 +00:00
Alfred Perlstein
3945a96431 Allow KASSERT to log instead of panic.
This is to allow debug images to be used without taking down the
system when non-fatal asserts are hit.

The following sysctls are added:

debug.kassert.warn_only: 1 = log, 0 = panic

debug.kassert.do_ktr: set to a ktr mask for logging via KTR

debug.kassert.do_log: 1 = log, 0 = quiet

debug.kassert.warnings: stats, number of kasserts hit

debug.kassert.log_panic_at:
  number of kasserts before we actually panic, 0 = never

debug.kassert.log_pps_limit: pps limit for log messages

debug.kassert.log_mute_at: stop warning after N kasserts, 0 = never stop

debug.kassert.kassert: set this sysctl to trigger a kassert

Discussed with: scottl, gnn, marcel
Sponsored by: iXsystems
2012-12-07 08:25:08 +00:00
Alfred Perlstein
3356d129ad Use uint instead of int for flags exported via sysctl. 2012-12-07 05:55:48 +00:00
Kevin Lo
b08d12d9be - according to POSIX, make socket(2) return EAFNOSUPPORT rather than
EPROTONOSUPPORT if the address family is not supported.
- introduce pffinddomain() to find a domain by family and use it as
  appropriate.

Reviewed by:	glebius
2012-12-07 02:22:48 +00:00
David Xu
3f6bad0181 Eliminate superfluous code. 2012-12-06 06:29:08 +00:00
Attilio Rao
bdf9120c16 Fixup r243901:
- As the comment report, CALLOUT_LOCAL_ALLOC cannot be checked
  directly from the callout flags but might be checked by a cached
  value.  Hence, do so before to actually remove the callout, when
  needed, in softclock_call_cc().
- In softclock_call_cc() also add a comment in the waiting and deferred
  migration case explaining that the dereference should be safe
  because of the migration dereference invariants.

Additively:
- In softclock_call_cc(), for the deferred migration case, move all the
  accesses to callout structure after the comment stating the callout
  must not be destroyed.
- For consistency with this last tweak, use cached c_flags for the
  KASSERT() in the deferred migration case.  It is not strictly necessary
  but this way all the callout accesses happen after the above mentioned
  comment, improving consistency.

Pointy hat to:	me
Sponsored by:	Isilon Systems / EMC Corporation
Reviewed by:	kib
MFC after:	2 weeks
X-MFC:		243901
2012-12-05 22:32:12 +00:00
Konstantin Belousov
eb8a718686 The softclock_call_cc() is executing with the callout already removed
from the callwheel. Calculate the cc->cc_next before removing the
callout, otherwise the code followed the invalid tailq links.  After
this, make softclock_call_cc() return void, since it always return
cc->cc_next, which is immediately available to the softclock()
anyway. This also allows to eliminate a label under #ifdef SMP.

Remove the assignment of cc->cc_next from callout_cc_del(), since the
function is called with the callout already removed from callwheel.

If cancelling the migration, also clear the CALLOUT_DFRMIGRATION flag.

Postpone the free of the timeout(9) allocated callouts after the
migration checks are done.

Add some more strict asserts about the state of the callout in
callout_call_cc().

Reviewed by:	attilio
Reported and tested by:	pho (previous version)
MFC after:	2 weeks
2012-12-05 19:02:22 +00:00
Attilio Rao
1c7d98d0df Check for lockmgr recursion in case of disown and downgrade and panic
also in !debugging kernel rather than having "undefined" behaviour.

Tested by:	avg
MFC after:	1 week
2012-12-05 15:11:01 +00:00
Gleb Smirnoff
eb1b1807af Mechanically substitute flags from historic mbuf allocator with
malloc(9) flags within sys.

Exceptions:

- sys/contrib not touched
- sys/mbuf.h edited manually
2012-12-05 08:04:20 +00:00
Konstantin Belousov
f7e50ea722 Fix a race between kern_setitimer() and realitexpire(), where the
callout is started before kern_setitimer() acquires process mutex, but
looses a race and kern_setitimer() gets the process mutex before the
callout.  Then, assuming that new specified struct itimerval has
it_interval zero, but it_value non-zero, the callout, after it starts
executing again, clears p->p_realtimer.it_value, but kern_setitimer()
already rescheduled the callout.

As the result of the race, both p_realtimer is zero, and the callout
is rescheduled. Then, in the exit1(), the exit code sees that it_value
is zero and does not even try to stop the callout. This allows the
struct proc to be reused and eventually the armed callout is
re-initialized.  The consequence is the corrupted callwheel tailq.

Use process mutex to interlock the callout start, which fixes the race.

Reported and tested by:	pho
Reviewed by:	jhb
MFC after:	2 weeks
2012-12-04 20:49:39 +00:00
Konstantin Belousov
9bdf6ccab3 Do not allocate buffer of the 255 bytes length on the stack.
Reported and tested by:	sig6247@gmail.com
MFC after:	1 week
2012-12-04 20:49:04 +00:00
Alfred Perlstein
922314f018 replace bit shifting loop with 1<<fls(n), improve comments.
Reviewed by: davide
2012-12-04 05:28:20 +00:00
Konstantin Belousov
07840861b1 The vnode_free_list_mtx is required unconditionally when iterating
over the active list. The mount interlock is not enough to guarantee
the validity of the tailq link pointers. The __mnt_vnode_next_active()
and __mnt_vnode_first_active() active lists iterators helper functions
did not provided the neccessary stability for the list, allowing the
iterators to pick garbage.

This was uncovered after the r243599 made the active list iterators
non-nop.

Since a vnode interlock is before the vnode_free_list_mtx, obtain the
vnode ilock in the non-blocking manner when under vnode_free_list_mtx,
and restart iteration after the yield if the lock attempt failed.

Assert that a vnode found on the list is active, and assert that the
helpers return the vnode with interlock owned.

Reported and tested by:	pho
MFC after:	1 week
2012-12-03 22:15:16 +00:00
Pawel Jakub Dawidek
8909f88d28 Fix one more compilation issue. 2012-12-01 08:59:36 +00:00
Pawel Jakub Dawidek
499f0f4d55 IFp4 @208451:
Fix path handling for *at() syscalls.

Before the change directory descriptor was totally ignored,
so the relative path argument was appended to current working
directory path and not to the path provided by descriptor, thus
wrong paths were stored in audit logs.

Now that we use directory descriptor in vfs_lookup, move
AUDIT_ARG_UPATH1() and AUDIT_ARG_UPATH2() calls to the place where
we hold file descriptors table lock, so we are sure paths will
be resolved according to the same directory in audit record and
in actual operation.

Sponsored by:	FreeBSD Foundation (auditdistd)
Reviewed by:	rwatson
MFC after:	2 weeks
2012-11-30 23:18:49 +00:00
Pawel Jakub Dawidek
e1216d1335 IFp4 @208450:
Remove redundant call to AUDIT_ARG_UPATH1().
Path will be remembered by the following NDINIT(AUDITVNODE1) call.

Sponsored by:	FreeBSD Foundation (auditdistd)
MFC after:	2 weeks
2012-11-30 22:49:28 +00:00
Andre Oppermann
df905a2bd3 Using a long is the wrong type to represent the realmem and maxmbufmem
variable as they may overflow on i386/PAE and i386 with > 2GB RAM.

Use 64bit quad_t instead.  It has broader kernel infrastructure support
with TUNABLE_QUAD_FETCH() and qmin/qmax() than other available types.

Pointed out by:	alc, bde
2012-11-29 07:30:42 +00:00
Andre Oppermann
416a434cd0 Complete r243631 by applying the remainder of kern_mbuf.c that got
lost while merging into the commit tree.

MFC after:	1 month
X-MFC-with:	r243631
2012-11-27 23:16:56 +00:00
Andre Oppermann
358c7f47da Fix r243627 by testing against the head socket instead of the socket
just created.

MFC after:	1 week
X-MFC-with:	r243627
2012-11-27 22:35:48 +00:00
Andre Oppermann
ead46972a4 Base the mbuf related limits on the available physical memory or
kernel memory, whichever is lower.  The overall mbuf related memory
limit must be set so that mbufs (and clusters of various sizes)
can't exhaust physical RAM or KVM.

The limit is set to half of the physical RAM or KVM (whichever is
lower) as the baseline.  In any normal scenario we want to leave
at least half of the physmem/kvm for other kernel functions and
userspace to prevent it from swapping too easily.  Via a tunable
kern.maxmbufmem the limit can be upped to at most 3/4 of physmem/kvm.

At the same time divorce maxfiles from maxusers and set maxfiles to
physpages / 8 with a floor based on maxusers.  This way busy servers
can make use of the significantly increased mbuf limits with a much
larger number of open sockets.

Tidy up ordering in init_param2() and check up on some users of
those values calculated here.

Out of the overall mbuf memory limit 2K clusters and 4K (page size)
clusters to get 1/4 each because these are the most heavily used mbuf
sizes.  2K clusters are used for MTU 1500 ethernet inbound packets.
4K clusters are used whenever possible for sends on sockets and thus
outbound packets.  The larger cluster sizes of 9K and 16K are limited
to 1/6 of the overall mbuf memory limit.  When jumbo MTU's are used
these large clusters will end up only on the inbound path.  They are
not used on outbound, there it's still 4K.  Yes, that will stay that
way because otherwise we run into lots of complications in the
stack.  And it really isn't a problem, so don't make a scene.

Normal mbufs (256B) weren't limited at all previously.  This was
problematic as there are certain places in the kernel that on
allocation failure of clusters try to piece together their packet
from smaller mbufs.

The mbuf limit is the number of all other mbuf sizes together plus
some more to allow for standalone mbufs (ACK for example) and to
send off a copy of a cluster.  Unfortunately there isn't a way to
set an overall limit for all mbuf memory together as UMA doesn't
support such a limiting.

NB: Every cluster also has an mbuf associated with it.

Two examples on the revised mbuf sizing limits:

1GB KVM:
 512MB limit for mbufs
 419,430 mbufs
  65,536 2K mbuf clusters
  32,768 4K mbuf clusters
   9,709 9K mbuf clusters
   5,461 16K mbuf clusters

16GB RAM:
 8GB limit for mbufs
 33,554,432 mbufs
  1,048,576 2K mbuf clusters
    524,288 4K mbuf clusters
    155,344 9K mbuf clusters
     87,381 16K mbuf clusters

These defaults should be sufficient for even the most demanding
network loads.

MFC after:	1 month
2012-11-27 21:19:58 +00:00
Andre Oppermann
2c3142c82c Fix a race on listen socket teardown where while draining the
accept queues a new socket/connection may be added to the queue
due to a race on the ACCEPT_LOCK.

The submitted patch is slightly changed in comments, teardown
and locking order and extended with KASSERT's.

Submitted by:	Vijay Singh <vijju.singh-at-gmail-dot-com>
Found by:	His team.
MFC after:	1 week
2012-11-27 20:04:52 +00:00
Pawel Jakub Dawidek
b0c9d4d70e Add kern.capmode_coredump sysctl/tunable to allow processes in capability mode
to dump core.

Reviewed by:	rwatson
Obtained from:	WHEEL Systems
MFC after:	2 weeks
2012-11-27 10:38:11 +00:00
Pawel Jakub Dawidek
f121e3e81d - Add NOCAPCHECK flag to namei that allows lookup to work even if the process
is in capability mode.
- Add VN_OPEN_NOCAPCHECK flag for vn_open_cred() to will ne converted into
  NOCAPCHECK namei flag.

This functionality will be used to enable core dumps for sandboxed processes.

Reviewed by:	rwatson
Obtained from:	WHEEL Systems
MFC after:	2 weeks
2012-11-27 10:32:35 +00:00
Pawel Jakub Dawidek
90b2202145 Regenerate after r243610. 2012-11-27 10:25:03 +00:00
Pawel Jakub Dawidek
8890f5d020 Allow to use kill(2) in capability mode, but process can send a signal only
to himself. For example abort(3) at first tries to do kill(getpid(), SIGABRT)
which was failing in capability mode, so the code was failing back to exit(1).

Reviewed by:	rwatson
Obtained from:	WHEEL Systems
MFC after:	2 weeks
2012-11-27 10:22:40 +00:00
Pawel Jakub Dawidek
b62d05fcf9 Allow to modify kern.sugid_coredump and kern.corefile from loader.conf.
Obtained from:	WHEEL Systems
2012-11-27 10:16:48 +00:00
Pawel Jakub Dawidek
c320984687 More style fixes. 2012-11-27 10:15:58 +00:00
Pawel Jakub Dawidek
23c6445a4b Style fixes (mostly whitespaces). 2012-11-27 10:11:54 +00:00
David Xu
3da9ab75f4 Take first active vnode correctly.
Reviewed by:	kib
MFC after:	3 days
2012-11-27 06:07:58 +00:00
Pawel Jakub Dawidek
4f66641749 Look for zombie process only if we were given process id.
Reviewed by:	kib
MFC after:	2 weeks
X-MFC-after-or-with:	243142
2012-11-25 19:31:42 +00:00
Andriy Gapon
6898bee9a9 remove stop_scheduler_on_panic knob
There has not been any complaints about the default behavior, so there
is no need to keep a knob that enables the worse alternative.

Now that the hard-stopping of other CPUs is the only behavior, the panic_cpu
spinlock-like logic can be dropped, because only a single CPU is
supposed to win stop_cpus_hard(other_cpus) race and proceed past that
call.

MFC after:	1 month
2012-11-25 14:22:08 +00:00
Andriy Gapon
6b991098a7 assert_vop_locked: make the assertion race-free and more efficient
this is really a minor improvement for the sake of correctness

MFC after:	6 days
2012-11-24 13:11:47 +00:00
Andriy Gapon
4f15bb6730 remove vop_lookup_pre and vop_lookup_post
Suggested by:	kib
MFC after:	5 days
2012-11-22 10:36:10 +00:00
Konstantin Belousov
daee0f0b0b Schedule garbage collection run for the in-flight rights passed over
the unix domain sockets to the next tick, coalescing the serial calls
until the collection fires.  The thought is that more work for the
collector could arise in the near time, allowing to clean more and not
spend too much CPU on repeated collection when there is no garbage.

Currently the collection task is fired immediately upon unix domain
socket close if there are any rights in flight, which caused excessive
CPU usage and too long blocking of the threads waiting for
unp_list_lock and unp_link_rwlock in write mode.

Robert noted that it would be nice if we could find some heuristic by
which we decide whether to run GC a bit more quickly.  E.g., if the
number of UNIX domain sockets is close to its resource limit, but not
quite.

Reported and tested by:	Markus Gebert <markus.gebert@hostpoint.ch>
Reviewed by:	rwatson
MFC after:	2 weeks
2012-11-20 15:45:48 +00:00
Konstantin Belousov
b7c8d2f2f5 Add a special meaning to the negative ticks argument for
taskqueue_enqueue_timeout().  Do not rearm the callout if it is
already armed and the ticks is negative.  Otherwise rearm it to fire
in abs(ticks) ticks in the future.

The intended use is to call taskqueue_enqueue_timeout() for the given
timeout_task with the same negative ticks argument.  As result, the
task is scheduled to execute not further than abs(ticks) ticks in
future, and the consequent enqueues are coalesced until the already
scheduled task is finished.

Reviewed by:	rwatson
Tested by:	Markus Gebert <markus.gebert@hostpoint.ch>
MFC after:	2 weeks
2012-11-20 15:33:48 +00:00
Attilio Rao
973b795b64 insmntque() is always called with the lock held in exclusive mode,
then:
- assume the lock is held in exclusive mode and remove a moot check
  about the lock acquisition.
- in the destructor remove !MPSAFE specific chunk.

Reviewed by:	kib
MFC after:	2 weeks
2012-11-19 20:43:19 +00:00
Andriy Gapon
ab49c952d9 assert_vop_locked should treat LK_EXCLOTHER as the not locked case
... from a perspective of the current thread.

Spotted by:	mjg
Discussed with:	kib
MFC after:	18 days
2012-11-19 11:35:56 +00:00
Andriy Gapon
c496727c54 vnode_if: fix locking protocol description for lookup and cachedlookup
Also remove the checks from vop_lookup_pre and vop_lookup_post, which
are now completely redundant (before this change they were partially
redundant).

Discussed with:	kib
MFC after:	10 days
2012-11-19 11:32:56 +00:00
Mateusz Guzik
dd103d4d06 Fix possible fp reference leak in posix_openpt
Reviewed by:	ed
Approved by:	trasz (mentor)
MFC after:	3 days
2012-11-18 15:48:34 +00:00
Gleb Smirnoff
716963cb5d Update comment. 2012-11-16 14:00:54 +00:00
Konstantin Belousov
134eb42e24 In pget(9), if PGET_NOTWEXIT flag is not specified, also search the
zombie list for the pid. This allows several kern.proc sysctls to
report useful information for zombies.

Hold the allproc_lock around all searches instead of relocking it.
Remove private pfind_locked() from the new nfs client code.

Requested and reviewed by:	pjd
Tested by:	pho
MFC after:	3 weeks
2012-11-16 08:25:06 +00:00
Konstantin Belousov
ea293f3f1d Restore the proper handling of the pid 0 for waitpid(2).
Fix the style around.

Reported and reviewed by:	bde (previous version)
MFC after:	28 days
2012-11-16 06:32:38 +00:00
Konstantin Belousov
a2a8559624 Style fixes for r242958.
Reported and reviewed by:	bde
MFC after:	28 days
2012-11-16 06:22:14 +00:00
Edward Tomasz Napierala
baf85d0a22 Improve KASSERT messages in racct, to make it clear which resource
caused the problem.

Submitted by:	mjg
2012-11-15 15:55:49 +00:00
Edward Tomasz Napierala
84c9193ba0 Fix kassert that's not really valid for %CPU accounting. The problem
here is race between decaying the resource usage in containers, and updating
per-process usage; basically, the former may cause per-container usage
to get smaller than per-process usage.

Submitted by:	Rudo Tomori
2012-11-15 14:11:34 +00:00
Alexander Motin
2fd4047f32 Fix bug in r242852 that prevented CPU from becoming idle if kernel built
without SMP support.
2012-11-15 14:10:51 +00:00
Jeff Roberson
28d91af30f - Implement run-time expansion of the KTR buffer via sysctl.
- Implement a function to ensure that all preempted threads have switched
   back out at least once.  Use this to make sure there are no stale
   references to the old ktr_buf or the lock profiling buffers before
   updating them.

Reviewed by:	marius (sparc64 parts), attilio (earlier patch)
Sponsored by:	EMC / Isilon Storage Division
2012-11-15 00:51:57 +00:00
Baptiste Daroussin
6f0a5dea71 Style fix
MFC after:	1 day
2012-11-14 10:33:12 +00:00
Baptiste Daroussin
6f68699fbd return ERANGE if the buffer is too small to contain the login as documented in
the manpage

Reviewed by:	cognet, kib
MFC after:	1 month
2012-11-14 10:32:12 +00:00
Mateusz Guzik
4419a8a88c enterpgrp: get rid of pgrp2 variable and use KASSERT directly on pgfind result.
pgrp2 was used only for debugging, but pgrp2 = pgfind(..) was present in compiled code even for kernels without INVARIANTS

Approved by:	trasz (mentor)
MFC after:	1 week
2012-11-13 22:01:25 +00:00
Konstantin Belousov
552e993580 Regen 2012-11-13 12:53:41 +00:00
Konstantin Belousov
f13b5a0f01 Add the wait6(2) system call. It takes POSIX waitid()-like process
designator to select a process which is waited for. The system call
optionally returns siginfo_t which would be otherwise provided to
SIGCHLD handler, as well as extended structure accounting for child
and cumulative grandchild resource usage.

Allow to get the current rusage information for non-exited processes
as well, similar to Solaris.

The explicit WEXITED flag is required to wait for exited processes,
allowing for more fine-grained control of the events the waiter is
interested in.

Fix the handling of siginfo for WNOWAIT option for all wait*(2)
family, by not removing the queued signal state.

PR:	standards/170346
Submitted by:	"Jukka A. Ukkonen" <jau@iki.fi>
MFC after:	1 month
2012-11-13 12:52:31 +00:00
Edward Tomasz Napierala
84590fd8e5 Don't divide by zero.
Tested by:	swills
2012-11-13 11:29:08 +00:00
Alexander Motin
2c27cb3a34 Several optimizations to sched_idletd():
- Do not try to steal load from other CPUs if there was no contest switches
on this CPU (i.e. it was idle all the time and woke up just for bus mastering
or TLB shutdown). If current CPU was idle, then it is quite unlikely that some
other CPU has load to steal.  Under high I/O rate, when TLB shutdowns cause
numerous CPU wakeups, on 24-CPU system load stealing code may consume up to
25% of all CPU time without giving any benefits.
 - Change code that implements spinning for load to restart spin in case of
context switch.  Previous code periodically called cpu_idle() even under
high interrupt/context switch rate.
 - Rise spinning threshold to 10KHz, where it gives at least some effect
that may worth consumed power.

Reviewed by:	jeff@
2012-11-10 07:02:57 +00:00
Alfred Perlstein
79f62ed690 Allow maxusers to scale on machines with large address space.
Some hooks are added to clamp down maxusers and nmbclusters for
small address space systems.

VM_MAX_AUTOTUNE_MAXUSERS - the max maxusers that will be autotuned based on
physical memory.
VM_MAX_AUTOTUNE_NMBCLUSTERS - max nmbclusters based on physical memory.

These are set to the old values on i386 to preserve the clamping that was
being done to all arches.

Another macro VM_AUTOTUNE_NMBCLUSTERS is provided to allow an override
for the calculation on a MD basis.  Currently no arch defines this.

Reviewed by: peter
MFC after: 2 weeks
2012-11-10 02:08:40 +00:00
Attilio Rao
bc2258da88 Complete MPSAFE VFS interface and remove MNTK_MPSAFE flag.
Porters should refer to __FreeBSD_version 1000021 for this change as
it may have happened at the same timeframe.
2012-11-09 18:02:25 +00:00
Marius Strobl
c882264c95 Make r242655 build on sparc64. While at it, make vm_{max,min}_kernel_address
vm_offset_t as they should be.
2012-11-08 08:10:32 +00:00
Jeff Roberson
5e5c387373 - Change ULE to use dynamic slice sizes for the timeshare queue in order
to further reduce latency for threads in this queue.  This should help
   as threads transition from realtime to timeshare.  The latency is
   bound to a max of sched_slice until we have more than sched_slice / 6
   threads runnable.  Then the min slice is allotted to all threads and
   latency becomes (nthreads - 1) * min_slice.

Discussed with: mav
2012-11-08 01:46:47 +00:00
Kevin Lo
0f5e7edc14 Fix typo; s/ouput/output 2012-11-07 07:00:59 +00:00
Alfred Perlstein
fc6874bcbb export VM_MIN_KERNEL_ADDRESS and VM_MAX_KERNEL_ADDRESS via sysctl.
On several platforms the are determined by too many nested #defines to be
easily discernible.  This will aid in development of auto-tuning.
2012-11-06 04:10:32 +00:00
Konstantin Belousov
76fd782cd9 A clarification to the behaviour of the active vnode list management
regarding the vnode page cleaning.

In collaboration with:	pho
MFC after:	1 week
2012-11-05 16:40:42 +00:00
Konstantin Belousov
90af57930c Add decoding of the missed MNT_KERN_ flags to ddb "show mount" command.
MFC after:	3 weeks
2012-11-04 13:33:13 +00:00
Konstantin Belousov
fb81941575 Add decoding of the missed VI_ and VV_ flags to ddb "show vnode" command.
MFC after:	3 days
2012-11-04 13:32:45 +00:00
Konstantin Belousov
df3161c7df Order the enumeration of the MNT_ flags to be the same as the order of
their definitions.

MFC after:	3 days
2012-11-04 13:31:41 +00:00
Ed Schouten
305921c48e Add tty_set_winsize().
This removes some of the signalling magic from the Syscons driver and
puts it in the TTY layer, where it belongs.
2012-11-03 22:21:37 +00:00
Attilio Rao
19d4153329 Merge r242395,242483 from mutex implementation:
give rwlock(9) the ability to crunch different type of structures, with
the only constraint that they have a lock cookie named rw_lock.
This name, then, becames reserved from the struct that wants to use
the rwlock(9) KPI and other locking primitives cannot reuse it for
their members.

Namely such structs are the current struct rwlock and the new struct
rwlock_padalign. The new structure will define an object which has the
same layout of a struct rwlock but will be allocated in areas aligned
to the cache line size and will be as big as a cache line.

For further details check comments on above mentioned revisions.

Reviewed by:	jimharris, jeff
2012-11-03 15:57:37 +00:00
Alfred Perlstein
5a3a8ec037 Merge 242488, better use of strlcpy.
Submitted by:	Eric van Gyzen <eric@vangyzen.net>
2012-11-02 18:57:38 +00:00
Konstantin Belousov
140dedb81c The r241025 fixed the case when a binary, executed from nullfs mount,
was still possible to open for write from the lower filesystem.  There
is a symmetric situation where the binary could already has file
descriptors opened for write, but it can be executed from the nullfs
overlay.

Handle the issue by passing one v_writecount reference to the lower
vnode if nullfs vnode has non-zero v_writecount.  Note that only one
write reference can be donated, since nullfs only keeps one use
reference on the lower vnode.  Always use the lower vnode v_writecount
for the checks.

Introduce the VOP_GET_WRITECOUNT to read v_writecount, which is
currently always bypassed to the lower vnode, and VOP_ADD_WRITECOUNT
to manipulate the v_writecount value, which manages a single bypass
reference to the lower vnode.  Caling the VOPs instead of directly
accessing v_writecount provide the fix described in the previous
paragraph.

Tested by:	pho
MFC after:	3 weeks
2012-11-02 13:56:36 +00:00
Alfred Perlstein
bad7e7f3dd Provide a device name in the sysctl tree for programs to query the
state of crashdump target devices.

This will be used to add a "-l" (ell) flag to dumpon(8) to list the
currently configured dumpdev.

Reviewed by:	phk
2012-11-01 17:01:05 +00:00
Attilio Rao
4ceaf45de5 Rework the known mutexes to benefit about staying on their own
cache line in order to avoid manual frobbing but using
struct mtx_padalign.

The sole exception being nvme and sxfge drivers, where the author
redefined CACHE_LINE_SIZE manually, so they need to be analyzed and
dealt with separately.

Reviwed by:	jimharris, alc
2012-10-31 18:07:18 +00:00
Jim Harris
84e7a2ebb7 Pad and align the callout_cpu mtx to its own cacheline to reduce false
sharing especially on the default CPU 0 callout_cpu structure.

This will be followed up by attilio@ with a conversion to the new struct
mtx_padalign but doing this manual conversion first gives an easy MFC
candidate since mtx_padalign is a more extensive system change.

Sponsored by:	Intel
Reviewed by:	jeff, attilio
MFC after:	1 week
2012-10-31 17:12:12 +00:00
Attilio Rao
7f44c61839 Give mtx(9) the ability to crunch different type of structures, with the
only constraint that they have a lock cookie named mtx_lock.
This name, then, becames reserved from the struct that wants to use the
mtx(9) KPI and other locking primitives cannot reuse it for their
members.

Namely such structs are the current struct mtx and the new
struct mtx_padalign.  The new structure will define an object which is
the same as the same layout of a struct mtx but will be allocated in
areas aligned to the cache line size and will be as big as a cache line.

This is supposed to give higher performance for highly contented mutexes
both spin or sleep (because of the adaptive spinning), where the cache
line contention results in too much traffic on the system bus.

The struct mtx_padalign can be used in a completely transparent way
with the mtx(9) KPI.

At the moment, a possibility to MFC the patch should be carefully
evaluated because this patch breaks the low level KPI
(not its representation though).

Discussed with:	jhb
Reviewed by:	jeff, andre
Reviewed by:	mdf (earlier version)
Tested by:	jimharris
2012-10-31 13:38:56 +00:00
Attilio Rao
5584e91718 Fixup r240246: hwpmc needs to retain the pinning until ASTs are not
executed. This means past the point where userret() is generally
executed.

Skip the td_pinned check if a callchain tracing is currently happening
and add a more robust check to pmc_capture_user_callchain() in order to
catch td_pinned leak past ast() in hwpmc case.

Reported and tested by:	fabient
MFC after:	1 week
X-MFC:	r240246
2012-10-30 15:10:50 +00:00
Attilio Rao
a049aa05c9 tdq_lock_pair() already does spinlock_enter() so migration is not
possible in sched_balance_pair(). Remove redundant sched_pin().

Reviewed by:	marius, jeff
2012-10-30 12:25:52 +00:00
Andre Oppermann
e8ad36aba4 In soreceive_stream() don't drop an already dequeued mbuf chain by
overwriting the return mbuf pointer with newly received data after
a loop.  Instead append the new mbuf chain to the existing one.

Fix up sb_lastrecord when dequeuing mbuf's so that sbappend_stream()
doesn't get confused.

For the remainder copy case in the mbuf delivery part deduct the
copied length len instead of the whole mbuf length.  Additionally
don't depend on 'n' being being available which isn't true in the
case of MSG_PEEK.

Fix the MSG_WAITALL case by comparing against sb_hiwat.  Before
it was looping for every receive as sb_lowat normally is zero.
Add comment about issue with (MSG_WAITALL | MSG_PEEK) which isn't
properly handled.

Submitted by:	trociny (except for the change in last paragraph)
2012-10-29 12:31:12 +00:00
Andre Oppermann
fdd1b7f52a Add logging for socket attach failures in sonewconn() during accept(2).
Include the pointer to the PCB so it can be attributed to a particular
application by corresponding it to "netstat -A" output.

MFC after:	2 weeks
2012-10-29 12:14:57 +00:00
Kevin Lo
a2c36a0234 Since the macro dtom() has been removed, fix comments about the dtom.
Reviewed by:	glebius
2012-10-29 10:04:28 +00:00
Andre Oppermann
14d7c5b11c Improve m_cat() by being able to also merge contents from M_EXT
mbuf's by doing proper testing with M_WRITABLE().

In m_collapse() replace an incomplete manual check for M_RDONLY
with the M_WRITABLE() macro that also tests for shared buffers
and other cases that make a particular mbuf immutable.

MFC after:	2 weeks
2012-10-28 18:38:51 +00:00
Davide Italiano
ba4be2110a The fields of struct timespec32 should be int32_t and not uint32_t.
Make this change.

Reviewed by:	bde, davidxu
Tested by:	pho
MFC after:	1 week
2012-10-27 23:42:41 +00:00
Edward Tomasz Napierala
36af98697d Add CPU percentage limit enforcement to RCTL. The resouce name is "pcpu".
It was implemented by Rudolf Tomori during Google Summer of Code 2012.
2012-10-26 16:01:08 +00:00
Ed Schouten
1da7bb41ed Correct SIGTTIN handling.
In the old TTY layer, SIGTTIN was correctly handled like this:

	while (data should be read) {
		send SIGTTIN if not foreground process group
		read data
	}

In the new TTY layer, however, this behaviour was changed, based on a
false interpretation of the standard:

	send SIGTTIN if not foreground process group
	while (data should be read) {
		read data
	}

Correct this by pushing tty_wait_background() into the ttydisc_read_*()
functions.

Reported by:	koitsu
PR:		kern/173010
MFC after:	2 weeks
2012-10-25 09:05:21 +00:00
Alfred Perlstein
7b6d92c0a0 Allow autotune maxusers > 384 on 64 bit machines
A default install on large memory machines with multiple 10gigE interfaces
were not being given enough mbufs to do full bandwidth TCP or NFS traffic.

To keep the value somewhat reasonable, we scale back the number of
maxuers by 1/6 past the 384 point.  This gives us enough mbufs for most
of our pretty basic 10gigE line-speed tests to complete.
2012-10-25 01:46:20 +00:00
Jim Harris
39f819e2fc Pad tdq_lock to avoid false sharing with tdq_load and tdq_cpu_idle.
This enables CPU searches (which read tdq_load) to operate independently
of any contention on the spinlock.  Some scheduler-intensive workloads
running on an 8C single-socket SNB Xeon show considerable improvement with
this change (2-3% perf improvement, 5-6% decrease in CPU util).

Sponsored by:	Intel
Reviewed by:	jeff
2012-10-24 18:36:41 +00:00