freebsd-dev

Author	SHA1	Message	Date
Hiren Panchasara	86a996e6bd	There are times when it would be really nice to have a record of the last few packets and/or state transitions from each TCP socket. That would help with narrowing down certain problems we see in the field that are hard to reproduce without understanding the history of how we got into a certain state. This change provides just that. It saves copies of the last N packets in a list in the tcpcb. When the tcpcb is destroyed, the list is freed. I thought this was likely to be more performance-friendly than saving copies of the tcpcb. Plus, with the packets, you should be able to reverse-engineer what happened to the tcpcb. To enable the feature, you will need to compile a kernel with the TCPPCAP option. Even then, the feature defaults to being deactivated. You can activate it by setting a positive value for the number of captured packets. You can do that on either a global basis or on a per-socket basis (via a setsockopt call). There is no way to get the packets out of the kernel other than using kmem or getting a coredump. I thought that would help some of the legal/privacy concerns regarding such a feature. However, it should be possible to add a future effort to export them in PCAP format. I tested this at low scale, and found that there were no mbuf leaks and the peak mbuf usage appeared to be unchanged with and without the feature. The main performance concern I can envision is the number of mbufs that would be used on systems with a large number of sockets. If you save five packets per direction per socket and have 3,000 sockets, that will consume at least 30,000 mbufs just to keep these packets. I tried to reduce the concerns associated with this by limiting the number of clusters (not mbufs) that could be used for this feature. Again, in my testing, that appears to work correctly. Differential Revision: D3100 Submitted by: Jonathan Looney <jlooney at juniper dot net> Reviewed by: gnn, hiren	2015-10-14 00:35:37 +00:00
Edward Tomasz Napierala	92001b9497	Change the default setting of kern.ipc.shm_allow_removed from 0 to 1. This removes the need for manually changing this flag for Google Chrome users. It also improves compatibility with Linux applications running under Linuxulator compatibility layer, and possibly also helps in porting software from Linux. Generally speaking, the flag allows applications to create the shared memory segment, attach it, remove it, and then continue to use it and to reattach it later. This means that the kernel will automatically "clean up" after the application exits. It could be argued that it's against POSIX. However, SUSv3 says this about IPC_RMID: "Remove the shared memory identifier specified by shmid from the system and destroy the shared memory segment and shmid_ds data structure associated with it." From my reading, we break it in any case by deferring removal of the segment until it's detached; we won't break it any more by also deferring removal of the identifier. This is the behaviour exhibited by Linux since... probably always, and also by OpenBSD since the following commit: revision 1.54 date: 2011/10/27 07:56:28; author: robert; state: Exp; lines: +3 -8; Allow segments to be used even after they were marked for deletion with the IPC_RMID flag. This is permitted as an extension beyond the standards and this is similar to what other operating systems like linux do. MFC after: 1 month Relnotes: yes Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D3603	2015-10-10 09:29:47 +00:00
Edward Tomasz Napierala	b9a5c7b595	Provide better debug message on kernel module name clash. Reviewed by: kib@ MFC after: 1 month Sponsored by: The FreeBSD Foundation	2015-10-10 09:21:55 +00:00
Edward Tomasz Napierala	8d90e66066	Remove root_mount_wait(). It's not used anywhere. Reviewed by: bapt@ MFC after: 1 month Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D3787	2015-10-09 12:11:37 +00:00
Konstantin Belousov	4b48959f9f	Enforce the maxproc limitation before allocating struct proc, initial struct thread and kernel stack for the thread. Otherwise, a load similar to a fork bomb would exhaust KVA and possibly kmem, mostly due to the struct proc being type-stable. The nprocs counter is changed from being protected by allproc_lock sx to be an atomic variable. Note that ddb/db_ps.c:db_ps() use of nprocs was unsafe before, and is still unsafe, but it seems that the only possible undesired consequence is the harmless warning printed when allproc linked list length does not match nprocs. Diagnosed by: Svatopluk Kraus <onwahe@gmail.com> Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 1 week	2015-10-08 11:07:09 +00:00
Fabien Thomas	78e79434d2	Fix r283998 that broke mapin events for hwpmc. Reviewed by: jhb Sponsored by: Stormshield	2015-10-08 09:54:33 +00:00
Gleb Smirnoff	e40e8705db	Fix regression from r248371. We need to copy packet header to new mbuf. Unlike in the pre-r248371 code, assert that M_PKTHDR is set only on a first mbuf. Reported & tested by: Andriy Voskoboinyk <s3erios gmail.com> Sponsored by: Nginx, Inc.	2015-10-07 12:40:00 +00:00
John Baldwin	189ac973de	Fix various edge cases related to system call tracing. - Always set td_dbg_sc_* when P_TRACED is set on system call entry even if the debugger is not tracing system call entries. This ensures the fields are valid when reporting other stops that occur at system call boundaries such as for PT_FOLLOW_FORKS or when only tracing system call exits. - Set TDB_SCX when reporting the stop for a new child process in fork_return(). This causes the event to be reported as a system call exit. - Report a system call exit event in fork_return() for new threads in a traced process. - Copy td_dbg_sc_* to new threads instead of zeroing. This ensures that td_dbg_sc_code in particular will report the system call that created the new thread or process when it reports a system call exit event in fork_return(). - Add new ptrace tests to verify that new child processes and threads report system call exit events with a valid pl_syscall_code via PT_LWPINFO. Reviewed by: kib Differential Revision: https://reviews.freebsd.org/D3822	2015-10-06 19:29:05 +00:00
Conrad Meyer	e6b95927f3	Fix core corruption caused by race in note_procstat_vmmap This fix is spiritually similar to r287442 and was discovered thanks to the KASSERT added in that revision. NT_PROCSTAT_VMMAP output length, when packing kinfo structs, is tied to the length of filenames corresponding to vnodes in the process' vm map via vn_fullpath. As vnodes may move during coredump, this is racy. We do not remove the race, only prevent it from causing coredump corruption. - Add a sysctl, kern.coredump_pack_vmmapinfo, to allow users to disable kinfo packing for PROCSTAT_VMMAP notes. This avoids VMMAP corruption and truncation, even if names change, at the cost of up to PATH_MAX bytes per mapped object. The new sysctl is documented in core.5. - Fix note_procstat_vmmap to self-limit in the second pass. This addresses corruption, at the cost of sometimes producing a truncated result. - Fix PROCSTAT_VMMAP consumers libutil (and libprocstat, via copy-paste) to grok the new zero padding. Reported by: pho (https://people.freebsd.org/~pho/stress/log/datamove4-2.txt) Relnotes: yes Sponsored by: EMC / Isilon Storage Division Differential Revision: https://reviews.freebsd.org/D3824	2015-10-06 18:07:00 +00:00
Gleb Smirnoff	640082d498	Remove debugging variable from r143761.	2015-10-06 09:43:49 +00:00
John Baldwin	3edd0ffffe	Include additional info in ptrace(2) KTR traces: - The new PC value and signal passed to PT_CONTINUE, PT_DETACH, PT_SYSCALL, and PT_TO_SC[EX]. - The system call code returned via PT_LWPINFO. MFC after: 1 week	2015-10-05 21:36:53 +00:00
Mark Johnston	403ec61cbb	Revert r288628 and instead fix a discrepancy between the posix_fadvise(2) man page and POSIX: posix_fadvise(2) returns an error number on failure. Reported by: jilles MFC after: 1 week	2015-10-03 22:27:14 +00:00
Mark Johnston	a7713f7631	The return value of posix_fadvise(2) is just an error status, so sys_posix_fadvise() should simply return the errno (or 0) to syscallenter() rather than setting a return value. MFC after: 1 week	2015-10-03 19:37:41 +00:00
Alan Cox	acada7aef0	Perform a single batched update to the object's paging-in-progress count rather than updating it for each page.	2015-10-03 17:04:52 +00:00
Poul-Henning Kamp	d58b610faa	Fail the sbuf if vsnprintf(3) fails.	2015-10-02 09:23:14 +00:00
Mark Johnston	0a19cfd454	Ensure that vop_stdadvise() does not call getblk() on vnodes that have an empty bufobj. Otherwise, vnodes belonging to filesystems that do not use the buffer cache may trigger assertion failures. Reported by: Fabien Keil	2015-10-01 16:34:53 +00:00
Colin Percival	2eb0015ab7	Disable suspend when we're shutting down. This solves the "tell FreeBSD to shut down; close laptop lid" scenario which otherwise tended to end with a laptop overheating or the battery dying. The implementation uses a new sysctl, kern.suspend_blocked; init(8) sets this while rc.suspend runs, and the ACPI sleep code ignores requests while the sysctl is set. Discussed on: freebsd-acpi (35 emails) MFC after: 1 week	2015-10-01 10:52:26 +00:00
Mark Johnston	3138cd3670	As a step towards the elimination of PG_CACHED pages, rework the handling of POSIX_FADV_DONTNEED so that it causes the backing pages to be moved to the head of the inactive queue instead of being cached. This affects the implementation of POSIX_FADV_NOREUSE as well, since it works by applying POSIX_FADV_DONTNEED to file ranges after they have been read or written. At that point the corresponding buffers may still be dirty, so the previous implementation would coalesce successive ranges and apply POSIX_FADV_DONTNEED to the result, ensuring that pages backing the dirty buffers would eventually be cached. To preserve this behaviour in an efficient manner, this change adds a new buf flag, B_NOREUSE, which causes the pages backing a VMIO buf to be placed at the head of the inactive queue when the buf is released. POSIX_FADV_NOREUSE then works by setting this flag in bufs that underlie the specified range. Reviewed by: alc, kib Sponsored by: EMC / Isilon Storage Division Differential Revision: https://reviews.freebsd.org/D3726	2015-09-30 23:06:29 +00:00
Andriy Gapon	2f2f522b5d	save some bytes by using more concise SDT_PROBE<n> instead of SDT_PROBE SDT_PROBE requires 5 parameters whereas SDT_PROBE<n> requires n parameters where n is typically smaller than 5. Perhaps SDT_PROBE should be made a private implementation detail. MFC after: 20 days	2015-09-28 12:14:16 +00:00
Jeff Roberson	4615830db2	- Collapse vfs_vmio_truncate & vfs_vmio_release into a single function. - Allow vfs_vmio_invalidate() to free the pages, leaving us with a single loop and bufobj lock when B_NOCACHE/B_INVAL is used. - Eliminate the special B_ASYNC handling on free that has not been relevant for some time. - Remove the extraneous page busy from vfs_vmio_truncate(). Reviewed by: kib Tested by: pho Sponsored by: EMC / Isilon storage division	2015-09-27 05:16:06 +00:00
Mark Johnston	0a805de6f3	Remove a check for a condition that is always false by a preceding KASSERT that was added in r144704.	2015-09-26 22:26:55 +00:00
Mark Johnston	d925c2e800	Fix argument ordering in vn_printf(). MFC after: 3 days	2015-09-26 22:16:54 +00:00
Conrad Meyer	2f1c4e0ebf	sbuf: Process more than one char at a time Revamp sbuf_put_byte() to sbuf_put_bytes() in the obvious fashion and fixup callers. Add a thin shim around sbuf_put_bytes() with the old ABI to avoid ugly changes to some callers. Reviewed by: jhb, markj Obtained from: Dan Sledz Sponsored by: EMC / Isilon Storage Division Differential Revision: https://reviews.freebsd.org/D3717	2015-09-25 18:37:14 +00:00
Konstantin Belousov	b2557db607	Use per-cpu values for base and last in tc_cpu_ticks(). The values are updated lockess, different CPUs write its own view of timecounter state. The critical section is done for safety, callers of tc_cpu_ticks() are supposed to already enter critical section, or to own a spinlock. The change fixes sporadical reports of too high values reported for the (W)CPU on platforms that do not provide cpu ticker and use tc_cpu_ticks(), in particular, arm*. Diagnosed and reviewed by: jhb Sponsored by: The FreeBSD Foundation MFC after: 1 week	2015-09-25 13:03:57 +00:00
Mateusz Guzik	3c44a3495f	kqueue: simplify kern_kqueue by not refing/unrefing creds too early No functional changes.	2015-09-23 12:45:08 +00:00
Jeff Roberson	589c956a5a	- Fix a nonsense reordering that somehow slipped into my last diff. Reported by: pho	2015-09-23 07:44:07 +00:00
Jeff Roberson	8264830c95	Some refactoring of the buf/vm interface. - Eliminate bogus page replacement that is inconsistently applied in the invalidation loop in brelse. This has been a no-op in modern times as biodone() is responsible for cleaning up after bogus pages. This would've spammed the console with printfs at a minimum. - Allow the compiler and human readers alike to reason about allocbuf() by splitting it into constituent parts. - Separate the VM manipulating and buf manipulating code in brelse() and bufdone() so that the intentions are clear. This makes it evident that there are several duplicated buf pages loops that will be consolidated at a later time. Reviewed by: kib Tested by: pho Sponsored by: EMC / Isilon Storage Division	2015-09-22 23:57:52 +00:00
Alan Cox	15aaea7892	Change vm_page_unwire() such that it (1) accepts PQ_NONE as the specified queue and (2) returns a Boolean indicating whether the page's wire count transitioned to zero. Exploit this change in vfs_vmio_release() to avoid pointlessly enqueueing a page that is about to be freed. (An earlier version of this change was developed by attilio@ and kmacy@. Any errors in this version are my own.) Reviewed by: kib Sponsored by: EMC / Isilon Storage Division	2015-09-22 18:16:52 +00:00
Hans Petter Selasky	c55f4c9445	Revert r287780 until more developers have their say. Differential Revision: https://reviews.freebsd.org/D3521 Requested by: gnn	2015-09-22 06:51:55 +00:00
Bryan Drewery	6c5c24c98c	vfs_mountroot_shuffle() never returns non-zero.	2015-09-22 03:34:07 +00:00
Konstantin Belousov	1f57d8c66b	Ensure that maxproc does not exceed pid_max, at the time of boot. Changes to kern.pid_max mib after the boot can break this relation. The maxfiles value was calculated by the MAXFILES formula based on maxproc value, but this change decouples them, and MAXFILES now references maxusers. Without manual tuning, the maxfiles default value remains as it was prior to this commit. But for systems which have tuned maxproc and rely on maxfiles to adjust, additional reconfiguration is needed. Reported by: rwatson Reviewed by: emaste Sponsored by: The FreeBSD Foundation MFC after: 2 weeks	2015-09-21 15:02:59 +00:00
Konstantin Belousov	cff8c6f2d1	Add support for weak symbols to the kernel linkers. It means that linkers no longer raise an error when undefined weak symbols are found, but relocate as if the symbol value was 0. Note that we do not repeat the mistake of userspace dynamic linker of making the symbol lookup prefer non-weak symbol definition over the weak one, if both are available. In fact, kernel linker uses the first definition found, and ignores duplicates. Signature of the elf_lookup() and elf_obj_lookup() functions changed to split result/error code and the symbol address returned. Otherwise, it is impossible to return zero address as the symbol value, to MD relocation code. This explains the mechanical changes in elf_machdep.c sources. The powerpc64 R_PPC_JMP_SLOT handler did not checked error from the lookup() call, the patch leaves the code as is (untested). Reported by: glebius Sponsored by: The FreeBSD Foundation MFC after: 1 week	2015-09-20 01:27:59 +00:00
Edward Tomasz Napierala	0d3d0cc358	Kernel part of reroot support - a way to change rootfs without reboot. Note that the mountlist manipulations are somewhat fragile, and not very pretty. The reason for this is to avoid changing vfs_mountroot(), which is (obviously) rather mission-critical, but not very well documented, and thus hard to test properly. It might be possible to rework it to use its own simple root mount mechanism instead of vfs_mountroot(). Reviewed by: kib@ MFC after: 1 month Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D2698	2015-09-18 17:32:22 +00:00
John Baldwin	bdd64116b0	Always clear TDB_USERWR before fetching system call arguments. The TDB_USERWR flag may still be set after a debugger detaches from a process via PT_DETACH. Previously the flag would never be cleared forcing a double fetch of the system call arguments for each system call. Note that the flag cannot be cleared at PT_DETACH time in case one of the threads in the process is currently stopped in syscallenter() and the debugger has modified the arguments for that pending system call before detaching. Reviewed by: kib MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D3678	2015-09-16 20:55:00 +00:00
John Baldwin	4295bec16a	When a process group leader exits, all of the processes in the group are sent SIGHUP and SIGCONT if any of the processes are stopped. Currently this behavior is triggered for any type of process stop including ptrace() stops and transient stops for single threading during exit() and execve(). Thus, if a debugger is attached to a process in a group when the leader exits, the entire group can be HUPed. Instead, only send the signals if a process in the group is stopped due to SIGSTOP. PR: 201149 Reviewed by: kib MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D3681	2015-09-16 16:40:07 +00:00
Mateusz Guzik	7665e341ca	sysctl: switch sysctllock to a sleepable rmlock, take 2 This restores r285125. Previous attempt was reverted due to a bug in rmlocks, which is fixed since r287833.	2015-09-15 23:06:56 +00:00
John Baldwin	e89d5f43da	Threads holding a read lock of a sleepable rm lock are not permitted to sleep. The rmlock implementation enforces this by disabling sleeping when a read lock is acquired. To simplify the implementation, sleeping is disabled for most of the duration of rm_rlock. However, it doesn't need to be disabled until the lock is acquired. If a sleepable rm lock is contested, then rm_rlock may need to acquire the backing sx lock. This tripped the overly-broad assertion. Fix by relaxing the assertion around the call to sx_xlock(). Reported by: mjg Reviewed by: kib, mjg MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D3324	2015-09-15 22:16:21 +00:00
Conrad Meyer	55d33667ee	kevent(2): Note DOOMED vnodes with NOTE_REVOKE In poll mode, check for and wake VBAD vnodes. (Vnodes that are VBAD at registration will never be woken by the RECLAIM trigger.) Add post-VOP_RECLAIM hook to trigger notes on vnode reclamation. (Vnodes that were fine at registration but are vgoned while being monitored should signal waiters.) Reviewed by: kib Approved by: markj (mentor) Sponsored by: EMC / Isilon Storage Division Differential Revision: https://reviews.freebsd.org/D3675	2015-09-15 20:22:30 +00:00
Hans Petter Selasky	9acc0eafd7	Implement callout_drain_async(), inspired by the projects/hps_head branch. This function is used to drain a callout via a callback instead of blocking the caller until the drain is complete. Refer to the callout_drain_async() manual page for a detailed description. Limitation: If a lock is used with the callout, the callout can only be drained asynchronously one time unless the callout_init_mtx() function is called again. This limitation is not present in projects/hps_head and will require more invasive changes to the timeout code, which was not in the scope of this patch. Differential Revision: https://reviews.freebsd.org/D3521 Reviewed by: wblock MFC after: 1 month	2015-09-14 10:52:26 +00:00
Warner Losh	7297c5e535	bufdonebio is now unused. Retire it too.	2015-09-11 04:20:04 +00:00
Mark Johnston	610141cebb	Add stack_save_td_running(), a function to trace the kernel stack of a running thread. It is currently implemented only on amd64 and i386; on these architectures, it is implemented by raising an NMI on the CPU on which the target thread is currently running. Unlike stack_save_td(), it may fail, for example if the thread is running in user mode. This change also modifies the kern.proc.kstack sysctl to use this function, so that stacks of running threads are shown in the output of "procstat -kk". This is handy for debugging threads that are stuck in a busy loop. Reviewed by: bdrewery, jhb, kib Sponsored by: EMC / Isilon Storage Division Differential Revision: https://reviews.freebsd.org/D3256	2015-09-11 03:54:37 +00:00
Warner Losh	ad8d57a99d	dev_strategy and dev_strategy_csw are unused since r281825. Remove them. Differential Revision: https://reviews.freebsd.org/D3620	2015-09-11 00:38:58 +00:00
Adrian Chadd	32766cd281	Also make kern.maxfilesperproc a boot time tunable. Auto-tuning threshold discussions aside, it turns out that if you want to lower this on say, rather memory-packed machines, you either set maxusers or kern.maxfiles, or you set it in sysctl. The former is a non-exact way to tune this; the latter doesn't actually affect anything in the startup scripts. This first occured because I wondered why the hell screen would take upwards of 10 seconds to spawn a new screen. I then found python doing the same thing during fork/exec of child processes - it calls close() on each FD up to the current openfiles limit. On a 1TB machine this is like, 26 million FDs per process. Ugh. So: * This allows it to be set early in /boot/loader.conf; * It can be used to work around the ridiculous situation of screen, python, etc doing a close() on potentially millions of FDs even though you only have four open. Tested: * 4GB, 32GB, 64GB, 128GB, 384GB, 1TB systems with autotune, ensuring screen and python forking doesn't result in some pretty hilariously bad behaviour. TODO: * Note that the default login.conf sets openfiles-cur to unlimited, effectively obeying kern.maxfilesperproc. Perhaps we should fix this. * .. and even if we do, we need to also ensure that daemons get a soft limit of something reasonable and capped - they can request more FDs themselves. MFC after: 1 week Sponsored by: Norse Corp, Inc.	2015-09-10 04:05:58 +00:00
Konstantin Belousov	9e18c9eb27	For open("name", O_DIRECTORY \| O_CREAT), do not try to create the named node, open(2) cannot create directories. But do allow the flag combination to succeed if the directory already exists. Declare the open("name", O_DIRECTORY \| O_CREAT \| O_EXCL) always invalid for the same reason, since open(2) cannot create directory. Note that there is an argument that O_DIRECTORY \| O_CREAT should be invalid always, regardless of the target directory existence or O_EXCL. The current fix is conservative and allows the call to succeed in the situation where it succeeded before the patch. Reported by: Tom Ridge <freebsd@tom-ridge.com> Reviewed by: rwatson PR: 202892 Sponsored by: The FreeBSD Foundation MFC after: 1 week	2015-09-09 19:31:08 +00:00
Mateusz Guzik	9af8c8b72b	fd: make rights a mandatory argument to fgetvp_rights The only caller already always passes rights.	2015-09-07 20:05:56 +00:00
Mateusz Guzik	d7832811a7	fd: make the common case in filecaps_copy work lockless The filedesc lock is only needed if ioctls caps are present, which is a rare situation. This is a step towards reducing the scope of the filedesc lock.	2015-09-07 20:02:56 +00:00
Conrad Meyer	bcb60d52e6	Follow-up to r287442: Move sysctl to compiled-once file Avoid duplicate sysctl nodes. Found by: tijl Approved by: markj (mentor) Sponsored by: EMC / Isilon Storage Division Differential Revision: https://reviews.freebsd.org/D3586	2015-09-07 16:44:28 +00:00
Kirk McKusick	17518b1a2b	Track changes to kern.maxvnodes and appropriately increase or decrease the size of the name cache hash table (mapping file names to vnodes) and the vnode hash table (mapping mount point and inode number to vnode). An appropriate locking strategy is the key to changing hash table sizes while they are in active use. Reviewed by: kib Tested by: Peter Holm Differential Revision: https://reviews.freebsd.org/D2265 MFC after: 2 weeks	2015-09-06 05:50:51 +00:00
Xin LI	28ffe927c2	Expose an interface to determine if an ACE is inherited. Submitted by: sef Reviewed by: trasz MFC after: 2 weeks Differential Revision: https://reviews.freebsd.org/D3540	2015-09-04 00:14:20 +00:00
Conrad Meyer	14bdbaf2e4	Detect badly behaved coredump note helpers Coredump notes depend on being able to invoke dump routines twice; once in a dry-run mode to get the size of the note, and another to actually emit the note to the corefile. When a note helper emits a different length section the second time around than the length it requested the first time, the kernel produces a corrupt coredump. NT_PROCSTAT_FILES output length, when packing kinfo structs, is tied to the length of filenames corresponding to vnodes in the process' fd table via vn_fullpath. As vnodes may move around during dump, this is racy. So: - Detect badly behaved notes in putnote() and pad underfilled notes. - Add a fail point, debug.fail_point.fill_kinfo_vnode__random_path to exercise the NT_PROCSTAT_FILES corruption. It simply picks random lengths to expand or truncate paths to in fo_fill_kinfo_vnode(). - Add a sysctl, kern.coredump_pack_fileinfo, to allow users to disable kinfo packing for PROCSTAT_FILES notes. This should avoid both FILES note corruption and truncation, even if filenames change, at the cost of about 1 kiB in padding bloat per open fd. Document the new sysctl in core.5. - Fix note_procstat_files to self-limit in the 2nd pass. Since sometimes this will result in a short write, pad up to our advertised size. This addresses note corruption, at the risk of sometimes truncating the last several fd info entries. - Fix NT_PROCSTAT_FILES consumers libutil and libprocstat to grok the zero padding. With suggestions from: bjk, jhb, kib, wblock Approved by: markj (mentor) Relnotes: yes Sponsored by: EMC / Isilon Storage Division Differential Revision: https://reviews.freebsd.org/D3548	2015-09-03 20:32:10 +00:00
Mateusz Guzik	7e8f566c0c	fd: remove UMA_ZONE_ZINIT argument from Files zone Originally it was added in order to prevent trashing of objects with INVARIANTS enabled. The same effect is now provided with mere UMA_ZONE_NOFREE. This reverts r286921. Discussed with: kib	2015-09-02 23:14:39 +00:00
John Baldwin	ded3e7f08e	The 'sa' argument to syscallret() is not unused.	2015-09-01 22:28:23 +00:00
John Baldwin	183b68f74f	Export current system call code and argument count for system call entry and exit events. procfs stop events for system call tracing report these values (argument count for system call entry and code for system call exit), but ptrace() does not provide this information. (Note that while the system call code can be determined in an ABI-specific manner during system call entry, it is not generally available during system call exit.) The values are exported via new fields at the end of struct ptrace_lwpinfo available via PT_LWPINFO. Reviewed by: kib MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D3536	2015-09-01 22:24:54 +00:00
Konstantin Belousov	6ae26d06dc	Exit notification for EVFILT_PROC removes knote from the knlist. In particular, this invalidates the knote kn_link linkage, making the SLIST_FOREACH() loop accessing undefined values (e.g. trashed by QUEUE_MACRO_DEBUG). If the knote is freed by other thread when kq lock is released or when influx is cleared, e.g. by knote_scan() for kqueue owning the knote, the iteration step would access freed memory. Use SLIST_FOREACH_SAFE() to fix iteration. Diagnosed by: avg Tested by: avg, lstewart, pawel Sponsored by: The FreeBSD Foundation MFC after: 2 weeks	2015-09-01 14:05:29 +00:00
Konstantin Belousov	78b9afe121	Clean up the kqueue use of the uma KPI. Explain why it is fine to not check for M_NOWAIT failures in kqueue_register(). Remove unneeded check for NULL result from waitable allocation in kqueue_scan(). uma_free(9) handles NULL argument correctly, remove checks for NULL. Remove useless cast and adjust style in knote_alloc(). Sponsored by: The FreeBSD Foundation MFC after: 2 weeks	2015-09-01 13:21:32 +00:00
Andriy Gapon	378d5c6c89	callout_reset: fix a reversed check for cc_exec_cancel The typo was introduced in r278469 / `344ecf88af`. As a result of the bug there was a timing window where callout_reset() would fail to cancel a concurrent execution of a callout that is about to start and would schedule the callout again. The callout would fire more times than it is scheduled. That would happen even if the callout is initialized with a lock. For example, the bug triggered the "Stray timeout" assertion in taskqueue_timeout_func(). MFC after: 5 days	2015-09-01 09:27:14 +00:00
Konstantin Belousov	8d830e02f5	Use P1B_PRIO_MAX to designate max posix priority for the RR/FIFO scheduler types. It was intended to be used there, compare with the min value, and with the test for correctness in ksched_setscheduler(). Note that P1B_PRIO_MAX and RTP_PRIO_MAX do have the same numerical values, the change is cosmetical. Sponsored by: The FreeBSD Foundation MFC after: 1 week	2015-08-30 18:02:57 +00:00
Konstantin Belousov	44e629f18d	Remove single-use macros obfuscating malloc(9) and free(9) calls. Style. Sponsored by: The FreeBSD Foundation MFC after: 1 week	2015-08-30 17:58:11 +00:00
Julien Charbon	2ea3089cb1	Revert r286880: If at first this change made sense, it turns out it helps only the TCP timers callout(9) usage. As the benefit for others callout(9) usages did not reach a consensus the historical usage should prevail. Differential Revision: https://reviews.freebsd.org/D3078	2015-08-30 13:44:46 +00:00
Warner Losh	de830d432c	Remove now obsolete comment. MFC After: 2 days	2015-08-28 20:06:58 +00:00
Warner Losh	3f27281613	Per overwhelming sentiment in the code review, use FEATURE instead. Differential Revision: https://reviews.freebsd.org/D3488 MFC After: 2 days	2015-08-28 19:53:19 +00:00
Ed Schouten	bc1ace0b96	Decompose linkat()/renameat() rights to source and target. To make it easier to understand how Capsicum interacts with linkat() and renameat(), rename the rights to CAP_{LINK,RENAME}AT_{SOURCE,TARGET}. This also addresses a shortcoming in Capsicum, where it isn't possible to disable linking to files stored in a directory. Creating hardlinks essentially makes it possible to access files with additional rights. Reviewed by: rwatson, wblock Differential Revision: https://reviews.freebsd.org/D3411	2015-08-27 15:16:41 +00:00
Julien Charbon	cd252ea74d	Silent a compilation warning on callout_stop()	2015-08-27 10:43:35 +00:00
Julien Charbon	682d0e15b5	In callout_stop(), do not forget to initialize not_running variable. Thanks to hselasky for noticing that. Differential Revision: https://reviews.freebsd.org/D3078 (Updated) Submitted by: hselasky Pointy hat to: jch	2015-08-27 08:58:03 +00:00
Julien Charbon	0cfae4b4bc	In callout_stop(), if a callout is both pending and currently being serviced return 0 (fail) but it is applicable only mpsafe callouts. Thanks to hselasky for finding this. Differential Revision: https://reviews.freebsd.org/D3078 (Updated) Submitted by: hselasky Reviewed by: jch	2015-08-27 08:15:32 +00:00
Marcel Moolenaar	898b510468	An error of -1 from parse_mount() indicates that the specification was invalid. Don't trigger a mount failure (which by default means a panic), but instead just move on to the next directive in the configuration. This typically has us ask for the root mount. PR: 163245	2015-08-27 04:25:27 +00:00
Warner Losh	135342777c	When the kernel is compiled with INVARIANTS, export that as debug.invariants. Differential Revision: https://reviews.freebsd.org/D3488 MFC after: 3 days	2015-08-26 23:58:03 +00:00
George V. Neville-Neil	57031f7912	Summary: Add the interactivity equations to the header comment for our interactivity calculation routine. Suggested by: rwatson	2015-08-26 16:36:41 +00:00
Edward Tomasz Napierala	c9ba65040f	Make vfs_unmountall() unmount /dev after /, not before. The only reason this didn't result in an unclean shutdown is that devfs ignores MNT_FORCE flag. Reviewed by: kib@ MFC after: 1 month Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D3467	2015-08-24 13:18:13 +00:00
Edward Tomasz Napierala	6e572e084b	After r286237 it should be fine to call vgone(9) on a busy GEOM vnode; remove KASSERT that would prevent forced devfs unmount from working. MFC after: 1 month Sponsored by: The FreeBSD Foundation	2015-08-23 14:53:54 +00:00
Roger Pau Monné	e8234cfef6	preload_search_info: make sure mod is set Add a check to preload_search_info to make sure mod is set. Most of the callers of preload_search_info don't check that the mod parameter is set, which can cause page faults. While at it, remove some now unnecessary checks before calling preload_search_info. Sponsored by: Citrix Systems R&D Reviewed by: kib Differential Revision: https://reviews.freebsd.org/D3440	2015-08-21 15:57:57 +00:00
Konstantin Belousov	41d50cd6b7	If process becomes reaper (procctl(PROC_REAP_ACQUIRE)) while already having some children, the children' reaper is not reset to the parent. This allows for the situation where reaper has children but not descendands and the too strict asserts in the reap_status() fire. Remove the wrong asserts, add some clarification for the situation to the procctl(2) REAP_STATUS. Reported and tested by: feld Sponsored by: The FreeBSD Foundation MFC after: 1 week	2015-08-20 22:44:26 +00:00
Konstantin Belousov	fe5ec54b50	fget_unlocked() depends on the freed struct file f_count field being zero. The file_zone if no-free, but r284861 added trashing of the freed memory. Most visible manifestation of the issue were 'memory modified after free' panics for the file zone, triggered from falloc_noinstall(). Add UMA_ZONE_ZINIT flag to turn off trashing. Mjg noted that it makes sense to not trash freed memory for any non-free zone, which will be done later. Reported and tested by: pho Discussed with: mjg Sponsored by: The FreeBSD Foundation	2015-08-19 11:53:32 +00:00
Julien Charbon	a1e6f8ff27	callout_stop() should return 0 (fail) when the callout is currently being serviced and indeed unstoppable. A scenario to reproduce this case is: - the callout is being serviced and at same time, - callout_reset() is called on this callout that sets the CALLOUT_PENDING flag and at same time, - callout_stop() is called on this callout and returns 1 (success) even if the callout is indeed currently running and unstoppable. This issue was caught up while making r284245 (D2763) workaround, and was discussed at BSDCan 2015. Once applied the r284245 workaround is not needed anymore and will be reverted. Differential Revision: https://reviews.freebsd.org/D3078 Reviewed by: jhb Sponsored by: Verisign, Inc.	2015-08-18 10:15:09 +00:00
Rui Paulo	aea3463e34	genassym.sh: call nm(1) with NMFLAGS.	2015-08-14 22:57:13 +00:00
Ian Lepore	e8bac3f240	If a specific timecounter has been chosen via sysctl, and a new timecounter with higher quality registers (presumably in a module that has just been loaded), do not undo the user's choice by switching to the new timecounter. Document that behavior, and also the fact that there is no way to unregister a timecounter (and thus no way to unload a module containing one).	2015-08-12 20:50:20 +00:00
Mariusz Zaborski	53bf545ddb	When the wait*(2) syscalls wait for any process (P_ALL), they should ignore processes created with the pdfork(2) syscall. PR: 201054 Approved by: pjd (mentor) Discussed with: emaste, rwatson	2015-08-12 20:08:54 +00:00
Ed Schouten	880e2c6c52	Perform cleanups in response to D3307. - Document the kern_kevent_anonymous() function. - Add assertions to ensure that we don't silently leave the kqueue linked from a file descriptor table. Reviewed by: jmg Differential Revision: https://reviews.freebsd.org/D3364	2015-08-12 17:46:26 +00:00
Ed Schouten	8c43e4ccfa	Properly return ENOTDIR when calling at() on a non-vnode. We already properly return ENOTDIR when calling at() on a non-directory vnode, but it turns out that if you call it on a socket, we see EINVAL. Patch up namei to properly translate this to ENOTDIR.	2015-08-12 16:17:00 +00:00
Ed Schouten	f3fe76ecd8	Unignore signals when starting CloudABI processes. As CloudABI processes cannot adjust their signal handlers, we need to make sure that we start up CloudABI processes with consistent signal masks. Though the POSIx standard signal behavior is all right, we do need to make sure that we ignore SIGPIPE, as it would otherwise be hard to interact with pipes and sockets. Extend execsigs() to iterate over ps_sigignore and call sigdflt() for each of the ignored signals. Reviewed by: kib Obtained from: https://github.com/NuxiNL/freebsd Differential Revision: https://reviews.freebsd.org/D3365	2015-08-12 11:30:31 +00:00
Ed Schouten	e26f6b5f6b	Add support for anonymous kqueues. CloudABI's polling system calls merge the concept of one-shot polling (poll, select) and stateful polling (kqueue). They share the same data structures. Extend FreeBSD's kqueue to provide support for waiting for events on an anonymous kqueue. Unlike stateful polling, there is no need to support timeouts, as an additional timer event could be used instead. Furthermore, it makes no sense to use a different number of input and output kevents. Merge this into a single argument. Obtained from: https://github.com/NuxiNL/freebsd Differential Revision: https://reviews.freebsd.org/D3307	2015-08-11 13:47:23 +00:00
Ed Schouten	aa04a06df5	Introduce kern_cap_rights_limit(). The existing sys_cap_rights_limit() expects that a cap_rights_t object lives in userspace. It is therefore hard to call into it from kernelspace. Move the interesting bits of sys_cap_rights_limit() into kern_cap_rights_limit(), so that we can call into it from the CloudABI compatibility layer. Obtained from: https://github.com/NuxiNL/freebsd Differential Revision: https://reviews.freebsd.org/D3314	2015-08-11 08:43:50 +00:00
Konstantin Belousov	edc8222303	Make kstack_pages a tunable on arm, x86, and powepc. On i386, the initial thread stack is not adjusted by the tunable, the stack is allocated too early to get access to the kernel environment. See TD0_KSTACK_PAGES for the thread0 stack sizing on i386. The tunable was tested on x86 only. From the visual inspection, it seems that it might work on arm and powerpc. The arm USPACE_SVC_STACK_TOP and powerpc USPACE macros seems to be already incorrect for the threads with non-default kstack size. I only changed the macros to use variable instead of constant, since I cannot test. On arm64, mips and sparc64, some static data structures are sized by KSTACK_PAGES, so the tunable is disabled. Sponsored by: The FreeBSD Foundation MFC after: 2 week	2015-08-10 17:18:21 +00:00
Alexander V. Chernikov	0cbefd30cb	Add const-qualifiers for source mbuf argument in m_dup(), m_copym(), m_dup_pkthdr() and m_tag_copy_chain().	2015-08-08 15:50:46 +00:00
Ian Lepore	721b581722	Only process the PPS event types currently enabled in pps_params.mode. This makes the PPS API behave correctly, but isn't ideal -- we still end up capturing PPS data for non-enabled edges, we just don't process the data into an event that becomes visible outside of kern_tc. That's because the event type isn't passed to pps_capture(), so it can't do the filtering. Any solution for capture filtering is going to require touching every driver.	2015-08-07 23:31:31 +00:00
Ian Lepore	6f7a9f7c8d	RFC 2783 requires a status of ETIMEDOUT, not EWOULDBLOCK, on a timeout.	2015-08-07 21:14:19 +00:00
Ed Schouten	a2034cc98a	Allow the creation of kqueues with a restricted set of Capsicum rights. On CloudABI we want to create file descriptors with just the minimal set of Capsicum rights in place. The reason for this is that it makes it easier to obtain uniform behaviour across different operating systems. By explicitly whitelisting the operations, we can return consistent error codes, but also prevent applications from depending OS-specific behaviour. Extend kern_kqueue() to take an additional struct filecaps that is passed on to falloc_caps(). Update the existing consumers to pass in NULL. Differential Revision: https://reviews.freebsd.org/D3259	2015-08-05 07:36:50 +00:00
Ed Schouten	2433a4eb04	Make it possible to implement poll(2) on top of kqueue(2). It looks like EVFILT_READ and EVFILT_WRITE trigger under the same conditions as poll()'s POLLRDNORM and POLLWRNORM as described by POSIX. The only difference is that POLLRDNORM has to be triggered on regular files unconditionally, whereas EVFILT_READ only triggers when not EOF. Introduce a new flag, NOTE_FILE_POLL, that can be used to make EVFILT_READ and EVFILT_WRITE behave identically to poll(). This flag will be used by cloudlibc's poll() function. Reviewed by: jmg Differential Revision: https://reviews.freebsd.org/D3303	2015-08-05 07:34:29 +00:00
Konstantin Belousov	35dfc644f5	Copy the fencing of the algorithm to do lock-less update and reading of the timehands, from the kern_tc.c implementation to vdso. Add comments giving hints where to look for the algorithm explanation. To compensate the removal of rmb() in userspace binuptime(), add explicit lfence instruction before rdtsc. On i386, add usual complications to detect SSE2 presence; assume that old CPUs which do not implement SSE2 also execute rdtsc almost in order. Reviewed by: alc, bde (previous version) Sponsored by: The FreeBSD Foundation MFC after: 2 weeks	2015-08-04 12:33:51 +00:00
Edward Tomasz Napierala	57a73b26e0	Mark vgonel() as static. It was already declared static earlier; no idea why compilers don't warn about this. MFC after: 1 month Sponsored by: The FreeBSD Foundation	2015-08-04 08:51:56 +00:00
Ed Schouten	dc4b532479	Fix bad arithmetic in umtx_key_get() to compute object offset. It looks like umtx_key_get() has the addition and subtraction the wrong way around, meaning that it fails to match in certain cases. This causes the cloudlibc unit tests to deadlock in certain cases. Reviewed by: kib Differential Revision: https://reviews.freebsd.org/D3287	2015-08-04 06:01:13 +00:00
Ed Schouten	52942c1eae	Add missing const keyword to function parameter. The umtx_key_get() function does not dereference the address off the userspace object. The pointer can safely be const.	2015-08-03 21:11:33 +00:00
John Baldwin	92de34df2c	kgdb uses td_oncpu to determine if a thread is running and should use a pcb from stoppcbs[] rather than the thread's PCB. However, exited threads retained td_oncpu from the last time they ran, and newborn threads had their CPU fields cleared to zero during fork and thread creation since they are in the set of fields zeroed when threads are setup. To fix, explicitly update the CPU fields for exiting threads in sched_throw() to reflect the switch out and reset the CPU fields for new threads in sched_fork_thread() to NOCPU. Reviewed by: kib MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D3193	2015-08-03 20:43:36 +00:00
Ed Schouten	39f5ebb774	Add sysent flag to switch to capabilities mode on startup. CloudABI processes should run in capabilities mode automatically. There is no need to switch manually (e.g., by calling cap_enter()). Add a flag, SV_CAPSICUM, that can be used to call into cap_enter() during execve(). Reviewed by: kib	2015-08-03 13:41:47 +00:00
Mark Johnston	ce1c953ee0	Don't modify curthread->td_locks unless INVARIANTS is enabled. This field is only used in a KASSERT that verifies that no locks are held when returning to user mode. Moreover, the td_locks accounting is only correct when LOCK_DEBUG > 0, which is implied by INVARIANTS. Reviewed by: jhb MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D3205	2015-08-02 00:03:08 +00:00
John Baldwin	98685dc8af	Clear P_TRACED before reparenting a detached process back to its original parent. Otherwise the debugee will be set as an orphan of the debugger. Add tests for tracing forks via PT_FOLLOW_FORK. Reviewed by: kib MFC after: 2 weeks Differential Revision: https://reviews.freebsd.org/D2809	2015-08-01 16:27:52 +00:00
Ed Schouten	7ee1b208c3	Add kern_shm_open(). This allows you to specify the capabilities that the new file descriptor should have. This allows us to create shared memory objects that only have the rights we're interested in. The idea behind restricting the rights is that it makes it a lot easier for CloudABI to get consistent behaviour across different operating systems. We only need to make sure that a shared memory implementation consistently implements the operations that are whitelisted. Approved by: kib Obtained from: https://github.com/NuxiNL/freebsd	2015-08-01 07:21:14 +00:00
Ed Schouten	6236e71bfe	Fix accidental line wrapping introduced in r286122.	2015-07-31 10:46:45 +00:00
Ed Schouten	367a13f905	Limit rights on process descriptors. On CloudABI, the rights bits returned by cap_rights_get() match up with the operations that you can actually perform on the file descriptor. Limiting the rights is good, because it makes it easier to get uniform behaviour across different operating systems. If process descriptors on FreeBSD would suddenly gain support for any new file operation, this wouldn't become exposed to CloudABI processes without first extending the rights. Extend fork1() to gain a 'struct filecaps' argument that allows you to construct process descriptors with custom rights. Use this in cloudabi_sys_proc_fork() to limit the rights to just fstat() and pdwait(). Obtained from: https://github.com/NuxiNL/freebsd	2015-07-31 10:21:58 +00:00
Konstantin Belousov	8917728875	vn_io_fault() handling of the LOR for i/o into the file-backed buffers has observable overhead when the buffer pages are not resident or not mapped. The overhead comes at least from two factors, one is the additional work needed to detect the situation, prepare and execute the rollbacks. Another is the consequence of the i/o splitting into the batches of the held pages, causing filesystems see series of the smaller i/o requests instead of the single large request. Note that expected case of the resident i/o buffer does not expose these issues. Provide a prefaulting for the userspace i/o buffers, disabled by default. I am careful of not enabling prefaulting by default for now, since it would be detrimental for the applications which speculatively pass extra-large buffers of anonymous memory to not deal with buffer sizing (if such apps exist). Found and tested by: bde, emaste Sponsored by: The FreeBSD Foundation MFC after: 1 week	2015-07-31 04:12:51 +00:00

1 2 3 4 5 ...

14552 Commits