freebsd-dev

Author	SHA1	Message	Date
Konstantin Belousov	6c775eb64e	Allow PT_INTERP and PT_NOTES segments to be located anywhere in the executable image. Keep one page (arbitrary) limit on the max allowed size of the PT_NOTES. The ELF image activators still require that program headers of the executable are fully contained in the first page of the image file. Reviewed by: emaste, jhb Sponsored by: The FreeBSD Foundation MFC after: 2 weeks Differential revision: https://reviews.freebsd.org/D3871	2015-10-14 18:27:35 +00:00
Jeff Roberson	21fae96123	Parallelize the buffer cache and rewrite getnewbuf(). This results in a 8x performance improvement in a micro benchmark on a 4 socket machine. - Get buffer headers from a per-cpu uma cache that sits in from of the free queue. - Use a per-cpu quantum cache in vmem to eliminate contention for kva. - Use multiple clean queues according to buffer cache size to eliminate clean queue lock contention. - Introduce a bufspace daemon that attempts to prevent getnewbuf() callers from blocking or doing direct recycling. - Close some bufspace allocation races that could lead to endless recycling. - Further the transition to a more modern style of small functions grouped by prefix in order to improve growing complexity. Sponsored by: EMC / Isilon Reviewed by: kib Tested by: pho	2015-10-14 02:10:07 +00:00
Hiren Panchasara	86a996e6bd	There are times when it would be really nice to have a record of the last few packets and/or state transitions from each TCP socket. That would help with narrowing down certain problems we see in the field that are hard to reproduce without understanding the history of how we got into a certain state. This change provides just that. It saves copies of the last N packets in a list in the tcpcb. When the tcpcb is destroyed, the list is freed. I thought this was likely to be more performance-friendly than saving copies of the tcpcb. Plus, with the packets, you should be able to reverse-engineer what happened to the tcpcb. To enable the feature, you will need to compile a kernel with the TCPPCAP option. Even then, the feature defaults to being deactivated. You can activate it by setting a positive value for the number of captured packets. You can do that on either a global basis or on a per-socket basis (via a setsockopt call). There is no way to get the packets out of the kernel other than using kmem or getting a coredump. I thought that would help some of the legal/privacy concerns regarding such a feature. However, it should be possible to add a future effort to export them in PCAP format. I tested this at low scale, and found that there were no mbuf leaks and the peak mbuf usage appeared to be unchanged with and without the feature. The main performance concern I can envision is the number of mbufs that would be used on systems with a large number of sockets. If you save five packets per direction per socket and have 3,000 sockets, that will consume at least 30,000 mbufs just to keep these packets. I tried to reduce the concerns associated with this by limiting the number of clusters (not mbufs) that could be used for this feature. Again, in my testing, that appears to work correctly. Differential Revision: D3100 Submitted by: Jonathan Looney <jlooney at juniper dot net> Reviewed by: gnn, hiren	2015-10-14 00:35:37 +00:00
Edward Tomasz Napierala	92001b9497	Change the default setting of kern.ipc.shm_allow_removed from 0 to 1. This removes the need for manually changing this flag for Google Chrome users. It also improves compatibility with Linux applications running under Linuxulator compatibility layer, and possibly also helps in porting software from Linux. Generally speaking, the flag allows applications to create the shared memory segment, attach it, remove it, and then continue to use it and to reattach it later. This means that the kernel will automatically "clean up" after the application exits. It could be argued that it's against POSIX. However, SUSv3 says this about IPC_RMID: "Remove the shared memory identifier specified by shmid from the system and destroy the shared memory segment and shmid_ds data structure associated with it." From my reading, we break it in any case by deferring removal of the segment until it's detached; we won't break it any more by also deferring removal of the identifier. This is the behaviour exhibited by Linux since... probably always, and also by OpenBSD since the following commit: revision 1.54 date: 2011/10/27 07:56:28; author: robert; state: Exp; lines: +3 -8; Allow segments to be used even after they were marked for deletion with the IPC_RMID flag. This is permitted as an extension beyond the standards and this is similar to what other operating systems like linux do. MFC after: 1 month Relnotes: yes Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D3603	2015-10-10 09:29:47 +00:00
Edward Tomasz Napierala	b9a5c7b595	Provide better debug message on kernel module name clash. Reviewed by: kib@ MFC after: 1 month Sponsored by: The FreeBSD Foundation	2015-10-10 09:21:55 +00:00
Edward Tomasz Napierala	8d90e66066	Remove root_mount_wait(). It's not used anywhere. Reviewed by: bapt@ MFC after: 1 month Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D3787	2015-10-09 12:11:37 +00:00
Konstantin Belousov	4b48959f9f	Enforce the maxproc limitation before allocating struct proc, initial struct thread and kernel stack for the thread. Otherwise, a load similar to a fork bomb would exhaust KVA and possibly kmem, mostly due to the struct proc being type-stable. The nprocs counter is changed from being protected by allproc_lock sx to be an atomic variable. Note that ddb/db_ps.c:db_ps() use of nprocs was unsafe before, and is still unsafe, but it seems that the only possible undesired consequence is the harmless warning printed when allproc linked list length does not match nprocs. Diagnosed by: Svatopluk Kraus <onwahe@gmail.com> Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 1 week	2015-10-08 11:07:09 +00:00
Fabien Thomas	78e79434d2	Fix r283998 that broke mapin events for hwpmc. Reviewed by: jhb Sponsored by: Stormshield	2015-10-08 09:54:33 +00:00
Gleb Smirnoff	e40e8705db	Fix regression from r248371. We need to copy packet header to new mbuf. Unlike in the pre-r248371 code, assert that M_PKTHDR is set only on a first mbuf. Reported & tested by: Andriy Voskoboinyk <s3erios gmail.com> Sponsored by: Nginx, Inc.	2015-10-07 12:40:00 +00:00
John Baldwin	189ac973de	Fix various edge cases related to system call tracing. - Always set td_dbg_sc_* when P_TRACED is set on system call entry even if the debugger is not tracing system call entries. This ensures the fields are valid when reporting other stops that occur at system call boundaries such as for PT_FOLLOW_FORKS or when only tracing system call exits. - Set TDB_SCX when reporting the stop for a new child process in fork_return(). This causes the event to be reported as a system call exit. - Report a system call exit event in fork_return() for new threads in a traced process. - Copy td_dbg_sc_* to new threads instead of zeroing. This ensures that td_dbg_sc_code in particular will report the system call that created the new thread or process when it reports a system call exit event in fork_return(). - Add new ptrace tests to verify that new child processes and threads report system call exit events with a valid pl_syscall_code via PT_LWPINFO. Reviewed by: kib Differential Revision: https://reviews.freebsd.org/D3822	2015-10-06 19:29:05 +00:00
Conrad Meyer	e6b95927f3	Fix core corruption caused by race in note_procstat_vmmap This fix is spiritually similar to r287442 and was discovered thanks to the KASSERT added in that revision. NT_PROCSTAT_VMMAP output length, when packing kinfo structs, is tied to the length of filenames corresponding to vnodes in the process' vm map via vn_fullpath. As vnodes may move during coredump, this is racy. We do not remove the race, only prevent it from causing coredump corruption. - Add a sysctl, kern.coredump_pack_vmmapinfo, to allow users to disable kinfo packing for PROCSTAT_VMMAP notes. This avoids VMMAP corruption and truncation, even if names change, at the cost of up to PATH_MAX bytes per mapped object. The new sysctl is documented in core.5. - Fix note_procstat_vmmap to self-limit in the second pass. This addresses corruption, at the cost of sometimes producing a truncated result. - Fix PROCSTAT_VMMAP consumers libutil (and libprocstat, via copy-paste) to grok the new zero padding. Reported by: pho (https://people.freebsd.org/~pho/stress/log/datamove4-2.txt) Relnotes: yes Sponsored by: EMC / Isilon Storage Division Differential Revision: https://reviews.freebsd.org/D3824	2015-10-06 18:07:00 +00:00
Gleb Smirnoff	640082d498	Remove debugging variable from r143761.	2015-10-06 09:43:49 +00:00
John Baldwin	3edd0ffffe	Include additional info in ptrace(2) KTR traces: - The new PC value and signal passed to PT_CONTINUE, PT_DETACH, PT_SYSCALL, and PT_TO_SC[EX]. - The system call code returned via PT_LWPINFO. MFC after: 1 week	2015-10-05 21:36:53 +00:00
Mark Johnston	403ec61cbb	Revert r288628 and instead fix a discrepancy between the posix_fadvise(2) man page and POSIX: posix_fadvise(2) returns an error number on failure. Reported by: jilles MFC after: 1 week	2015-10-03 22:27:14 +00:00
Mark Johnston	a7713f7631	The return value of posix_fadvise(2) is just an error status, so sys_posix_fadvise() should simply return the errno (or 0) to syscallenter() rather than setting a return value. MFC after: 1 week	2015-10-03 19:37:41 +00:00
Alan Cox	acada7aef0	Perform a single batched update to the object's paging-in-progress count rather than updating it for each page.	2015-10-03 17:04:52 +00:00
Poul-Henning Kamp	d58b610faa	Fail the sbuf if vsnprintf(3) fails.	2015-10-02 09:23:14 +00:00
Mark Johnston	0a19cfd454	Ensure that vop_stdadvise() does not call getblk() on vnodes that have an empty bufobj. Otherwise, vnodes belonging to filesystems that do not use the buffer cache may trigger assertion failures. Reported by: Fabien Keil	2015-10-01 16:34:53 +00:00
Colin Percival	2eb0015ab7	Disable suspend when we're shutting down. This solves the "tell FreeBSD to shut down; close laptop lid" scenario which otherwise tended to end with a laptop overheating or the battery dying. The implementation uses a new sysctl, kern.suspend_blocked; init(8) sets this while rc.suspend runs, and the ACPI sleep code ignores requests while the sysctl is set. Discussed on: freebsd-acpi (35 emails) MFC after: 1 week	2015-10-01 10:52:26 +00:00
Mark Johnston	3138cd3670	As a step towards the elimination of PG_CACHED pages, rework the handling of POSIX_FADV_DONTNEED so that it causes the backing pages to be moved to the head of the inactive queue instead of being cached. This affects the implementation of POSIX_FADV_NOREUSE as well, since it works by applying POSIX_FADV_DONTNEED to file ranges after they have been read or written. At that point the corresponding buffers may still be dirty, so the previous implementation would coalesce successive ranges and apply POSIX_FADV_DONTNEED to the result, ensuring that pages backing the dirty buffers would eventually be cached. To preserve this behaviour in an efficient manner, this change adds a new buf flag, B_NOREUSE, which causes the pages backing a VMIO buf to be placed at the head of the inactive queue when the buf is released. POSIX_FADV_NOREUSE then works by setting this flag in bufs that underlie the specified range. Reviewed by: alc, kib Sponsored by: EMC / Isilon Storage Division Differential Revision: https://reviews.freebsd.org/D3726	2015-09-30 23:06:29 +00:00
Andriy Gapon	2f2f522b5d	save some bytes by using more concise SDT_PROBE<n> instead of SDT_PROBE SDT_PROBE requires 5 parameters whereas SDT_PROBE<n> requires n parameters where n is typically smaller than 5. Perhaps SDT_PROBE should be made a private implementation detail. MFC after: 20 days	2015-09-28 12:14:16 +00:00
Jeff Roberson	4615830db2	- Collapse vfs_vmio_truncate & vfs_vmio_release into a single function. - Allow vfs_vmio_invalidate() to free the pages, leaving us with a single loop and bufobj lock when B_NOCACHE/B_INVAL is used. - Eliminate the special B_ASYNC handling on free that has not been relevant for some time. - Remove the extraneous page busy from vfs_vmio_truncate(). Reviewed by: kib Tested by: pho Sponsored by: EMC / Isilon storage division	2015-09-27 05:16:06 +00:00
Mark Johnston	0a805de6f3	Remove a check for a condition that is always false by a preceding KASSERT that was added in r144704.	2015-09-26 22:26:55 +00:00
Mark Johnston	d925c2e800	Fix argument ordering in vn_printf(). MFC after: 3 days	2015-09-26 22:16:54 +00:00
Conrad Meyer	2f1c4e0ebf	sbuf: Process more than one char at a time Revamp sbuf_put_byte() to sbuf_put_bytes() in the obvious fashion and fixup callers. Add a thin shim around sbuf_put_bytes() with the old ABI to avoid ugly changes to some callers. Reviewed by: jhb, markj Obtained from: Dan Sledz Sponsored by: EMC / Isilon Storage Division Differential Revision: https://reviews.freebsd.org/D3717	2015-09-25 18:37:14 +00:00
Konstantin Belousov	b2557db607	Use per-cpu values for base and last in tc_cpu_ticks(). The values are updated lockess, different CPUs write its own view of timecounter state. The critical section is done for safety, callers of tc_cpu_ticks() are supposed to already enter critical section, or to own a spinlock. The change fixes sporadical reports of too high values reported for the (W)CPU on platforms that do not provide cpu ticker and use tc_cpu_ticks(), in particular, arm*. Diagnosed and reviewed by: jhb Sponsored by: The FreeBSD Foundation MFC after: 1 week	2015-09-25 13:03:57 +00:00
Mateusz Guzik	3c44a3495f	kqueue: simplify kern_kqueue by not refing/unrefing creds too early No functional changes.	2015-09-23 12:45:08 +00:00
Jeff Roberson	589c956a5a	- Fix a nonsense reordering that somehow slipped into my last diff. Reported by: pho	2015-09-23 07:44:07 +00:00
Jeff Roberson	8264830c95	Some refactoring of the buf/vm interface. - Eliminate bogus page replacement that is inconsistently applied in the invalidation loop in brelse. This has been a no-op in modern times as biodone() is responsible for cleaning up after bogus pages. This would've spammed the console with printfs at a minimum. - Allow the compiler and human readers alike to reason about allocbuf() by splitting it into constituent parts. - Separate the VM manipulating and buf manipulating code in brelse() and bufdone() so that the intentions are clear. This makes it evident that there are several duplicated buf pages loops that will be consolidated at a later time. Reviewed by: kib Tested by: pho Sponsored by: EMC / Isilon Storage Division	2015-09-22 23:57:52 +00:00
Alan Cox	15aaea7892	Change vm_page_unwire() such that it (1) accepts PQ_NONE as the specified queue and (2) returns a Boolean indicating whether the page's wire count transitioned to zero. Exploit this change in vfs_vmio_release() to avoid pointlessly enqueueing a page that is about to be freed. (An earlier version of this change was developed by attilio@ and kmacy@. Any errors in this version are my own.) Reviewed by: kib Sponsored by: EMC / Isilon Storage Division	2015-09-22 18:16:52 +00:00
Hans Petter Selasky	c55f4c9445	Revert r287780 until more developers have their say. Differential Revision: https://reviews.freebsd.org/D3521 Requested by: gnn	2015-09-22 06:51:55 +00:00
Bryan Drewery	6c5c24c98c	vfs_mountroot_shuffle() never returns non-zero.	2015-09-22 03:34:07 +00:00
Konstantin Belousov	1f57d8c66b	Ensure that maxproc does not exceed pid_max, at the time of boot. Changes to kern.pid_max mib after the boot can break this relation. The maxfiles value was calculated by the MAXFILES formula based on maxproc value, but this change decouples them, and MAXFILES now references maxusers. Without manual tuning, the maxfiles default value remains as it was prior to this commit. But for systems which have tuned maxproc and rely on maxfiles to adjust, additional reconfiguration is needed. Reported by: rwatson Reviewed by: emaste Sponsored by: The FreeBSD Foundation MFC after: 2 weeks	2015-09-21 15:02:59 +00:00
Konstantin Belousov	cff8c6f2d1	Add support for weak symbols to the kernel linkers. It means that linkers no longer raise an error when undefined weak symbols are found, but relocate as if the symbol value was 0. Note that we do not repeat the mistake of userspace dynamic linker of making the symbol lookup prefer non-weak symbol definition over the weak one, if both are available. In fact, kernel linker uses the first definition found, and ignores duplicates. Signature of the elf_lookup() and elf_obj_lookup() functions changed to split result/error code and the symbol address returned. Otherwise, it is impossible to return zero address as the symbol value, to MD relocation code. This explains the mechanical changes in elf_machdep.c sources. The powerpc64 R_PPC_JMP_SLOT handler did not checked error from the lookup() call, the patch leaves the code as is (untested). Reported by: glebius Sponsored by: The FreeBSD Foundation MFC after: 1 week	2015-09-20 01:27:59 +00:00
Edward Tomasz Napierala	0d3d0cc358	Kernel part of reroot support - a way to change rootfs without reboot. Note that the mountlist manipulations are somewhat fragile, and not very pretty. The reason for this is to avoid changing vfs_mountroot(), which is (obviously) rather mission-critical, but not very well documented, and thus hard to test properly. It might be possible to rework it to use its own simple root mount mechanism instead of vfs_mountroot(). Reviewed by: kib@ MFC after: 1 month Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D2698	2015-09-18 17:32:22 +00:00
John Baldwin	bdd64116b0	Always clear TDB_USERWR before fetching system call arguments. The TDB_USERWR flag may still be set after a debugger detaches from a process via PT_DETACH. Previously the flag would never be cleared forcing a double fetch of the system call arguments for each system call. Note that the flag cannot be cleared at PT_DETACH time in case one of the threads in the process is currently stopped in syscallenter() and the debugger has modified the arguments for that pending system call before detaching. Reviewed by: kib MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D3678	2015-09-16 20:55:00 +00:00
John Baldwin	4295bec16a	When a process group leader exits, all of the processes in the group are sent SIGHUP and SIGCONT if any of the processes are stopped. Currently this behavior is triggered for any type of process stop including ptrace() stops and transient stops for single threading during exit() and execve(). Thus, if a debugger is attached to a process in a group when the leader exits, the entire group can be HUPed. Instead, only send the signals if a process in the group is stopped due to SIGSTOP. PR: 201149 Reviewed by: kib MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D3681	2015-09-16 16:40:07 +00:00
Mateusz Guzik	7665e341ca	sysctl: switch sysctllock to a sleepable rmlock, take 2 This restores r285125. Previous attempt was reverted due to a bug in rmlocks, which is fixed since r287833.	2015-09-15 23:06:56 +00:00
John Baldwin	e89d5f43da	Threads holding a read lock of a sleepable rm lock are not permitted to sleep. The rmlock implementation enforces this by disabling sleeping when a read lock is acquired. To simplify the implementation, sleeping is disabled for most of the duration of rm_rlock. However, it doesn't need to be disabled until the lock is acquired. If a sleepable rm lock is contested, then rm_rlock may need to acquire the backing sx lock. This tripped the overly-broad assertion. Fix by relaxing the assertion around the call to sx_xlock(). Reported by: mjg Reviewed by: kib, mjg MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D3324	2015-09-15 22:16:21 +00:00
Conrad Meyer	55d33667ee	kevent(2): Note DOOMED vnodes with NOTE_REVOKE In poll mode, check for and wake VBAD vnodes. (Vnodes that are VBAD at registration will never be woken by the RECLAIM trigger.) Add post-VOP_RECLAIM hook to trigger notes on vnode reclamation. (Vnodes that were fine at registration but are vgoned while being monitored should signal waiters.) Reviewed by: kib Approved by: markj (mentor) Sponsored by: EMC / Isilon Storage Division Differential Revision: https://reviews.freebsd.org/D3675	2015-09-15 20:22:30 +00:00
Hans Petter Selasky	9acc0eafd7	Implement callout_drain_async(), inspired by the projects/hps_head branch. This function is used to drain a callout via a callback instead of blocking the caller until the drain is complete. Refer to the callout_drain_async() manual page for a detailed description. Limitation: If a lock is used with the callout, the callout can only be drained asynchronously one time unless the callout_init_mtx() function is called again. This limitation is not present in projects/hps_head and will require more invasive changes to the timeout code, which was not in the scope of this patch. Differential Revision: https://reviews.freebsd.org/D3521 Reviewed by: wblock MFC after: 1 month	2015-09-14 10:52:26 +00:00
Warner Losh	7297c5e535	bufdonebio is now unused. Retire it too.	2015-09-11 04:20:04 +00:00
Mark Johnston	610141cebb	Add stack_save_td_running(), a function to trace the kernel stack of a running thread. It is currently implemented only on amd64 and i386; on these architectures, it is implemented by raising an NMI on the CPU on which the target thread is currently running. Unlike stack_save_td(), it may fail, for example if the thread is running in user mode. This change also modifies the kern.proc.kstack sysctl to use this function, so that stacks of running threads are shown in the output of "procstat -kk". This is handy for debugging threads that are stuck in a busy loop. Reviewed by: bdrewery, jhb, kib Sponsored by: EMC / Isilon Storage Division Differential Revision: https://reviews.freebsd.org/D3256	2015-09-11 03:54:37 +00:00
Warner Losh	ad8d57a99d	dev_strategy and dev_strategy_csw are unused since r281825. Remove them. Differential Revision: https://reviews.freebsd.org/D3620	2015-09-11 00:38:58 +00:00
Adrian Chadd	32766cd281	Also make kern.maxfilesperproc a boot time tunable. Auto-tuning threshold discussions aside, it turns out that if you want to lower this on say, rather memory-packed machines, you either set maxusers or kern.maxfiles, or you set it in sysctl. The former is a non-exact way to tune this; the latter doesn't actually affect anything in the startup scripts. This first occured because I wondered why the hell screen would take upwards of 10 seconds to spawn a new screen. I then found python doing the same thing during fork/exec of child processes - it calls close() on each FD up to the current openfiles limit. On a 1TB machine this is like, 26 million FDs per process. Ugh. So: * This allows it to be set early in /boot/loader.conf; * It can be used to work around the ridiculous situation of screen, python, etc doing a close() on potentially millions of FDs even though you only have four open. Tested: * 4GB, 32GB, 64GB, 128GB, 384GB, 1TB systems with autotune, ensuring screen and python forking doesn't result in some pretty hilariously bad behaviour. TODO: * Note that the default login.conf sets openfiles-cur to unlimited, effectively obeying kern.maxfilesperproc. Perhaps we should fix this. * .. and even if we do, we need to also ensure that daemons get a soft limit of something reasonable and capped - they can request more FDs themselves. MFC after: 1 week Sponsored by: Norse Corp, Inc.	2015-09-10 04:05:58 +00:00
Konstantin Belousov	9e18c9eb27	For open("name", O_DIRECTORY \| O_CREAT), do not try to create the named node, open(2) cannot create directories. But do allow the flag combination to succeed if the directory already exists. Declare the open("name", O_DIRECTORY \| O_CREAT \| O_EXCL) always invalid for the same reason, since open(2) cannot create directory. Note that there is an argument that O_DIRECTORY \| O_CREAT should be invalid always, regardless of the target directory existence or O_EXCL. The current fix is conservative and allows the call to succeed in the situation where it succeeded before the patch. Reported by: Tom Ridge <freebsd@tom-ridge.com> Reviewed by: rwatson PR: 202892 Sponsored by: The FreeBSD Foundation MFC after: 1 week	2015-09-09 19:31:08 +00:00
Mateusz Guzik	9af8c8b72b	fd: make rights a mandatory argument to fgetvp_rights The only caller already always passes rights.	2015-09-07 20:05:56 +00:00
Mateusz Guzik	d7832811a7	fd: make the common case in filecaps_copy work lockless The filedesc lock is only needed if ioctls caps are present, which is a rare situation. This is a step towards reducing the scope of the filedesc lock.	2015-09-07 20:02:56 +00:00
Conrad Meyer	bcb60d52e6	Follow-up to r287442: Move sysctl to compiled-once file Avoid duplicate sysctl nodes. Found by: tijl Approved by: markj (mentor) Sponsored by: EMC / Isilon Storage Division Differential Revision: https://reviews.freebsd.org/D3586	2015-09-07 16:44:28 +00:00
Kirk McKusick	17518b1a2b	Track changes to kern.maxvnodes and appropriately increase or decrease the size of the name cache hash table (mapping file names to vnodes) and the vnode hash table (mapping mount point and inode number to vnode). An appropriate locking strategy is the key to changing hash table sizes while they are in active use. Reviewed by: kib Tested by: Peter Holm Differential Revision: https://reviews.freebsd.org/D2265 MFC after: 2 weeks	2015-09-06 05:50:51 +00:00

1 2 3 4 5 ...

14504 Commits