freebsd-skq

Author	SHA1	Message	Date
adrian	64b681fbd5	Don't call wakeup if we're just returning reserved space; just return the reservation and wait for more space to appear. Submitted by: jeff Reviewed by: kib	2015-12-16 00:13:16 +00:00
jamie	b78d6a91e2	Fix jail name checking that disallowed anything that starts with '0'. The intention was to just limit leading zeroes on numeric names. That check is now improved to also catch the leading spaces and '+' that strtoul can pass through. PR: 204897 MFC after: 3 days	2015-12-15 17:25:00 +00:00
trasz	9d2d111f78	Tweak comments. MFC after: 1 month Sponsored by: The FreeBSD Foundation	2015-12-13 11:30:36 +00:00
trasz	6751d261c4	Actually make the 'amount' argument to racct_adjust_resource() signed, as it was always supposed to be. MFC after: 1 month Sponsored by: The FreeBSD Foundation	2015-12-13 11:21:13 +00:00
trasz	3430b87794	Avoid useless relocking. MFC after: 1 month Sponsored by: The FreeBSD Foundation	2015-12-13 11:08:29 +00:00
markj	98fd4878e0	Don't make assertions about td_critnest when the scheduler is stopped. A panicking thread always executes with a critical section held, so any attempt to allocate or free memory while dumping will otherwise cause a second panic. This can occur, for example, if xpt_polled_action() completes non-dump I/O that was pending at the time of the panic. The fact that this can occur is itself a bug, but asserting in this case does little but reduce the reliability of kernel dumps. Suggested by: kib Reported by: pho	2015-12-11 20:05:07 +00:00
imp	b4d51d26ba	Create the MDT_PNP_INFO metadata record to communicate PNP info about modules. External agents may use this data to automatically load those modules. Differential Review: https://reviews.freebsd.org/D3461	2015-12-11 05:27:53 +00:00
smh	4a58b9436f	Don't use 0 for pointer comparison Use NULL instead of 0 for comparison with panicstr. MFC after: 1 week Sponsored by: Multiplay	2015-12-08 18:38:33 +00:00
markj	f734f97f4e	Add helper functions proc_readmem() and proc_writemem(). These helper functions can be used to read in or write a buffer from or to an arbitrary process' address space. Without them, this can only be done using proc_rwmem(), which requires the caller to fill out a uio. This is onerous and results in code duplication; the new functions provide a simpler interface which is sufficient for most existing callers of proc_rwmem(). This change also adds a manual page for proc_rwmem() and the new functions. Reviewed by: jhb, kib Differential Revision: https://reviews.freebsd.org/D4245	2015-12-07 21:33:15 +00:00
emaste	0d1c50f494	Replace magic value ELF note type with NT_FREEBSD_ABI_TAG As of r291909 elf_common.h provides a definition. Suggested by: kib Sponsored by: The FreeBSD Foundation	2015-12-07 18:43:27 +00:00
kib	80e8626b43	Add support for usermode (vdso-like) gettimeofday(2) and clock_gettime(2) on ARMv7 and ARMv8 systems which have architectural generic timer hardware. It is similar how the RDTSC timer is used in userspace on x86. Fix a permission problem where generic timer access from EL0 (or userspace on v7) was not properly initialized on APs. For ARMv7, mark the stack non-executable. The shared page is added for all arms (including ARMv8 64bit), and the signal trampoline code is moved to the page. Reviewed by: andrew Discussed with: emaste, mmel Sponsored by: The FreeBSD Foundation Differential revision: https://reviews.freebsd.org/D4209	2015-12-07 12:20:26 +00:00
mckusick	1a9ecd3df9	We need to zero out the clustering variables in a freed vnode structure. For completeness add a VNASSERT that there are no threads waiting on a range lock (this was previously checked on every vnode free). Reported by; Rick Macklem Fix from: Mateusz Guzik PR: 204949	2015-12-04 03:54:18 +00:00
ken	d0f081c521	Add asynchronous command support to the pass(4) driver, and the new camdd(8) utility. CCBs may be queued to the driver via the new CAMIOQUEUE ioctl, and completed CCBs may be retrieved via the CAMIOGET ioctl. User processes can use poll(2) or kevent(2) to get notification when I/O has completed. While the existing CAMIOCOMMAND blocking ioctl interface only supports user virtual data pointers in a CCB (generally only one per CCB), the new CAMIOQUEUE ioctl supports user virtual and physical address pointers, as well as user virtual and physical scatter/gather lists. This allows user applications to have more flexibility in their data handling operations. Kernel memory for data transferred via the queued interface is allocated from the zone allocator in MAXPHYS sized chunks, and user data is copied in and out. This is likely faster than the vmapbuf()/vunmapbuf() method used by the CAMIOCOMMAND ioctl in configurations with many processors (there are more TLB shootdowns caused by the mapping/unmapping operation) but may not be as fast as running with unmapped I/O. The new memory handling model for user requests also allows applications to send CCBs with request sizes that are larger than MAXPHYS. The pass(4) driver now limits queued requests to the I/O size listed by the SIM driver in the maxio field in the Path Inquiry (XPT_PATH_INQ) CCB. There are some things things would be good to add: 1. Come up with a way to do unmapped I/O on multiple buffers. Currently the unmapped I/O interface operates on a struct bio, which includes only one address and length. It would be nice to be able to send an unmapped scatter/gather list down to busdma. This would allow eliminating the copy we currently do for data. 2. Add an ioctl to list currently outstanding CCBs in the various queues. 3. Add an ioctl to cancel a request, or use the XPT_ABORT CCB to do that. 4. Test physical address support. Virtual pointers and scatter gather lists have been tested, but I have not yet tested physical addresses or scatter/gather lists. 5. Investigate multiple queue support. At the moment there is one queue of commands per pass(4) device. If multiple processes open the device, they will submit I/O into the same queue and get events for the same completions. This is probably the right model for most applications, but it is something that could be changed later on. Also, add a new utility, camdd(8) that uses the asynchronous pass(4) driver interface. This utility is intended to be a basic data transfer/copy utility, a simple benchmark utility, and an example of how to use the asynchronous pass(4) interface. It can copy data to and from pass(4) devices using any target queue depth, starting offset and blocksize for the input and ouptut devices. It currently only supports SCSI devices, but could be easily extended to support ATA devices. It can also copy data to and from regular files, block devices, tape devices, pipes, stdin, and stdout. It does not support queueing multiple commands to any of those targets, since it uses the standard read(2)/write(2)/writev(2)/readv(2) system calls. The I/O is done by two threads, one for the reader and one for the writer. The reader thread sends completed read requests to the writer thread in strictly sequential order, even if they complete out of order. That could be modified later on for random I/O patterns or slightly out of order I/O. camdd(8) uses kqueue(2)/kevent(2) to get I/O completion events from the pass(4) driver and also to send request notifications internally. For pass(4) devcies, camdd(8) uses a single buffer (CAM_DATA_VADDR) per CAM CCB on the reading side, and a scatter/gather list (CAM_DATA_SG) on the writing side. In addition to testing both interfaces, this makes any potential reblocking of I/O easier. No data is copied between the reader and the writer, but rather the reader's buffers are split into multiple I/O requests or combined into a single I/O request depending on the input and output blocksize. For the file I/O path, camdd(8) also uses a single buffer (read(2), write(2), pread(2) or pwrite(2)) on reads, and a scatter/gather list (readv(2), writev(2), preadv(2), pwritev(2)) on writes. Things that would be nice to do for camdd(8) eventually: 1. Add support for I/O pattern generation. Patterns like all zeros, all ones, LBA-based patterns, random patterns, etc. Right Now you can always use /dev/zero, /dev/random, etc. 2. Add support for a "sink" mode, so we do only reads with no writes. Right now, you can use /dev/null. 3. Add support for automatic queue depth probing, so that we can figure out the right queue depth on the input and output side for maximum throughput. At the moment it defaults to 6. 4. Add support for SATA device passthrough I/O. 5. Add support for random LBAs and/or lengths on the input and output sides. 6. Track average per-I/O latency and busy time. The busy time and latency could also feed in to the automatic queue depth determination. sys/cam/scsi/scsi_pass.h: Define two new ioctls, CAMIOQUEUE and CAMIOGET, that queue and fetch asynchronous CAM CCBs respectively. Although these ioctls do not have a declared argument, they both take a union ccb pointer. If we declare a size here, the ioctl code in sys/kern/sys_generic.c will malloc and free a buffer for either the CCB or the CCB pointer (depending on how it is declared). Since we have to keep a copy of the CCB (which is fairly large) anyway, having the ioctl malloc and free a CCB for each call is wasteful. sys/cam/scsi/scsi_pass.c: Add asynchronous CCB support. Add two new ioctls, CAMIOQUEUE and CAMIOGET. CAMIOQUEUE adds a CCB to the incoming queue. The CCB is executed immediately (and moved to the active queue) if it is an immediate CCB, but otherwise it will be executed in passstart() when a CCB is available from the transport layer. When CCBs are completed (because they are immediate or passdone() if they are queued), they are put on the done queue. If we get the final close on the device before all pending I/O is complete, all active I/O is moved to the abandoned queue and we increment the peripheral reference count so that the peripheral driver instance doesn't go away before all pending I/O is done. The new passcreatezone() function is called on the first call to the CAMIOQUEUE ioctl on a given device to allocate the UMA zones for I/O requests and S/G list buffers. This may be good to move off to a taskqueue at some point. The new passmemsetup() function allocates memory and scatter/gather lists to hold the user's data, and copies in any data that needs to be written. For virtual pointers (CAM_DATA_VADDR), the kernel buffer is malloced from the new pass(4) driver malloc bucket. For virtual scatter/gather lists (CAM_DATA_SG), buffers are allocated from a new per-pass(9) UMA zone in MAXPHYS-sized chunks. Physical pointers are passed in unchanged. We have support for up to 16 scatter/gather segments (for the user and kernel S/G lists) in the default struct pass_io_req, so requests with longer S/G lists require an extra kernel malloc. The new passcopysglist() function copies a user scatter/gather list to a kernel scatter/gather list. The number of elements in each list may be different, but (obviously) the amount of data stored has to be identical. The new passmemdone() function copies data out for the CAM_DATA_VADDR and CAM_DATA_SG cases. The new passiocleanup() function restores data pointers in user CCBs and frees memory. Add new functions to support kqueue(2)/kevent(2): passreadfilt() tells kevent whether or not the done queue is empty. passkqfilter() adds a knote to our list. passreadfiltdetach() removes a knote from our list. Add a new function, passpoll(), for poll(2)/select(2) to use. Add devstat(9) support for the queued CCB path. sys/cam/ata/ata_da.c: Add support for the BIO_VLIST bio type. sys/cam/cam_ccb.h: Add a new enumeration for the xflags field in the CCB header. (This doesn't change the CCB header, just adds an enumeration to use.) sys/cam/cam_xpt.c: Add a new function, xpt_setup_ccb_flags(), that allows specifying CCB flags. sys/cam/cam_xpt.h: Add a prototype for xpt_setup_ccb_flags(). sys/cam/scsi/scsi_da.c: Add support for BIO_VLIST. sys/dev/md/md.c: Add BIO_VLIST support to md(4). sys/geom/geom_disk.c: Add BIO_VLIST support to the GEOM disk class. Re-factor the I/O size limiting code in g_disk_start() a bit. sys/kern/subr_bus_dma.c: Change _bus_dmamap_load_vlist() to take a starting offset and length. Add a new function, _bus_dmamap_load_pages(), that will load a list of physical pages starting at an offset. Update _bus_dmamap_load_bio() to allow loading BIO_VLIST bios. Allow unmapped I/O to start at an offset. sys/kern/subr_uio.c: Add two new functions, physcopyin_vlist() and physcopyout_vlist(). sys/pc98/include/bus.h: Guard kernel-only parts of the pc98 machine/bus.h header with #ifdef _KERNEL. This allows userland programs to include <machine/bus.h> to get the definition of bus_addr_t and bus_size_t. sys/sys/bio.h: Add a new bio flag, BIO_VLIST. sys/sys/uio.h: Add prototypes for physcopyin_vlist() and physcopyout_vlist(). share/man/man4/pass.4: Document the CAMIOQUEUE and CAMIOGET ioctls. usr.sbin/Makefile: Add camdd. usr.sbin/camdd/Makefile: Add a makefile for camdd(8). usr.sbin/camdd/camdd.8: Man page for camdd(8). usr.sbin/camdd/camdd.c: The new camdd(8) utility. Sponsored by: Spectra Logic MFC after: 1 week	2015-12-03 20:54:55 +00:00
mckusick	25671cd0d5	We need to zero out the union of pointers in a freed vnode structure. PR: 204949 Fix from: Mateusz Guzik Tested by: Jason Unovitch	2015-12-03 02:04:22 +00:00
nwhitehorn	635273ca5b	Missed header_supported call from r291020: make really, really sure the brand likes the executable.	2015-12-01 17:00:31 +00:00
mjg	101bd0f093	capsicum: plug spurious memset in __cap_rights_init Reviewed by: pjd	2015-12-01 02:48:42 +00:00
mckusick	6f0b4b3366	As the kernel allocates and frees vnodes, it fully initializes them on every allocation and fully releases them on every free. These are not trivial costs: it starts by zeroing a large structure then initializes a mutex, a lock manager lock, an rw lock, four lists, and six pointers. And looking at vfs.vnodes_created, these operations are being done millions of times an hour on a busy machine. As a performance optimization, this code update uses the uma_init and uma_fini routines to do these initializations and cleanups only as the vnodes enter and leave the vnode_zone. With this change the initializations are only done kern.maxvnodes times at system startup and then only rarely again. The frees are done only if the vnode_zone shrinks which never happens in practice. For those curious about the avoided work, look at the vnode_init() and vnode_fini() functions in kern/vfs_subr.c to see the code that has been removed from the main vnode allocation/free path. Reviewed by: kib Tested by: Peter Holm	2015-11-29 21:42:26 +00:00
kib	ee461b4bba	Remove sv_prepsyscall, sv_sigsize and sv_sigtbl members of the struct sysent. sv_prepsyscall is unused. sv_sigsize and sv_sigtbl translate signal number from the FreeBSD namespace into the ABI domain. It is only utilized on i386 for iBCS2 binaries. The issue with this approach is that signals for iBCS2 were delivered with the FreeBSD signal frame layout, which does not follow iBCS2. The same note is true for any other potential user if sv_sigtbl. In other words, if ABI needs signal number translation, it really needs custom sv_sendsig method instead. Sponsored by: The FreeBSD Foundation	2015-11-28 08:49:07 +00:00
kib	d58541e227	Remove VI_AGE vnode iflag, it is unused. Noted by: bde Sponsored by: The FreeBSD Foundation	2015-11-27 01:45:40 +00:00
kib	3cb64bd1ce	Move the comment about resident pages preventing vnode from leaving active list, into the header comment for vdrop(), which is the function that decides whether to leave the vnode on the list. Note that dirty page write-out in vinactive() is asynchronous. Discussed with: alc Sponsored by: The FreeBSD Foundation MFC after: 1 week	2015-11-27 01:16:35 +00:00
ae	da001d5bf7	Check that hhk_helper pointer isn't NULL before access. It isn't forbidden to use NULL pointer for hook_helper in hookinfo structure when hhook_add_hook() adds new helper hook.	2015-11-25 07:14:58 +00:00
kib	896302b6a8	Rework the vnode cache recycling to meet free and unused vnodes targets. See the comment above wantfreevnodes variable for the description of the algorithm. The vfs.vlru_alloc_cache_src sysctl is removed. New code frees namecache sources as the last chance to satisfy the highest watermark, instead of selecting the source vnodes randomly. This provides good enough behaviour to keep vn_fullpath() working in most situations. The filesystem layout with deep trees, where the removed knob was required, is thus handled automatically. Submitted by: bde Discussed with: mckusick Tested by: pho MFC after: 1 month	2015-11-24 09:45:36 +00:00
markj	9b7d03a6f4	The buffer passed to an sbuf drain callback is not necessarily null-terminated, so don't assume that it is. Reported by: pho X-MFC-With: r291059	2015-11-23 18:45:35 +00:00
kib	e0c4faece4	Split kerne timekeep ABI structure vdso_sv_tk out of the struct sysentvec. This allows the timekeep data to be shared between similar ABIs which cannot share sysentvec. Make the timekeep_push_vdso() tick callback to the timekeep structures instead of sysentvecs. If several sysentvec share the vdso_sv_tk structure, we would update the userspace data several times on each tick, without the change. Only allocate vdso_sv_tk in the exec_sysvec_init() sysinit when sysentvec is marked with the new SV_TIMEKEEP flag. This saves allocation and update of unneeded vdso_sv_tk for ABIs which do not provide userspace gettimeofday yet, which are PowerPCs arches right now. Make vdso_sv_tk allocator public, namely split out and export alloc_sv_tk() and alloc_sv_tk_compat32(). ABIs which share timekeep data now can allocate it manually and share as appropriate. Requested by: nwhitehorn Tested by: nwhitehorn, pho Sponsored by: The FreeBSD Foundation MFC after: 2 weeks	2015-11-23 07:09:35 +00:00
glebius	6460e3db4e	Remove remnants of the old NFS from vnode pager. Reviewed by: kib Sponsored by: Netflix	2015-11-20 23:52:27 +00:00
trasz	a6632d64f5	The freebsd4_getfsstat() was broken in r281551 to always return 0 on success. All versions of getfsstat(3) are supposed to return the number of [o]statfs structs in the array that was copied out. Also fix missing bounds checking and signed comparison of unsigned types. Submitted by: bde@ MFC after: 1 month Sponsored by: The FreeBSD Foundation	2015-11-20 14:08:12 +00:00
jtl	f2aa140123	Consistently enforce the restriction against calling malloc/free when in a critical section. uma_zalloc_arg()/uma_zalloc_free() may acquire a sleepable lock on the zone. The malloc() family of functions may call uma_zalloc_arg() or uma_zalloc_free(). The malloc(9) man page currently claims that free() will never sleep. It also implies that the malloc() family of functions will not sleep when called with M_NOWAIT. However, it is more correct to say that these functions will not sleep indefinitely. Indeed, they may acquire a sleepable lock. However, a developer may overlook this restriction because the WITNESS check that catches attempts to call the malloc() family of functions within a critical section is inconsistenly applied. This change clarifies the language of the malloc(9) man page to clarify the restriction against calling the malloc() family of functions while in a critical section or holding a spin lock. It also adds KASSERTs at appropriate points to make the enforcement of this restriction more consistent. PR: 204633 Differential Revision: https://reviews.freebsd.org/D4197 Reviewed by: markj Approved by: gnn (mentor) Sponsored by: Juniper Networks	2015-11-19 14:04:53 +00:00
markj	7b874e569f	Remove a commented-out debug print. MFC after: 1 week	2015-11-19 05:58:51 +00:00
markj	7df27c4620	Add support for a configurable output channel to witness(4). This is useful in environments where system configuration is performed by automated interaction with the system console, since unexpected witness output makes such automation difficult. With this change, the new debug.witness.output_channel sysctl allows one to specify that witness output is to be printed to the kernel log (using log(9)) rather than the console. Reviewed by: cem, jhb MFC after: 2 weeks Relnotes: yes Sponsored by: EMC / Isilon Storage Division Differential Revision: https://reviews.freebsd.org/D4183	2015-11-19 05:56:59 +00:00
markj	afc7726a46	Add vlog(9). Reviewed by: cem, jhb MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D4183	2015-11-19 05:50:22 +00:00
nwhitehorn	2b225aeb0b	Extend r270123 to run the brand info's header_supported() routine for branded as well as unbranded binaries. This will be required to add support for the new ELFv2 ABI on powerpc64, which is distinguished from ELFv1 by the contents of the ELF header's flags field. Reviewed by: imp MFC after: 2 weeks	2015-11-18 17:03:22 +00:00
marius	7965ca51f7	- Unbreak dumpsys(9) on sparc64 after r276772 - While at it, arrange #ifndefs in kern_dump.c more intelligently; it's rather confusing to have multiple competing and/or unused functions in the kernel.	2015-11-16 23:02:33 +00:00
trasz	ae1d62cec8	Speed up rctl operation with large rulesets, by holding the lock during iteration instead of relocking it for each traversed rule. Reviewed by: mjg@ MFC after: 1 month Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D4110	2015-11-15 12:10:51 +00:00
rrs	dc494194a2	This fixes several places where callout_stops return is examined. The new return codes of -1 were mistakenly being considered "true". Callout_stop now returns -1 to indicate the callout had either already completed or was not running and 0 to indicate it could not be stopped. Also update the manual page to make it more consistent no non-zero in the callout_stop or callout_reset descriptions. MFC after: 1 Month with associated callout change.	2015-11-13 22:51:35 +00:00
jhb	c1d9f70889	Export various helper variables describing the layout and size of certain kernel structures for use by debuggers. This mostly aids in examining cores from a kernel without debug symbols as a debugger can infer these values if debug symbols are available. One set of variables describes the layout of 'struct linker_file' to walk the list of loaded kernel modules. A second set of variables describes the layout of 'struct proc' and 'struct thread' to walk the list of processes in the kernel and the threads in each process. The 'pcb_size' variable is used to index into the stoppcbs[] array. The 'vm_maxuser_address' is used to distinguish kernel virtual addresses from user addresses. This doesn't have to be perfect, and 'vm_maxuser_address' is a cheap and simple way to differentiate kernel pointers from simple values like TIDs and PIDs. While here, annotate the fields in struct pcb used by kgdb on amd64 and i386 to note that their ABI should be preserved. Annotations for other platforms will be added in the future. Reviewed by: kib MFC after: 2 weeks Differential Revision: https://reviews.freebsd.org/D3773	2015-11-12 22:00:59 +00:00
rrs	24a4335f25	Add new async_drain to the callout system. This is so-far not used but should be used by TCP for sure in its cleanup of the IN-PCB (will be coming shortly). Sponsored by: Netflix Inc. Differential Revision: https://reviews.freebsd.org/D4076	2015-11-10 14:49:32 +00:00
jpaetzel	22e9b6f780	Fix a bug in the CPU % limiting code If you attempt to set a pcpu limit that is higher than 110% using rctl (for instance, you want a jail to be able to use 2 cores on your system so you set pcpu to 200%) the thing you are trying to limit becomes unthrottled. PR: 189870 Submitted by: dustinwenz@ebureau.com Reviewed by: trasz MFC after: 1 week	2015-11-10 14:14:32 +00:00
trasz	425e227d02	Make naming more consistent; no functional changes. MFC after: 1 month Sponsored by: The FreeBSD Foundation	2015-11-08 18:11:24 +00:00
trasz	183c511071	Speed up rctl(8) rule retrieval; the difference shows mostly in "rctl -n", as otherwise most of the time is spent resolving UIDs to names. Reviewed by: mjg@ MFC after: 1 month Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D4059	2015-11-08 18:08:31 +00:00
tijl	4e8b6b4a06	Since r289279 bufinit() uses mp_ncpus, but some architectures set this variable during mp_start() which is too late. Move this to mp_setmaxid() where other architectures set it and move x86 assertions to MI code. Reviewed by: kib (x86 part)	2015-11-08 14:26:50 +00:00
markj	a7bb6eb720	- Consistently use PROC_ASSERT_HELD() to verify that a process' hold count is non-zero. - Include the process address in the PROC_ASSERT_HELD() and PROC_ASSERT_NOT_HELD() assertion messages so that the corresponding process can be found easily when debugging. MFC after: 1 week	2015-11-08 01:38:56 +00:00
cem	c56fbbf66e	Flesh out sysctl types further (follow-up of r290475) Use the right intmax_t type instead of intptr_t in a few remaining places. Add support for CTLFLAG_TUN for the new fixed with types. Bruce will be upset that the new handlers silently truncate tuned quad-sized inputs, but so do all of the existing handlers. Add the new types to debug_dump_node, for whatever use that is. Bump FreeBSD_version again, for good measure. We are changing SYSCTL_HANDLER_ARGS and a member of struct sysctl_oid to intmax_t. Correct the sysctl typed NULL values for the fixed-width types. (Hat tip: hps@.) Suggested by: hps (partial) Sponsored by: EMC / Isilon Storage Division	2015-11-07 18:26:32 +00:00
adrian	4c139fdba3	Add a sched_yield() to work around low memory conditions in the current code. Things seem to get stuck in low memory conditions where no bufs are available, the reclamation path is called to wakeup the daemon, but no sleeping is done. Because of this, we are stuck in a tight loop in the current process and never run said reclamation path. This was introduced in r289279 . This is only a temporary workaround to restore system usefulness until the more permanent solutions can be found. Tested: * Carambola2, 64MB (and 32MB by manual config.)	2015-11-07 04:04:00 +00:00
cem	65f14c5699	Round out SYSCTL macros to the full set of fixed-width types Add S8, S16, S32, and U32 types; add SYSCTL() macros for them, as well as for the existing 64-bit types. (While SYSCTLQUAD and UQUAD macros already exist, they do not take the same sort of 'val' parameter that the other macros do.) Clean up the documented "types" in the sysctl.9 document. (These are macros and thus not real types, but the manual page documents intent.) The sysctl_add_oid(9) arg2 has been bumped from intptr_t to intmax_t to accommodate 64-bit types on 32-bit pointer architectures. This is just the kernel support piece; the userspace sysctl(1) support will follow in a later patch. Submitted by: Ravi Pokala <rpokala@panasas.com> Reviewed by: cem Relnotes: no Sponsored by: Panasas Differential Revision: https://reviews.freebsd.org/D4091	2015-11-07 01:43:01 +00:00
mjg	bc4e60465f	fd: implement kern.proc.nfds sysctl Intended purpose is to provide an equivalent of OpenBSD's getdtablecount syscall for the compat library..	2015-11-07 00:18:14 +00:00
jhb	78dafbc6fb	When dumping an rman in DDB, include the RID of each resource. Submitted by: Ravi Pokala (rpokala@panasas.com) Reviewed by: imp MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D4086	2015-11-05 23:12:23 +00:00
markj	70321eeb6d	Have elf_lookup() return an error if the specified non-weak symbol could not be found. Otherwise, relocations against such symbols will be silently ignored instead of causing an error to be raised. Reviewed by: kib MFC after: 1 week	2015-11-03 03:29:35 +00:00
ngie	c5650de671	Define `fhard` in pps_event(..) only when PPS_SYNC is defined to mute an -Wunused-but-set-variable warning Reported by: FreeBSD_HEAD_amd64_gcc4.9 jenkins job Sponsored by: EMC / Isilon Storage Division	2015-11-02 03:14:37 +00:00
ngie	c5d7b522c7	Define `compress` in `__elfN(coredump)` when #ifdef GZIO is true to mute an -Wunused-but-set-variable warning Reported by: FreeBSD_HEAD_amd64_gcc4.9 jenkins job Sponsored by: EMC / Isilon Storage Division	2015-11-02 01:47:26 +00:00
imp	6471aad35d	The error classification from lower layers is a poor indicator of whether an error is recoverable. Always re-dirty the buffer on errors from write requests. The invalidation we used to do for errors not EIO doesn't need to be done for a device that's really gone, since that's done in a different path. Reviewed by: mckusick@, kib@	2015-10-31 04:53:07 +00:00
kib	979d5cd34d	Minor (and incomplete) style cleanup. Sponsored by: The FreeBSD Foundation MFC after: 1 week	2015-10-30 20:47:42 +00:00
kib	eddc63a998	Also mark compat32 umtx op table as constant. Sponsored by: The FreeBSD Foundation MFC after: 1 week	2015-10-30 19:32:30 +00:00
kib	25e819cedd	Use C99 array initialization, which also makes the code self-documented, and eases addition of new ops. For the similar reasons, eliminate UMTX_OP_MAX. nitems() handles the only use of the symbol. Sponsored by: The FreeBSD Foundation MFC after: 1 week	2015-10-30 19:20:40 +00:00
trasz	e52d6f1e4a	After r290196, the kernel won't wait for stuff like gmirror nodes if they are not required for mounting rootfs. However, it's possible that some setups try to mount them in mountcritlocal (ie from fstab). Export the list of current root mount holds using a new sysctl, vfs.root_mount_hold, and make mountcritlocal retry if "mount -a" fails and the list is not empty. MFC after: 1 month Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D3709	2015-10-30 15:52:10 +00:00
trasz	3b970d7ccc	Make root mount wait mechanism smarter, by making it wait only if the root device doesn't yet exist. Reviewed by: kib@, marcel@ MFC after: 1 month Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D3709	2015-10-30 15:35:04 +00:00
bdrewery	7d7f09674b	getnewbuf: Initialize bp to avoid uninitialized pointer dereference and brelse(). This came in recently in r289279. Coverity CID: 1331561	2015-10-29 19:02:24 +00:00
hselasky	881f337ccc	Add missing NULL check in physio(). When destroying a character device the si_devsw field is set to NULL before all references are gone, to indicate the character device is going away. This can cause a NULL-dereference fault inside physio(). The callers of physio() should own a thread reference on the cdev and if si_devsw is seen as non-NULL, it is usable during the execution of the function. Else an ENXIO error code is returned. Reviewed by: kib MFC after: 2 weeks	2015-10-29 13:53:37 +00:00
imp	20050b3d3c	Add a note to the effect that BUS_ADD_CHILD calls device_add_child_ordered to add the child. device_add_child_ordered doesn't call BUS_ADD_CHILD.	2015-10-28 18:53:18 +00:00
mckusick	485c0ddc22	Bring the tags and links entries for amd64 up to date. Based on how out of date it is, I doubt that anyone other than me and my code-reading students still use it.	2015-10-27 22:59:24 +00:00
pjd	10f57f392f	The aio_waitcomplete(2) syscall should not sleep when the given timeout is 0. Without this change it was sleeping for one tick. Maybe not a big deal, but it makes share/dtrace/blocking script to report that. Reviewed by: jhb Differential Revision: https://reviews.freebsd.org/D3814 Sponsored by: Wheel Systems, http://wheelsystems.com	2015-10-25 18:48:09 +00:00
cem	51e3af66cf	Sysctl: Add common support for U8, U16 types Sponsored by: EMC / Isilon Storage Division	2015-10-22 23:03:06 +00:00
jhb	88ad316e08	Missing regen after last change to sys/kern/syscalls.master.	2015-10-22 21:30:39 +00:00
jhb	9740ac3060	Rename remaining linux32 symbols such as linux_sysent[] and linux_syscallnames[] from linux_* to linux32_* to avoid conflicts with linux64.ko. While here, add support for linux64 binaries to systrace. - Update NOPROTO entries in amd64/linux/syscalls.master to match the main table to fix systrace build. - Add a special case for union l_semun arguments to the systrace generation. - The systrace_linux32 module now only builds the systrace_linux32.ko. module on amd64. - Add a new systrace_linux module that builds on both i386 and amd64. For i386 it builds the existing systrace_linux.ko. For amd64 it builds a systrace_linux.ko for 64-bit binaries. Reviewed by: markj Differential Revision: https://reviews.freebsd.org/D3954	2015-10-22 21:28:20 +00:00
ed	8be8acd7af	Add a way to distinguish between forking and thread creation in schedtail. For CloudABI we need to initialize the registers of new threads differently based on whether the thread got created through a fork or through simple thread creation. Add a flag, TDP_FORKING, that is set by do_fork() and cleared by fork_exit(). This can be tested against in schedtail. Reviewed by: kib Differential Revision: https://reviews.freebsd.org/D3973	2015-10-22 09:33:34 +00:00
kib	f2fc8e1816	Trim spaces at end of line to record the proper commit message for r289660: Do not allow to execute ptrace(PT_TRACE_ME) when the process is already traced. Do not allow to execute ptrace(PT_TRACE_ME) when there is no parent which can trace the process, i.e. when the parent is already init. Note that after the PT_TRACE_ME request the process is unkillable and non-continuable until a debugger is attached, or parent is killed, the later clears P_TRACED state. Since init clearly would not debug the caller, and cannot be killed, disallow creation of unkillable processes. Reviewed by: jhb, pho Reported and tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 2 weeks Differential revision: https://reviews.freebsd.org/D3908	2015-10-20 20:38:20 +00:00
kib	c517f862f2	Mark struct thread zone as type-stable. When establishing the locking state for several lock types (including blockable mutexes and sx) failed, locking primitives try to spin while the owner thread is running. The spinning loop performs the test for running condition by dereferencing the owner->td_state field of the owner thread. If the owner thread exited while spinner was put off the processor, it is harmless to access reused struct thread owner, since in some near future the current processor would notice the owner change and make appropriate progress. But it could be that the page which carried the freed struct thread was unmapped, then we fault (this cannot happen on amd64). For now, disallowing free of the struct thread seems to be good enough, and tests which create a lot of threads once, did not demonstrated regressions. Reviewed by: jhb, pho Reported and tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 2 weeks Differential revision: https://reviews.freebsd.org/D3908	2015-10-20 20:29:21 +00:00
kib	791509a17b	Reviewed by: jhb, pho Reported and tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 2 weeks Differential revision: https://reviews.freebsd.org/D3908	2015-10-20 20:22:57 +00:00
kib	f8f0302151	No need to dereference struct proc to pids when comparing processes for equality. Reviewed by: jhb, pho Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 2 weeks	2015-10-20 20:12:42 +00:00
jhb	6db15b38fe	Switch pl_child_pid from int to pid_t. Reviewed by: emaste, kib MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D3857	2015-10-20 17:58:21 +00:00
ian	e4cc0fc7e4	Fix printf format to allow for bus_size_t not being u_long on all platforms.	2015-10-20 03:25:17 +00:00
ngie	42e86731a7	Replace /dev/acd0 with /dev/cd1 atapicd(4) has been removed since r249083, and if a system has more than one optical drive, it will likely be /dev/cd1 Update mount.conf(8) to reflect the change in behavior MFC after: never Sponsored by: EMC / Isilon Storage Division	2015-10-17 08:51:10 +00:00
kib	0fdc52ff38	If falloc_caps() failed, cleanup needs to be performed. This is a bug in r289026. Sponsored by: The FreeBSD Foundation	2015-10-16 14:55:39 +00:00
kib	a6091af923	Allow PT_INTERP and PT_NOTES segments to be located anywhere in the executable image. Keep one page (arbitrary) limit on the max allowed size of the PT_NOTES. The ELF image activators still require that program headers of the executable are fully contained in the first page of the image file. Reviewed by: emaste, jhb Sponsored by: The FreeBSD Foundation MFC after: 2 weeks Differential revision: https://reviews.freebsd.org/D3871	2015-10-14 18:27:35 +00:00
jeff	4402204d47	Parallelize the buffer cache and rewrite getnewbuf(). This results in a 8x performance improvement in a micro benchmark on a 4 socket machine. - Get buffer headers from a per-cpu uma cache that sits in from of the free queue. - Use a per-cpu quantum cache in vmem to eliminate contention for kva. - Use multiple clean queues according to buffer cache size to eliminate clean queue lock contention. - Introduce a bufspace daemon that attempts to prevent getnewbuf() callers from blocking or doing direct recycling. - Close some bufspace allocation races that could lead to endless recycling. - Further the transition to a more modern style of small functions grouped by prefix in order to improve growing complexity. Sponsored by: EMC / Isilon Reviewed by: kib Tested by: pho	2015-10-14 02:10:07 +00:00
hiren	0d12306188	There are times when it would be really nice to have a record of the last few packets and/or state transitions from each TCP socket. That would help with narrowing down certain problems we see in the field that are hard to reproduce without understanding the history of how we got into a certain state. This change provides just that. It saves copies of the last N packets in a list in the tcpcb. When the tcpcb is destroyed, the list is freed. I thought this was likely to be more performance-friendly than saving copies of the tcpcb. Plus, with the packets, you should be able to reverse-engineer what happened to the tcpcb. To enable the feature, you will need to compile a kernel with the TCPPCAP option. Even then, the feature defaults to being deactivated. You can activate it by setting a positive value for the number of captured packets. You can do that on either a global basis or on a per-socket basis (via a setsockopt call). There is no way to get the packets out of the kernel other than using kmem or getting a coredump. I thought that would help some of the legal/privacy concerns regarding such a feature. However, it should be possible to add a future effort to export them in PCAP format. I tested this at low scale, and found that there were no mbuf leaks and the peak mbuf usage appeared to be unchanged with and without the feature. The main performance concern I can envision is the number of mbufs that would be used on systems with a large number of sockets. If you save five packets per direction per socket and have 3,000 sockets, that will consume at least 30,000 mbufs just to keep these packets. I tried to reduce the concerns associated with this by limiting the number of clusters (not mbufs) that could be used for this feature. Again, in my testing, that appears to work correctly. Differential Revision: D3100 Submitted by: Jonathan Looney <jlooney at juniper dot net> Reviewed by: gnn, hiren	2015-10-14 00:35:37 +00:00
trasz	3e6041333a	Change the default setting of kern.ipc.shm_allow_removed from 0 to 1. This removes the need for manually changing this flag for Google Chrome users. It also improves compatibility with Linux applications running under Linuxulator compatibility layer, and possibly also helps in porting software from Linux. Generally speaking, the flag allows applications to create the shared memory segment, attach it, remove it, and then continue to use it and to reattach it later. This means that the kernel will automatically "clean up" after the application exits. It could be argued that it's against POSIX. However, SUSv3 says this about IPC_RMID: "Remove the shared memory identifier specified by shmid from the system and destroy the shared memory segment and shmid_ds data structure associated with it." From my reading, we break it in any case by deferring removal of the segment until it's detached; we won't break it any more by also deferring removal of the identifier. This is the behaviour exhibited by Linux since... probably always, and also by OpenBSD since the following commit: revision 1.54 date: 2011/10/27 07:56:28; author: robert; state: Exp; lines: +3 -8; Allow segments to be used even after they were marked for deletion with the IPC_RMID flag. This is permitted as an extension beyond the standards and this is similar to what other operating systems like linux do. MFC after: 1 month Relnotes: yes Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D3603	2015-10-10 09:29:47 +00:00
trasz	887351d9dc	Provide better debug message on kernel module name clash. Reviewed by: kib@ MFC after: 1 month Sponsored by: The FreeBSD Foundation	2015-10-10 09:21:55 +00:00
trasz	9b6b02c02f	Remove root_mount_wait(). It's not used anywhere. Reviewed by: bapt@ MFC after: 1 month Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D3787	2015-10-09 12:11:37 +00:00
kib	753bf3d282	Enforce the maxproc limitation before allocating struct proc, initial struct thread and kernel stack for the thread. Otherwise, a load similar to a fork bomb would exhaust KVA and possibly kmem, mostly due to the struct proc being type-stable. The nprocs counter is changed from being protected by allproc_lock sx to be an atomic variable. Note that ddb/db_ps.c:db_ps() use of nprocs was unsafe before, and is still unsafe, but it seems that the only possible undesired consequence is the harmless warning printed when allproc linked list length does not match nprocs. Diagnosed by: Svatopluk Kraus <onwahe@gmail.com> Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 1 week	2015-10-08 11:07:09 +00:00
fabient	368842a107	Fix r283998 that broke mapin events for hwpmc. Reviewed by: jhb Sponsored by: Stormshield	2015-10-08 09:54:33 +00:00
glebius	daa604f294	Fix regression from r248371. We need to copy packet header to new mbuf. Unlike in the pre-r248371 code, assert that M_PKTHDR is set only on a first mbuf. Reported & tested by: Andriy Voskoboinyk <s3erios gmail.com> Sponsored by: Nginx, Inc.	2015-10-07 12:40:00 +00:00
jhb	b86a33b09a	Fix various edge cases related to system call tracing. - Always set td_dbg_sc_* when P_TRACED is set on system call entry even if the debugger is not tracing system call entries. This ensures the fields are valid when reporting other stops that occur at system call boundaries such as for PT_FOLLOW_FORKS or when only tracing system call exits. - Set TDB_SCX when reporting the stop for a new child process in fork_return(). This causes the event to be reported as a system call exit. - Report a system call exit event in fork_return() for new threads in a traced process. - Copy td_dbg_sc_* to new threads instead of zeroing. This ensures that td_dbg_sc_code in particular will report the system call that created the new thread or process when it reports a system call exit event in fork_return(). - Add new ptrace tests to verify that new child processes and threads report system call exit events with a valid pl_syscall_code via PT_LWPINFO. Reviewed by: kib Differential Revision: https://reviews.freebsd.org/D3822	2015-10-06 19:29:05 +00:00
cem	9c1e214f79	Fix core corruption caused by race in note_procstat_vmmap This fix is spiritually similar to r287442 and was discovered thanks to the KASSERT added in that revision. NT_PROCSTAT_VMMAP output length, when packing kinfo structs, is tied to the length of filenames corresponding to vnodes in the process' vm map via vn_fullpath. As vnodes may move during coredump, this is racy. We do not remove the race, only prevent it from causing coredump corruption. - Add a sysctl, kern.coredump_pack_vmmapinfo, to allow users to disable kinfo packing for PROCSTAT_VMMAP notes. This avoids VMMAP corruption and truncation, even if names change, at the cost of up to PATH_MAX bytes per mapped object. The new sysctl is documented in core.5. - Fix note_procstat_vmmap to self-limit in the second pass. This addresses corruption, at the cost of sometimes producing a truncated result. - Fix PROCSTAT_VMMAP consumers libutil (and libprocstat, via copy-paste) to grok the new zero padding. Reported by: pho (https://people.freebsd.org/~pho/stress/log/datamove4-2.txt) Relnotes: yes Sponsored by: EMC / Isilon Storage Division Differential Revision: https://reviews.freebsd.org/D3824	2015-10-06 18:07:00 +00:00
glebius	94c16d891e	Remove debugging variable from r143761.	2015-10-06 09:43:49 +00:00
jhb	e2e358d7ec	Include additional info in ptrace(2) KTR traces: - The new PC value and signal passed to PT_CONTINUE, PT_DETACH, PT_SYSCALL, and PT_TO_SC[EX]. - The system call code returned via PT_LWPINFO. MFC after: 1 week	2015-10-05 21:36:53 +00:00
markj	d6b5b38ff0	Revert r288628 and instead fix a discrepancy between the posix_fadvise(2) man page and POSIX: posix_fadvise(2) returns an error number on failure. Reported by: jilles MFC after: 1 week	2015-10-03 22:27:14 +00:00
markj	f73def85e6	The return value of posix_fadvise(2) is just an error status, so sys_posix_fadvise() should simply return the errno (or 0) to syscallenter() rather than setting a return value. MFC after: 1 week	2015-10-03 19:37:41 +00:00
alc	df033a6930	Perform a single batched update to the object's paging-in-progress count rather than updating it for each page.	2015-10-03 17:04:52 +00:00
phk	0c2e89c505	Fail the sbuf if vsnprintf(3) fails.	2015-10-02 09:23:14 +00:00
markj	49cededc7e	Ensure that vop_stdadvise() does not call getblk() on vnodes that have an empty bufobj. Otherwise, vnodes belonging to filesystems that do not use the buffer cache may trigger assertion failures. Reported by: Fabien Keil	2015-10-01 16:34:53 +00:00
cperciva	8e136c4370	Disable suspend when we're shutting down. This solves the "tell FreeBSD to shut down; close laptop lid" scenario which otherwise tended to end with a laptop overheating or the battery dying. The implementation uses a new sysctl, kern.suspend_blocked; init(8) sets this while rc.suspend runs, and the ACPI sleep code ignores requests while the sysctl is set. Discussed on: freebsd-acpi (35 emails) MFC after: 1 week	2015-10-01 10:52:26 +00:00
markj	6348241c12	As a step towards the elimination of PG_CACHED pages, rework the handling of POSIX_FADV_DONTNEED so that it causes the backing pages to be moved to the head of the inactive queue instead of being cached. This affects the implementation of POSIX_FADV_NOREUSE as well, since it works by applying POSIX_FADV_DONTNEED to file ranges after they have been read or written. At that point the corresponding buffers may still be dirty, so the previous implementation would coalesce successive ranges and apply POSIX_FADV_DONTNEED to the result, ensuring that pages backing the dirty buffers would eventually be cached. To preserve this behaviour in an efficient manner, this change adds a new buf flag, B_NOREUSE, which causes the pages backing a VMIO buf to be placed at the head of the inactive queue when the buf is released. POSIX_FADV_NOREUSE then works by setting this flag in bufs that underlie the specified range. Reviewed by: alc, kib Sponsored by: EMC / Isilon Storage Division Differential Revision: https://reviews.freebsd.org/D3726	2015-09-30 23:06:29 +00:00
avg	425c0bb088	save some bytes by using more concise SDT_PROBE<n> instead of SDT_PROBE SDT_PROBE requires 5 parameters whereas SDT_PROBE<n> requires n parameters where n is typically smaller than 5. Perhaps SDT_PROBE should be made a private implementation detail. MFC after: 20 days	2015-09-28 12:14:16 +00:00
jeff	5ba0fc23d6	- Collapse vfs_vmio_truncate & vfs_vmio_release into a single function. - Allow vfs_vmio_invalidate() to free the pages, leaving us with a single loop and bufobj lock when B_NOCACHE/B_INVAL is used. - Eliminate the special B_ASYNC handling on free that has not been relevant for some time. - Remove the extraneous page busy from vfs_vmio_truncate(). Reviewed by: kib Tested by: pho Sponsored by: EMC / Isilon storage division	2015-09-27 05:16:06 +00:00
markj	bb2ac24a51	Remove a check for a condition that is always false by a preceding KASSERT that was added in r144704.	2015-09-26 22:26:55 +00:00
markj	3eefaf542b	Fix argument ordering in vn_printf(). MFC after: 3 days	2015-09-26 22:16:54 +00:00
cem	d13a26b53a	sbuf: Process more than one char at a time Revamp sbuf_put_byte() to sbuf_put_bytes() in the obvious fashion and fixup callers. Add a thin shim around sbuf_put_bytes() with the old ABI to avoid ugly changes to some callers. Reviewed by: jhb, markj Obtained from: Dan Sledz Sponsored by: EMC / Isilon Storage Division Differential Revision: https://reviews.freebsd.org/D3717	2015-09-25 18:37:14 +00:00
kib	a20643c9c7	Use per-cpu values for base and last in tc_cpu_ticks(). The values are updated lockess, different CPUs write its own view of timecounter state. The critical section is done for safety, callers of tc_cpu_ticks() are supposed to already enter critical section, or to own a spinlock. The change fixes sporadical reports of too high values reported for the (W)CPU on platforms that do not provide cpu ticker and use tc_cpu_ticks(), in particular, arm*. Diagnosed and reviewed by: jhb Sponsored by: The FreeBSD Foundation MFC after: 1 week	2015-09-25 13:03:57 +00:00
mjg	93991d4618	kqueue: simplify kern_kqueue by not refing/unrefing creds too early No functional changes.	2015-09-23 12:45:08 +00:00
jeff	9a362d63de	- Fix a nonsense reordering that somehow slipped into my last diff. Reported by: pho	2015-09-23 07:44:07 +00:00
jeff	b17d09da88	Some refactoring of the buf/vm interface. - Eliminate bogus page replacement that is inconsistently applied in the invalidation loop in brelse. This has been a no-op in modern times as biodone() is responsible for cleaning up after bogus pages. This would've spammed the console with printfs at a minimum. - Allow the compiler and human readers alike to reason about allocbuf() by splitting it into constituent parts. - Separate the VM manipulating and buf manipulating code in brelse() and bufdone() so that the intentions are clear. This makes it evident that there are several duplicated buf pages loops that will be consolidated at a later time. Reviewed by: kib Tested by: pho Sponsored by: EMC / Isilon Storage Division	2015-09-22 23:57:52 +00:00
alc	7567a2924b	Change vm_page_unwire() such that it (1) accepts PQ_NONE as the specified queue and (2) returns a Boolean indicating whether the page's wire count transitioned to zero. Exploit this change in vfs_vmio_release() to avoid pointlessly enqueueing a page that is about to be freed. (An earlier version of this change was developed by attilio@ and kmacy@. Any errors in this version are my own.) Reviewed by: kib Sponsored by: EMC / Isilon Storage Division	2015-09-22 18:16:52 +00:00
hselasky	d198450d08	Revert r287780 until more developers have their say. Differential Revision: https://reviews.freebsd.org/D3521 Requested by: gnn	2015-09-22 06:51:55 +00:00
bdrewery	2be5ce0af5	vfs_mountroot_shuffle() never returns non-zero.	2015-09-22 03:34:07 +00:00
kib	9704d25bc3	Ensure that maxproc does not exceed pid_max, at the time of boot. Changes to kern.pid_max mib after the boot can break this relation. The maxfiles value was calculated by the MAXFILES formula based on maxproc value, but this change decouples them, and MAXFILES now references maxusers. Without manual tuning, the maxfiles default value remains as it was prior to this commit. But for systems which have tuned maxproc and rely on maxfiles to adjust, additional reconfiguration is needed. Reported by: rwatson Reviewed by: emaste Sponsored by: The FreeBSD Foundation MFC after: 2 weeks	2015-09-21 15:02:59 +00:00
kib	518734671f	Add support for weak symbols to the kernel linkers. It means that linkers no longer raise an error when undefined weak symbols are found, but relocate as if the symbol value was 0. Note that we do not repeat the mistake of userspace dynamic linker of making the symbol lookup prefer non-weak symbol definition over the weak one, if both are available. In fact, kernel linker uses the first definition found, and ignores duplicates. Signature of the elf_lookup() and elf_obj_lookup() functions changed to split result/error code and the symbol address returned. Otherwise, it is impossible to return zero address as the symbol value, to MD relocation code. This explains the mechanical changes in elf_machdep.c sources. The powerpc64 R_PPC_JMP_SLOT handler did not checked error from the lookup() call, the patch leaves the code as is (untested). Reported by: glebius Sponsored by: The FreeBSD Foundation MFC after: 1 week	2015-09-20 01:27:59 +00:00
trasz	bd8e12dd02	Kernel part of reroot support - a way to change rootfs without reboot. Note that the mountlist manipulations are somewhat fragile, and not very pretty. The reason for this is to avoid changing vfs_mountroot(), which is (obviously) rather mission-critical, but not very well documented, and thus hard to test properly. It might be possible to rework it to use its own simple root mount mechanism instead of vfs_mountroot(). Reviewed by: kib@ MFC after: 1 month Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D2698	2015-09-18 17:32:22 +00:00
jhb	9d7ff3e89e	Always clear TDB_USERWR before fetching system call arguments. The TDB_USERWR flag may still be set after a debugger detaches from a process via PT_DETACH. Previously the flag would never be cleared forcing a double fetch of the system call arguments for each system call. Note that the flag cannot be cleared at PT_DETACH time in case one of the threads in the process is currently stopped in syscallenter() and the debugger has modified the arguments for that pending system call before detaching. Reviewed by: kib MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D3678	2015-09-16 20:55:00 +00:00
jhb	d32c347344	When a process group leader exits, all of the processes in the group are sent SIGHUP and SIGCONT if any of the processes are stopped. Currently this behavior is triggered for any type of process stop including ptrace() stops and transient stops for single threading during exit() and execve(). Thus, if a debugger is attached to a process in a group when the leader exits, the entire group can be HUPed. Instead, only send the signals if a process in the group is stopped due to SIGSTOP. PR: 201149 Reviewed by: kib MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D3681	2015-09-16 16:40:07 +00:00
mjg	8ada4c3bbf	sysctl: switch sysctllock to a sleepable rmlock, take 2 This restores r285125. Previous attempt was reverted due to a bug in rmlocks, which is fixed since r287833.	2015-09-15 23:06:56 +00:00
jhb	d7f36d77c9	Threads holding a read lock of a sleepable rm lock are not permitted to sleep. The rmlock implementation enforces this by disabling sleeping when a read lock is acquired. To simplify the implementation, sleeping is disabled for most of the duration of rm_rlock. However, it doesn't need to be disabled until the lock is acquired. If a sleepable rm lock is contested, then rm_rlock may need to acquire the backing sx lock. This tripped the overly-broad assertion. Fix by relaxing the assertion around the call to sx_xlock(). Reported by: mjg Reviewed by: kib, mjg MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D3324	2015-09-15 22:16:21 +00:00
cem	9fe341127e	kevent(2): Note DOOMED vnodes with NOTE_REVOKE In poll mode, check for and wake VBAD vnodes. (Vnodes that are VBAD at registration will never be woken by the RECLAIM trigger.) Add post-VOP_RECLAIM hook to trigger notes on vnode reclamation. (Vnodes that were fine at registration but are vgoned while being monitored should signal waiters.) Reviewed by: kib Approved by: markj (mentor) Sponsored by: EMC / Isilon Storage Division Differential Revision: https://reviews.freebsd.org/D3675	2015-09-15 20:22:30 +00:00
hselasky	1a99c0928d	Implement callout_drain_async(), inspired by the projects/hps_head branch. This function is used to drain a callout via a callback instead of blocking the caller until the drain is complete. Refer to the callout_drain_async() manual page for a detailed description. Limitation: If a lock is used with the callout, the callout can only be drained asynchronously one time unless the callout_init_mtx() function is called again. This limitation is not present in projects/hps_head and will require more invasive changes to the timeout code, which was not in the scope of this patch. Differential Revision: https://reviews.freebsd.org/D3521 Reviewed by: wblock MFC after: 1 month	2015-09-14 10:52:26 +00:00
imp	dbff665874	bufdonebio is now unused. Retire it too.	2015-09-11 04:20:04 +00:00
markj	e8967c8bd9	Add stack_save_td_running(), a function to trace the kernel stack of a running thread. It is currently implemented only on amd64 and i386; on these architectures, it is implemented by raising an NMI on the CPU on which the target thread is currently running. Unlike stack_save_td(), it may fail, for example if the thread is running in user mode. This change also modifies the kern.proc.kstack sysctl to use this function, so that stacks of running threads are shown in the output of "procstat -kk". This is handy for debugging threads that are stuck in a busy loop. Reviewed by: bdrewery, jhb, kib Sponsored by: EMC / Isilon Storage Division Differential Revision: https://reviews.freebsd.org/D3256	2015-09-11 03:54:37 +00:00
imp	9fcb42ef36	dev_strategy and dev_strategy_csw are unused since r281825. Remove them. Differential Revision: https://reviews.freebsd.org/D3620	2015-09-11 00:38:58 +00:00
adrian	73868b5389	Also make kern.maxfilesperproc a boot time tunable. Auto-tuning threshold discussions aside, it turns out that if you want to lower this on say, rather memory-packed machines, you either set maxusers or kern.maxfiles, or you set it in sysctl. The former is a non-exact way to tune this; the latter doesn't actually affect anything in the startup scripts. This first occured because I wondered why the hell screen would take upwards of 10 seconds to spawn a new screen. I then found python doing the same thing during fork/exec of child processes - it calls close() on each FD up to the current openfiles limit. On a 1TB machine this is like, 26 million FDs per process. Ugh. So: * This allows it to be set early in /boot/loader.conf; * It can be used to work around the ridiculous situation of screen, python, etc doing a close() on potentially millions of FDs even though you only have four open. Tested: * 4GB, 32GB, 64GB, 128GB, 384GB, 1TB systems with autotune, ensuring screen and python forking doesn't result in some pretty hilariously bad behaviour. TODO: * Note that the default login.conf sets openfiles-cur to unlimited, effectively obeying kern.maxfilesperproc. Perhaps we should fix this. * .. and even if we do, we need to also ensure that daemons get a soft limit of something reasonable and capped - they can request more FDs themselves. MFC after: 1 week Sponsored by: Norse Corp, Inc.	2015-09-10 04:05:58 +00:00
kib	9f66ca5a18	For open("name", O_DIRECTORY \| O_CREAT), do not try to create the named node, open(2) cannot create directories. But do allow the flag combination to succeed if the directory already exists. Declare the open("name", O_DIRECTORY \| O_CREAT \| O_EXCL) always invalid for the same reason, since open(2) cannot create directory. Note that there is an argument that O_DIRECTORY \| O_CREAT should be invalid always, regardless of the target directory existence or O_EXCL. The current fix is conservative and allows the call to succeed in the situation where it succeeded before the patch. Reported by: Tom Ridge <freebsd@tom-ridge.com> Reviewed by: rwatson PR: 202892 Sponsored by: The FreeBSD Foundation MFC after: 1 week	2015-09-09 19:31:08 +00:00
mjg	519b1a7211	fd: make rights a mandatory argument to fgetvp_rights The only caller already always passes rights.	2015-09-07 20:05:56 +00:00
mjg	cc8534cb73	fd: make the common case in filecaps_copy work lockless The filedesc lock is only needed if ioctls caps are present, which is a rare situation. This is a step towards reducing the scope of the filedesc lock.	2015-09-07 20:02:56 +00:00
cem	a8fae65d69	Follow-up to r287442: Move sysctl to compiled-once file Avoid duplicate sysctl nodes. Found by: tijl Approved by: markj (mentor) Sponsored by: EMC / Isilon Storage Division Differential Revision: https://reviews.freebsd.org/D3586	2015-09-07 16:44:28 +00:00
mckusick	17357462a0	Track changes to kern.maxvnodes and appropriately increase or decrease the size of the name cache hash table (mapping file names to vnodes) and the vnode hash table (mapping mount point and inode number to vnode). An appropriate locking strategy is the key to changing hash table sizes while they are in active use. Reviewed by: kib Tested by: Peter Holm Differential Revision: https://reviews.freebsd.org/D2265 MFC after: 2 weeks	2015-09-06 05:50:51 +00:00
delphij	db0a2c953d	Expose an interface to determine if an ACE is inherited. Submitted by: sef Reviewed by: trasz MFC after: 2 weeks Differential Revision: https://reviews.freebsd.org/D3540	2015-09-04 00:14:20 +00:00
cem	f96df638b8	Detect badly behaved coredump note helpers Coredump notes depend on being able to invoke dump routines twice; once in a dry-run mode to get the size of the note, and another to actually emit the note to the corefile. When a note helper emits a different length section the second time around than the length it requested the first time, the kernel produces a corrupt coredump. NT_PROCSTAT_FILES output length, when packing kinfo structs, is tied to the length of filenames corresponding to vnodes in the process' fd table via vn_fullpath. As vnodes may move around during dump, this is racy. So: - Detect badly behaved notes in putnote() and pad underfilled notes. - Add a fail point, debug.fail_point.fill_kinfo_vnode__random_path to exercise the NT_PROCSTAT_FILES corruption. It simply picks random lengths to expand or truncate paths to in fo_fill_kinfo_vnode(). - Add a sysctl, kern.coredump_pack_fileinfo, to allow users to disable kinfo packing for PROCSTAT_FILES notes. This should avoid both FILES note corruption and truncation, even if filenames change, at the cost of about 1 kiB in padding bloat per open fd. Document the new sysctl in core.5. - Fix note_procstat_files to self-limit in the 2nd pass. Since sometimes this will result in a short write, pad up to our advertised size. This addresses note corruption, at the risk of sometimes truncating the last several fd info entries. - Fix NT_PROCSTAT_FILES consumers libutil and libprocstat to grok the zero padding. With suggestions from: bjk, jhb, kib, wblock Approved by: markj (mentor) Relnotes: yes Sponsored by: EMC / Isilon Storage Division Differential Revision: https://reviews.freebsd.org/D3548	2015-09-03 20:32:10 +00:00
mjg	0c1fc3bcd2	fd: remove UMA_ZONE_ZINIT argument from Files zone Originally it was added in order to prevent trashing of objects with INVARIANTS enabled. The same effect is now provided with mere UMA_ZONE_NOFREE. This reverts r286921. Discussed with: kib	2015-09-02 23:14:39 +00:00
jhb	0e4166b711	The 'sa' argument to syscallret() is not unused.	2015-09-01 22:28:23 +00:00
jhb	5c94ee0044	Export current system call code and argument count for system call entry and exit events. procfs stop events for system call tracing report these values (argument count for system call entry and code for system call exit), but ptrace() does not provide this information. (Note that while the system call code can be determined in an ABI-specific manner during system call entry, it is not generally available during system call exit.) The values are exported via new fields at the end of struct ptrace_lwpinfo available via PT_LWPINFO. Reviewed by: kib MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D3536	2015-09-01 22:24:54 +00:00
kib	7f517dc1f3	Exit notification for EVFILT_PROC removes knote from the knlist. In particular, this invalidates the knote kn_link linkage, making the SLIST_FOREACH() loop accessing undefined values (e.g. trashed by QUEUE_MACRO_DEBUG). If the knote is freed by other thread when kq lock is released or when influx is cleared, e.g. by knote_scan() for kqueue owning the knote, the iteration step would access freed memory. Use SLIST_FOREACH_SAFE() to fix iteration. Diagnosed by: avg Tested by: avg, lstewart, pawel Sponsored by: The FreeBSD Foundation MFC after: 2 weeks	2015-09-01 14:05:29 +00:00
kib	cb97f9b456	Clean up the kqueue use of the uma KPI. Explain why it is fine to not check for M_NOWAIT failures in kqueue_register(). Remove unneeded check for NULL result from waitable allocation in kqueue_scan(). uma_free(9) handles NULL argument correctly, remove checks for NULL. Remove useless cast and adjust style in knote_alloc(). Sponsored by: The FreeBSD Foundation MFC after: 2 weeks	2015-09-01 13:21:32 +00:00
avg	a047794b25	callout_reset: fix a reversed check for cc_exec_cancel The typo was introduced in r278469 / `344ecf88af`. As a result of the bug there was a timing window where callout_reset() would fail to cancel a concurrent execution of a callout that is about to start and would schedule the callout again. The callout would fire more times than it is scheduled. That would happen even if the callout is initialized with a lock. For example, the bug triggered the "Stray timeout" assertion in taskqueue_timeout_func(). MFC after: 5 days	2015-09-01 09:27:14 +00:00
kib	b1635150b8	Use P1B_PRIO_MAX to designate max posix priority for the RR/FIFO scheduler types. It was intended to be used there, compare with the min value, and with the test for correctness in ksched_setscheduler(). Note that P1B_PRIO_MAX and RTP_PRIO_MAX do have the same numerical values, the change is cosmetical. Sponsored by: The FreeBSD Foundation MFC after: 1 week	2015-08-30 18:02:57 +00:00
kib	5b7cf788aa	Remove single-use macros obfuscating malloc(9) and free(9) calls. Style. Sponsored by: The FreeBSD Foundation MFC after: 1 week	2015-08-30 17:58:11 +00:00
jch	f011b0c89e	Revert r286880: If at first this change made sense, it turns out it helps only the TCP timers callout(9) usage. As the benefit for others callout(9) usages did not reach a consensus the historical usage should prevail. Differential Revision: https://reviews.freebsd.org/D3078	2015-08-30 13:44:46 +00:00
imp	ae67ea5282	Remove now obsolete comment. MFC After: 2 days	2015-08-28 20:06:58 +00:00
imp	48cf3d6ef2	Per overwhelming sentiment in the code review, use FEATURE instead. Differential Revision: https://reviews.freebsd.org/D3488 MFC After: 2 days	2015-08-28 19:53:19 +00:00
ed	066f63003b	Decompose linkat()/renameat() rights to source and target. To make it easier to understand how Capsicum interacts with linkat() and renameat(), rename the rights to CAP_{LINK,RENAME}AT_{SOURCE,TARGET}. This also addresses a shortcoming in Capsicum, where it isn't possible to disable linking to files stored in a directory. Creating hardlinks essentially makes it possible to access files with additional rights. Reviewed by: rwatson, wblock Differential Revision: https://reviews.freebsd.org/D3411	2015-08-27 15:16:41 +00:00
jch	761d6ba633	Silent a compilation warning on callout_stop()	2015-08-27 10:43:35 +00:00
jch	4015a832a3	In callout_stop(), do not forget to initialize not_running variable. Thanks to hselasky for noticing that. Differential Revision: https://reviews.freebsd.org/D3078 (Updated) Submitted by: hselasky Pointy hat to: jch	2015-08-27 08:58:03 +00:00
jch	433b16759b	In callout_stop(), if a callout is both pending and currently being serviced return 0 (fail) but it is applicable only mpsafe callouts. Thanks to hselasky for finding this. Differential Revision: https://reviews.freebsd.org/D3078 (Updated) Submitted by: hselasky Reviewed by: jch	2015-08-27 08:15:32 +00:00
marcel	22e802147b	An error of -1 from parse_mount() indicates that the specification was invalid. Don't trigger a mount failure (which by default means a panic), but instead just move on to the next directive in the configuration. This typically has us ask for the root mount. PR: 163245	2015-08-27 04:25:27 +00:00
imp	b82b5280a1	When the kernel is compiled with INVARIANTS, export that as debug.invariants. Differential Revision: https://reviews.freebsd.org/D3488 MFC after: 3 days	2015-08-26 23:58:03 +00:00
gnn	304a5ae6cb	Summary: Add the interactivity equations to the header comment for our interactivity calculation routine. Suggested by: rwatson	2015-08-26 16:36:41 +00:00
trasz	c1207ee9b8	Make vfs_unmountall() unmount /dev after /, not before. The only reason this didn't result in an unclean shutdown is that devfs ignores MNT_FORCE flag. Reviewed by: kib@ MFC after: 1 month Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D3467	2015-08-24 13:18:13 +00:00
trasz	9e31188bdc	After r286237 it should be fine to call vgone(9) on a busy GEOM vnode; remove KASSERT that would prevent forced devfs unmount from working. MFC after: 1 month Sponsored by: The FreeBSD Foundation	2015-08-23 14:53:54 +00:00
royger	5b319fbe38	preload_search_info: make sure mod is set Add a check to preload_search_info to make sure mod is set. Most of the callers of preload_search_info don't check that the mod parameter is set, which can cause page faults. While at it, remove some now unnecessary checks before calling preload_search_info. Sponsored by: Citrix Systems R&D Reviewed by: kib Differential Revision: https://reviews.freebsd.org/D3440	2015-08-21 15:57:57 +00:00
kib	fd746f6246	If process becomes reaper (procctl(PROC_REAP_ACQUIRE)) while already having some children, the children' reaper is not reset to the parent. This allows for the situation where reaper has children but not descendands and the too strict asserts in the reap_status() fire. Remove the wrong asserts, add some clarification for the situation to the procctl(2) REAP_STATUS. Reported and tested by: feld Sponsored by: The FreeBSD Foundation MFC after: 1 week	2015-08-20 22:44:26 +00:00
kib	730b59f53e	fget_unlocked() depends on the freed struct file f_count field being zero. The file_zone if no-free, but r284861 added trashing of the freed memory. Most visible manifestation of the issue were 'memory modified after free' panics for the file zone, triggered from falloc_noinstall(). Add UMA_ZONE_ZINIT flag to turn off trashing. Mjg noted that it makes sense to not trash freed memory for any non-free zone, which will be done later. Reported and tested by: pho Discussed with: mjg Sponsored by: The FreeBSD Foundation	2015-08-19 11:53:32 +00:00
jch	7242654eab	callout_stop() should return 0 (fail) when the callout is currently being serviced and indeed unstoppable. A scenario to reproduce this case is: - the callout is being serviced and at same time, - callout_reset() is called on this callout that sets the CALLOUT_PENDING flag and at same time, - callout_stop() is called on this callout and returns 1 (success) even if the callout is indeed currently running and unstoppable. This issue was caught up while making r284245 (D2763) workaround, and was discussed at BSDCan 2015. Once applied the r284245 workaround is not needed anymore and will be reverted. Differential Revision: https://reviews.freebsd.org/D3078 Reviewed by: jhb Sponsored by: Verisign, Inc.	2015-08-18 10:15:09 +00:00
rpaulo	32e921b2a6	genassym.sh: call nm(1) with NMFLAGS.	2015-08-14 22:57:13 +00:00
ian	e065188170	If a specific timecounter has been chosen via sysctl, and a new timecounter with higher quality registers (presumably in a module that has just been loaded), do not undo the user's choice by switching to the new timecounter. Document that behavior, and also the fact that there is no way to unregister a timecounter (and thus no way to unload a module containing one).	2015-08-12 20:50:20 +00:00

1 2 3 4 5 ...

14703 Commits