freebsd-nq

Author	SHA1	Message	Date
John Baldwin	24150d37d3	- Fix a couple of inverted panic messages for shared/exclusive mismatches of a lock within a single thread. - Fix handling of interlocks in WITNESS by properly requiring the interlock to be held exactly once if it is specified.	2013-06-03 17:41:11 +00:00
John Baldwin	95d28652af	- Handle the recursed/not recursed flags with RA_RLOCKED in rw_assert(). - Tweak a panic message.	2013-06-03 17:38:57 +00:00
Konstantin Belousov	d39116f5d5	Be more generous when donating the current thread time to the owner of the vnode lock while iterating over the free vnode list. Instead of yielding, pause for 1 tick. The change is reported to help in some virtualized environments. Submitted by: Roger Pau Monn? <roger.pau@citrix.com> Discussed with: jilles Tested by: pho MFC after: 2 weeks	2013-06-03 17:36:43 +00:00
Konstantin Belousov	1e65d73c74	Do not map the shared page COW. If the process wired its address space, fork(2) would cause shadowing of the physical object and copying of the shared page into private copy, effectively preventing updates for the exported timehands structure and stopping the clock. Specify the maximum allowed permissions for the page to be read and execute, preventing write from the user mode. Reported and tested by: <huanghwh@yahoo.com> Sponsored by: The FreeBSD Foundation MFC after: 2 weeks	2013-06-03 04:32:53 +00:00
Konstantin Belousov	92fab43f7f	When auto-sizing the buffer cache, limit the amount of physical memory used as the estimation of size, to 32GB. This provides around 100K of buffer headers and corresponding KVA for buffer map at the peak. Sizing the cache larger is not useful, also resulting in the wasting and exhausting of KVA for large machines. Reported and tested by: bdrewery Sponsored by: The FreeBSD Foundation	2013-06-03 04:16:48 +00:00
Alan Cox	39a4cd0cec	Reduce the scope of the VM object locking in brelse(). In my tests, this change reduced the total number of VM object lock acquisitions by brelse() by 74%. Sponsored by: EMC / Isilon Storage Division	2013-06-02 16:18:03 +00:00
Marius Strobl	0ad17e4b32	Move an assertion to the right spot; only bus_dmamap_load_mbuf(9) requires a pkthdr being present but that's not the case for either _bus_dmamap_load_mbuf_sg() or bus_dmamap_load_mbuf_sg(9). Reported by: sbruno MFC after: 1 week	2013-06-01 11:42:47 +00:00
John Baldwin	3d4c503cf0	Style fixes to vn_ioctl(). Suggested by: bde	2013-05-31 16:15:22 +00:00
Jeff Roberson	22a722605d	- Convert the bufobj lock to rwlock. - Use a shared bufobj lock in getblk() and inmem(). - Convert softdep's lk to rwlock to match the bufobj lock. - Move INFREECNT to b_flags and protect it with the buf lock. - Remove unnecessary locking around bremfree() and BKGRDINPROG. Sponsored by: EMC / Isilon Storage Division Discussed with: mckusick, kib, mdf	2013-05-31 00:43:41 +00:00
Julian Elischer	4591f0d339	Initialising the new fibnum field to a known value turns out to be a GOOD IDEA (TM). Apparently MOST users set this (e.g. tcp and friends) but there are a few users that just assume that it is a sensible value but then go on to read it. These include SCTP, pf and the FLOWTABLE option (and maybe others).	2013-05-24 02:18:37 +00:00
Lawrence Stewart	7639c9be45	Ensure alq's shutdown_pre_sync event handler is deregistered on module unload to avoid a dangling pointer and eventual panic on system shutdown. Reported by: Ali <comnetboy at gmail.com> Tested by: Ali <comnetboy at gmail.com> MFC after: 1 week	2013-05-24 00:49:12 +00:00
Pawel Jakub Dawidek	92981fdf9e	Use proper malloc type for ioctls white-list. Reported by: pho Tested by: pho	2013-05-23 21:07:26 +00:00
Luigi Rizzo	4b62214f4a	Increase the (arbitrary) limit for the number of packets per tick from 1k to 20k The previous value was good 10 years ago, but not anymore now. More importantly, lots of good surprises: polling is incredibly effective under virtualization, and not only prevents livelock but also saves most of the VM exit overhead in receive mode. Using polling, a FreeBSD instance under qemu-kvm remains perfectly responsive even when bombed with 10 Mpps over an emulated e1000, and happily processes 1.7 Mpps through ipfw. Note that some incompatibilities still remain: e.g. polling is not (yet) compatible with netmap, and seems to freeze the guest when kern.polling.idle_poll=1 MFC after: 3 days	2013-05-22 16:32:18 +00:00
Mateusz Guzik	ecbb2a1819	passing fd over unix socket: fix a corner case where caller wants to pass no descriptors. Previously the kernel would leak memory and try to free a potentially arbitrary pointer. Reviewed by: pjd	2013-05-21 21:58:00 +00:00
Attilio Rao	bed927ee17	vm_object locking is not needed there as pages are already wired. Sponsored by: EMC / Isilon storage division Submitted by: alc	2013-05-21 20:54:03 +00:00
Konstantin Belousov	f85769eb75	Regenerate.	2013-05-21 11:41:08 +00:00
Konstantin Belousov	48947eccee	Fix the wait6(2) on 32bit architectures and for the compat32, by using the right type for the argument in syscalls.master. Also fix the posix_fallocate(2) and posix_fadvise(2) compat32 syscalls on the architectures which require padding of the 64bit argument. Noted and reviewed by: jhb Pointy hat to: kib MFC after: 1 week	2013-05-21 11:40:16 +00:00
Pawel Jakub Dawidek	9b9ff7d390	Style nits.	2013-05-19 23:30:24 +00:00
Pawel Jakub Dawidek	9b1040a574	Use SDT_PROBE1() instead of SDT_PROBE().	2013-05-19 23:29:22 +00:00
Jamie Gritton	761d2bb5b9	Refine the "nojail" rc keyword, adding "nojailvnet" for files that don't apply to most jails but do apply to vnet jails. This includes adding a new sysctl "security.jail.vnet" to identify vnet jails. PR: conf/149050 Submitted by: mdodd MFC after: 3 days	2013-05-19 04:10:34 +00:00
Attilio Rao	e3ed7ff03f	Use readlocking now that assertions on vm_page_lookup() are relaxed. Sponsored by: EMC / Isilon storage division Reviewed by: alc Tested by: flo, pho	2013-05-17 20:03:55 +00:00
Jaakko Heinonen	c532f8c4d6	A library function shall not set errno to 0. Reviewed by: mdf	2013-05-16 18:13:10 +00:00
Jeff Roberson	f2cc1285c2	- Add a new general purpose path-compressed radix trie which can be used with any structure containing a uint64_t index. The tree code auto-generates type safe wrappers. - Eliminate the buf splay and replace it with pctrie. This is not only significantly faster with large files but also allows for the possibility of shared locking. Reviewed by: alc, attilio Sponsored by: EMC / Isilon Storage Division	2013-05-12 04:05:01 +00:00
Konstantin Belousov	0fc6daa72d	- Fix nullfs vnode reference leak in nullfs_reclaim_lowervp(). The null_hashget() obtains the reference on the nullfs vnode, which must be dropped. - Fix a wart which existed from the introduction of the nullfs caching, do not unlock lower vnode in the nullfs_reclaim_lowervp(). It should be innocent, but now it is also formally safe. Inform the nullfs_reclaim() about this using the NULLV_NOUNLOCK flag set on nullfs inode. - Add a callback to the upper filesystems for the lower vnode unlinking. When inactivating a nullfs vnode, check if the lower vnode was unlinked, indicated by nullfs flag NULLV_DROP or VV_NOSYNC on the lower vnode, and reclaim upper vnode if so. This allows nullfs to purge cached vnodes for the unlinked lower vnode, avoiding excessive caching. Reported by: G??ran L??wkrantz <goran.lowkrantz@ismobile.com> Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 2 weeks	2013-05-11 11:17:44 +00:00
Eitan Adler	7a2b450ff8	Fxi a bunch of typos. PR: misc/174625 Submitted by: Jeremy Chadwick <jdc@koitsu.org>	2013-05-10 16:41:26 +00:00
Marcel Moolenaar	e63091ea6c	Add option WITNESS_NO_VNODE to suppress printing LORs between VNODE locks. To support this, VNODE locks are created with the LK_IS_VNODE flag. This flag is propagated down using the LO_IS_VNODE flag. Note that WITNESS still records the LOR. Only the printing and the optional entering into the kernel debugger is bypassed with the WITNESS_NO_VNODE option.	2013-05-09 16:28:18 +00:00
Konstantin Belousov	3328431dec	Item 1 in r248830 causes earlier exits from the sendfile(2), before all requested data was sent. The reason is that xfsize <= 0 condition must not be tested at all if space == loopbytes. Otherwise, the done is set to 1, and sendfile(2) is aborted too early. Instead of moving the condition to exiting the inner loop after the xfersize check, directly check for the completed transfer before the testing of the available space in the socket buffer, and revert item 1 of r248830. It is arguably another bug to sleep waiting for socket buffer space (or return EAGAIN for non-blocking socket) if all bytes are already transferred. Reported by: pho Discussed with: scottl, gibbs Tested by: scottl (stable/9 backport), pho	2013-05-09 16:05:51 +00:00
Andre Oppermann	6753da1356	When the accept queue is full print the number of already pending new connections instead of by how many we're over the limit, which is always 1. Noticed by: jmallet MFC after: 1 week	2013-05-08 14:13:14 +00:00
Scott Long	ab8f55b9fd	Add a sysctl vfs.read_min to complement the exiting vfs.read_max. It defaults to 1, meaning that it's off. When read-ahead is enabled on a file, the vfs cluster code deliberately breaks a read into 2 I/O transactions; one to satisfy the actual read, and one to perform read-ahead. This makes sense in low-latency circumstances, but often produces unbalanced i/o transactions that penalize disks. By setting vfs.read_min, we can tell the algorithm to fetch a larger transaction that what we asked for, achieving the same effect as the read-ahead but without the doubled, unbalanced transaction and the slightly lower latency. This significantly helps our workloads with video streaming. Submitted by: emax Reviewed by: kib Obtained from: Netflix	2013-05-07 08:16:21 +00:00
Andre Oppermann	f89d4c3acf	Back out r249318, r249320 and r249327 due to a heisenbug most likely related to a race condition in the ipi_hash_lock with the exact cause currently unknown but under investigation.	2013-05-06 16:42:18 +00:00
Matthew D Fleming	770b41b349	Add missing vdrop() in error case. Submitted by: Fahad (mohd.fahadullah@isilon.com) MFC after: 1 week	2013-05-04 18:38:16 +00:00
John Baldwin	958aa57537	Similar to 233760 and 236717, export some more useful info about the kernel-based POSIX semaphore descriptors to userland via procstat(1) and fstat(1): - Change sem file descriptors to track the pathname they are associated with and add a ksem_info() method to copy the path out to a caller-supplied buffer. - Use the fo_stat() method of shared memory objects and ksem_info() to export the path, mode, and value of a semaphore via struct kinfo_file. - Add a struct semstat to the libprocstat(3) interface along with a procstat_get_sem_info() to export the mode and value of a semaphore. - Teach fstat about semaphores and to display their path, mode, and value. MFC after: 2 weeks	2013-05-03 21:11:57 +00:00
John Baldwin	dfa66c01ae	Fix FIONREAD on regular files. The computed result was being ignored and it was being passed down to VOP_IOCTL() where it promptly resulted in ENOTTY due to a missing else for the past 8 years. While here, use a shared vnode lock while fetching the current file's size. MFC after: 1 week	2013-05-03 19:08:58 +00:00
Jilles Tjoelker	b201f4a0dc	Regenerate files for pipe2().	2013-05-01 22:45:04 +00:00
Jilles Tjoelker	dc570d5e56	Add pipe2() system call. The pipe2() function is similar to pipe() but allows setting FD_CLOEXEC and O_NONBLOCK (on both sides) as part of the function. If p points to two writable ints, pipe2(p, 0) is equivalent to pipe(p). If the pointer is not valid, behaviour differs: pipe2() writes into the array from the kernel like socketpair() does, while pipe() writes into the array from an architecture-specific assembler wrapper. Reviewed by: kan, kib	2013-05-01 22:42:42 +00:00
Jilles Tjoelker	1bf6b724f1	Regenerate files for accept4().	2013-05-01 20:12:58 +00:00
Jilles Tjoelker	da7d2afb6d	Add accept4() system call. The accept4() function, compared to accept(), allows setting the new file descriptor atomically close-on-exec and explicitly controlling the non-blocking status on the new socket. (Note that the latter point means that accept() is not equivalent to any form of accept4().) The linuxulator's accept4 implementation leaves a race window where the new file descriptor is not close-on-exec because it calls sys_accept(). This implementation leaves no such race window (by using falloc() flags). The linuxulator could be fixed and simplified by using the new code. Like accept(), accept4() is async-signal-safe, a cancellation point and permitted in capability mode.	2013-05-01 20:10:21 +00:00
Mikolaj Golub	1b8388cde9	Introduce a constant, ELF_NOTE_ROUNDSIZE, which evidently declare our intention to use 4-byte padding for elf notes. MFC after: 3 weeks	2013-05-01 14:59:16 +00:00
Jilles Tjoelker	cd31b6dd08	socket: Make shutdown() wake up a blocked accept(). A blocking accept (and some other operations) waits on &so->so_timeo. Once it wakes up, it will detect the SBS_CANTRCVMORE bit. The error from accept() is [ECONNABORTED] which is not the nicest one -- the thread calling accept() needs to know out-of-band what is happening. A spurious wakeup on so->so_timeo appears harmless (sleep retried) except when lingering on close (SO_LINGER, and in that case there is no descriptor to call shutdown() on) so this should be fairly safe. A shutdown() already woke up a blocked accept() for TCP sockets, but not for Unix domain sockets. This fix is generic for all domains. This patch was sent to -hackers@ and -net@ on April 5. MFC after: 2 weeks	2013-04-30 15:06:30 +00:00
Konstantin Belousov	3d31767952	Eliminate the layering violation in the kern_sendfile(). When quering the file size, use VOP_GETATTR() instead of accessing vnode vm_object un_pager.vnp.vnp_size. Take the shared vnode lock earlier to cover the added VOP_GETATTR() call and, as consequence, the whole internal sendfile loop. Reduce vm object lock scope to not protect the local calculations. Note that this is the last misuse of the vnp_size in the tree, the others were removed from the ELF image activator by r230246. Reviewed by: alc Tested by: pho, bf (previous version) MFC after: 1 week	2013-04-28 19:12:09 +00:00
Andre Oppermann	2ebcc8ac4a	Base the calculation of maxmbufmem in part on kmem_map size instead of kernel_map size to prevent kernel memory exhaustion by mbufs and a subsequent panic on physical page allocation failure. On architectures without a direct map all mbuf memory (except for jumbo mbufs larger than PAGE_SIZE) comes from kmem_map. It is the limiting factor hence. For architectures with a direct map using the size of kmem_map is a good proxy of available kernel memory as well. If it is much smaller the mbuf limit may be sub-optimal but remains reasonable, while avoiding panics under exhaustion. The overall mbuf memory limit calculation may be reconsidered again later, however due to the many different mbuf sizes and different backing KVM maps it is a tricky subject. Found by: pho's new network stress test Pointed out by: alc (kmem_map instead of kernel_map) Tested by: pho	2013-04-24 13:54:55 +00:00
Jaakko Heinonen	a208417c41	Include PID in the error message which is printed when the maxproc limit is exceeded. Improve formatting of the message while here. PR: kern/60550 Submitted by: Lowell Gilbert, bde	2013-04-19 15:19:29 +00:00
Gleb Smirnoff	14658a80fe	Don't compare unsigned socklen_t against < 0. Reviewed by: jhb	2013-04-19 13:40:13 +00:00
Jilles Tjoelker	1e367efa8b	sem: Restart the POSIX sem_* calls after signals with SA_RESTART set. Programs often do not expect an [EINTR] return from sem_wait() and POSIX only allows it if the signal was installed without SA_RESTART. The timeout in sem_timedwait() is absolute so it can be restarted normally. The umtx call can be invoked with a relative timeout and in that case [ERESTART] must be changed to [EINTR]. However, libc does not do this. The old POSIX semaphore implementation did this correctly (before r249566), unlike the new umtx one. It may be desirable to avoid [EINTR] completely, which matches the pthread functions and is explicitly permitted by POSIX. However, the kernel must return [EINTR] at least for signals with SA_RESTART clear, otherwise pthread cancellation will not abort a semaphore wait. In this commit, only restore the 8.x behaviour which is also permitted by POSIX. Discussed with: jhb MFC after: 1 week	2013-04-19 10:16:00 +00:00
Gleb Smirnoff	8f779cc541	On non-ACPI i386 mp_ncpus is initialized at SI_SUB_CPU, and this prevents us from creating UMA_ZONE_PCPU zones earlier. As bandaid shift initialization of counter(9) zone later. Reviewed by: kib Reported & tested by: Lytochkin Boris <lytboris gmail.com>	2013-04-17 18:43:33 +00:00
Gabor Kovesdan	a2098fea6d	- Correct mispellings of the word necessary Submitted by: Christoph Mallon <christoph.mallon@gmx.de> (via private mail)	2013-04-17 11:42:40 +00:00
Gabor Kovesdan	ab3f6b347e	- Correct mispellings of the word occurrence Submitted by: Christoph Mallon <christoph.mallon@gmx.de> (via private mail)	2013-04-17 11:40:10 +00:00
Warner Losh	11c601447d	r249408 and r249436 cause a NULL pointer dereference on the CUBIEBOARD since it doesn't set the kernel envrionment at all. Work around this by making sure kern_envp is non-NULL before dereferencing it.	2013-04-16 22:09:08 +00:00
John Baldwin	8916af883c	- Document that sem_wait() can fail with EINTR if it is interrupted by a signal. - Fix the old ksem implementation for POSIX semaphores to not restart sem_wait() or sem_timedwait() if interrupted by a signal. MFC after: 1 week	2013-04-16 20:26:31 +00:00
Mikolaj Golub	f1fca82ed5	Add a new set of notes to a process core dump to store procstat data. The notes format is a header of sizeof(int), which stores the size of the corresponding data structure to provide some versioning, and data in the format as it is returned by a related sysctl call. The userland tools (procstat(1)) will be taught to extract this data, providing additional info for postmortem analysis. PR: kern/173723 Suggested by: jhb Discussed with: jhb, kib Reviewed by: jhb (initial version), kib MFC after: 1 month	2013-04-16 19:19:14 +00:00

1 2 3 4 5 ...

13240 Commits