freebsd-nq

Author	SHA1	Message	Date
Konstantin Belousov	d86c1f0dc1	i386 4/4G split. The change makes the user and kernel address spaces on i386 independent, giving each almost the full 4G of usable virtual addresses except for one PDE at top used for trampoline and per-CPU trampoline stacks, and system structures that must be always mapped, namely IDT, GDT, common TSS and LDT, and process-private TSS and LDT if allocated. By using 1:1 mapping for the kernel text and data, it appeared possible to eliminate assembler part of the locore.S which bootstraps initial page table and KPTmap. The code is rewritten in C and moved into the pmap_cold(). The comment in vmparam.h explains the KVA layout. There is no PCID mechanism available in protected mode, so each kernel/user switch forth and back completely flushes the TLB, except for the trampoline PTD region. The TLB invalidations for userspace becomes trivial, because IPI handlers switch page tables. On the other hand, context switches no longer need to reload %cr3. copyout(9) was rewritten to use vm_fault_quick_hold(). An issue for new copyout(9) is compatibility with wiring user buffers around sysctl handlers. This explains two kind of locks for copyout ptes and accounting of the vslock() calls. The vm_fault_quick_hold() AKA slow path, is only tried after the 'fast path' failed, which temporary changes mapping to the userspace and copies the data to/from small per-cpu buffer in the trampoline. If a page fault occurs during the copy, it is short-circuit by exception.s to not even reach C code. The change was motivated by the need to implement the Meltdown mitigation, but instead of KPTI the full split is done. The i386 architecture already shows the sizing problems, in particular, it is impossible to link clang and lld with debugging. I expect that the issues due to the virtual address space limits would only exaggerate and the split gives more liveness to the platform. Tested by: pho Discussed with: bde Sponsored by: The FreeBSD Foundation MFC after: 1 month Differential revision: https://reviews.freebsd.org/D14633	2018-04-13 20:30:49 +00:00
Ed Maste	b7feabf906	Use C99 designated initializers for struct execsw It it makes use slightly more clear and facilitates grepping.	2018-03-13 13:09:10 +00:00
Pedro F. Giffuni	8a36da99de	sys/kern: adoption of SPDX licensing ID tags. Mainly focus on files that use BSD 2-Clause license, however the tool I was using misidentified many licenses so this was mostly a manual - error prone - task. The Software Package Data Exchange (SPDX) group provides a specification to make it easier for automated tools to detect and summarize well known opensource licenses. We are gradually adopting the specification, noting that the tags are considered only advisory and do not, in any way, superceed or replace the license texts.	2017-11-27 15:20:12 +00:00
Konstantin Belousov	d3c968bf84	Revert r323722. A better fix will be committed shortly, as well as some still useful bits of the reverted revision. The problem with the committed fix is that there are still issues with returning from NMI, when NMI interrupted kernel in a moment where the kernel segments selectors were still not loaded into registers. If this happens, the NMI return would loose the userspace selectors because r323722 does not reload segment registers on return to kernel mode. Fixing the problem is complicated. Since an alternative approach to handle the original bug exists, it makes sence to stop adding more complexity. Discussed with: bde Sponsored by: The FreeBSD Foundation MFC after: 1 week	2017-09-28 08:38:24 +00:00
Konstantin Belousov	5efe338f3d	Fix handling of the segment registers on i386. Suppose that userspace is executing with the non-standard segment descriptors. Then, until exception or interrupt handler executed SET_KERNEL_SEGS, kernel is still executing with user %ds, %es and %fs. If an interrupt occurs in this window, the interrupt handler is executed unsafely, relying on usability of the usermode registers. If the interrupt results in the context switch on return, the contamination of the kernel state spreads to the thread we switched to. As result, kernel data accesses might fault or, if only the base is changed, completely messed up. More, if the user segment was allocated in LDT, another thread might mark the descriptor as invalid before doreti code tried to reload them. In this case kernel panics. The issue exists for all exception entry points which use trap gate, and thus do not automatically disable interrupts on entry, and for lcall_handler. Fix is two-fold: first, we need to disable interrupts for all kernel entries, changing the IDT descriptor types from trap gate to interrupt gate. Interrupts are re-enabled not earlier than the kernel segments are loaded into the segment registers. Second, we only load the segment registers from the trap frame when returning to usermode. For the later, all interrupt return paths must happen through the doreti common code. There is no way to disable interrupts on call gate, which is the supposed mode of servicing for lcall $7,$0 syscalls. Change the LDT descriptor 0 into a code segment type and point it to the userspace trampoline which redirects the syscall to int $0x80. All the measures make the segment register handling similar to that of amd64. We do not apply amd64 optimizations of not reloading segment registers on return from the syscall. Reported by: Maxime Villard <max@m00nbsd.net> Tested by: pho (the non-lcall part) Reviewed by: jhb Sponsored by: The FreeBSD Foundation MFC after: 2 weeks Differential revision: https://reviews.freebsd.org/D12402	2017-09-18 20:22:42 +00:00
Dmitry Chagin	038c720553	Implement vsyscall hack. Prior to 2.13 glibc uses vsyscall instead of vdso. An upcoming linux_base-c6 needs it. Differential Revision: https://reviews.freebsd.org/D1090 Reviewed by: kib, trasz MFC after: 1 week	2016-01-09 20:18:53 +00:00
Konstantin Belousov	724f4b62b0	Remove sv_prepsyscall, sv_sigsize and sv_sigtbl members of the struct sysent. sv_prepsyscall is unused. sv_sigsize and sv_sigtbl translate signal number from the FreeBSD namespace into the ABI domain. It is only utilized on i386 for iBCS2 binaries. The issue with this approach is that signals for iBCS2 were delivered with the FreeBSD signal frame layout, which does not follow iBCS2. The same note is true for any other potential user if sv_sigtbl. In other words, if ABI needs signal number translation, it really needs custom sv_sendsig method instead. Sponsored by: The FreeBSD Foundation	2015-11-28 08:49:07 +00:00
Mateusz Guzik	f6f6d24062	Implement lockless resource limits. Use the same scheme implemented to manage credentials. Code needing to look at process's credentials (as opposed to thred's) is provided with *_proc variants of relevant functions. Places which possibly had to take the proc lock anyway still use the proc pointer to access limits.	2015-06-10 10:48:12 +00:00
Dmitry Chagin	91d1786f65	In preparation for switching linuxulator to the use the native 1:1 threads add a hook for cleaning thread resources before the thread die. Differential Revision: https://reviews.freebsd.org/D1038	2015-05-24 14:51:29 +00:00
Konstantin Belousov	481af8b933	Cosmetics: define FREEBSD32_MINUSER and AOUT32_MINUSER for struct sysentvec .sv_minuser. Also improve style. Submitted by: Oliver Pinter <oliver.pinter@gmail.com> MFC after: 1 week	2012-07-22 13:41:45 +00:00
Warner Losh	f4dc9a4015	Remove an old hack I noticed years ago, but never committed.	2012-06-28 07:33:43 +00:00
Konstantin Belousov	508462ed1b	Do not trash the argv[0] pointer for an a.out process on amd64. Found with the binary provided by joerg.	2011-06-16 22:00:59 +00:00
Konstantin Belousov	5a888c066f	Fix silly typo that resulted in the a.out process stack to end at ~200MB instead of 3GB on amd64.	2011-06-16 21:59:16 +00:00
Edward Tomasz Napierala	1ba5ad4210	Add accounting for most of the memory-related resources. Sponsored by: The FreeBSD Foundation Reviewed by: kib (earlier version)	2011-04-05 20:23:59 +00:00
Konstantin Belousov	7332c129e0	Add support for executing the FreeBSD 1/i386 a.out binaries on amd64. In particular: - implement compat shims for old stat(2) variants and ogetdirentries(2); - implement delivery of signals with ancient stack frame layout and corresponding sigreturn(2); - implement old getpagesize(2); - provide a user-mode trampoline and LDT call gate for lcall $7,$0; - port a.out image activator and connect it to the build as a module on amd64. The changes are hidden under COMPAT_43. MFC after: 1 month	2011-04-01 11:16:29 +00:00
Dmitry Chagin	e5d81ef1b5	Extend struct sysvec with new method sv_schedtail, which is used for an explicit process at fork trampoline path instead of eventhadler(schedtail) invocation for each child process. Remove eventhandler(schedtail) code and change linux ABI to use newly added sysvec method. While here replace explicit comparing of module sysentvec structure with the newly created process sysentvec to detect the linux ABI. Discussed with: kib MFC after: 2 Week	2011-03-08 19:01:45 +00:00
Konstantin Belousov	afe1a68827	Reorganize syscall entry and leave handling. Extend struct sysvec with three new elements: sv_fetch_syscall_args - the method to fetch syscall arguments from usermode into struct syscall_args. The structure is machine-depended (this might be reconsidered after all architectures are converted). sv_set_syscall_retval - the method to set a return value for usermode from the syscall. It is a generalization of cpu_set_syscall_retval(9) to allow ABIs to override the way to set a return value. sv_syscallnames - the table of syscall names. Use sv_set_syscall_retval in kern_sigsuspend() instead of hardcoding the call to cpu_set_syscall_retval(). The new functions syscallenter(9) and syscallret(9) are provided that use sv_syscall pointers and contain the common repeated code from the syscall() implementations for the architecture-specific syscall trap handlers. Syscallenter() fetches arguments, calls syscall implementation from ABI sysent table, and set up return frame. The end of syscall bookkeeping is done by syscallret(). Take advantage of single place for MI syscall handling code and implement ptrace_lwpinfo pl_flags PL_FLAG_SCE, PL_FLAG_SCX and PL_FLAG_EXEC. The SCE and SCX flags notify the debugger that the thread is stopped at syscall entry or return point respectively. The EXEC flag augments SCX and notifies debugger that the process address space was changed by one of exec(2)-family syscalls. The i386, amd64, sparc64, sun4v, powerpc and ia64 syscall()s are changed to use syscallenter()/syscallret(). MIPS and arm are not converted and use the mostly unchanged syscall() implementation. Reviewed by: jhb, marcel, marius, nwhitehorn, stas Tested by: marcel (ia64), marius (sparc64), nwhitehorn (powerpc), stas (mips) MFC after: 1 month	2010-05-23 18:32:02 +00:00
Konstantin Belousov	b4cf0e62f4	Add sv_flags field to struct sysentvec with intention to provide description of the ABI of the currently executing image. Change some places to test the flags instead of explicit comparing with address of known sysentvec structures to determine ABI features. Discussed with: dchagin, imp, jhb, peter	2008-11-22 12:36:15 +00:00
Konstantin Belousov	a8d403e102	Change the static struct sysentvec and struct Elf_Brandinfo initializers to the C99 style. At least, it is easier to read sysent definitions that way, and search for the actual instances of sigcode etc. Explicitely initialize sysentvec.sv_maxssiz that was missed in most sysvecs. No objection from: jhb MFC after: 1 month	2008-09-24 10:14:37 +00:00
Attilio Rao	22db15c06f	VOP_LOCK1() (and so VOP_LOCK()) and VOP_UNLOCK() are only used in conjuction with 'thread' argument passing which is always curthread. Remove the unuseful extra-argument and pass explicitly curthread to lower layer functions, when necessary. KPI results broken by this change, which should affect several ports, so version bumping and manpage update will be further committed. Tested by: kris, pho, Diego Sardina <siarodx at gmail dot com>	2008-01-13 14:44:15 +00:00
Attilio Rao	cb05b60a89	vn_lock() is currently only used with the 'curthread' passed as argument. Remove this argument and pass curthread directly to underlying VOP_LOCK1() VFS method. This modify makes the code cleaner and in particular remove an annoying dependence helping next lockmgr() cleanup. KPI results, obviously, changed. Manpage and FreeBSD_version will be updated through further commits. As a side note, would be valuable to say that next commits will address a similar cleanup about VFS methods, in particular vop_lock1 and vop_unlock. Tested by: Diego Sardina <siarodx at gmail dot com>, Andrea Di Pasquale <whyx dot it at gmail dot com>	2008-01-10 01:10:58 +00:00
Konstantin Belousov	89b57fcf01	Fix for the panic("vm_thread_new: kstack allocation failed") and silent NULL pointer dereference in the i386 and sparc64 pmap_pinit() when the kmem_alloc_nofault() failed to allocate address space. Both functions now return error instead of panicing or dereferencing NULL. As consequence, vmspace_exec() and vmspace_unshare() returns the errno int. struct vmspace arg was added to vm_forkproc() to avoid dealing with failed allocation when most of the fork1() job is already done. The kernel stack for the thread is now set up in the thread_alloc(), that itself may return NULL. Also, allocation of the first process thread is performed in the fork1() to properly deal with stack allocation failure. proc_linkup() is separated into proc_linkup() called from fork1(), and proc_linkup0(), that is used to set up the kernel process (was known as swapper). In collaboration with: Peter Holm Reviewed by: jhb	2007-11-05 11:36:16 +00:00
Alan Cox	41634e2e8d	Correct two vm object reference leaks in error cases. Submitted by: davidxu	2006-03-16 08:51:59 +00:00
Alan Cox	60bb39431a	Maintain the lock on the vnode for most of exec_elfN_imgact(). Specifically, it is required for the I/O that may be performed by elfN_load_section(). Avoid an obscure deadlock in the a.out, elf, and gzip image activators. Add a comment describing why the deadlock does not occur in the common case and how it might occur in less usual circumstances. Eliminate an unused variable from exec_aout_imgact(). In collaboration with: tegge	2005-12-24 04:57:50 +00:00
Jeff Roberson	e11a45c9e5	- Neither of our image formats require Giant now that the vm and vfs have been locked.	2005-05-03 10:51:38 +00:00
Maxim Sobolev	610ecfe035	o Split out kernel part of execve(2) syscall into two parts: one that copies arguments into the kernel space and one that operates completely in the kernel space; o use kernel-only version of execve(2) to kill another stackgap in linuxlator/i386. Obtained from: DragonFlyBSD (partially) MFC after: 2 weeks	2005-01-29 23:12:00 +00:00
Warner Losh	9454b2d864	/* -> /*- for copyright notices, minor format tweaks as necessary	2005-01-06 23:35:40 +00:00
David Schultz	1eecfae3e5	Axe a.out core dump support. Neither older gdb binaries nor current bfd sources understand the present format.	2004-11-27 06:46:59 +00:00
David Schultz	0ef5c36ff1	Maintain the broken state of backwards compatibilty for a.out (and PECOFF!) core dumps. None of the old versions of gdb I tried were able to read a.out core dumps before or after this change. Reviewed by: arch@	2004-11-20 02:32:04 +00:00
Tim J. Robbins	f99619a0dc	Change the types of vn_rdwr_inchunks()'s len and aresid arguments to size_t and size_t *, respectively. Update callers for the new interface. This is a better fix for overflows that occurred when dumping segments larger than 2GB to core files.	2004-06-05 02:18:28 +00:00
John Baldwin	91d5354a2c	Locking for the per-process resource limits structure. - struct plimit includes a mutex to protect a reference count. The plimit structure is treated similarly to struct ucred in that is is always copy on write, so having a reference to a structure is sufficient to read from it without needing a further lock. - The proc lock protects the p_limit pointer and must be held while reading limits from a process to keep the limit structure from changing out from under you while reading from it. - Various global limits that are ints are not protected by a lock since int writes are atomic on all the archs we support and thus a lock wouldn't buy us anything. - All accesses to individual resource limits from a process are abstracted behind a simple lim_rlimit(), lim_max(), and lim_cur() API that return either an rlimit, or the current or max individual limit of the specified resource from a process. - dosetrlimit() was renamed to kern_setrlimit() to match existing style of other similar syscall helper functions. - The alpha OSF/1 compat layer no longer calls getrlimit() and setrlimit() (it didn't used the stackgap when it should have) but uses lim_rlimit() and kern_setrlimit() instead. - The svr4 compat no longer uses the stackgap for resource limits calls, but uses lim_rlimit() and kern_setrlimit() instead. - The ibcs2 compat no longer uses the stackgap for resource limits. It also no longer uses the stackgap for accessing sysctl's for the ibcs2_sysconf() syscall but uses kernel_sysctl() instead. As a result, ibcs2_sysconf() no longer needs Giant. - The p_rlimit macro no longer exists. Submitted by: mtm (mostly, I only did a few cleanups and catchups) Tested on: i386 Compiled on: alpha, amd64	2004-02-04 21:52:57 +00:00
Peter Wemm	c460ac3a00	Add sysentvec->sv_fixlimits() hook so that we can catch cases on 64 bit systems where the data/stack/etc limits are too big for a 32 bit process. Move the 5 or so identical instances of ELF_RTLD_ADDR() into imgact_elf.c. Supply an ia32_fixlimits function. Export the clip/default values to sysctl under the compat.ia32 heirarchy. Have mmap(0, ...) respect the current p->p_limits[RLIMIT_DATA].rlim_max value rather than the sysctl tweakable variable. This allows mmap to place mappings at sensible locations when limits have been reduced. Have the imgact_elf.c ld-elf.so.1 placement algorithm use the same method as mmap(0, ...) now does. Note that we cannot remove all references to the sysctl tweakable maxdsiz etc variables because /etc/login.conf specifies a datasize of 'unlimited'. And that causes exec etc to fail since it can no longer find space to mmap things.	2003-09-25 01:10:26 +00:00
David E. O'Brien	677b542ea2	Use __FBSDID().	2003-06-11 00:56:59 +00:00
Warner Losh	a163d034fa	Back out M_* changes, per decision of the TRB. Approved by: trb	2003-02-19 05:47:46 +00:00
Alfred Perlstein	44956c9863	Remove M_TRYWAIT/M_WAITOK/M_WAIT. Callers should use 0. Merge M_NOWAIT/M_DONTWAIT into a single flag M_NOWAIT.	2003-01-21 08:56:16 +00:00
Jake Burkholder	05ba50f522	Use the fields in the sysentvec and in the vm map header in place of the constants VM_MIN_ADDRESS, VM_MAXUSER_ADDRESS, USRSTACK and PS_STRINGS. This is mainly so that they can be variable even for the native abi, based on different machine types. Get stack protections from the sysentvec too. This makes it trivial to map the stack non-executable for certain abis, on machines that support it.	2002-09-21 22:07:17 +00:00
Bruce Evans	e5d6cd0c98	Include <sys/malloc.h> instead of depending on namespace pollution 2 layers deep in <sys/proc.h> or <sys/vnode.h>. Removed unused includes. Sorted includes.	2002-09-10 11:57:02 +00:00
Peter Wemm	a9f9df5daf	Tidy up some loose ends that bde pointed out. caddr_t bad, ok? Move fill_kinfo_proc to before we copy the results instead of after the copy and too late. There is still more to do here.	2002-09-07 22:31:44 +00:00
Peter Wemm	99a17113cd	The true value of how the kernel was configured for KSTACK_PAGES was not available at module compile time. Do not #include the bogus opt_kstack_pages.h at this point and instead refer to the variables that are also exported via sysctl.	2002-09-07 22:15:47 +00:00
Peter Wemm	710ded3ac5	Collect the a.out coredump code into the calling functions. XXX why does pecoff dump in a.out format?	2002-09-07 01:23:51 +00:00
Jake Burkholder	f36ba45234	Added fields for VM_MIN_ADDRESS, PS_STRINGS and stack protections to sysentvec. Initialized all fields of all sysentvecs, which will allow them to be used instead of constants in more places. Provided stack fixup routines for emulations that previously used the default.	2002-09-01 21:41:24 +00:00
Robert Watson	9ca435893b	In order to better support flexible and extensible access control, make a series of modifications to the credential arguments relating to file read and write operations to cliarfy which credential is used for what: - Change fo_read() and fo_write() to accept "active_cred" instead of "cred", and change the semantics of consumers of fo_read() and fo_write() to pass the active credential of the thread requesting an operation rather than the cached file cred. The cached file cred is still available in fo_read() and fo_write() consumers via fp->f_cred. These changes largely in sys_generic.c. For each implementation of fo_read() and fo_write(), update cred usage to reflect this change and maintain current semantics: - badfo_readwrite() unchanged - kqueue_read/write() unchanged pipe_read/write() now authorize MAC using active_cred rather than td->td_ucred - soo_read/write() unchanged - vn_read/write() now authorize MAC using active_cred but VOP_READ/WRITE() with fp->f_cred Modify vn_rdwr() to accept two credential arguments instead of a single credential: active_cred and file_cred. Use active_cred for MAC authorization, and select a credential for use in VOP_READ/WRITE() based on whether file_cred is NULL or not. If file_cred is provided, authorize the VOP using that cred, otherwise the active credential, matching current semantics. Modify current vn_rdwr() consumers to pass a file_cred if used in the context of a struct file, and to always pass active_cred. When vn_rdwr() is used without a file_cred, pass NOCRED. These changes should maintain current semantics for read/write, but avoid a redundant passing of fp->f_cred, as well as making it more clear what the origin of each credential is in file descriptor read/write operations. Follow-up commits will make similar changes to other file descriptor operations, and modify the MAC framework to pass both credentials to MAC policy modules so they can implement either semantic for revocation. Obtained from: TrustedBSD Project Sponsored by: DARPA, NAI Labs	2002-08-15 20:55:08 +00:00
Jeff Roberson	619eb6e579	- Hold the vnode lock throughout execve. - Set VV_TEXT in the top level execve code. - Fixup the image activators to deal with the newly locked vnode.	2002-08-13 06:55:28 +00:00
Jeff Roberson	e6e370a7fe	- Replace v_flag with v_iflag and v_vflag - v_vflag is protected by the vnode lock and is used when synchronization with VOP calls is needed. - v_iflag is protected by interlock and is used for dealing with vnode management issues. These flags include X/O LOCK, FREE, DOOMED, etc. - All accesses to v_iflag and v_vflag have either been locked or marked with mp_fixme's. - Many ASSERT_VOP_LOCKED calls have been added where the locking was not clear. - Many functions in vfs_subr.c were restructured to provide for stronger locking. Idea stolen from: BSD/OS	2002-08-04 10:29:36 +00:00
Peter Wemm	3ebc124838	Infrastructure tweaks to allow having both an Elf32 and an Elf64 executable handler in the kernel at the same time. Also, allow for the exec_new_vmspace() code to build a different sized vmspace depending on the executable environment. This is a big help for execing i386 binaries on ia64. The ELF exec code grows the ability to map partial pages when there is a page size difference, eg: emulating 4K pages on 8K or 16K hardware pages. Flesh out the i386 emulation support for ia64. At this point, the only binary that I know of that fails is cvsup, because the cvsup runtime tries to execute code in pages not marked executable. Obtained from: dfr (mostly, many tweaks from me).	2002-07-20 02:56:12 +00:00
Jeff Roberson	0b2ed1aef7	Clean up execve locking: - Grab the vnode object early in exec when we still have the vnode lock. - Cache the object in the image_params. - Make use of the cached object in imgact_*.c	2002-07-06 07:00:01 +00:00
John Baldwin	65c9b4303b	- Change fill_kinfo_proc() to require that the process is locked when it is called. - Change sysctl_out_proc() to require that the process is locked when it is called and to drop the lock before it returns. If this proves too complex we can change sysctl_out_proc() to simply acquire the lock at the very end and have the calling code drop the lock right after it returns. - Lock the process we are going to export before the p_cansee() in the loop in sysctl_kern_proc() and hold the lock until we call sysctl_out_proc(). - Don't call p_cansee() on the process about to be exported twice in the aforementioned loop.	2002-04-09 20:10:46 +00:00
Alfred Perlstein	4d77a549fe	Remove __P.	2002-03-19 21:25:46 +00:00
John Baldwin	a854ed9893	Simple p_ucred -> td_ucred changes to start using the per-thread ucred reference.	2002-02-27 18:32:23 +00:00
Paul Saab	cbc89bfbfe	Make MAXTSIZ, DFLDSIZ, MAXDSIZ, DFLSSIZ, MAXSSIZ, SGROWSIZ loader tunable. Reviewed by: peter MFC after: 2 weeks	2001-10-10 23:06:54 +00:00

1 2 3

123 Commits