Windows DRIVER_OBJECT and DEVICE_OBJECT mechanism so that we can
simulate driver stacking.
In Windows, each loaded driver image is attached to a DRIVER_OBJECT
structure. Windows uses the registry to match up a given vendor/device
ID combination with a corresponding DRIVER_OBJECT. When a driver image
is first loaded, its DriverEntry() routine is invoked, which sets up
the AddDevice() function pointer in the DRIVER_OBJECT and creates
a dispatch table (based on IRP major codes). When a Windows bus driver
detects a new device, it creates a Physical Device Object (PDO) for
it. This is a DEVICE_OBJECT structure, with semantics analagous to
that of a device_t in FreeBSD. The Windows PNP manager will invoke
the driver's AddDevice() function and pass it pointers to the DRIVER_OBJECT
and the PDO.
The AddDevice() function then creates a new DRIVER_OBJECT structure of
its own. This is known as the Functional Device Object (FDO) and
corresponds roughly to a private softc instance. The driver uses
IoAttachDeviceToDeviceStack() to add this device object to the
driver stack for this PDO. Subsequent drivers (called filter drivers
in Windows-speak) can be loaded which add themselves to the stack.
When someone issues an IRP to a device, it travel along the stack
passing through several possible filter drivers until it reaches
the functional driver (which actually knows how to talk to the hardware)
at which point it will be completed. This is how Windows achieves
driver layering.
Project Evil now simulates most of this. if_ndis now has a modevent
handler which will use MOD_LOAD and MOD_UNLOAD events to drive the
creation and destruction of DRIVER_OBJECTs. (The load event also
does the relocation/dynalinking of the image.) We don't have a registry,
so the DRIVER_OBJECTS are stored in a linked list for now. Eventually,
the list entry will contain the vendor/device ID list extracted from
the .INF file. When ndis_probe() is called and detectes a supported
device, it will create a PDO for the device instance and attach it
to the DRIVER_OBJECT just as in Windows. ndis_attach() will then call
our NdisAddDevice() handler to create the FDO. The NDIS miniport block
is now a device extension hung off the FDO, just as it is in Windows.
The miniport characteristics table is now an extension hung off the
DRIVER_OBJECT as well (the characteristics are the same for all devices
handled by a given driver, so they don't need to be per-instance.)
We also do an IoAttachDeviceToDeviceStack() to put the FDO on the
stack for the PDO. There are a couple of fake bus drivers created
for the PCI and pccard buses. Eventually, there will be one for USB,
which will actually accept USB IRP.s
Things should still work just as before, only now we do things in
the proper order and maintain the correct framework to support passing
IRPs between drivers.
Various changes:
- corrected the comments about IRQL handling in subr_hal.c to more
accurately reflect reality
- update ndiscvt to make the drv_data symbol in ndis_driver_data.h a
global so that if_ndis_pci.o and/or if_ndis_pccard.o can see it.
- Obtain the softc pointer from the miniport block by referencing
the PDO rather than a private pointer of our own (nmb_ifp is no
longer used)
- implement IoAttachDeviceToDeviceStack(), IoDetachDevice(),
IoGetAttachedDevice(), IoAllocateDriverObjectExtension(),
IoGetDriverObjectExtension(), IoCreateDevice(), IoDeleteDevice(),
IoAllocateIrp(), IoReuseIrp(), IoMakeAssociatedIrp(), IoFreeIrp(),
IoInitializeIrp()
- fix a few mistakes in the driver_object and device_object definitions
- add a new module, kern_windrv.c, to handle the driver registration
and relocation/dynalinkign duties (which don't really belong in
kern_ndis.c).
- made ndis_block and ndis_chars in the ndis_softc stucture pointers
and modified all references to it
- fixed NdisMRegisterMiniport() and NdisInitializeWrapper() so they
work correctly with the new driver_object mechanism
- changed ndis_attach() to call NdisAddDevice() instead of ndis_load_driver()
(which is now deprecated)
- used ExAllocatePoolWithTag()/ExFreePool() in lookaside list routines
instead of kludged up alloc/free routines
- added kern_windrv.c to sys/modules/ndis/Makefile and files.i386.
the semantics in that the returned filename to use is now a kernel
pointer rather than a user space pointer. This required changing the
arguments to the CHECKALT*() macros some and changing the various system
calls that used pathnames to use the kern_foo() functions that can accept
kernel space filename pointers instead of calling the system call
directly.
- Use kern_open(), kern_access(), kern_msgctl(), kern_execve(),
kern_mkfifo(), kern_mknod(), kern_statfs(), kern_fstatfs(),
kern_setitimer(), kern_stat(), kern_lstat(), kern_fstat(), kern_utimes(),
kern_pathconf(), and kern_unlink().
duplicating the contents of the same functions inline.
- Consolidate common code to convert a BSD statfs struct to a Linux struct
into a static worker function.
structure in the struct pointed to by the 3rd argument for IPC_STAT and
get rid of the 4th argument. The old way returned a pointer into the
kernel array that the calling function would then access afterwards
without holding the appropriate locks and doing non-lock-safe things like
copyout() with the data anyways. This change removes that unsafeness and
resulting race conditions as well as simplifying the interface.
- Implement kern_foo wrappers for stat(), lstat(), fstat(), statfs(),
fstatfs(), and fhstatfs(). Use these wrappers to cut out a lot of
code duplication for freebsd4 and netbsd compatability system calls.
- Add a new lookup function kern_alternate_path() that looks up a filename
under an alternate prefix and determines which filename should be used.
This is basically a more general version of linux_emul_convpath() that
can be shared by all the ABIs thus allowing for further reduction of
code duplication.
providing special version of CDIOCREADSUBCHANNEL ioctl(), which assumes that
result has to be placed into kernel space not user space. In the long run
more generic solution has to be designed WRT emulating various ioctl()s
that operate on userspace buffers, but right now there is only one such
ioctl() is emulated, so that it makes little sense.
MFC after: 2 weeks
copies arguments into the kernel space and one that operates
completely in the kernel space;
o use kernel-only version of execve(2) to kill another stackgap in
linuxlator/i386.
Obtained from: DragonFlyBSD (partially)
MFC after: 2 weeks
from the userland and pushes results back and the second which does
actual processing. Use the latter to eliminate stackgap in the linux wrapper
of that syscall.
MFC after: 2 weeks
pops data from the userland and pushes results back and the second which does
actual processing. Use the latter to eliminate stackgap in the linux wrappers
of those syscalls.
MFC after: 2 weeks
attributes in casts (i.e. foo = (__stdcall sometype)bar). This only
happens in two places where we need to set up function pointers, so
work around the problem with some void pointer magic.
USB device support):
- Convert all of my locally chosen function names to their actual
Windows equivalents, where applicable. This is a big no-op change
since it doesn't affect functionality, but it helps avoid a bit
of confusion (it's now a lot easier to see which functions are
emulated Windows API routines and which are just locally defined).
- Turn ndis_buffer into an mdl, like it should have been. The structure
is the same, but now it belongs to the subr_ntoskrnl module.
- Implement a bunch of MDL handling macros from Windows and use them where
applicable.
- Correct the implementation of IoFreeMdl().
- Properly implement IoAllocateMdl() and MmBuildMdlForNonPagedPool().
- Add the definitions for struct irp and struct driver_object.
- Add IMPORT_FUNC() and IMPORT_FUNC_MAP() macros to make formatting
the module function tables a little cleaner. (Should also help
with AMD64 support later on.)
- Fix if_ndis.c to use KeRaiseIrql() and KeLowerIrql() instead of
the previous calls to hal_raise_irql() and hal_lower_irql() which
have been renamed.
The function renaming generated a lot of churn here, but there should
be very little operational effect.
calls MiniportQueryInformation(), it will return NDIS_STATUS_PENDING.
When this happens, ndis_get_info() will sleep waiting for a completion
event. If two threads call ndis_get_info() and both end up having to
sleep, they will both end up waiting on the same wait channel, which
can cause a panic in sleepq_add() if INVARIANTS are turned on.
Fix this by having ndis_get_info() use a common mutex rather than
using the process mutex with PROC_LOCK(). Also do the same for
ndis_set_info(). Note that Pierre's original patch also made ndis_thsuspend()
use the new mutex, but ndis_thsuspend() shouldn't need this since
it will make each thread that calls it sleep on a unique wait channel.
Also, it occured to me that we probably don't want to enter
MiniportQueryInformation() or MiniportSetInformation() from more
than one thread at any given time, so now we acquire a Windows
spinlock before calling either of them. The Microsoft documentation
says that MiniportQueryInformation() and MiniportSetInformation()
are called at DISPATCH_LEVEL, and previously we would call
KeRaiseIrql() to set the IRQL to DISPATCH_LEVEL before entering
either routine, but this only guarantees mutual exclusion on
uniprocessor machines. To make it SMP safe, we need to use a real
spinlock. For now, I'm abusing the spinlock embedded in the
NDIS_MINIPORT_BLOCK structure for this purpose. (This may need to be
applied to some of the other routines in kern_ndis.c at a later date.)
Export ntoskrnl_init_lock() (KeInitializeSpinlock()) from subr_ntoskrnl.c
since we need to use in in kern_ndis.c, and since it's technically part
of the Windows kernel DDK API along with the other spinlock routines. Use
it in subr_ndis.c too rather than frobbing the spinlock directly.
- Mark mount, unmount and nmount MPSAFE.
- Add a stub for _umtx_op().
- Mark open(), link(), unlink(), and freebsd32_sigaction() MPSAFE.
Pointy hats to: several
Use this in all the places where sleeping with the lock held is not
an issue.
The distinction will become significant once we finalize the exact
lock-type to use for this kind of case.
sysctl routines and state. Add some code to use it for signalling the need
to downconvert a data structure to 32 bits on a 64 bit OS when requested by
a 32 bit app.
I tried to do this in a generic abi wrapper that intercepted the sysctl
oid's, or looked up the format string etc, but it was a real can of worms
that turned into a fragile mess before I even got it partially working.
With this, we can now run 'sysctl -a' on a 32 bit sysctl binary and have
it not abort. Things like netstat, ps, etc have a long way to go.
This also fixes a bug in the kern.ps_strings and kern.usrstack hacks.
These do matter very much because they are used by libc_r and other things.
After some discussion the best option seems to be to signal the thread's
death from within the kernel. This requires that thr_exit() take an
argument.
Discussed with: davidxu, deischen, marcel
MFC after: 3 days
the raw values including for child process statistics and only compute the
system and user timevals on demand.
- Fix the various kern_wait() syscall wrappers to only pass in a rusage
pointer if they are going to use the result.
- Add a kern_getrusage() function for the ABI syscalls to use so that they
don't have to play stackgap games to call getrusage().
- Fix the svr4_sys_times() syscall to just call calcru() to calculate the
times it needs rather than calling getrusage() twice with associated
stackgap, etc.
- Add a new rusage_ext structure to store raw time stats such as tick counts
for user, system, and interrupt time as well as a bintime of the total
runtime. A new p_rux field in struct proc replaces the same inline fields
from struct proc (i.e. p_[isu]ticks, p_[isu]u, and p_runtime). A new p_crux
field in struct proc contains the "raw" child time usage statistics.
ruadd() has been changed to handle adding the associated rusage_ext
structures as well as the values in rusage. Effectively, the values in
rusage_ext replace the ru_utime and ru_stime values in struct rusage. These
two fields in struct rusage are no longer used in the kernel.
- calcru() has been split into a static worker function calcru1() that
calculates appropriate timevals for user and system time as well as updating
the rux_[isu]u fields of a passed in rusage_ext structure. calcru() uses a
copy of the process' p_rux structure to compute the timevals after updating
the runtime appropriately if any of the threads in that process are
currently executing. It also now only locks sched_lock internally while
doing the rux_runtime fixup. calcru() now only requires the caller to
hold the proc lock and calcru1() only requires the proc lock internally.
calcru() also no longer allows callers to ask for an interrupt timeval
since none of them actually did.
- calcru() now correctly handles threads executing on other CPUs.
- A new calccru() function computes the child system and user timevals by
calling calcru1() on p_crux. Note that this means that any code that wants
child times must now call this function rather than reading from p_cru
directly. This function also requires the proc lock.
- This finishes the locking for rusage and friends so some of the Giant locks
in exit1() and kern_wait() are now gone.
- The locking in ttyinfo() has been tweaked so that a shared lock of the
proctree lock is used to protect the process group rather than the process
group lock. By holding this lock until the end of the function we now
ensure that the process/thread that we pick to dump info about will no
longer vanish while we are trying to output its info to the console.
Submitted by: bde (mostly)
MFC after: 1 month
These normally only manifest if the ndis compat module is statically
compiled into a kernel image by way of 'options NDISAPI'.
Submitted by: Dmitri Nikulin
Approved by: wpaul
PR: kern/71449
MFC after: 1 week
directly. This removes a few more users of the stackgap and also marks
the syscalls using these wrappers MP safe where appropriate.
Tested on: i386 with linux acroread5
Compiled on: i386, alpha LINT
valid; otherwise a caller could trick us into changing any 32-bit word
in kernel memory to LINUX_SOL_SOCKET (0x00000001) if its previous value
is SOL_SOCKET (0x0000ffff).
MFC after: 3 days