freebsd-skq

Author	SHA1	Message	Date
Robert Watson	8e937462f4	Make if_grow static -- it's not used outside of if.c, and with the internals destined to change, it's better if it remains that way. MFC after: 3 days	2009-08-24 12:52:05 +00:00
Marko Zec	52db6805ea	When moving ifnets from one vnet to another, and the ifnet has ifaddresses of AF_LINK type which thus have an embedded if_index "backpointer", we must update that if_index backpointer to reflect the new if_index that our ifnet just got assigned. This change affects only options VIMAGE builds. Submitted by: bz Reviewed by: bz Approved by: re (rwatson), julian (mentor)	2009-08-24 10:14:09 +00:00
Robert Watson	77dfcdc445	Rework global locks for interface list and index management, correcting several critical bugs, including race conditions and lock order issues: Replace the single rwlock, ifnet_lock, with two locks, an rwlock and an sxlock. Either can be held to stablize the lists and indexes, but both are required to write. This allows the list to be held stable in both network interrupt contexts and sleepable user threads across sleeping memory allocations or device driver interactions. As before, writes to the interface list must occur from sleepable contexts. Reviewed by: bz, julian MFC after: 3 days	2009-08-23 20:40:19 +00:00
Marko Zec	9abb486279	Appease VNET_DEBUG - in if_vmove we temporarily switch i.e. recurse from one vnet to another which is OK, so no need to flood the console with warnings here. Approved by: re (rwatson), julian (mentor)	2009-08-14 22:46:45 +00:00
Robert Watson	530c006014	Merge the remainder of kern_vimage.c and vimage.h into vnet.c and vnet.h, we now use jails (rather than vimages) as the abstraction for virtualization management, and what remained was specific to virtual network stacks. Minor cleanups are done in the process, and comments updated to reflect these changes. Reviewed by: bz Approved by: re (vimage blanket)	2009-08-01 19:26:27 +00:00
Bjoern A. Zeeb	be31e5e7b5	Make the in-kernel logic for the SIOCSIFVNET, SIOCSIFRVNET ioctls (ifconfig ifN (-)vnet <jname\|jid>) work correctly. Move vi_if_move to if.c and split it up into two functions(), one for each ioctl. In the reclaim case, correctly set the vnet before calling if_vmove. Instead of silently allowing a move of an interface from the current vnet to the current vnet, return an error. () There is some duplicate interface name checking before actually moving the interface between network stacks without locking and thus race prone. Ideally if_vmove will correctly and automagically handle these in the future. Suggested by: rwatson (*) Approved by: re (kib)	2009-07-26 11:29:26 +00:00
Robert Watson	d0728d7174	Introduce and use a sysinit-based initialization scheme for virtual network stacks, VNET_SYSINIT: - Add VNET_SYSINIT and VNET_SYSUNINIT macros to declare events that will occur each time a network stack is instantiated and destroyed. In the !VIMAGE case, these are simply mapped into regular SYSINIT/SYSUNINIT. For the VIMAGE case, we instead use SYSINIT's to track their order and properties on registration, using them for each vnet when created/ destroyed, or immediately on module load for already-started vnets. - Remove vnet_modinfo mechanism that existed to serve this purpose previously, as well as its dependency scheme: we now just use the SYSINIT ordering scheme. - Implement VNET_DOMAIN_SET() to allow protocol domains to declare that they want init functions to be called for each virtual network stack rather than just once at boot, compiling down to DOMAIN_SET() in the non-VIMAGE case. - Walk all virtualized kernel subsystems and make use of these instead of modinfo or DOMAIN_SET() for init/uninit events. In some cases, convert modular components from using modevent to using sysinit (where appropriate). In some cases, do minor rejuggling of SYSINIT ordering to make room for or better manage events. Portions submitted by: jhb (VNET_SYSINIT), bz (cleanup) Discussed with: jhb, bz, julian, zec Reviewed by: bz Approved by: re (VIMAGE blanket)	2009-07-23 20:46:49 +00:00
Robert Watson	006e9db452	Normalize field naming for struct vnet, fix two debugging printfs that print them. Reviewed by: bz Approved by: re (kensmith, kib)	2009-07-19 17:40:45 +00:00
Robert Watson	5ee847d3ac	Reimplement and/or implement vnet list locking by replacing a mostly unused custom mutex/condvar-based sleep locks with two locks: an rwlock (for non-sleeping use) and sxlock (for sleeping use). Either acquired for read is sufficient to stabilize the vnet list, but both must be acquired for write to modify the list. Replace previous no-op read locking macros, used in various places in the stack, with actual locking to prevent race conditions. Callers must declare when they may perform unbounded sleeps or not when selecting how to lock. Refactor vnet sysinits so that the vnet list and locks are initialized before kernel modules are linked, as the kernel linker will use them for modules loaded by the boot loader. Update various consumers of these KPIs based on whether they may sleep or not. Reviewed by: bz Approved by: re (kib)	2009-07-19 14:20:53 +00:00
Jamie Gritton	7afcbc18b3	Remove the interim vimage containers, struct vimage and struct procg, and the ioctl-based interface that supported them. Approved by: re (kib), bz (mentor)	2009-07-17 14:48:21 +00:00
Robert Watson	1e77c1056a	Remove unused VNET_SET() and related macros; only VNET_GET() is ever actually used. Rename VNET_GET() to VNET() to shorten variable references. Discussed with: bz, julian Reviewed by: bz Approved by: re (kensmith, kib)	2009-07-16 21:13:04 +00:00
Robert Watson	eddfbb763d	Build on Jeff Roberson's linker-set based dynamic per-CPU allocator (DPCPU), as suggested by Peter Wemm, and implement a new per-virtual network stack memory allocator. Modify vnet to use the allocator instead of monolithic global container structures (vinet, ...). This change solves many binary compatibility problems associated with VIMAGE, and restores ELF symbols for virtualized global variables. Each virtualized global variable exists as a "reference copy", and also once per virtual network stack. Virtualized global variables are tagged at compile-time, placing the in a special linker set, which is loaded into a contiguous region of kernel memory. Virtualized global variables in the base kernel are linked as normal, but those in modules are copied and relocated to a reserved portion of the kernel's vnet region with the help of a the kernel linker. Virtualized global variables exist in per-vnet memory set up when the network stack instance is created, and are initialized statically from the reference copy. Run-time access occurs via an accessor macro, which converts from the current vnet and requested symbol to a per-vnet address. When "options VIMAGE" is not compiled into the kernel, normal global ELF symbols will be used instead and indirection is avoided. This change restores static initialization for network stack global variables, restores support for non-global symbols and types, eliminates the need for many subsystem constructors, eliminates large per-subsystem structures that caused many binary compatibility issues both for monitoring applications (netstat) and kernel modules, removes the per-function INIT_VNET_*() macros throughout the stack, eliminates the need for vnet_symmap ksym(2) munging, and eliminates duplicate definitions of virtualized globals under VIMAGE_GLOBALS. Bump __FreeBSD_version and update UPDATING. Portions submitted by: bz Reviewed by: bz, zec Discussed with: gnn, jamie, jeff, jhb, julian, sam Suggested by: peter Approved by: re (kensmith)	2009-07-14 22:48:30 +00:00
Brooks Davis	6cb7f168db	Remove support for the /dev/net/* per-interface devices. They serve little purpose and are unused in the base system. The IOCTL functionality is entirely duplicated and routing sockets provide a richer interface than the kqueue functionality. Further, it is not practical for these devices to be made sensible in the face of VIMAGE. Bump __FreeBSD_version on the off chance that there is any code out there that actually uses this stuff. Reviewed by: rwatson Discussed with: bz, zec Approved by: re@ (kensmith)	2009-06-29 19:46:29 +00:00
Robert Watson	395cbe82d2	Remove unnecessary include of kdb.h that snuck in during ifaddr refcount work. Reported by: pluknet <pluknet at gmail.com> Approved by: re (kib)	2009-06-27 10:30:28 +00:00
Robert Watson	f9ef96ca71	Define four wrapper functions for interface address locking, if_addr_rlock() and if_addr_runlock() for regular address lists, and if_maddr_rlock() and if_maddr_runlock() for multicast address lists. We will use these in various kernel modules to avoid encoding specific type and locking strategy information into modules that currently use IF_ADDR_LOCK() and IF_ADDR_UNLOCK() directly. MFC after: 6 weeks	2009-06-26 00:36:47 +00:00
Robert Watson	3baaf2974d	In if_setlladdr(), use IF_ADDR_LOCK() and ifaddr references to improve the safety of link layer address manipulation. MFC after: 6 weeks	2009-06-24 10:36:48 +00:00
Robert Watson	8c0fec805f	Modify most routines returning 'struct ifaddr *' to return references rather than pointers, requiring callers to properly dispose of those references. The following routines now return references: ifaddr_byindex ifa_ifwithaddr ifa_ifwithbroadaddr ifa_ifwithdstaddr ifa_ifwithnet ifaof_ifpforaddr ifa_ifwithroute ifa_ifwithroute_fib rt_getifa rt_getifa_fib IFP_TO_IA ip_rtaddr in6_ifawithifp in6ifa_ifpforlinklocal in6ifa_ifpwithaddr in6_ifadd carp_iamatch6 ip6_getdstifaddr Remove unused macro which didn't have required referencing: IFP_TO_IA6 This closes many small races in which changes to interface or address lists while an ifaddr was in use could lead to use of freed memory (etc). In a few cases, add missing if_addr_list locking required to safely acquire references. Because of a lack of deep copying support, we accept a race in which an in6_ifaddr pointed to by mbuf tags and extracted with ip6_getdstifaddr() doesn't hold a reference while in transmit. Once we have mbuf tag deep copy support, this can be fixed. Reviewed by: bz Obtained from: Apple, Inc. (portions) MFC after: 6 weeks (portions)	2009-06-23 20:19:09 +00:00
Bjoern A. Zeeb	a877d0cffa	Remove duplicate #include <net/route.h> from the middle of the file.	2009-06-23 13:16:16 +00:00
Bjoern A. Zeeb	b58ea5f310	Move virtualization of routing related variables into their own Vimage module, which had been there already but now is stateful. All variables are now file local; so this further limits the global spreading of routing related things throughout the kernel. Add a missing function local variable in case of MPATHing. Reviewed by: zec	2009-06-22 17:48:16 +00:00
Robert Watson	8896f83a58	Add a new function, ifa_ifwithaddr_check(), which rather than returning a pointer to an ifaddr matching the passed socket address, returns a boolean indicating whether one was present. In the (near) future, ifa_ifwithaddr() will return a referenced ifaddr rather than a raw ifaddr pointer, and the new wrapper will allow callers that care only about the boolean condition to avoid having to free that reference. MFC after: 3 weeks	2009-06-22 10:59:34 +00:00
Bjoern A. Zeeb	bed56bb51b	After the update to fxp(4) in r194573 we should no longer need this DELAY(100) hack introduced in r56938. Thanks to: yongari MFC after: 6 weeks X-MFC note: not before the fxp(4) changes	2009-06-22 10:27:20 +00:00
Robert Watson	1099f828b3	Clean up common ifaddr management: - Unify reference count and lock initialization in a single function, ifa_init(). - Move tear-down from a macro (IFAFREE) to a function ifa_free(). - Move reference count bump from a macro (IFAREF) to a function ifa_ref(). - Instead of using a u_int protected by a mutex to refcount(9) for reference count management. The ifa_mtx is now used for exactly one ioctl, and possibly should be removed. MFC after: 3 weeks	2009-06-21 19:30:33 +00:00
Roman Divacky	e40bae9a45	Switch cmd argument to u_long. This matches what if_ethersubr.c does and allows the code to compile cleanly on amd64 with clang. Reviewed by: rwatson Approved by: ed (mentor)	2009-06-21 10:29:31 +00:00
Sam Leffler	d659538f72	r193336 moved ifq_detach to if_free which broke if_alloc followed by if_free (w/o doing if_attach); move ifq_attach to if_alloc and rename ifq_attach/detach to ifq_init/ifq_delete to better identify their purpose Reviewed by: jhb, kmacy	2009-06-15 19:50:03 +00:00
Jamie Gritton	679e13901c	Manage vnets via the jail system. If a jail is given the boolean parameter "vnet" when it is created, a new vnet instance will be created along with the jail. Networks interfaces can be moved between prisons with an ioctl similar to the one that moves them between vimages. For now vnets will co-exist under both jails and vimages, but soon struct vimage will be going away. Reviewed by: zec, julian Approved by: bz (mentor)	2009-06-15 18:59:29 +00:00
Bjoern A. Zeeb	259d2d5431	carp(4) allows people to share a set of IP addresses and can only use IPv4/v6 for inter-node communication (according to my reading). Properly wrap the carp callouts in INET \|\| INET6 and refelect this in sys/conf/files as well. While in theory this should be ok, it might be a bit optimistic to think that carp could build with inet6 only[1]. Discussed with: mlaier [1]	2009-06-11 10:26:38 +00:00
Konstantin Belousov	d8b0556c6d	Adapt vfs kqfilter to the shared vnode lock used by zfs write vop. Use vnode interlock to protect the knote fields [1]. The locking assumes that shared vnode lock is held, thus we get exclusive access to knote either by exclusive vnode lock protection, or by shared vnode lock + vnode interlock. Do not use kl_locked() method to assert either lock ownership or the fact that curthread does not own the lock. For shared locks, ownership is not recorded, e.g. VOP_ISLOCKED can return LK_SHARED for the shared lock not owned by curthread, causing false positives in kqueue subsystem assertions about knlist lock. Remove kl_locked method from knlist lock vector, and add two separate assertion methods kl_assert_locked and kl_assert_unlocked, that are supposed to use proper asserts. Change knlist_init accordingly. Add convenience function knlist_init_mtx to reduce number of arguments for typical knlist initialization. Submitted by: jhb [1] Noted by: jhb [2] Reviewed by: jhb Tested by: rnoland	2009-06-10 20:59:32 +00:00
Bjoern A. Zeeb	8d8bc0182e	After r193232 rt_tables in vnet.h are no longer indirectly dependent on the ROUTETABLES kernel option thus there is no need to include opt_route.h anymore in all consumers of vnet.h and no longer depend on it for module builds. Remove the hidden include in flowtable.h as well and leave the two explicit #includes in ip_input.c and ip_output.c.	2009-06-08 19:57:35 +00:00
Marko Zec	bc29160df3	Introduce an infrastructure for dismantling vnet instances. Vnet modules and protocol domains may now register destructor functions to clean up and release per-module state. The destructor mechanisms can be triggered by invoking "vimage -d", or a future equivalent command which will be provided via the new jail framework. While this patch introduces numerous placeholder destructor functions, many of those are currently incomplete, thus leaking memory or (even worse) failing to stop all running timers. Many of such issues are already known and will be incrementaly fixed over the next weeks in smaller incremental commits. Apart from introducing new fields in structs ifnet, domain, protosw and vnet_net, which requires the kernel and modules to be rebuilt, this change should have no impact on nooptions VIMAGE builds, since vnet destructors can only be called in VIMAGE kernels. Moreover, destructor functions should be in general compiled in only in options VIMAGE builds, except for kernel modules which can be safely kldunloaded at run time. Bump __FreeBSD_version to 800097. Reviewed by: bz, julian Approved by: rwatson, kib (re), julian (mentor)	2009-06-08 17:15:40 +00:00
Robert Watson	bcf11e8d00	Move "options MAC" from opt_mac.h to opt_global.h, as it's now in GENERIC and used in a large number of files, but also because an increasing number of incorrect uses of MAC calls were sneaking in due to copy-and-paste of MAC-aware code without the associated opt_mac.h include. Discussed with: pjd	2009-06-05 14:55:22 +00:00
Sam Leffler	c9dd371765	move ifq_detach from if_detach to if_free; this permits callers to reference if_snd in the period between detach+free which helps simplify detach code Reviewed by: jhb, rwatson	2009-06-02 18:53:21 +00:00
Bjoern A. Zeeb	c2c2a7c11e	Convert the two dimensional array to be malloced and introduce an accessor function to get the correct rnh pointer back. Update netstat to get the correct pointer using kvm_read() as well. This not only fixes the ABI problem depending on the kernel option but also permits the tunable to overwrite the kernel option at boot time up to MAXFIBS, enlarging the number of FIBs without having to recompile. So people could just use GENERIC now. Reviewed by: julian, rwatson, zec X-MFC: not possible	2009-06-01 15:49:42 +00:00
Marko Zec	feb08d06b9	Introduce an interm userland-kernel API for creating vnets and assigning ifnets from one vnet to another. Deletion of vnets is not yet supported. The interface is implemented as an ioctl extension so that no syscalls had to be introduced. This should be acceptable given that the new interface will be used for a short / interim period only, until the new jail management framwork gains the capability of managing vnets. This method for managing vimages / vnets has been in use for the past 7 years without any observable issues. The userland tool to be used in conjunction with the interim API can be found in p4: //depot/projects/vimage-commit2/src/usr.sbin/vimage/... and will most probably never get commited to svn. While here, bump copyright notices in kern_vimage.c and vimage.h to cover work done in year 2009. Approved by: julian (mentor) Discussed with: bz, rwatson	2009-05-31 12:10:04 +00:00
Marko Zec	67da1f3d8d	Set ifp->if_afdata_initialized to 0 while holding IF_AFDATA_LOCK on ifp, not after the lock has been released. Reviewed by: bz Discussed with: rwatson	2009-05-22 22:22:21 +00:00
Marko Zec	e0c14af9b3	Introduce the if_vmove() function, which will be used in the future for reassigning ifnets from one vnet to another. if_vmove() works by calling a restricted subset of actions normally executed by if_detach() on an ifnet in the current vnet, and then switches to the target vnet and executes an appropriate subset of if_attach() actions there. if_attach() and if_detach() have become wrapper functions around if_attach_internal() and if_detach_internal(), where the later variants have an additional argument, a flag indicating whether a full attach or detach sequence is to be executed, or only a restricted subset suitable for moving an ifnet from one vnet to another. Hence, if_vmove() will not call if_detach() and if_attach() directly, but will call the if_detach_internal() and if_attach_internal() variants instead, with the vmove flag set. While here, staticize ifnet_setbyindex() since it is not referenced from outside of sys/net/if.c. Also rename ifccnt field in struct vimage to ifcnt, and do some minor whitespace garbage collection where appropriate. This change should have no functional impact on nooptions VIMAGE kernel builds. Reviewed by: bz, rwatson, brooks? Approved by: julian (mentor)	2009-05-22 22:09:00 +00:00
Marko Zec	21ca7b57bd	Change the curvnet variable from a global const struct vnet , previously always pointing to the default vnet context, to a dynamically changing thread-local one. The currvnet context should be set on entry to networking code via CURVNET_SET() macros, and reverted to previous state via CURVNET_RESTORE(). Recursions on curvnet are permitted, though strongly discuouraged. This change should have no functional impact on nooptions VIMAGE kernel builds, where CURVNET_ macros expand to whitespace. The curthread->td_vnet (aka curvnet) variable's purpose is to be an indicator of the vnet context in which the current network-related operation takes place, in case we cannot deduce the current vnet context from any other source, such as by looking at mbuf's m->m_pkthdr.rcvif->if_vnet, sockets's so->so_vnet etc. Moreover, so far curvnet has turned out to be an invaluable consistency checking aid: it helps to catch cases when sockets, ifnets or any other vnet-aware structures may have leaked from one vnet to another. The exact placement of the CURVNET_SET() / CURVNET_RESTORE() macros was a result of an empirical iterative process, whith an aim to reduce recursions on CURVNET_SET() to a minimum, while still reducing the scope of CURVNET_SET() to networking only operations - the alternative would be calling CURVNET_SET() on each system call entry. In general, curvnet has to be set in three typicall cases: when processing socket-related requests from userspace or from within the kernel; when processing inbound traffic flowing from device drivers to upper layers of the networking stack, and when executing timer-driven networking functions. This change also introduces a DDB subcommand to show the list of all vnet instances. Approved by: julian (mentor)	2009-05-05 10:56:12 +00:00
Marko Zec	f6dfe47a14	Permit buiding kernels with options VIMAGE, restricted to only a single active network stack instance. Turning on options VIMAGE at compile time yields the following changes relative to default kernel build: 1) V_ accessor macros for virtualized variables resolve to structure fields via base pointers, instead of being resolved as fields in global structs or plain global variables. As an example, V_ifnet becomes: options VIMAGE: ((struct vnet_net ) vnet_net)->_ifnet default build: vnet_net_0._ifnet options VIMAGE_GLOBALS: ifnet 2) INIT_VNET_ macros will declare and set up base pointers to be used by V_ accessor macros, instead of resolving to whitespace: INIT_VNET_NET(ifp->if_vnet); becomes struct vnet_net vnet_net = (ifp->if_vnet)->mod_data[VNET_MOD_NET]; 3) Memory for vnet modules registered via vnet_mod_register() is now allocated at run time in sys/kern/kern_vimage.c, instead of per vnet module structs being declared as globals. If required, vnet modules can now request the framework to provide them with allocated bzeroed memory by filling in the vmi_size field in their vmi_modinfo structures. 4) structs socket, ifnet, inpcbinfo, tcpcb and syncache_head are extended to hold a pointer to the parent vnet. options VIMAGE builds will fill in those fields as required. 5) curvnet is introduced as a new global variable in options VIMAGE builds, always pointing to the default and only struct vnet. 6) struct sysctl_oid has been extended with additional two fields to store major and minor virtualization module identifiers, oid_v_subs and oid_v_mod. SYSCTL_V_ family of macros will fill in those fields accordingly, and store the offset in the appropriate vnet container struct in oid_arg1. In sysctl handlers dealing with virtualized sysctls, the SYSCTL_RESOLVE_V_ARG1() macro will compute the address of the target variable and make it available in arg1 variable for further processing. Unused fields in structs vnet_inet, vnet_inet6 and vnet_ipfw have been deleted. Reviewed by: bz, rwatson Approved by: julian (mentor)	2009-04-30 13:36:26 +00:00
Robert Watson	8bd015a1ca	As with ifnet_byindex_ref(), don't return IFF_DYING interfaces from ifunit_ref(). ifunit() continues to return them. MFC after: 3 weeks	2009-04-23 15:56:01 +00:00
Robert Watson	6064c5d362	Add ifunit_ref(), a version of ifunit(), that returns not just an interface pointer, but also a reference to it. Modify ifioctl() to use ifunit_ref(), holding the reference until all ioctls, etc, have completed. This closes a class of reader-writer races in which interfaces could be removed during long-running ioctls, leading to crashes. Many other consumers of ifunit() should now use ifunit_ref() to avoid similar races. MFC after: 3 weeks	2009-04-23 13:08:47 +00:00
Robert Watson	111c6b617b	During if_detach(), invoke if_dead() to set the ifnet's function pointers to "dead" implementations that no-op rather than invoking the device driver. This would generally be unexpected and possibly quite badly handled by most device drivers after if_detach() has completed. Reviewed by: bms MFC after: 3 weeks	2009-04-23 11:51:53 +00:00
Robert Watson	d6f157ea9a	Move portions of data structure initialization from if_attach() to if_alloc(), and portions of data structure destruction from if_detach() to if_free(). These changes leave more of the struct ifnet in a safe-to-access condition between alloc and attach, and between detach and free, and focus on attach/detach as stack usage events rather than data structure initialization. Affected fields include the linkstate task queue, if_afdata lock, address lists, kqueue state, and MAC labels. ifq_attach() ifq_detach() are not moved as ifq_attach() may use a queue length set by the device driver between if_alloc() and if_attach(). MFC after: 3 weeks	2009-04-23 10:59:40 +00:00
Robert Watson	242a8e72eb	Add a new interface flag, IFF_DYING, which is set when a device driver calls if_free(), and remains set if the refcount is elevated. IF_DYING skips the bit in the if_flags bitmask previously used by IFF_NEEDSGIANT, so that an MFC can be done without changing which bit is used, as IFF_NEEDSGIANT is still present in 7.x. ifnet_byindex_ref() checks for IFF_DYING and returns NULL if it is set, preventing new references from by acquired by index, preventing monitoring sysctls from seeing it. Other lookup mechanisms currently do not check IFF_DYING, but may need to in the future. MFC after: 3 weeks	2009-04-23 09:32:30 +00:00
Robert Watson	27d37320ec	Start to address a number of races relating to use of ifnet pointers after the corresponding interface has been destroyed: (1) Add an ifnet refcount, ifp->if_refcount. Initialize it to 1 in if_alloc(), and modify if_free_type() to decrement and check the refcount. (2) Add new if_ref() and if_rele() interfaces to allow kernel code walking global interface lists to release IFNET_[RW]LOCK() yet keep the ifnet stable. Currently, if_rele() is a no-op wrapper around if_free(), but this may change in the future. (3) Add new ifnet field, if_alloctype, which caches the type passed to if_alloc(), but unlike if_type, won't be changed by drivers. This allows asynchronous free's of the interface after the driver has released it to still use the right type. Use that instead of the type passed to if_free_type(), but assert that they are the same (might have to rethink this if that doesn't work out). (4) Add a new ifnet_byindex_ref(), which looks up an interface by index and returns a reference rather than a pointer to it. (5) Fix if_alloc() to fully initialize the if_addr_mtx before hooking up the ifnet to global lists. (6) Modify sysctls in if_mib.c to use ifnet_byindex_ref() and release the ifnet when done. When this change is MFC'd, it will need to replace if_ispare fields rather than adding new fields in order to avoid breaking the binary interface. Once this change is MFC'd, if_free_type() should be removed, as its 'type' argument is now optional. This refcount is not appropriate for counting mbuf pkthdr references, and also not for counting entry into the device driver via ifnet function pointers. An rmlock may be appropriate for the latter. Rather, this is about ensuring data structure stability when reaching an ifnet via global ifnet lists and tables followed by copy in or out of userspace. MFC after: 3 weeks Reported by: mdtancsa Reviewed by: brooks	2009-04-21 22:43:32 +00:00
Robert Watson	ab5ed8a5aa	Acquire the interface address list lock over some iterations over if_addrhead. This closes some reader-writer races associated with the address list. MFC after: 2 weeks	2009-04-21 19:06:47 +00:00
Kip Macy	7cc5b47fb3	export if_qflush for use by driver if_qflush routines only set ifp->if_{transmit, qflush} if not already set KASSERT that neither or both are set	2009-04-16 23:05:10 +00:00
Marko Zec	4d79e3d5e8	In the !VIMAGE_GLOBALS case, make sure not to call vnet_net_iattach() both via the vnet_mod_register() framework and then directly, but only once. Reviewed by: bz Approved by: julian (mentor)	2009-04-15 18:15:29 +00:00
Kip Macy	aee3056f64	call default if_qflush on ifq if default method isn't used by the driver	2009-04-14 03:17:44 +00:00
Marko Zec	bfe1aba468	Introduce vnet module registration / initialization framework with dependency tracking and ordering enforcement. With this change, per-vnet initialization functions introduced with r190787 are no longer directly called from traditional initialization functions (which cc in most cases inlined to pre-r190787 code), but are instead registered via the vnet framework first, and are invoked only after all prerequisite modules have been initialized. In the long run, this framework should allow us to both initialize and dismantle multiple vnet instances in a correct order. The problem this change aims to solve is how to replay the initialization sequence of various network stack components, which have been traditionally triggered via different mechanisms (SYSINIT, protosw). Note that this initialization sequence was and still can be subtly different depending on whether certain pieces of code have been statically compiled into the kernel, loaded as modules by boot loader, or kldloaded at run time. The approach is simple - we record the initialization sequence established by the traditional mechanisms whenever vnet_mod_register() is called for a particular vnet module. The vnet_mod_register_multi() variant allows a single initializer function to be registered multiple times but with different arguments - currently this is only used in kern/uipc_domain.c by net_add_domain() with different struct domain * as arguments, which allows for protosw-registered initialization routines to be invoked in a correct order by the new vnet initialization framework. For the purpose of identifying vnet modules, each vnet module has to have a unique ID, which is statically assigned in sys/vimage.h. Dynamic assignment of vnet module IDs is not supported yet. A vnet module may specify a single prerequisite module at registration time by filling in the vmi_dependson field of its vnet_modinfo struct with the ID of the module it depends on. Unless specified otherwise, all vnet modules depend on VNET_MOD_NET (container for ifnet list head, rt_tables etc.), which thus has to and will always be initialized first. The framework will panic if it detects any unresolved dependencies before completing system initialization. Detection of unresolved dependencies for vnet modules registered after boot (kldloaded modules) is not provided. Note that the fact that each module can specify only a single prerequisite may become problematic in the long run. In particular, INET6 depends on INET being already instantiated, due to TCP / UDP structures residing in INET container. IPSEC also depends on INET, which will in turn additionally complicate making INET6-only kernel configs a reality. The entire registration framework can be compiled out by turning on the VIMAGE_GLOBALS kernel config option. Reviewed by: bz Approved by: julian (mentor)	2009-04-11 05:58:58 +00:00
Max Laier	8623f9fd7a	Follow up for r190895 It's not only the "all" group that is affected, but all groups on the given interface. PR: kern/130977, kern/131310 MFC after: 3 days (%vnet)	2009-04-10 19:16:14 +00:00
Max Laier	876d5c038f	Remove interfaces from IFG_ALL on detach. This cures a couple of pf panics when using the "self" keyword in tables or as ()-style host address and fixes "ifconfig -g all" output. PR: kern/130977, kern/131310 Submitted by: Mikolaj Golub MFC after: 3 days	2009-04-10 14:41:51 +00:00
Marko Zec	1ed81b739e	First pass at separating per-vnet initializer functions from existing functions for initializing global state. At this stage, the new per-vnet initializer functions are directly called from the existing global initialization code, which should in most cases result in compiler inlining those new functions, hence yielding a near-zero functional change. Modify the existing initializer functions which are invoked via protosw, like ip_init() et. al., to allow them to be invoked multiple times, i.e. per each vnet. Global state, if any, is initialized only if such functions are called within the context of vnet0, which will be determined via the IS_DEFAULT_VNET(curvnet) check (currently always true). While here, V_irtualize a few remaining global UMA zones used by net/netinet/netipsec networking code. While it is not yet clear to me or anybody else whether this is the right thing to do, at this stage this makes the code more readable, and makes it easier to track uncollected UMA-zone-backed objects on vnet removal. In the long run, it's quite possible that some form of shared use of UMA zone pools among multiple vnets should be considered. Bump __FreeBSD_version due to changes in layout of structs vnet_ipfw, vnet_inet and vnet_net. Approved by: julian (mentor)	2009-04-06 22:29:41 +00:00
Sam Leffler	a51f44a7a6	enable setting the mac address of 802.11 devices	2009-03-28 17:36:56 +00:00
Jamie Gritton	bc3977f122	Call the interface's if_ioctl from ifioctl(), if the protocol didn't handle the ioctl. There are other paths that already call it, but this allows for a non-interface socket (like AF_LOCAL which ifconfig now uses) to use a broader class of interface ioctls. Approved by: bz (mentor), rwatson	2009-03-20 13:41:23 +00:00
Robert Watson	e5adda3d51	Remove IFF_NEEDSGIANT, a compatibility infrastructure introduced in FreeBSD 5.x to allow network device drivers to run with Giant despite the network stack being Giant-free. This significantly simplifies calls into ioctl() on network interfaces, especially in the multicast code, as well as eliminates deferred invocation of interface if_start routines. Disable the build on device drivers still depending on IFF_NEEDSGIANT as they no longer compile. They will be removed in a few weeks if they haven't been made MPSAFE in that time. Disabled drivers: if_ar if_axe if_aue if_cdce if_cue if_kue if_ray if_rue if_rum if_sr if_udav if_ural if_zyd Drivers that were already disabled because of tty changes: if_ppp if_sl Discussed on: arch@	2009-03-15 14:21:05 +00:00
Sam Leffler	c0c9ea90a8	remove stray ;	2009-03-14 17:54:58 +00:00
Bjoern A. Zeeb	33553d6e99	For all files including net/vnet.h directly include opt_route.h and net/route.h. Remove the hidden include of opt_route.h and net/route.h from net/vnet.h. We need to make sure that both opt_route.h and net/route.h are included before net/vnet.h because of the way MRT figures out the number of FIBs from the kernel option. If we do not, we end up with the default number of 1 when including net/vnet.h and array sizes are wrong. This does not change the list of files which depend on opt_route.h but we can identify them now more easily.	2009-02-27 14:12:05 +00:00
Jamie Gritton	b89e82dd87	Standardize the various prison_foo_ip[46] functions and prison_if to return zero on success and an error code otherwise. The possible errors are EADDRNOTAVAIL if an address being checked for doesn't match the prison, and EAFNOSUPPORT if the prison doesn't have any addresses in that address family. For most callers of these functions, use the returned error code instead of e.g. a hard-coded EADDRNOTAVAIL or EINVAL. Always include a jailed() check in these functions, where a non-jailed cred always returns success (and makes no changes). Remove the explicit jailed() checks that preceded many of the function calls. Approved by: bz (mentor)	2009-02-05 14:06:09 +00:00
John Baldwin	eb322a6f77	Only start the if_slowtimo timer (which drives the if_watchdog methods of network interfaces) if we have at least one interface with an if_watchdog routine. MFC after: 2 weeks	2009-01-23 20:53:01 +00:00
Kip Macy	6241d13a1b	if_rtdel is always called with the RADIX_NODE_HEAD lock held	2008-12-18 09:59:24 +00:00
Kip Macy	d24c444ca0	add ifnet_byindex_locked to allow for use of IFNET_RLOCK	2008-12-18 04:50:44 +00:00
Kip Macy	c368cff776	avoid trying to acquire a shared lock while holding an exclusive lock by making the ifnet lock acquisition exclusive	2008-12-17 04:33:52 +00:00
Kip Macy	991f8615e4	convert ifnet and afdata locks from mutexes to rwlocks	2008-12-17 00:11:56 +00:00
Qing Li	6e6b3f7cbc	This main goals of this project are: 1. separating L2 tables (ARP, NDP) from the L3 routing tables 2. removing as much locking dependencies among these layers as possible to allow for some parallelism in the search operations 3. simplify the logic in the routing code, The most notable end result is the obsolescent of the route cloning (RTF_CLONING) concept, which translated into code reduction in both IPv4 ARP and IPv6 NDP related modules, and size reduction in struct rtentry{}. The change in design obsoletes the semantics of RTF_CLONING, RTF_WASCLONE and RTF_LLINFO routing flags. The userland applications such as "arp" and "ndp" have been modified to reflect those changes. The output from "netstat -r" shows only the routing entries. Quite a few developers have contributed to this project in the past: Glebius Smirnoff, Luigi Rizzo, Alessandro Cerri, and Andre Oppermann. And most recently: - Kip Macy revised the locking code completely, thus completing the last piece of the puzzle, Kip has also been conducting active functional testing - Sam Leffler has helped me improving/refactoring the code, and provided valuable reviews - Julian Elischer setup the perforce tree for me and has helped me maintaining that branch before the svn conversion	2008-12-15 06:10:57 +00:00
Bjoern A. Zeeb	40eb85e75e	Whitespace changes only - tabs must have been converted to spaces somehow, when moving the code from p4 to svn. Sponsored by: The FreeBSD Foundation	2008-12-11 15:42:59 +00:00
Marko Zec	385195c062	Conditionally compile out V_ globals while instantiating the appropriate container structures, depending on VIMAGE_GLOBALS compile time option. Make VIMAGE_GLOBALS a new compile-time option, which by default will not be defined, resulting in instatiations of global variables selected for V_irtualization (enclosed in #ifdef VIMAGE_GLOBALS blocks) to be effectively compiled out. Instantiate new global container structures to hold V_irtualized variables: vnet_net_0, vnet_inet_0, vnet_inet6_0, vnet_ipsec_0, vnet_netgraph_0, and vnet_gif_0. Update the VSYM() macro so that depending on VIMAGE_GLOBALS the V_ macros resolve either to the original globals, or to fields inside container structures, i.e. effectively #ifdef VIMAGE_GLOBALS #define V_rt_tables rt_tables #else #define V_rt_tables vnet_net_0._rt_tables #endif Update SYSCTL_V_*() macros to operate either on globals or on fields inside container structs. Extend the internal kldsym() lookups with the ability to resolve selected fields inside the virtualization container structs. This applies only to the fields which are explicitly registered for kldsym() visibility via VNET_MOD_DECLARE() and vnet_mod_register(), currently this is done only in sys/net/if.c. Fix a few broken instances of MODULE_GLOBAL() macro use in SCTP code, and modify the MODULE_GLOBAL() macro to resolve to V_ macros, which in turn result in proper code being generated depending on VIMAGE_GLOBALS. De-virtualize local static variables in sys/contrib/pf/net/pf_subr.c which were prematurely V_irtualized by automated V_ prepending scripts during earlier merging steps. PF virtualization will be done separately, most probably after next PF import. Convert a few variable initializations at instantiation to initialization in init functions, most notably in ipfw. Also convert TUNABLE_INT() initializers for V_ variables to TUNABLE_FETCH_INT() in initializer functions. Discussed at: devsummit Strassburg Reviewed by: bz, julian Approved by: julian (mentor) Obtained from: //depot/projects/vimage-commit2/... X-MFC after: never Sponsored by: NLnet Foundation, The FreeBSD Foundation	2008-12-10 23:12:39 +00:00
Bjoern A. Zeeb	21b14a75f6	It does not make much sense to include net/route.h twice. Remove one #include.	2008-12-09 21:09:05 +00:00
Bjoern A. Zeeb	653735c44c	Add rwlock.h (and lock.h for that) to keep no-INET kernels compiling after RADIX_NODE_HEAD_{,UN}LOCK() were added. Must have been "learned" by pollution before (most likely: route.h -> radix.h -> rwlock.h)	2008-12-09 20:05:58 +00:00
Bjoern A. Zeeb	4b79449e2f	Rather than using hidden includes (with cicular dependencies), directly include only the header files needed. This reduces the unneeded spamming of various headers into lots of files. For now, this leaves us with very few modules including vnet.h and thus needing to depend on opt_route.h. Reviewed by: brooks, gnn, des, zec, imp Sponsored by: The FreeBSD Foundation	2008-12-02 21:37:28 +00:00
Bjoern A. Zeeb	413628a7e3	MFp4: Bring in updated jail support from bz_jail branch. This enhances the current jail implementation to permit multiple addresses per jail. In addtion to IPv4, IPv6 is supported as well. Due to updated checks it is even possible to have jails without an IP address at all, which basically gives one a chroot with restricted process view, no networking,.. SCTP support was updated and supports IPv6 in jails as well. Cpuset support permits jails to be bound to specific processor sets after creation. Jails can have an unrestricted (no duplicate protection, etc.) name in addition to the hostname. The jail name cannot be changed from within a jail and is considered to be used for management purposes or as audit-token in the future. DDB 'show jails' command was added to aid debugging. Proper compat support permits 32bit jail binaries to be used on 64bit systems to manage jails. Also backward compatibility was preserved where possible: for jail v1 syscalls, as well as with user space management utilities. Both jail as well as prison version were updated for the new features. A gap was intentionally left as the intermediate versions had been used by various patches floating around the last years. Bump __FreeBSD_version for the afore mentioned and in kernel changes. Special thanks to: - Pawel Jakub Dawidek (pjd) for his multi-IPv4 patches and Olivier Houchard (cognet) for initial single-IPv6 patches. - Jeff Roberson (jeff) and Randall Stewart (rrs) for their help, ideas and review on cpuset and SCTP support. - Robert Watson (rwatson) for lots and lots of help, discussions, suggestions and review of most of the patch at various stages. - John Baldwin (jhb) for his help. - Simon L. Nielsen (simon) as early adopter testing changes on cluster machines as well as all the testers and people who provided feedback the last months on freebsd-jail and other channels. - My employer, CK Software GmbH, for the support so I could work on this. Reviewed by: (see above) MFC after: 3 months (this is just so that I get the mail) X-MFC Before: 7.2-RELEASE if possible	2008-11-29 14:32:14 +00:00
Marko Zec	97021c2464	Merge more of currently non-functional (i.e. resolving to whitespace) macros from p4/vimage branch. Do a better job at enclosing all instantiations of globals scheduled for virtualization in #ifdef VIMAGE_GLOBALS blocks. De-virtualize and mark as const saorder_state_alive and saorder_state_any arrays from ipsec code, given that they are never updated at runtime, so virtualizing them would be pointless. Reviewed by: bz, julian Approved by: julian (mentor) Obtained from: //depot/projects/vimage-commit2/... X-MFC after: never Sponsored by: NLnet Foundation, The FreeBSD Foundation	2008-11-26 22:32:07 +00:00
Sam Leffler	1444358966	use consistent style	2008-11-24 17:34:00 +00:00
Kip Macy	db7f0b974f	- bump __FreeBSD version to reflect added buf_ring, memory barriers, and ifnet functions - add memory barriers to <machine/atomic.h> - update drivers to only conditionally define their own - add lockless producer / consumer ring buffer - remove ring buffer implementation from cxgb and update its callers - add if_transmit(struct ifnet ifp, struct mbuf m) to ifnet to allow drivers to efficiently manage multiple hardware queues (i.e. not serialize all packets through one ifq) - expose if_qflush to allow drivers to flush any driver managed queues This work was supported by Bitgravity Inc. and Chelsio Inc.	2008-11-22 05:55:56 +00:00
Marko Zec	44e33a0758	Change the initialization methodology for global variables scheduled for virtualization. Instead of initializing the affected global variables at instatiation, assign initial values to them in initializer functions. As a rule, initialization at instatiation for such variables should never be introduced again from now on. Furthermore, enclose all instantiations of such global variables in #ifdef VIMAGE_GLOBALS blocks. Essentialy, this change should have zero functional impact. In the next phase of merging network stack virtualization infrastructure from p4/vimage branch, the new initialization methology will allow us to switch between using global variables and their counterparts residing in virtualization containers with minimum code churn, and in the long run allow us to intialize multiple instances of such container structures. Discussed at: devsummit Strassburg Reviewed by: bz, julian Approved by: julian (mentor) Obtained from: //depot/projects/vimage-commit2/... X-MFC after: never Sponsored by: NLnet Foundation, The FreeBSD Foundation	2008-11-19 09:39:34 +00:00
Bjoern A. Zeeb	5a97c9d46c	Include if_arp.h for IFP2AC so that the netgraph parts in if.c are happy even if compiled without INET or INET6. MFC after: 2 months	2008-11-06 15:26:09 +00:00
Dag-Erling Smørgrav	1ede983cc9	Retire the MALLOC and FREE macros. They are an abomination unto style(9). MFC after: 3 months	2008-10-23 15:53:51 +00:00
Marko Zec	8b615593fc	Step 1.5 of importing the network stack virtualization infrastructure from the vimage project, as per plan established at devsummit 08/08: http://wiki.freebsd.org/Image/Notes200808DevSummit Introduce INIT_VNET_() initializer macros, VNET_FOREACH() iterator macros, and CURVNET_SET() context setting macros, all currently resolving to NOPs. Prepare for virtualization of selected SYSCTL objects by introducing a family of SYSCTL_V_() macros, currently resolving to their global counterparts, i.e. SYSCTL_V_INT() == SYSCTL_INT(). Move selected #defines from sys/sys/vimage.h to newly introduced header files specific to virtualized subsystems (sys/net/vnet.h, sys/netinet/vinet.h etc.). All the changes are verified to have zero functional impact at this point in time by doing MD5 comparision between pre- and post-change object files(). () netipsec/keysock.c did not validate depending on compile time options. Implemented by: julian, bz, brooks, zec Reviewed by: julian, bz, brooks, kris, rwatson, ... Approved by: julian (mentor) Obtained from: //depot/projects/vimage-commit2/... X-MFC after: never Sponsored by: NLnet Foundation, The FreeBSD Foundation	2008-10-02 15:37:58 +00:00
Ed Schouten	6bfa9a2d66	Replace all calls to minor() with dev2unit(). After I removed all the unit2minor()/minor2unit() calls from the kernel yesterday, I realised calling minor() everywhere is quite confusing. Character devices now only have the ability to store a unit number, not a minor number. Remove the confusion by using dev2unit() everywhere. This commit could also be considered as a bug fix. A lot of drivers call minor(), while they should actually be calling dev2unit(). In -CURRENT this isn't a problem, but it turns out we never had any problem reports related to that issue in the past. I suspect not many people connect more than 256 pieces of the same hardware. Reviewed by: kib	2008-09-27 08:51:18 +00:00
Ed Schouten	d3ce832719	Remove unit2minor() use from kernel code. When I changed kern_conf.c three months ago I made device unit numbers equal to (unneeded) device minor numbers. We used to require bitshifting, because there were eight bits in the middle that were reserved for a device major number. Not very long after I turned dev2unit(), minor(), unit2minor() and minor2unit() into macro's. The unit2minor() and minor2unit() macro's were no-ops. We'd better not remove these four macro's from the kernel, because there is a lot of (external) code that may still depend on them. For now it's harmless to remove all invocations of unit2minor() and minor2unit(). Reviewed by: kib	2008-09-26 14:19:52 +00:00
Bjoern A. Zeeb	f0c042211b	Make the checks for ptp interfaces in ifa_ifwithdstaddr() and ifa_ifwithnet() look more similar by comparing the pointer to NULL in both cases. MFC after: 3 months	2008-08-24 11:03:43 +00:00
Andrew Thompson	516993d48e	ifnet_setbyindex() is only used locally, go back to being static.	2008-08-20 05:00:18 +00:00
Julian Elischer	ac957cd271	A bunch of formatting fixes brough to light by, or created by the Vimage commit a few days ago.	2008-08-20 01:05:56 +00:00
Bjoern A. Zeeb	603724d3ab	Commit step 1 of the vimage project, (network stack) virtualization work done by Marko Zec (zec@). This is the first in a series of commits over the course of the next few weeks. Mark all uses of global variables to be virtualized with a V_ prefix. Use macros to map them back to their global names for now, so this is a NOP change only. We hope to have caught at least 85-90% of what is needed so we do not invalidate a lot of outstanding patches again. Obtained from: //depot/projects/vimage-commit2/... Reviewed by: brooks, des, ed, mav, julian, jamie, kris, rwatson, zec, ... (various people I forgot, different versions) md5 (with a bit of help) Sponsored by: NLnet Foundation, The FreeBSD Foundation X-MFC after: never V_Commit_Message_Reviewed_By: more people than the patch	2008-08-17 23:27:27 +00:00
Robert Watson	02f4879d3a	Introduce locking around use of ifindex_table, whose use was previously unsynchronized. While races were extremely rare, we've now had a couple of reports of panics in environments involving large numbers of IPSEC tunnels being added very quickly on an active system. - Add accessor functions ifnet_byindex(), ifaddr_byindex(), ifdev_byindex() to replace existing accessor macros. These functions now acquire the ifnet lock before derefencing the table. - Add IFNET_WLOCK_ASSERT(). - Add static accessor functions ifnet_setbyindex(), ifdev_setbyindex(), which set values in the table either asserting of acquiring the ifnet lock. - Use accessor functions throughout if.c to modify and read ifindex_table. - Rework ifnet attach/detach to lock around ifindex_table modification. Note that these changes simply close races around use of ifindex_table, and make no attempt to solve the probem of disappearing ifnets. Further refinement of this work, including with respect to ifindex_table resizing, is still required. In a future change, the ifnet lock should be converted from a mutex to an rwlock in order to reduce contention. Reviewed and tested by: brooks	2008-06-26 23:05:28 +00:00
Brooks Davis	d94ccb096b	The if_check() function performed three actions: - verified that the ifp->if_snd.ifq_mtx was initalized for all attached interfaces. This was pointless because it was initalized for all interfaces in if_attach() so I've removed it. - Checked that ifp->if_snd.ifq_maxlen is initalized and set it to ifqmaxlen if unset. This makes more sense in if_attach() so I moved it there. - The first call of if_slowtimo(). Delete if_check() and call if_slowtimo() directly from the SYSINIT().	2008-05-17 03:38:13 +00:00
Julian Elischer	8b07e49a00	Add code to allow the system to handle multiple routing tables. This particular implementation is designed to be fully backwards compatible and to be MFC-able to 7.x (and 6.x) Currently the only protocol that can make use of the multiple tables is IPv4 Similar functionality exists in OpenBSD and Linux. From my notes: ----- One thing where FreeBSD has been falling behind, and which by chance I have some time to work on is "policy based routing", which allows different packet streams to be routed by more than just the destination address. Constraints: ------------ I want to make some form of this available in the 6.x tree (and by extension 7.x) , but FreeBSD in general needs it so I might as well do it in -current and back port the portions I need. One of the ways that this can be done is to have the ability to instantiate multiple kernel routing tables (which I will now refer to as "Forwarding Information Bases" or "FIBs" for political correctness reasons). Which FIB a particular packet uses to make the next hop decision can be decided by a number of mechanisms. The policies these mechanisms implement are the "Policies" referred to in "Policy based routing". One of the constraints I have if I try to back port this work to 6.x is that it must be implemented as a EXTENSION to the existing ABIs in 6.x so that third party applications do not need to be recompiled in timespan of the branch. This first version will not have some of the bells and whistles that will come with later versions. It will, for example, be limited to 16 tables in the first commit. Implementation method, Compatible version. (part 1) ------------------------------- For this reason I have implemented a "sufficient subset" of a multiple routing table solution in Perforce, and back-ported it to 6.x. (also in Perforce though not always caught up with what I have done in -current/P4). The subset allows a number of FIBs to be defined at compile time (8 is sufficient for my purposes in 6.x) and implements the changes needed to allow IPV4 to use them. I have not done the changes for ipv6 simply because I do not need it, and I do not have enough knowledge of ipv6 (e.g. neighbor discovery) needed to do it. Other protocol families are left untouched and should there be users with proprietary protocol families, they should continue to work and be oblivious to the existence of the extra FIBs. To understand how this is done, one must know that the current FIB code starts everything off with a single dimensional array of pointers to FIB head structures (One per protocol family), each of which in turn points to the trie of routes available to that family. The basic change in the ABI compatible version of the change is to extent that array to be a 2 dimensional array, so that instead of protocol family X looking at rt_tables[X] for the table it needs, it looks at rt_tables[Y][X] when for all protocol families except ipv4 Y is always 0. Code that is unaware of the change always just sees the first row of the table, which of course looks just like the one dimensional array that existed before. The entry points rtrequest(), rtalloc(), rtalloc1(), rtalloc_ign() are all maintained, but refer only to the first row of the array, so that existing callers in proprietary protocols can continue to do the "right thing". Some new entry points are added, for the exclusive use of ipv4 code called in_rtrequest(), in_rtalloc(), in_rtalloc1() and in_rtalloc_ign(), which have an extra argument which refers the code to the correct row. In addition, there are some new entry points (currently called rtalloc_fib() and friends) that check the Address family being looked up and call either rtalloc() (and friends) if the protocol is not IPv4 forcing the action to row 0 or to the appropriate row if it IS IPv4 (and that info is available). These are for calling from code that is not specific to any particular protocol. The way these are implemented would change in the non ABI preserving code to be added later. One feature of the first version of the code is that for ipv4, the interface routes show up automatically on all the FIBs, so that no matter what FIB you select you always have the basic direct attached hosts available to you. (rtinit() does this automatically). You CAN delete an interface route from one FIB should you want to but by default it's there. ARP information is also available in each FIB. It's assumed that the same machine would have the same MAC address, regardless of which FIB you are using to get to it. This brings us as to how the correct FIB is selected for an outgoing IPV4 packet. Firstly, all packets have a FIB associated with them. if nothing has been done to change it, it will be FIB 0. The FIB is changed in the following ways. Packets fall into one of a number of classes. 1/ locally generated packets, coming from a socket/PCB. Such packets select a FIB from a number associated with the socket/PCB. This in turn is inherited from the process, but can be changed by a socket option. The process in turn inherits it on fork. I have written a utility call setfib that acts a bit like nice.. setfib -3 ping target.example.com # will use fib 3 for ping. It is an obvious extension to make it a property of a jail but I have not done so. It can be achieved by combining the setfib and jail commands. 2/ packets received on an interface for forwarding. By default these packets would use table 0, (or possibly a number settable in a sysctl(not yet)). but prior to routing the firewall can inspect them (see below). (possibly in the future you may be able to associate a FIB with packets received on an interface.. An ifconfig arg, but not yet.) 3/ packets inspected by a packet classifier, which can arbitrarily associate a fib with it on a packet by packet basis. A fib assigned to a packet by a packet classifier (such as ipfw) would over-ride a fib associated by a more default source. (such as cases 1 or 2). 4/ a tcp listen socket associated with a fib will generate accept sockets that are associated with that same fib. 5/ Packets generated in response to some other packet (e.g. reset or icmp packets). These should use the FIB associated with the packet being reponded to. 6/ Packets generated during encapsulation. gif, tun and other tunnel interfaces will encapsulate using the FIB that was in effect withthe proces that set up the tunnel. thus setfib 1 ifconfig gif0 [tunnel instructions] will set the fib for the tunnel to use to be fib 1. Routing messages would be associated with their process, and thus select one FIB or another. messages from the kernel would be associated with the fib they refer to and would only be received by a routing socket associated with that fib. (not yet implemented) In addition Netstat has been edited to be able to cope with the fact that the array is now 2 dimensional. (It looks in system memory using libkvm (!)). Old versions of netstat see only the first FIB. In addition two sysctls are added to give: a) the number of FIBs compiled in (active) b) the default FIB of the calling process. Early testing experience: ------------------------- Basically our (IronPort's) appliance does this functionality already using ipfw fwd but that method has some drawbacks. For example, It can't fully simulate a routing table because it can't influence the socket's choice of local address when a connect() is done. Testing during the generating of these changes has been remarkably smooth so far. Multiple tables have co-existed with no notable side effects, and packets have been routes accordingly. ipfw has grown 2 new keywords: setfib N ip from anay to any count ip from any to any fib N In pf there seems to be a requirement to be able to give symbolic names to the fibs but I do not have that capacity. I am not sure if it is required. SCTP has interestingly enough built in support for this, called VRFs in Cisco parlance. it will be interesting to see how that handles it when it suddenly actually does something. Where to next: -------------------- After committing the ABI compatible version and MFCing it, I'd like to proceed in a forward direction in -current. this will result in some roto-tilling in the routing code. Firstly: the current code's idea of having a separate tree per protocol family, all of the same format, and pointed to by the 1 dimensional array is a bit silly. Especially when one considers that there is code that makes assumptions about every protocol having the same internal structures there. Some protocols don't WANT that sort of structure. (for example the whole idea of a netmask is foreign to appletalk). This needs to be made opaque to the external code. My suggested first change is to add routing method pointers to the 'domain' structure, along with information pointing the data. instead of having an array of pointers to uniform structures, there would be an array pointing to the 'domain' structures for each protocol address domain (protocol family), and the methods this reached would be called. The methods would have an argument that gives FIB number, but the protocol would be free to ignore it. When the ABI can be changed it raises the possibilty of the addition of a fib entry into the "struct route". Currently, the structure contains the sockaddr of the desination, and the resulting fib entry. To make this work fully, one could add a fib number so that given an address and a fib, one can find the third element, the fib entry. Interaction with the ARP layer/ LL layer would need to be revisited as well. Qing Li has been working on this already. This work was sponsored by Ironport Systems/Cisco Reviewed by: several including rwatson, bz and mlair (parts each) Obtained from: Ironport systems/Cisco	2008-05-09 23:03:00 +00:00
Brooks Davis	ae0615f633	Delay the global registration of the struct ifnet in if_alloc() until after we're certain the allocation will entierly succeed. This fixes a leak in a fairly unlikely case. Reported by: vijay singh <vijjus at rocketmail dot com> MFC after: 1 week	2008-04-19 22:04:51 +00:00
Sam Leffler	fb27dd1db3	expose if_purgemaddrs, it will be used by the vap code unless someone redesigns the mcast support code in the next few weeks MFC after: 3 weeks	2008-03-25 21:23:32 +00:00
Robert Watson	237fdd787b	In keeping with style(9)'s recommendations on macros, use a ';' after each SYSINIT() macro invocation. This makes a number of lightweight C parsers much happier with the FreeBSD kernel source, including cflow's prcc and lxr. MFC after: 1 month Discussed with: imp, rink	2008-03-16 10:58:09 +00:00
Robert Watson	b9175c4556	Move IFF_NEEDSGIANT warning from if_ethersubr.c to if.c so it is displayed for all network interfaces, not just ethernet-like ones. Upgrade it to a louder WARNING and be explicit that the flag is obsolete. Support for IFF_NEEDSGIANT will be removed in a few months (see arch@ for details) and will not appear in 8.0. Upgrade if_watchdog to a WARNING.	2008-03-07 16:00:44 +00:00
Robert Watson	30d239bc4c	Merge first in a series of TrustedBSD MAC Framework KPI changes from Mac OS X Leopard--rationalize naming for entry points to the following general forms: mac_<object>_<method/action> mac_<object>_check_<method/action> The previous naming scheme was inconsistent and mostly reversed from the new scheme. Also, make object types more consistent and remove spaces from object types that contain multiple parts ("posix_sem" -> "posixsem") to make mechanical parsing easier. Introduce a new "netinet" object type for certain IPv4/IPv6-related methods. Also simplify, slightly, some entry point names. All MAC policy modules will need to be recompiled, and modules not updates as part of this commit will need to be modified to conform to the new KPI. Sponsored by: SPARTA (original patches against Mac OS X) Obtained from: TrustedBSD Project, Apple Computer	2007-10-24 19:04:04 +00:00
Robert Watson	33d2bb9ca3	First in a series of changes to remove the now-unused Giant compatibility framework for non-MPSAFE network protocols: - Remove debug_mpsafenet variable, sysctl, and tunable. - Remove NET_NEEDS_GIANT() and associate SYSINITSs used by it to force debug.mpsafenet=0 if non-MPSAFE protocols are compiled into the kernel. - Remove logic to automatically flag interrupt handlers as non-MPSAFE if debug.mpsafenet is set for an INTR_TYPE_NET handler. - Remove logic to automatically flag netisr handlers as non-MPSAFE if debug.mpsafenet is set. - Remove references in a few subsystems, including NFS and Cronyx drivers, which keyed off debug_mpsafenet to determine various aspects of their own locking behavior. - Convert NET_LOCK_GIANT(), NET_UNLOCK_GIANT(), and NET_ASSERT_GIANT into no-op's, as their entire behavior was determined by the value in debug_mpsafenet. - Alias NET_CALLOUT_MPSAFE to CALLOUT_MPSAFE. Many remaining references to NET_.*_GIANT() and NET_CALLOUT_MPSAFE are still present in subsystems, and will be removed in followup commits. Reviewed by: bz, jhb Approved by: re (kensmith)	2007-07-27 11:59:57 +00:00
Brooks Davis	a45cbf12c8	Update the comments on if_alloc(), if_free(), if_free_type(), and if_attach. Remove a comment about pre-3.0 network drivers from if_attach(). Be a bit more consistant about whitespace near comments.	2007-05-16 19:59:01 +00:00
Andrew Thompson	18242d3b09	Rename the trunk(4) driver to lagg(4) as it is too similar to vlan trunking. The name trunk is misused as the networking term trunk means carrying multiple VLANs over a single connection. The IEEE standard for link aggregation (802.3 section 3) does not talk about 'trunk' at all while it is used throughout IEEE 802.1Q in describing vlans. The lagg(4) driver provides link aggregation, failover and fault tolerance. Discussed on: current@	2007-04-17 00:35:11 +00:00
Andrew Thompson	b47888ceba	Add the trunk(4) driver for providing link aggregation, failover and fault tolerance. This driver allows aggregation of multiple network interfaces as one virtual interface using a number of different protocols/algorithms. failover - Sends traffic through the secondary port if the master becomes inactive. fec - Supports Cisco Fast EtherChannel. lacp - Supports the IEEE 802.3ad Link Aggregation Control Protocol (LACP) and the Marker Protocol. loadbalance - Static loadbalancing using an outgoing hash. roundrobin - Distributes outgoing traffic using a round-robin scheduler through all active ports. This code was obtained from OpenBSD and this also includes 802.3ad LACP support from agr(4) in NetBSD.	2007-04-10 00:27:25 +00:00
Bruce M Simpson	75ae0c016b	Fix a case where hardware removal of an interface caused an attempt to announce an ll_ifma which has gone away. Add a KASSERT to catch regressions. Bug found by: Tom Uffner	2007-03-27 16:11:28 +00:00
Bruce M Simpson	5896d12465	Fix tinderbox; ng_ether needs to see if_findmulti().	2007-03-20 03:15:43 +00:00
Bruce M Simpson	ec002fee99	Implement reference counting for ifmultiaddr, in_multi, and in6_multi structures. Detect when ifnet instances are detached from the network stack and perform appropriate cleanup to prevent memory leaks. This has been implemented in such a way as to be backwards ABI compatible. Kernel consumers are changed to use if_delmulti_ifma(); in_delmulti() is unable to detect interface removal by design, as it performs searches on structures which are removed with the interface. With this architectural change, the panics FreeBSD users have experienced with carp and pfsync should be resolved. Obtained from: p4 branch bms_netdev Reviewed by: andre Sponsored by: Garance A Drosehn Idea from: NetBSD MFC after: 1 month	2007-03-20 00:36:10 +00:00
Bruce M Simpson	40d8a30241	Fix a bug in if_findmulti(), whereby it would not find (and thus delete) a link-layer multicast group membership. Such memberships are needed in order to support protocols such as IS-IS without putting the interface into PROMISC or ALLMULTI modes. sa_equal() is not OK for comparing sockaddr_dl as it has deeper structure than a simple byte array, so add sa_dl_equal() and use that instead. Reviewed by: rwatson Verified with: /usr/sbin/mtest Bug found by: Jouke Witteveen MFC after: 2 weeks	2007-02-22 00:14:02 +00:00
Gleb Smirnoff	c18ffdc87d	The recent issues with em(4) interface has shown that the old 4.4BSD if_watchdog/if_timer interface doesn't fit modern SMP network stack design. Device drivers that need watchdog to monitor their hardware should implement it theirselves. Eventually the if_watchdog/if_timer API will be removed. For now, warn that driver uses it. Reviewed by: scottl	2006-11-30 15:02:01 +00:00
Robert Watson	acd3428b7d	Sweep kernel replacing suser(9) calls with priv(9) calls, assigning specific privilege names to a broad range of privileges. These may require some future tweaking. Sponsored by: nCircle Network Security, Inc. Obtained from: TrustedBSD Project Discussed on: arch@ Reviewed (at least in part) by: mlaier, jmg, pjd, bde, ceri, Alex Lyashkov <umka at sevcity dot net>, Skip Ford <skip dot ford at verizon dot net>, Antoine Brodin <antoine dot brodin at laposte dot net>	2006-11-06 13:42:10 +00:00

1 2 3 4 5 ...

411 Commits