freebsd-skq/sys/netpfil/ipfw/dummynet.txt

861 lines
38 KiB
Plaintext
Raw Normal View History

Bring in the most recent version of ipfw and dummynet, developed and tested over the past two months in the ipfw3-head branch. This also happens to be the same code available in the Linux and Windows ports of ipfw and dummynet. The major enhancement is a completely restructured version of dummynet, with support for different packet scheduling algorithms (loadable at runtime), faster queue/pipe lookup, and a much cleaner internal architecture and kernel/userland ABI which simplifies future extensions. In addition to the existing schedulers (FIFO and WF2Q+), we include a Deficit Round Robin (DRR or RR for brevity) scheduler, and a new, very fast version of WF2Q+ called QFQ. Some test code is also present (in sys/netinet/ipfw/test) that lets you build and test schedulers in userland. Also, we have added a compatibility layer that understands requests from the RELENG_7 and RELENG_8 versions of the /sbin/ipfw binaries, and replies correctly (at least, it does its best; sometimes you just cannot tell who sent the request and how to answer). The compatibility layer should make it possible to MFC this code in a relatively short time. Some minor glitches (e.g. handling of ipfw set enable/disable, and a workaround for a bug in RELENG_7's /sbin/ipfw) will be fixed with separate commits. CREDITS: This work has been partly supported by the ONELAB2 project, and mostly developed by Riccardo Panicucci and myself. The code for the qfq scheduler is mostly from Fabio Checconi, and Marta Carbone and Francesco Magno have helped with testing, debugging and some bug fixes.
2010-03-02 17:40:48 +00:00
#
# $FreeBSD$
#
Notes on the internal structure of dummynet (2010 version)
by Riccardo Panicucci and Luigi Rizzo
Work supported by the EC project ONELAB2
*********
* INDEX *
*********
Implementation of new dummynet
Internal structure
Files
Packet arrival
The reconfiguration routine
dummynet_task()
Configuration
Add a pipe
Add a scheduler
Add a flowset
Listing object
Delete of object
Delete a pipe
Delete a flowset
Delete a scheduler
Compatibility with FreeBSD7.2 and FreeBSD 8 ipfw binary
ip_dummynet_glue.c
ip_fw_glue.c
How to configure dummynet
How to implement a new scheduler
OPEN ISSUES
------------------------------
20100131 deleting RR causes infinite loop
presumably in the rr_free_queue() call -- seems to hang
forever when deleting a live flow
------------------------------
Dummynet is a traffic shaper and network emulator. Packets are
selected by an external filter such as ipfw, and passed to the emulator
with a tag such as "pipe 10" or "queue 5" which tells what to
do with the packet. As an example
ipfw add queue 5 icmp from 10.0.0.2 to all
All packets with the same tag belong to a "flowset", or a set
of flows which can be further partitioned according to a mask.
Flowsets are then passed to a scheduler for processing. The
association of flowsets and schedulers is configurable e.g.
ipfw queue 5 config sched 10 weight 3 flow_mask xxxx
ipfw queue 8 config sched 10 weight 1 ...
ipfw queue 3 config sched 20 weight 1 ...
"sched 10" represents one or more scheduler instances,
selected through a mask on the 5-tuple itself.
ipfw sched 20 config type FIFO sched_mask yyy ...
There are in fact two masks applied to each packet:
+ the "sched_mask" sends packets arriving to a scheduler_id to
one of many instances.
+ the "flow_mask" together with the flowset_id is used to
collect packets into independent flows on each scheduler.
As an example, we can have
ipfw queue 5 config sched 10 flow_mask src-ip 0x000000ff
ipfw sched 10 config type WF2Q+ sched_mask src-ip 0xffffff00
means that sched 10 will have one instance per /24 source subnet,
and within that, each individual source will be a flow.
Internal structure
-----------------
Dummynet-related data is split into several data structures,
part of them constituting the userland-kernel API, and others
specific to the kernel.
NOTE: for up-to-date details please look at the relevant source
headers (ip_dummynet.h, ip_dn_private.h, dn_sched.h)
USERLAND-KERNEL API (ip_dummynet.h)
struct dn_link:
contains data about the physical link such as
bandwidth, delay, burst size;
Bring in the most recent version of ipfw and dummynet, developed and tested over the past two months in the ipfw3-head branch. This also happens to be the same code available in the Linux and Windows ports of ipfw and dummynet. The major enhancement is a completely restructured version of dummynet, with support for different packet scheduling algorithms (loadable at runtime), faster queue/pipe lookup, and a much cleaner internal architecture and kernel/userland ABI which simplifies future extensions. In addition to the existing schedulers (FIFO and WF2Q+), we include a Deficit Round Robin (DRR or RR for brevity) scheduler, and a new, very fast version of WF2Q+ called QFQ. Some test code is also present (in sys/netinet/ipfw/test) that lets you build and test schedulers in userland. Also, we have added a compatibility layer that understands requests from the RELENG_7 and RELENG_8 versions of the /sbin/ipfw binaries, and replies correctly (at least, it does its best; sometimes you just cannot tell who sent the request and how to answer). The compatibility layer should make it possible to MFC this code in a relatively short time. Some minor glitches (e.g. handling of ipfw set enable/disable, and a workaround for a bug in RELENG_7's /sbin/ipfw) will be fixed with separate commits. CREDITS: This work has been partly supported by the ONELAB2 project, and mostly developed by Riccardo Panicucci and myself. The code for the qfq scheduler is mostly from Fabio Checconi, and Marta Carbone and Francesco Magno have helped with testing, debugging and some bug fixes.
2010-03-02 17:40:48 +00:00
struct dn_fs:
describes a flowset, i.e. a template for queues.
Main parameters are the scheduler we attach to, a flow_mask,
buckets, queue size, plr, weight, and other scheduler-specific
parameters.
struct dn_flow
contains information on a flow, including masks and
statistics
struct dn_sch:
defines a scheduler (and a link attached to it).
Parameters include scheduler type, sched_mask, number of
buckets, and possibly other scheduler-specific parameters,
struct dn_profile:
fields to simulate a delay profile
KERNEL REPRESENTATION (ip_dn_private.h)
struct mq
a queue of mbufs with head and tail.
struct dn_queue
individual queue of packets, created by a flowset using
flow_mask and attached to a scheduler instance selected
through sched_mask.
A dn_queue has a pointer to the dn_fsk (which in turn counts
how many queues point to it), a pointer to the
dn_sch_inst it attaches to, and is in a hash table in the
flowset. scheduler instances also should store queues in
their own containers used for scheduling (lists, trees, etc.)
CREATE: done on packet arrivals when a flow matches a flowset.
DELETE: done only when deleting the parent dn_sch_inst
or draining memory.
struct dn_fsk
includes a dn_fs; a pointer to the dn_schk; a link field
for the list of dn_fsk attached to the same scheduler,
or for the unlinked list;
a refcount for the number of queues pointing to it;
The dn_fsk is in a hash table, fshash.
CREATE: done on configuration commands.
DELETE: on configuration commands.
struct dn_sch_inst
a scheduler instance, created from a dn_schk applying sched_mask.
Contains a delay line, a reference to the parent, and scheduler-
specific info. Both dn_sch_inst and its delay line can be in the
evheap if they have events to be processed.
CREATE: created from a dn_schk applying sched_mask
DELETE: configuration command delete a scheduler which in turn
sweeps the hash table of instances deleting them
struct dn_schk
includes dn_sch, dn_link, a pointer to dn_profile,
a hash table of dn_sch_inst, a list of dn_fsk
attached to it.
CREATE: configuration command. If there are flowsets that
refer to this number, they are attached and moved
to the hash table
DELETE: manual, see dn_sch_inst
fshash schedhash
+---------------+ sched +--------------+
| sched-------------------->| NEW_SCHK|
-<----*sch_chain |<-----------------*fsk_list |
|NEW_FSK |<----. | [dn_link] |
+---------------+ | +--------------+
|qht (hash) | | | siht(hash) |
| [dn_queue] | | | [dn_si] |
| [dn_queue] | | | [dn_si] |
| ... | | | ... |
| +--------+ | | | +---------+ |
| |dn_queue| | | | |dn_si | |
| | fs *----------' | | | |
| | si *---------------------->| | |
| +---------+ | | +---------+ |
+---------------+ +--------------+
The following global data structures contain all
schedulers and flowsets.
- schedhash[x]: contains all scheduler templates in the system.
Looked up only on manual configurations, where flowsets
are attached to matching schedulers.
We have one entry per 'sched X config' command
(plus one for each 'pipe X config').
- fshash[x]: contains all flowsets.
We do a lookup on this for each packet.
We have one entry for each 'queue X config'
(plus one for each 'pipe X config').
Additionally, a list that contains all unlinked flowset:
- fsu: contains flowset that are not linked with any scheduler.
Flowset are put in this list when they refer to a non
existing scheduler.
We don't need an efficient data structure as we never search
here on a packet arrivals.
Scheduler instances and the delay lines associated with each scheduler
instance need to be woken up at certain times. Because we have many
such objects, we keep them in a priority heap (system_heap).
Almost all objects in this implementation are preceded by a structure
(struct dn_id) which makes it easier to identify them.
Files
-----
The dummynet code is split in several files.
All kernel code is in sys/netpfil/ipfw except ip_dummynet.h
Bring in the most recent version of ipfw and dummynet, developed and tested over the past two months in the ipfw3-head branch. This also happens to be the same code available in the Linux and Windows ports of ipfw and dummynet. The major enhancement is a completely restructured version of dummynet, with support for different packet scheduling algorithms (loadable at runtime), faster queue/pipe lookup, and a much cleaner internal architecture and kernel/userland ABI which simplifies future extensions. In addition to the existing schedulers (FIFO and WF2Q+), we include a Deficit Round Robin (DRR or RR for brevity) scheduler, and a new, very fast version of WF2Q+ called QFQ. Some test code is also present (in sys/netinet/ipfw/test) that lets you build and test schedulers in userland. Also, we have added a compatibility layer that understands requests from the RELENG_7 and RELENG_8 versions of the /sbin/ipfw binaries, and replies correctly (at least, it does its best; sometimes you just cannot tell who sent the request and how to answer). The compatibility layer should make it possible to MFC this code in a relatively short time. Some minor glitches (e.g. handling of ipfw set enable/disable, and a workaround for a bug in RELENG_7's /sbin/ipfw) will be fixed with separate commits. CREDITS: This work has been partly supported by the ONELAB2 project, and mostly developed by Riccardo Panicucci and myself. The code for the qfq scheduler is mostly from Fabio Checconi, and Marta Carbone and Francesco Magno have helped with testing, debugging and some bug fixes.
2010-03-02 17:40:48 +00:00
All userland code is in sbin/ipfw.
Files are
- sys/netpfil/ip_dummynet.h defines the kernel-userland API
Bring in the most recent version of ipfw and dummynet, developed and tested over the past two months in the ipfw3-head branch. This also happens to be the same code available in the Linux and Windows ports of ipfw and dummynet. The major enhancement is a completely restructured version of dummynet, with support for different packet scheduling algorithms (loadable at runtime), faster queue/pipe lookup, and a much cleaner internal architecture and kernel/userland ABI which simplifies future extensions. In addition to the existing schedulers (FIFO and WF2Q+), we include a Deficit Round Robin (DRR or RR for brevity) scheduler, and a new, very fast version of WF2Q+ called QFQ. Some test code is also present (in sys/netinet/ipfw/test) that lets you build and test schedulers in userland. Also, we have added a compatibility layer that understands requests from the RELENG_7 and RELENG_8 versions of the /sbin/ipfw binaries, and replies correctly (at least, it does its best; sometimes you just cannot tell who sent the request and how to answer). The compatibility layer should make it possible to MFC this code in a relatively short time. Some minor glitches (e.g. handling of ipfw set enable/disable, and a workaround for a bug in RELENG_7's /sbin/ipfw) will be fixed with separate commits. CREDITS: This work has been partly supported by the ONELAB2 project, and mostly developed by Riccardo Panicucci and myself. The code for the qfq scheduler is mostly from Fabio Checconi, and Marta Carbone and Francesco Magno have helped with testing, debugging and some bug fixes.
2010-03-02 17:40:48 +00:00
- ip_dn_private.h contains the kernel-specific APIs
and data structures
- dn_sched.h defines the scheduler API
- ip_dummynet.c cointains module glue and sockopt handlers, with all
functions to configure and list objects.
- ip_dn_io.c contains the functions directly related to packet processing,
and run in the critical path. It also contains some functions
exported to the schedulers.
- dn_heap.[ch] implement a binary heap and a generic hash table
- dn_sched_* implement the various scheduler modules
- dummynet.c is the file used to implement the user side of dummynet.
It contains the function to parsing command line, and functions to
show the output of dummynet objects.
Moreover, there are two new file (ip_dummynet_glue.c and ip_fw_glue.c) that
are used to allow compatibility with the "ipfw" binary from FreeBSD 7.2 and
FreeBSD 8.
LOCKING
=======
At the moment the entire processing occurs under a single lock
which is expected to be acquired in exclusive mode
DN_BH_WLOCK() / DN_BH_WUNLOCK().
In perspective we aim at the following:
- the 'busy' flag, 'pending' list and all structures modified by packet
arrivals and departures are protected by the BH_WLOCK.
This is normally acquired in exclusive mode by the packet processing
functions for short sections of code (exception -- the timer).
If 'busy' is not set, we can do regular packet processing.
If 'busy' is set, no pieces can be accessed.
We must enqueue the packet on 'pending' and return immediately.
- the 'busy' flag is set/cleared by long sections of code as follows:
UH_WLOCK(); KASSERT(busy == 0);
BH_WLOCK(); busy=1; BH_WUNLOCK();
... do processing ...
BH_WLOCK(); busy=0; drain_queue(pending); BH_WUNLOCK();
UH_WUNLOCK();
this normally happens when the upper half has something heavy
to do. The prologue and epilogue are not in the critical path.
- the main containers (fshash, schedhash, ...) are protected by
UH_WLOCK.
Packet processing
=================
A packet enters dummynet through dummynet_io(). We first lookup
the flowset number in fshash using dn_ht_find(), then find the scheduler
instance using ipdn_si_find(), then possibly identify the correct
queue with ipdn_q_find().
If successful, we call the scheduler's enqueue function(), and
if needed start I/O on the link calling serve_sched().
If the packet can be returned immediately, this is done by
leaving *m0 set. Otherwise, the packet is absorbed by dummynet
and we simply return, possibly with some appropriate error code.
Reconfiguration
---------------
Reconfiguration is the complex part of the system because we need to
keep track of the various objects and containers.
At the moment we do not use reference counts for objects so all
processing must be done under a lock.
The main entry points for configuration is the ip_dn_ctl() handler
for the IP_DUMMYNET3 sockopt (others are provided only for backward
compatibility). Modifications to the configuration call do_config().
The argument is a sequence of blocks each starting with a struct dn_id
which specifies its content.
The first dn_id must contain as obj.id the DN_API_VERSION
The obj.type is DN_CMD_CONFIG (followed by actual objects),
DN_CMD_DELETE (with the correct subtype and list of objects), or
DN_CMD_FLUSH.
DN_CMD_CONFIG is followed by objects to add/reconfigure. In general,
if an object already exists it is reconfigured, otherwise it is
created in a way that keeps the structure consistent.
We have the following objects in the system, normally numbered with
an identifier N between 1 and 65535. For certain objects we have
"shadow" copies numbered I+NMAX and I+ 2*NMAX which are used to
implement certain backward compatibility features.
In general we have the following linking
TRADITIONAL DUMMYNET QUEUES "queue N config ... pipe M ..."
corresponds to a dn_fs object numbered N
TRADITIONAL DUMMYNET PIPES "pipe N config ..."
dn_fs N+2*NMAX --> dn_sch N+NMAX type FIFO --> dn_link N+NMAX
GENERIC SCHEDULER "sched N config ... "
[dn_fs N+NMAX] --> dn_sch N --> dn_link N
The flowset N+NMAX is created only if the scheduler is not
of type MULTIQUEUE.
DELAY PROFILE "pipe N config profile ..."
it is always attached to an existing dn_link N
Because traditional dummynet pipes actually configure both a
'standalone' instance and one that can be used by queues,
we do the following:
"pipe N config ..." configures:
dn_sched N type WF2Q+
dn_sched N+NMAX type FIFO
dn_fs N+2NMAX attached to dn_sched N+NMAX
dn_pipe N
dn_pipe N+NMAX
"queue N config" configures
dn_fs N
"sched N config" configures
dn_sched N type as desired
dn_fs N+NMAX attached to dn_sched N
dummynet_task()
===============
The dummynet_task() function is the main dummynet processing function and is
Bring in the most recent version of ipfw and dummynet, developed and tested over the past two months in the ipfw3-head branch. This also happens to be the same code available in the Linux and Windows ports of ipfw and dummynet. The major enhancement is a completely restructured version of dummynet, with support for different packet scheduling algorithms (loadable at runtime), faster queue/pipe lookup, and a much cleaner internal architecture and kernel/userland ABI which simplifies future extensions. In addition to the existing schedulers (FIFO and WF2Q+), we include a Deficit Round Robin (DRR or RR for brevity) scheduler, and a new, very fast version of WF2Q+ called QFQ. Some test code is also present (in sys/netinet/ipfw/test) that lets you build and test schedulers in userland. Also, we have added a compatibility layer that understands requests from the RELENG_7 and RELENG_8 versions of the /sbin/ipfw binaries, and replies correctly (at least, it does its best; sometimes you just cannot tell who sent the request and how to answer). The compatibility layer should make it possible to MFC this code in a relatively short time. Some minor glitches (e.g. handling of ipfw set enable/disable, and a workaround for a bug in RELENG_7's /sbin/ipfw) will be fixed with separate commits. CREDITS: This work has been partly supported by the ONELAB2 project, and mostly developed by Riccardo Panicucci and myself. The code for the qfq scheduler is mostly from Fabio Checconi, and Marta Carbone and Francesco Magno have helped with testing, debugging and some bug fixes.
2010-03-02 17:40:48 +00:00
called every tick. This function first calculate the new current time, then
it checks if it is the time to wake up object from the system_heap comparing
the current time and the key of the heap. Two types of object (really the
heap contains pointer to objects) are in the
system_heap:
- scheduler instance: if a scheduler instance is waked up, the dequeue()
function is called until it has credit. If the dequeue() returns packets,
the scheduler instance is inserted in the heap with a new key depending of
the data that will be send out. If the scheduler instance remains with
some credit, it means that is hasn't other packet to send and so the
instance is no longer inserted in the heap.
If the scheduler instance extracted from the heap has the DELETE flag set,
the dequeue() is not called and the instance is destroyed now.
- delay line: when extracting a delay line, the function transmit_event() is
called to send out packet from delay line.
If the scheduler instance associated with this delay line doesn't exists,
the delay line will be delete now.
Configuration
=============
To create a pipe, queue or scheduler, the user should type commands like:
"ipfw pipe x config"
"ipfw queue y config pipe x"
"ipfw pipe x config sched <type>"
The userland side of dummynet will prepare a buffer contains data to pass to
kernel side.
The buffer contains all struct needed to configure an object. In more detail,
to configure a pipe all three structs (dn_link, dn_sch, dn_fs) are needed,
plus the delay profile struct if the pipe has a delay profile.
If configuring a scheduler only the struct dn_sch is wrote in the buffer,
while if configuring a flowset only the dn_fs struct is wrote.
The first struct in the buffer contains the type of command request, that is
if it is configuring a pipe, a queue, or a scheduler. Then there are structs
need to configure the object, and finally there is the struct that mark
the end of the buffer.
To support the insertion of pipe and queue using the old syntax, when adding
a pipe it's necessary to create a FIFO flowset and a FIFO scheduler, which
have a number x + DN_PIPEOFFSET.
Add a pipe
----------
A pipe is only a template for a link.
If the pipe already exists, parameters are updated. If a delay profile exists
it is deleted and a new one is created.
If the pipe doesn't exist a new one is created. After the creation, the
flowset unlinked list is scanned to see if there are some flowset that would
be linked with this pipe. If so, these flowset will be of wf2q+ type (for
compatibility) and a new wf2q+ scheduler is created now.
Add a scheduler
---------------
If the scheduler already exists, and the type and the mask are the same, the
scheduler is simply reconfigured calling the config_scheduler() scheduler
function with the RECONFIGURE flag active.
If the type or the mask differ, it is necessary to delete the old scheduler
and create a new one.
If the scheduler doesn't exists, a new one is created. If the scheduler has
a mask, the hash table is created to store pointers to scheduler instances.
When a new scheduler is created, it is necessary to scan the unlinked
flowset list to search eventually flowset that would be linked with this
scheduler number. If some are found, flowsets became of the type of this
scheduler and they are configured properly.
Add a flowset
-------------
Flowset pointers are store in the system in two list. The unlinked flowset list
contains all flowset that aren't linked with a scheduler, the flowset list
contains flowset linked to a scheduler, and so they have a type.
When adding a new flowset, first it is checked if the flowset exists (that is,
it is in the flowset list) and if it doesn't exists a new flowset is created
and added to unlinked flowset list if the scheduler which the flowset would be
linked doesn't exists, or added in the flowset list and configured properly if
the scheduler exists. If the flowset (before to be created) was in the
unlinked flowset list, it is removed and deleted, and then recreated.
If the flowset exists, to allow reconfiguration of this flowset, the
scheduler number and types must match with the one in memory. If this isn't
so, the flowset is deleted and a new one will be created. Really, the flowset
it isn't deleted now, but it is removed from flowset list and it will be
deleted later because there could be some queues that are using it.
Listing of object
=================
The user can request a list of object present in dummynet through the command
"ipfw [-v] pipe|queue [x] list|show"
The kernel side of dummynet send a buffer to user side that contains all
pipe, all scheduler, all flowset, plus all scheduler instances and all queues.
The dummynet user land will format the output and show only the relevant
information.
The buffer sent start with all pipe from the system. The entire struct dn_link
is passed, except the delay_profile struct that is useless in user space.
After pipes, all flowset are wrote in the buffer. The struct contains
scheduler flowset specific data is linked with the flowset writing the
'obj' id of the extension into the 'alg_fs' pointer.
Then schedulers are wrote. If a scheduler has one or more scheduler instance,
these are linked to the parent scheduler writing the id of the parent in the
'ptr_sched' pointer. If a scheduler instance has queues, there are wrote in
the buffer and linked thorugh the 'obj' and 'sched_inst' pointer.
Finally, flowsets in the unlinked flowset list are write in the buffer, and
then a struct gen in saved in the buffer to mark the last struct in the buffer.
Delete of object
================
An object is usually removed by user through a command like
"ipfw pipe|queue x delete". XXX sched?
ipfw pass to the kernel a struct gen that contains the type and the number
of the object to remove
Delete of pipe x
----------------
A pipe can be deleted by the user through the command 'ipfw pipe x delete'.
Bring in the most recent version of ipfw and dummynet, developed and tested over the past two months in the ipfw3-head branch. This also happens to be the same code available in the Linux and Windows ports of ipfw and dummynet. The major enhancement is a completely restructured version of dummynet, with support for different packet scheduling algorithms (loadable at runtime), faster queue/pipe lookup, and a much cleaner internal architecture and kernel/userland ABI which simplifies future extensions. In addition to the existing schedulers (FIFO and WF2Q+), we include a Deficit Round Robin (DRR or RR for brevity) scheduler, and a new, very fast version of WF2Q+ called QFQ. Some test code is also present (in sys/netinet/ipfw/test) that lets you build and test schedulers in userland. Also, we have added a compatibility layer that understands requests from the RELENG_7 and RELENG_8 versions of the /sbin/ipfw binaries, and replies correctly (at least, it does its best; sometimes you just cannot tell who sent the request and how to answer). The compatibility layer should make it possible to MFC this code in a relatively short time. Some minor glitches (e.g. handling of ipfw set enable/disable, and a workaround for a bug in RELENG_7's /sbin/ipfw) will be fixed with separate commits. CREDITS: This work has been partly supported by the ONELAB2 project, and mostly developed by Riccardo Panicucci and myself. The code for the qfq scheduler is mostly from Fabio Checconi, and Marta Carbone and Francesco Magno have helped with testing, debugging and some bug fixes.
2010-03-02 17:40:48 +00:00
To delete a pipe, the pipe is removed from the pipe list, and then deleted.
Also the scheduler associated with this pipe should be deleted.
For compatibility with old dummynet syntax, the associated FIFO scheduler and
FIFO flowset must be deleted.
Delete of flowset x
-------------------
To remove a flowset, we must be sure that is no longer referenced by any object.
Bring in the most recent version of ipfw and dummynet, developed and tested over the past two months in the ipfw3-head branch. This also happens to be the same code available in the Linux and Windows ports of ipfw and dummynet. The major enhancement is a completely restructured version of dummynet, with support for different packet scheduling algorithms (loadable at runtime), faster queue/pipe lookup, and a much cleaner internal architecture and kernel/userland ABI which simplifies future extensions. In addition to the existing schedulers (FIFO and WF2Q+), we include a Deficit Round Robin (DRR or RR for brevity) scheduler, and a new, very fast version of WF2Q+ called QFQ. Some test code is also present (in sys/netinet/ipfw/test) that lets you build and test schedulers in userland. Also, we have added a compatibility layer that understands requests from the RELENG_7 and RELENG_8 versions of the /sbin/ipfw binaries, and replies correctly (at least, it does its best; sometimes you just cannot tell who sent the request and how to answer). The compatibility layer should make it possible to MFC this code in a relatively short time. Some minor glitches (e.g. handling of ipfw set enable/disable, and a workaround for a bug in RELENG_7's /sbin/ipfw) will be fixed with separate commits. CREDITS: This work has been partly supported by the ONELAB2 project, and mostly developed by Riccardo Panicucci and myself. The code for the qfq scheduler is mostly from Fabio Checconi, and Marta Carbone and Francesco Magno have helped with testing, debugging and some bug fixes.
2010-03-02 17:40:48 +00:00
If the flowset to remove is in the unlinked flowset list, there is not any
issue, the flowset can be safely removed calling a free() (the flowset
extension is not yet created if the flowset is in this list).
If the flowset is in the flowset list, first we remove from it so new packet
are discarded when arrive. Next, the flowset is marked as delete.
Now we must check if some queue is using this flowset.
To do this, a counter (active_f) is provided. This counter indicate how many
queues exist using this flowset.
The active_f counter is automatically incremented when a queue is created
and decremented when a queue is deleted.
If the counter is 0, the flowset can be safely deleted, and the delete_alg_fs()
scheduler function is called before deallocate memory.
If the counter is not 0, the flowset remain in memory until the counter become
zero. When a queue is delete (by dn_delete_queue() function) it is checked if
the linked flowset is deleting and if so the counter is decrementing. If the
counter reaches 0, the flowset is deleted.
The deletion of a queue can be done only by the scheduler, or when the scheduler
is destroyed.
Delete of scheduler x
---------------------
To delete a scheduler we must be sure that any scheduler instance of this type
are in the system_heap. To do so, a counter (inst_counter) is provided.
This counter is managed by the system: it is incremented every time it is
inserted in the system_heap, and decremented every time it is extracted from it.
To delete the scheduler, first we remove it from the scheduler list, so new
packet are discarded when they arrive, and mark the scheduler as deleting.
If the counter is 0, we can remove the scheduler safely calling the
really_deletescheduler() function. This function will scan all scheduler
instances and call the delete_scheduler_instance() function that will delete
the instance. When all instance are deleted, the scheduler template is
deleted calling the delete_scheduler_template(). If the delay line associate
with the scheduler is empty, it is deleted now, else it will be deleted when
it will became empy.
If the counter was not 0, we wait for it. Every time the dummynet_task()
function extract a scheduler from the system_heap, the counter is decremented.
If the scheduler has the delete flag enabled the dequeue() is not called and
delete_scheduler_instance() is called to delete the instance.
Obviously this scheduler instance is no longer inserted in the system_heap.
Bring in the most recent version of ipfw and dummynet, developed and tested over the past two months in the ipfw3-head branch. This also happens to be the same code available in the Linux and Windows ports of ipfw and dummynet. The major enhancement is a completely restructured version of dummynet, with support for different packet scheduling algorithms (loadable at runtime), faster queue/pipe lookup, and a much cleaner internal architecture and kernel/userland ABI which simplifies future extensions. In addition to the existing schedulers (FIFO and WF2Q+), we include a Deficit Round Robin (DRR or RR for brevity) scheduler, and a new, very fast version of WF2Q+ called QFQ. Some test code is also present (in sys/netinet/ipfw/test) that lets you build and test schedulers in userland. Also, we have added a compatibility layer that understands requests from the RELENG_7 and RELENG_8 versions of the /sbin/ipfw binaries, and replies correctly (at least, it does its best; sometimes you just cannot tell who sent the request and how to answer). The compatibility layer should make it possible to MFC this code in a relatively short time. Some minor glitches (e.g. handling of ipfw set enable/disable, and a workaround for a bug in RELENG_7's /sbin/ipfw) will be fixed with separate commits. CREDITS: This work has been partly supported by the ONELAB2 project, and mostly developed by Riccardo Panicucci and myself. The code for the qfq scheduler is mostly from Fabio Checconi, and Marta Carbone and Francesco Magno have helped with testing, debugging and some bug fixes.
2010-03-02 17:40:48 +00:00
If the counter reaches 0, the delete_scheduler_template() function is called
all memory is released.
NOTE: Flowsets that belong to this scheduler are not deleted, so if a new
scheduler with the same number is inserted will use these flowsets.
To do so, the best approach would be insert these flowset in the
unlinked flowset list, but doing this now will be very expensive.
So flowsets will remain in memory and linked with a scheduler that no
longer exists until a packet belonging to this flowset arrives. When
this packet arrives, the reconfigure() function is called because the
generation number mismatch with one contains in the flowset and so
the flowset will be moved into the flowset unlinked list, or will be
linked with the new scheduler if a new one was created.
COMPATIBILITY WITH FREEBSD 7.2 AND FREEBSD 8 'IPFW' BINARY
==========================================================
Dummynet is not compatible with old ipfw binary because internal structs are
changed. Moreover, the old ipfw binary is not compatible with new kernels
because the struct that represents a firewall rule has changed. So, if a user
install a new kernel on a FreeBSD 7.2, the ipfw (and possibly many other
commands) will not work.
New dummynet uses a new socket option: IP_DUMMYNET3, used for both set and get.
The old option can be used to allow compatibility with the 'ipfw' binary of
older version (tested with 7.2 and 8.0) of FreeBSD.
Two file are provided for this purpose:
- ip_dummynet_glue.c translates old dummynet requests to the new ones,
- ip_fw_glue.c converts the rule format between 7.2 and 8 versions.
Let see in detail these two files.
IP_DUMMYNET_GLUE.C
------------------
The internal structs of new dummynet are very different from the original.
Because of there are some difference from between dummynet in FreeBSD 7.2 and
dummynet in FreeBSD 8 (the FreeBSD 8 version includes support to pipe delay
profile and burst option), I have to include both header files. I copied
the revision 191715 (for version 7.2) and the revision 196045 (for version 8)
and I appended a number to each struct to mark them.
The main function of this file is ip_dummynet_compat() that is called by
ip_dn_ctl() when it receive a request of old socket option.
A global variabile ('is7') store the version of 'ipfw' that FreeBSD is using.
This variable is set every time a request of configuration is done, because
with this request we receive a buffer of which size depending of ipfw version.
Because of in general the first action is a configuration, this variable is
usually set accordly. If the first action is a request of listing of pipes
or queues, the system cannot know the version of ipfw, and we suppose that
version 7.2 is used. If version is wrong, the output can be senseless, but
the application should not crash.
There are four request for old dummynet:
- IP_DUMMYNET_FLUSH: the flush options have no parameter, so simply the
dummynet_flush() function is called;
- IP_DUMMYNET_DEL: the delete option need to be translate.
It is only necessary to extract the number and the type of the object
(pipe or queue) to delete from the buffer received and build a new struct
gen contains the right parameters, then call the delete_object() function;
- IP_DUMMYNET_CONFIGURE: the configure command receive a buffer depending of
the ipfw version. After the properly extraction of all data, that depends
by the ipfw version used, new structures are filled and then the dummynet
config_link() function is properly called. Note that the 7.2 version does
not support some parameter as burst or delay profile.
- IP_DUMMYNET_GET: The get command should send to the ipfw the correct buffer
depending of its version. There are two function that build the
corrected buffer, ip_dummynet_get7() and ip_dummynet_get8(). These
functions reproduce the buffer exactly as 'ipfw' expect. The only difference
is that the weight parameter for a queue is no longer sent by dummynet and so
Bring in the most recent version of ipfw and dummynet, developed and tested over the past two months in the ipfw3-head branch. This also happens to be the same code available in the Linux and Windows ports of ipfw and dummynet. The major enhancement is a completely restructured version of dummynet, with support for different packet scheduling algorithms (loadable at runtime), faster queue/pipe lookup, and a much cleaner internal architecture and kernel/userland ABI which simplifies future extensions. In addition to the existing schedulers (FIFO and WF2Q+), we include a Deficit Round Robin (DRR or RR for brevity) scheduler, and a new, very fast version of WF2Q+ called QFQ. Some test code is also present (in sys/netinet/ipfw/test) that lets you build and test schedulers in userland. Also, we have added a compatibility layer that understands requests from the RELENG_7 and RELENG_8 versions of the /sbin/ipfw binaries, and replies correctly (at least, it does its best; sometimes you just cannot tell who sent the request and how to answer). The compatibility layer should make it possible to MFC this code in a relatively short time. Some minor glitches (e.g. handling of ipfw set enable/disable, and a workaround for a bug in RELENG_7's /sbin/ipfw) will be fixed with separate commits. CREDITS: This work has been partly supported by the ONELAB2 project, and mostly developed by Riccardo Panicucci and myself. The code for the qfq scheduler is mostly from Fabio Checconi, and Marta Carbone and Francesco Magno have helped with testing, debugging and some bug fixes.
2010-03-02 17:40:48 +00:00
it is set to 0.
Moreover, because of the internal structure has changed, the bucket size
of a queue could not be correct, because now all flowset share the hash
table.
If the version of ipfw is wrong, the output could be senseless or truncated,
but the application should not crash.
IP_FW_GLUE.C
------------
The ipfw binary also is used to add rules to FreeBSD firewall. Because of the
struct ip_fw is changed from FreeBsd 7.2 to FreeBSD 8, it is necessary
to write some glue code to allow use ipfw from FreeBSD 7.2 with the kernel
provided with FreeBSD 8.
This file contains two functions to convert a rule from FreeBSD 7.2 format to
FreeBSD 8 format, and viceversa.
The conversion should be done when a rule passes from userspace to kernel space
and viceversa.
I have to modify the ip_fw2.c file to manage these two case, and added a
variable (is7) to store the ipfw version used, using an approach like the
previous file:
- when a new rule is added (option IP_FW_ADD) the is7 variable is set if the
size of the rule received correspond to FreeBSD 7.2 ipfw version. If so, the
Bring in the most recent version of ipfw and dummynet, developed and tested over the past two months in the ipfw3-head branch. This also happens to be the same code available in the Linux and Windows ports of ipfw and dummynet. The major enhancement is a completely restructured version of dummynet, with support for different packet scheduling algorithms (loadable at runtime), faster queue/pipe lookup, and a much cleaner internal architecture and kernel/userland ABI which simplifies future extensions. In addition to the existing schedulers (FIFO and WF2Q+), we include a Deficit Round Robin (DRR or RR for brevity) scheduler, and a new, very fast version of WF2Q+ called QFQ. Some test code is also present (in sys/netinet/ipfw/test) that lets you build and test schedulers in userland. Also, we have added a compatibility layer that understands requests from the RELENG_7 and RELENG_8 versions of the /sbin/ipfw binaries, and replies correctly (at least, it does its best; sometimes you just cannot tell who sent the request and how to answer). The compatibility layer should make it possible to MFC this code in a relatively short time. Some minor glitches (e.g. handling of ipfw set enable/disable, and a workaround for a bug in RELENG_7's /sbin/ipfw) will be fixed with separate commits. CREDITS: This work has been partly supported by the ONELAB2 project, and mostly developed by Riccardo Panicucci and myself. The code for the qfq scheduler is mostly from Fabio Checconi, and Marta Carbone and Francesco Magno have helped with testing, debugging and some bug fixes.
2010-03-02 17:40:48 +00:00
rule is converted to version 8 calling the function convert_rule_to_8().
Moreover, after the insertion of the rule, the rule is now reconverted to
version 7 because the ipfw binary will print it.
- when the user request a list of rules (option IP_FW_GET) the is7 variable
should be set correctly because we suppose that a configure command was done,
else we suppose that the FreeBSD version is 8. The function ipfw_getrules()
in ip_fw2.c file return all rules, eventually converted to version 7 (if
the is7 is set) to the ipfw binary.
The conversion of a rule is quite simple. The only difference between the
two structures (struct ip_fw) is that in the new there is a new field
(uint32_t id). So, I copy the entire rule in a buffer and the copy the rule in
the right position in the new (or old) struct. The size of commands are not
changed, and the copy is done into a cicle.
How to configure dummynet
=========================
It is possible to configure dummynet through two main commands:
'ipfw pipe' and 'ipfw queue'.
To allow compatibility with old version, it is possible configure dummynet
using the old command syntax. Doing so, obviously, it is only possible to
configure a FIFO scheduler or a wf2q+ scheduler.
A new command, 'ipfw pipe x config sched <type>' is supported to add a new
scheduler to the system.
- ipfw pipe x config ...
create a new pipe with the link parameters
create a new scheduler fifo (x + offset)
create a new flowset fifo (x + offset)
the mask is eventually stored in the FIFO scheduler
- ipfw queue y config pipe x ...
create a new flowset y linked to sched x.
The type of flowset depends by the specified scheduler.
If the scheduler does not exist, this flowset is inserted in a special
list and will be not active.
If pipe x exists and sched does not exist, a new wf2q+ scheduler is
created and the flowset will be linked to this new scheduler (this is
done for compatibility with old syntax).
- ipfw pipe x config sched <type> ...
create a new scheduler x of type <type>.
Search into the flowset unlinked list if there are some flowset that
should be linked with this new scheduler.
- ipfw pipe x delete
delete the pipe x
delete the scheduler fifo (x + offset)
delete the scheduler x
delete the flowset fifo (x + offset)
- ipfw queue x delete
delete the flowset x
- ipfw sched x delete ///XXX
delete the scheduler x
Follow now some examples to how configure dummynet:
- Ex1:
ipfw pipe 10 config bw 1M delay 15 // create a pipe with band and delay
A FIFO flowset and scheduler is
also created
ipfw queue 5 config pipe 10 weight 56 // create a flowset. This flowset
will be of wf2q+ because a pipe 10
exists. Moreover, the wf2q+
scheduler is created now.
- Ex2:
ipfw queue 5 config pipe 10 weight 56 // Create a flowset. Scheduler 10
does not exist, so this flowset
is inserted in the unlinked
flowset list.
ipfw pipe 10 config bw... // Create a pipe, a FIFO flowset and scheduler.
Because of a flowset with 'pipe 10' exists,
a wf2q+ scheduler is created now and that
flowset is linked with this sceduler.
- Ex3:
ipfw pipe 10 config bw... // Create a pipe, a FIFO flowset and scheduler.
ipfw pipe 10 config sched rr // Create a scheduler of type RR, linked to
pipe 10
ipfw queue 5 config pipe 10 weight 56 // Create a flowset 5. This flowset
will belong to scheduler 10 and
it is of type RR
- Ex4:
ipfw pipe 10 config sched rr // Create a scheduler of type RR, linked to
pipe 10 (not exist yet)
ipfw pipe 10 config bw... // Create a pipe, a FIFO flowset and scheduler.
ipfw queue 5 config pipe 10 weight 56 // Create a flowset 5.This flowset
will belong to scheduler 10 and
it is of type RR
ipfw pipe 10 config sched wf2q+ // Modify the type of scheduler 10. It
becomes a wf2q+ scheduler.
When a new packet of flowset 5 arrives,
the flowset 5 becomes to wf2q+ type.
How to implement a new scheduler
================================
In dummynet, a scheduler algorithm is represented by two main structs, some
functions and other minor structs.
- A struct dn_sch_xyz (where xyz is the 'type' of scheduler algorithm
implemented) contains data relative to scheduler, as global parameter that
are common to all instances of the scheduler
- A struct dn_sch_inst_xyz contains data relative to a single scheduler
instance, as local status variable depending for example by flows that
are linked with the scheduler, and so on.
To add a scheduler to dummynet, the user should type a command like:
'ipfw pipe x config sched <type> [mask ... ...]'
This command creates a new struct dn_sch_xyz of type <type>, and
store the optional parameter in that struct.
The parameter mask determines how many scheduler instance of this
scheduler may exist. For example, it is possible to divide traffic
depending on the source port (or destination, or ip address...),
so that every scheduler instance act as an independent scheduler.
If the mask is not set, all traffic goes to the same instance.
When a packet arrives to a scheduler, the system search the corrected
scheduler instance, and if it does not exist it is created now (the
struct dn_sch_inst_xyz is allocated by the system, and the scheduler
fills the field correctly). It is a task of the scheduler to create
the struct that contains all queues for a scheduler instance.
Dummynet provides some function to create an hash table to store
queues, but the schedule algorithm can choice the own struct.
To link a flow to a scheduler, the user should type a command like:
'ipfw queue z config pipe x [mask... ...]'
This command creates a new 'dn_fs' struct that will be inserted
in the system. If the scheduler x exists, this flowset will be
linked to that scheduler and the flowset type become the same as
the scheduler type. At this point, the function create_alg_fs_xyz()
is called to allow store eventually parameter for the flowset that
depend by scheduler (for example the 'weight' parameter for a wf2q+
scheduler, or some priority...). A parameter mask can be used for
a flowset. If the mask parameter is set, the scheduler instance can
separate packet according to its flow id (src and dst ip, ports...)
and assign it to a separate queue. This is done by the scheduler,
so it can ignore the mask if it wants.
See now the two main structs:
struct dn_sch_xyz {
struct gen g; /* important the name g */
/* global params */
};
struct dn_sch_inst_xyz {
struct gen g; /* important the name g */
/* params of the instance */
};
It is important to embed the struct gen as first parameter. The struct gen
contains some values that the scheduler instance must fill (the 'type' of
scheduler, the 'len' of the struct...)
The function create_scheduler_xyz() should be implemented to initialize global
parameters in the first struct, and if memory allocation is done it is
mandatory to implement the delete_scheduler_template() function to free that
memory.
The function create_scheduler_instance_xyz() must be implemented even if the
scheduler instance does not use extra parameters. In this function the struct
gen fields must be filled with corrected infos. The
delete_scheduler_instance_xyz() function must bu implemented if the instance
has allocated some memory in the previous function.
To store data belonging to a flowset the follow struct is used:
struct alg_fs_xyz {
struct gen g;
/* fill correctly the gen struct
g.subtype = DN_XYZ;
g.len = sizeof(struct alg_fs_xyz)
...
*/
/* params for the flow */
};
The create_alg_fs_xyz() function is mandatory, because it must fill the struct
gen, but the delete_alg_fs_xyz() is mandatory only if the previous function
has allocated some memory.
A struct dn_queue contains packets belonging to a queue and some statistical
data. The scheduler could have to store data in this struct, so it must define
a dn_queue_xyz struct:
struct dn_queue_xyz {
struct dn_queue q;
/* parameter for a queue */
}
All structures are allocated by the system. To do so, the scheduler must
set the size of its structs in the scheduler descriptor:
scheduler_size: sizeof(dn_sch_xyz)
scheduler_i_size: sizeof(dn_sch_inst_xyz)
flowset_size: sizeof(alg_fs_xyz)
queue_size: sizeof(dn_queue_xyz);
The scheduler_size could be 0, but other struct must have at least a struct gen.
After the definition of structs, it is necessary to implement the
scheduler functions.
- int (*config_scheduler)(char *command, void *sch, int reconfigure);
Configure a scheduler, or reconfigure if 'reconfigure' == 1.
This function performs additional allocation and initialization of global
parameter for this scheduler.
If memory is allocated here, the delete_scheduler_template() function
should be implemented to remove this memory.
- int (*delete_scheduler_template)(void* sch);
Delete a scheduler template. This function is mandatory if the scheduler
uses extra data respect the struct dn_sch.
- int (*create_scheduler_instance)(void *s);
Create a new scheduler instance. The system allocate the necessary memory
and the schedulet can access it using the 's' pointer.
The scheduler instance stores all queues, and to do this can use the
hash table provided by the system.
- int (*delete_scheduler_instance)(void *s);
Delete a scheduler instance. It is important to free memory allocated
by create_scheduler_instance() function. The memory allocated by system
is freed by the system itself. The struct contains all queue also has
to be deleted.
- int (*enqueue)(void *s, struct gen *f, struct mbuf *m,
struct ipfw_flow_id *id);
Called when a packet arrives. The packet 'm' belongs to the scheduler
instance 's', has a flowset 'f' and the flowid 'id' has already been
masked. The enqueue() must call dn_queue_packet(q, m) function to really
enqueue packet in the queue q. The queue 'q' is chosen by the scheduler
and if it does not exist should be created calling the dn_create_queue()
function. If the schedule want to drop the packet, it must call the
dn_drop_packet() function and then return 1.
- struct mbuf * (*dequeue)(void *s);
Called when the timer expires (or when a packet arrives and the scheduler
instance is idle).
This function is called when at least a packet can be send out. The
scheduler choices the packet and returns it; if no packet are in the
schedulerinstance, the function must return NULL.
Before return a packet, it is important to call the function
dn_return_packet() to update some statistic of the queue and update the
queue counters.
- int (*drain_queue)(void *s, int flag);
The system request to scheduler to delete all queues that is not using
to free memory. The flag parameter indicate if a queue must be deleted
even if it is active.
- int (*create_alg_fs)(char *command, struct gen *g, int reconfigure);
It is called when a flowset is linked with a scheduler. This is done
when the scheduler is defined, so we can know the type of flowset.
The function initialize the flowset paramenter parsing the command
line. The parameter will be stored in the g struct that have the right
size allocated by the system. If the reconfigure flag is set, it means
that the flowset is reconfiguring
- int (*delete_alg_fs)(struct gen *f);
It is called when a flowset is deleting. Must remove the memory allocate
by the create_alg_fs() function.
- int (*create_queue_alg)(struct dn_queue *q, struct gen *f);
Called when a queue is created. The function should link the queue
to the struct used by the scheduler instance to store all queues.
- int (*delete_queue_alg)(struct dn_queue *q);
Called when a queue is deleting. The function should remove extra data
and update the struct contains all queues in the scheduler instance.
The struct scheduler represent the scheduler descriptor that is passed to
dummynet when a scheduler module is loaded.
2012-08-07 07:52:25 +00:00
This struct contains the type of scheduler, the length of all structs and
Bring in the most recent version of ipfw and dummynet, developed and tested over the past two months in the ipfw3-head branch. This also happens to be the same code available in the Linux and Windows ports of ipfw and dummynet. The major enhancement is a completely restructured version of dummynet, with support for different packet scheduling algorithms (loadable at runtime), faster queue/pipe lookup, and a much cleaner internal architecture and kernel/userland ABI which simplifies future extensions. In addition to the existing schedulers (FIFO and WF2Q+), we include a Deficit Round Robin (DRR or RR for brevity) scheduler, and a new, very fast version of WF2Q+ called QFQ. Some test code is also present (in sys/netinet/ipfw/test) that lets you build and test schedulers in userland. Also, we have added a compatibility layer that understands requests from the RELENG_7 and RELENG_8 versions of the /sbin/ipfw binaries, and replies correctly (at least, it does its best; sometimes you just cannot tell who sent the request and how to answer). The compatibility layer should make it possible to MFC this code in a relatively short time. Some minor glitches (e.g. handling of ipfw set enable/disable, and a workaround for a bug in RELENG_7's /sbin/ipfw) will be fixed with separate commits. CREDITS: This work has been partly supported by the ONELAB2 project, and mostly developed by Riccardo Panicucci and myself. The code for the qfq scheduler is mostly from Fabio Checconi, and Marta Carbone and Francesco Magno have helped with testing, debugging and some bug fixes.
2010-03-02 17:40:48 +00:00
all function pointers.
If a function is not implemented should be initialize to NULL. Some functions
are mandatory, other are mandatory if some memory should be freed.
Mandatory functions:
- create_scheduler_instance()
- enqueue()
- dequeue()
- create_alg_fs()
- drain_queue()
Optional functions:
- config_scheduler()
- create_queue_alg()
Mandatory functions if the corresponding create...() has allocated memory:
- delete_scheduler_template()
- delete_scheduler_instance()
- delete_alg_fs()
- delete_queue_alg()