User's Guide to PSO Stack Software Revision: 1.1 1. Introduction This document describes the PSO Stack software. Users are encouraged to read other books on TCP/IP and BSD Unix kernel software such as those by Richard Stevens and Kirk McKusick and others. The PSO Stack is a port of 4.4 BSD Lite based Operating Sytem (NetBSD) which is extensively documented in various books available in technical bookstores. In particular Richard Stevens' TCP/IP books described very detailed source code level description of the BSD TCP/IP stack software. For other more general BSD Unix kernel level issues McKusick et al has written a definitive book on the subject. This document will only discuss issues that are specific to the PSO Stack port itself. Information that are general, or generic in nature, such as the TCP/IP architecture, the implementation details of BSD TCP/IP source code, or BSD Unix kernel architecture will not be covered. The user of PSO Stack is expected to have the general knowledge about the TCP/IP and BSD Unix kernel architecture. In the following sections, references are made to a specific Realtime OS VxWorks(TM) produced by Wind River Systems of Alameda, CA, USA. While the first port of PSO Stack has been targetted for this specific RTOS, the way this port is done allows easy portability to other RTOS architectures. All RTOS specific features are encapsulated to be independent of the underlying RTOS primitives and API's. Any reference to the generic term RTOS can be substituted for VxWorks for this reason. Also the generic term thread can be substituted for VxWorks term task. 1.1 Overview The PSO Stack consists of two parts. The first part is a relatively straight-forward port of the C source code modules that implement TCP/IP, socket API and other I/O primitives that will be used to replace the original (VxWorks) TCP/IP modules and provide similar functionalities. The second part is original C source code modules that emulate enough functionalities that can be used to glue the first part onto existing RTOS. These include not only API related interfaces, but also operational entities that are required for runtime support, such as various threads that will provide context for the network device driver I/O to the rest of the PSO Stack and RTOS. Other support functions such as replacement memory allocation routines are thrown in to improve the function of the PSO Stack software. Originally the software from NETBSD distribution (1.3.3) is taken and compiled using VxWorks/Tornado cross compilers. The cross compilation using these GCC based tools yields libraries that can be linked with target code (VxWorks' BSP code and original target library shipped with VxWorks). Some of the routines in the original VxWorks' library need to be replaced with PSO Stack routines, which is accomplished by making a copy of the original library, deleting existing routines from it, and linking the final BSP code against both original VxWorks library as well as the new PSO Stack library. Most of the kernel files are from NetBSD 1.3.3 and later release snapshot (Feb, 1999). The library functions under lib directory is from NetBSD 1.3.2 CD-ROM. NetBSD 1.3.X is based partly on BSD 4.4. All of the networking files therefore are derived from 4.4 BSD Software. The PSO Stack libraries have been developed for a variety of host and target environments, but not all combinations are tested due to lack of available resources. The following combinations are initially supported. Targets: Sparc/Solaris 2.X VxSim, SA110 (StrongARM) EBSA285, PC (pc486) Hosts: Sparc/Solaris 2.7 PC/windows 95. 1.2 Scope of the software The PSO Stack contents reflect a subset of BSD Unix kernel software. The directory structures are similar to that of the BSD Unix kernel source tree. The directories kern, arch, dev, include, net, netinet, opt, sys, and vm are all identical to the BSD kernel top level directories. Only necessary portions of the these directories are ported over. In addition to these, new directories specific to PSO Stack are added. These are: pso, lib and vxworks. The 'lib' directory is derived from BSD user level library software as well as a seperate library to deal with better memory allocation and freeing, which comes from Doug Lea's public domain distribution of malloc library. The 'pso' directory contains PSO Stack specific implementations of necessary functions that are used to glue the BSD software to RTOS being used. The 'vxworks' directory contains all corresponding and other parts of the software that are used by 'pso' directory and main BSD software. The software under 'vxworks' provide all necessary glue to the VxWorks port of PSO Stack. It is possible to port PSO Stack by creating a new directory for a new RTOS, for example pSOS which interface to the portable 'pso' interface. Functionally, this collection of software provides a complete replacement for the existing VxWorks TCP/IP, socket API, device driver interface for network devices, malloc and free. Its intended purpose is to provide users of the software a solid TCP/IP software product with complete source code which can work with existing RTOS environment. 2. Changes made to original 4.4 bsd software VxWorks notable changes are 'ifndef ORIG' surrounded. Sometimes 'ifdef ORIG' is used. ORIG means original BSD code. So in our environment ORIG is obviously not defined. Everything that depends on the fact that ORIG is not defined is the stuff newly added by PSO Systems Inc. So if you grep for ORIG you will see changes from original BSD code. Sometimes ifdef is used sometimes ifndef is used, with ORIG flag. 2.1 Socket API related issues Since Unix system calls such as socket related routines are dependent on architecture specific system call interfaces which rely on "trap" based interfaces some modifications were made to accommodate this transition. Although some RTOS system calls are implemented similarly via "trap" interfaces, VxWorks and other (e.g. Nucleus) do not have the concept of system calls via "traps" into kernel mode. These operating systems run in single address space, in single previledge mode (protected mode). All functions calls just simply that -- function calls. Therefore even system calls are just simply calls to subroutines. Original system call interface relies on the fact that the user level library calls that implement socket API (for example) will "trap" into kernel to make a system call. Kernel resident implemention of this "trap" handling will execute the function, obtaining the user supplied arguments from the pointer and return the value. This interface had to be rewritten to be simplified. Most of this is done in sys_socket.c and uipc_syscalls.c. The use of syscallarg() and the argument pointer 'uap' is specific to the Unix kernel. VxWorks version does away with these conventions and expect direct argument passing via normal C function call interface. The function falloc() has been re-written so that it will interface with RTOS to obtain a new object that can be used to map a file descriptor. Correspondingly getsock() has been modified to return the socket pointer from a given file descriptor. The function fd_get_value() will return the value stored in the file descriptor. A given file descriptor, for example, will have an associated socket pointer within it. This is only specific to the sockets. RTOS may implement other file descriptors that are not related to sockets. Those file descriptors will use the value differently. In VxWorks, iosFdSet() and iosFdValue() are used to set (store) and get the socket pointer value for a given 'ios' file descriptor in VxWorks. The socket API is functionally equivalent except in signal behavior. The original VxWorks' socket API does not support asynchronous I/O signal SIGIO because it is difficult to implement in VxWorks. Similarly PSO Stack socket API has the same limitation. In multi-threaded environments, use of SIGIO is unnecessary since most of the notification is done via semaphores and thread switching is fast enough to dedicate a thread for I/O channel. The 'struct socket' (an internal representation of what a socket is, as used by the kernel software) contains additions to the data structure that are necessary for this port. These are defined in socketvar.h file. These changes are mostly due to the use of channels that are employed by the timeout/wait mechanism in the kernel. BSD Unix sleep/wakeup mechanism is converted to use semaphore in VxWorks and corresponding changes had to be made to the datastructures to hold semaphore values. This change is further propagated to the tsleep call argument change. There are a few PSO specific additions to socket API. One has to do with extending support for facilities that are either signal based (SIGIO for example) or non existent in original code. In VxWorks, its original socket API does not support asynchronous SIGIO delivery when I/O event is available for a socket. This omission partly has to do with the way VxWorks signal mechanism is implemented, and also partly due to the way VxWorks designers intended the system to be used. However, it is occasionally necessary to employ completely asynchronous I/O over sockets, especially in embedded real-time environment. Thus, PSO has added an extention to socket API that allows users to register callback routines that can be triggered by the protocol and socket layer software when there is a need to do so. These occasions are mostly equivalent to places where original Unix software intended signal delivery (SIGIO or SIGPIPE). Users can indicate what types of events to monitor via a flag passed during registration process. Based on this flag the stack software will trigger a call to the user's own registered callback routines. Another addition, similar to VxWorks' own addition to original BSD Socket API has to do with connection establishment with timeout. In original BSD API, the connect() call does not take a timeout. Since TCP can take a long time to connect(), and longer for connection attempt to fail and connect() call to return to users, it is necessary to have a timeout mechanism associated with connection establishment in realtime environments. Simply hanging on connection attempt for long non-deterministic amount of time is not compatible with realtime environments. For this reason a new API call which takes an additional timeout argument is created. This new call operates on sockets by first marking it as non-blocking, and performing select() with timeout until the connection attempt is either deemed successful or failed. Beyond these minor changes, the socket layer API is compatible with original BSD software and porting code from BSD Unix based implementation should be very easy to do. 2.2 TCP and IP protocols Most of the TCP and IP software is preserved in their original condition. However a few things in the TCP stack code had to be modified to accommodate the multi-threaded runtime environments. In a preemptive multi-thread environment like VxWorks some of the code that works OK in Unix kernel (BSD Unix) does not work. In tcp_input.c for example, you can find that there is a place where state transition to disconnected state happens after notification to other modules about the disconnectedness happens. This does not work in multi-thread environments because as soon as the notification is sent via semaphore, the thread pending on that condition can run immediately. The state transition has to be marked before that happens. Otherwise, the state is one step behind the rest of the code that might start running in a different thread context and malfunction of the stack code can (and does) happen. The specific code in tcp_input.c has to do with marking the state of the TCP connection to be in TIME_WAIT state before telling others that it is disconnected. The newer 4.4 BSD code already has many fixes that addresses similar problems which existed in 4.3 release. Namely the bugs that occurred due to code which wakes up the readers before appending data in multi-threaded environments have been already fixed in the original 4.4 distribution. The IP level code is the same as original except the ip_id (the ID for each IP datagram) is generated based on the timer tick via timer_tick_get() specific to PSO Stack environment. 2.3 mbuf and memory resource issues Original VxWorks' memory allocation and free routines have bad fragmentation behavior that can leave the system memory pool in severely fragmented state after a lot of allocation and freeing. To address this problem a better memory allocator/freer is ported. This is located under lib/libc/stdlib/malloc.c. This version of malloc contains two different algorithms. One is the BSD based Kingsley "bucket" allocator which has some unique fragmentation behavior. The other is Doug Lea's well tested allocator that tries to minimize fragmentation while keeping the speed/space requirements. USE_BSD and USE_DL are ifdefs used to enable either one of these. USE_DL is turned on by default since it seems to behave better in heavily networking environments. The mbuf code has been left pretty much untouched. The NetBSD implementation of mbuf has a lot of enhancements over original 4.4 BSD. It already includes all of the enhancements that are usually applied to accommodate VxWorks like single space memory system. Normally, a lot of changes were required in the past when porting BSD mbuf code to VxWorks, but the NetBSD code already has all the issues addressed with respect to the way cluster mbufs are handled. For example, MFREE() already knows how to detect freeing of cluster mbufs (marked M_EXT) and the mbuf data structure already has proper place holders for the cluster related back-call 'free routine' that are placed there by the allocator. Usually this kind of functionality had to be added to the existing source base when porting mbuf code from Unix to VxWorks in the past. This is due to the fact that Unix code tended to assume VM architecture where cluster mbufs were allocated out of pages of memory that can be copied easily by "flipping" page reference bits. Different Unix systems tended to implement this differently. And often cluster mbuf support did not support allocation of different size clusters at all. NetBSD implementation of the cluster mbufs solves all of these problems via use of ext_size and reference pointers, along with new macros like MEXTADD(). Currently the mbuf system in this release does not initialize the static pool of memory seperate from the rest of the system. This is in part intentional. Having to carve out a big chunk of system memory for mbufs only tends to degrade efficiency of the memory usage. Since the new malloc has very good speed and fragmentation behavior this should work out well. However, if for any reason, this is not desired, it is easy to create a seperate pool of memory for mbuf allocation. Changes need to be made to mbinit() which sets up pools. Currently the pool_init() is minimal -- it does not set up any real pools. Instead it records the unit of allocation required, which is later used to allocate memory from system pool dynamically. To get a seperate pool, pool_init() can be made to allocate a large (requested) size of memory from system malloc pool and later allocate pool_get() requests out of this new pool. 2.4 The device driver issues The PSO Stack as supplied comes with a working loopback device driver which is used to demonstrate the proper function of the stack software. When integrating real device drivers a number of additional steps need to happen. Although PSO Stack includes ported software that handles PCI devices (probing, configuring, etc.) it needs to incorporate more code to actually interface with many PCI style devices available commercially. It is recommended that users port existing working drivers from VxWorks to PSO Stack. Only changes required are mbuf related code and the way interrupts are queued to the thread level software. The mbuf related code will need to change because the definitions of mbuf as well as associated macros for handling mbufs have changed. The functions such as build_cluster() and do_protocol_with_type() are VxWorks specific and no longer required by the PSO Stack. Any use of such calls will need to be replaced. The build_cluster() can be replaced with MEXT related macros that are used to allocate and construct cluster mbufs. The do_protocol() routines can be replaced either by direct calls to ether_input() or manually constructed based on the code inside ether_input() (as supplied in PSO Stack net/if_ethersubr.c). If your driver specifically references netJobAdd() (some do, some don't) it needs to be replaced by schednetisr() (see if_ethersubr.c). PSO Stack handles incoming interrupts a little differently than original VxWorks. Both attempt to minimize interrupt latency by doing minimal amount of work at interrupt level (that of queueing further work to be performed later at thread level). However they are different in the way threads are used. VxWorks uses overloaded single thread (netTask) to do everything network related, including handling incoming interrupts from one or more network devices. PSO Stack uses a dedicated thread for each interface, and each can have different priorities. This allows prioritization of packet processing and controlling data flows based on type of network devices attached to a system, for example. It also does not suffer from congestion that netTask often suffers from. Additionally, netTask can hang due to a number of small errors such as one non-fatal error in any number of areas in the network code and drivers. Unlike netTask, PSO Stack will continue to function even when one of the devices die. 3. Replacements for VxWorks facilities 3.1 netTask replacements For each of the network devices a seperate thread is created to handle device specific events (such as interrupts). This is done from each of the device drivers via init_netisr_handler(). The body of the code which implements the thread is passed as a function pointer to init_netisr_handler() which creates it as a thread. This means that each device driver is able to customize what its own thread is supposed to do. Note that even the loopback driver (if_loop.c) creates a thread for its work. Even though loopback driver doesn't use real hardware interrupts at all since it is a pseudo driver that simply loops the packets over a queue, it does use a thread of its own to hand off the work of handling the packet that was sent by an application thread. An application thread hands off the packet to be delivered via socket API. The packet sent is queued and eventually handled in some other thread's context (the reader), but the work of handing over the queued packet to the reader (via IP -> TCP stack and back up to socket API on reader side) is done by the device driver thread. The VxWorks netTask also handles other things than interrupts from device drivers. It serves as central context for all network related functions, delayed execution of function calls, and general context provider for all protocol related software. PSO Stack provides a generic message handling thread, as implemented in vxworks/vxworks_port.c as queue_message_thread. It is possible to use this thread for all non interrupt related functions related to networking which require context. It is also possible to fashion another thread to further sub-divide the functionalities served by an independent thread with its own priority. Currently queue_message_thread (whose body is defined in queue_message_loop()) is mainly used for various timers triggered by timeout interrupt handler (in VxWorks watchdog timers). The interrupt level code that handles timeouts queue messages containing function pointers which are to be executed at thread level by the queue_message_thread context. To summarize, the netTask's functionalities are subdivided into multiple threads. The interrupt handling (deferred thread level interrupt event handling) is done in network device driver specific threads, each of which is created by the device driver writer (or porter). An instance of such thread is created per active instance of a network device driver of all kinds. Almost all other work that requires an independent thread context (such as timer event execution at thread level) is done via queue_message_thread. 3.2 Other support code replacements Two categories of further replacements of existing facilities are made. The first has to do with functional replacements. This includes routines that actually attempt to emulate the original functionality of the routine in underlying RTOS. The second has to do with stubbing out routines that are no longer needed (or not yet needed). These are simply null functions for the sake of completing the linking of object modules successfully to turn out a useable executable object files. It is only done to satisfy the linker. The first category includes such routines as: netLibInit() and sockLibInit(). The call the netLibInit() is made by the start-up phase code within VxWorks that are included as part of the BSP source tree (under target/config/all). This call is meant to initialize network software, so it is a good place for PSO Stack to call init_bsd_compat() which initializes mutex'es required in the PSO Stack and call a sequence of routines that are used to establish initial runtime conditions, similar to the way BSD Unix initialization happens. Except that we are only confining ourselves to network related functions such as: mbinit(), soinit(), ifinit(), domaininit(). In other words, mbuf subsystem, socket subsystem, network interface driver subsystem, protocol domain subsystem including protocol specific initializations are done here. The routine sockLibInit() is equivalent to call init_socket_lib(). This creates a hook into VxWorks I/O subsystem (ioLib and iosLib) to allow socket objects. The sockets appear as I/O object that can be read, written, and ioctl'ed after this point. The second category includes a number of routines (not a complete list) that are either not needed for now or simply unimplemented: ipAttach, ipLibInit , rawIpLibInit, rawLibInit, udpLibInit, udpShowInit, tcpLibInit , tcpShowInit, icmpLibInit, igmpLibInit, netShowInit, ifMaskSet, ifAddrSet, bsdSockLibInit, sockLibAdd, connectWithTimeout, ifAddrGet, routeAdd, netJobAdd, ulipInit, elcdetach, ultradetach, slipInit, eiattach , elcattach, ultraattach, eexattach, eltattach, eneattach. It may be necessary to stub out more or less depending on the version of your VxWorks development environment. 3.3. Functional replacements Some routines are not replaced but implemented under different names. These include such routines as ifconfig(), netstat() (in PSO Stack) vs. ifAddrSet(), icmpstatShow(), etc. (in VxWorks). The PSO Stack tries to be more BSD Unix compatible in these kinds of support routines, since more people are familiar with Unix style commands that are normally used to setup and control the networking facilities. To set a network device's IP address, for example, one would do ifconfig("dev0", "addr", inet_addr("123.33.2.38")), instead of calling non-standard routine ifAddrSet(). This is a minor difference, but well worth changing. A lot of original BSD code can be reused as a result. Examples of some of these calls and how they are used can be found in pso.c file which is included as part of the target BSP directory addition. In pso.c file the first routine pso_init() calls a number of routines to initialize the stack. init_bsd_compat() and init_socket_lib() are called to initialize bsd compatibility and socket API related facilities. Then, it is possible to initialize the loopback device driver by calling loattach() routine. Note that these sequence of initialization is similar to VxWorks' own network initialization in usrNetwork.c but much simpler. Users can change pso_init() by adding more initialization for user specific device drivers, for example. 3.4 PSO Additions There are two directories under the main tree that are created by PSO. These are called pso and vxworks. The pso directory contains RTOS independent implementation of interfaces to underlying RTOS (e.g. VxWorks) as well as additional code to interface with the BSD software to fit it into a tradtional RTOS paradigm. The vxworks directory implements required API used by PSO Stack for the VxWorks RTOS specifically. Users are noted to pay attention to the way data structures that are unique to VxWorks can be exposed to the PSO Stack software. Since the original BSD Unix software and VxWork share similar concepts, there are data structures and names of data structures that clash if both systems' include files are used at the same time. This poses a dilemma for the programmers who port software based on BSD Unix to VxWorks. One one hand the data structures unique to VxWorks must be preserved for binary compatibility at function call level. On the other hand, knowing too much detail of VxWorks data structures require inclusion of VxWorks' header file that can clash with BSD header files. The way PSO Stack uses VxWorks header files handles this problem. The only place VxWorks header files are used is inside files under vxworks subdirectory. Any exposed data structure elements outside vxworks directory are exposed as void pointers. For example, the semaphore structure is exposed as void pointer. This problem is acutely illustrated in the way ioctl routines are implemented. Due to the system specific definitions of the ioctl 'commands' (e.g. FIONBIO), the files that references such commands must take care to avoid conflicts between BSD and VxWorks. This problem has influenced the way VxWorks select() support is incorporated into PSO Stack. VxWorks implements select() via ioctls FIOSELECT and FIOUNSELECT. These are unique to VxWorks and the use of them are very unique to VxWorks. The existence of vxworks/vxworks_port.c file is due to this issue. Note that it is possible to ignore VxWorks' resident select support by replacing VxWorks I/O subsystem with BSD I/O interfaces altogether. For PSO Stack software, we have chosen not to do this. Our goal has been to preserve as much of the BSD software while using as much of existing VxWorks facilities as possible. (If this were not our goal, why bother using VxWorks at all? One could simply use BSD kernel as embedded OS.) 4. Packet forwarding and ATM specific considerations 4.1 Buffer handling When packets arrive from interfaces it is sent up to IP layer code which either forwards the packet or keeps it. If the packet is destined for the local host the packet is consumed locally. Otherwise fowarding of packets happen via routing code at the IP layer. The reception and IP layer code handling will happen in the context of the driver specific thread which is created by the user. This thread, which is created when init_netisr_handler() is called in the device driver initialization routine, provides thread level context for handling the reception and other interrupt events at thread level rather than interrupt level (to minimize interrupt latency). When it receives a packet it typically will call ether_input() in BSD Unix drivers. It may also call ipintr() directly, as in the case of if_loop.c (loopback network driver). The ether_input() will eventually call ipintr() as well, after decoding the ethernet header information and verifying minimal amount of sanity checks. In VxWorks environments, drivers that are written for VxWorks may also call ether_input(), or sometimes they may call do_protocol() instead (also do_protocol_with_type() sometimes). The do_protocol() does similar function as ether_input(). Eventually the packet ends up in IP layer code (via ipintr()), which will either consume the packet locally by sending it up to the upper layer protocols (UDP or TCP), or just forwarding the IP packet out via ip_forward(). The ip_forward() will determine routing information via rtalloc(). The driver thread provides context for all of the routing level code and the forwarded packet gets queued (or sent immediately) via ip_output() which will send the packet out to the device that is resolved via routing table lookup. When forwarding ATM packets to and from the ethernet devices, one has to take into account of the fact that the MTU for ATM is substantially larger than ethernet (typically 8K vs. 1.5K bytes, but ATM MTU can be as large as 64K theoretically). To avoid copying data as much as possible, the driver writer should make an attempt to examine the code path that carries the data from ATM to ethernet and vice versa. In an ideal situation, large driver application memory pieces (8K for ATM) will be initially loaned to the ATM device by the driver. The ATM device will DMA data into these buffers and interrupt the driver's interrupt handler which will notify the driver thread via semaphore (schednetisr()). The driver thread will perform the input side work of protocol as described above, and possibly forward the packet out to the ethernet device via ip_output(). The mbuf being passed to ip_output() which goes down to ethernet driver's output routine at this point, will be a "cluster" mbuf which has a pointer to the data portion. The data portion here is a piece of memory that contains IP datagram content which has been allocated by the input device (ATM). When the packet is sent to output device (ethernet) it will be freed, at which point the free routine registered with this cluster mbuf will be called. The free routine has been registered when the cluster mbuf is created (see MEXTADD macro in sys/mbuf.h) with a pointer to an internal function which resides in the device driver (ATM in this example). Since the MTU sizes vary, the 8K buffer needs to be split into multiple buffers and given to ethernet device for output. Avoding copying during this splitting phase can be tricky, since multiple buffers for ethernet belong to one ATM buffer. For proper optimal copying-avoiding packet forwarding, ethernet driver will need to be optimized to properly handle these buffer size differences. 4.2 Thread context switching In hard realtime environments where response time is critical, a typical preemptive, priority driven strict scheduling behavior is essential. Minimal interrupt latency and response is critical to realtime system behavior. This is not always the most important factor in other embedded environments where throughput is of the highest concern. For example, in an embedded packet router or file server environments it is sometimes better to optimize for the best CPU usage in terms of maximizing the throughput while minimizing the amount of work per unit of work. The realtime response is typically achieved with threads based architecture which has fast thread context switching, with strict priorities that address the event handling priorities. The thread switching overhead in these environments can sometimes hinder the overall throughput behavior given fixed hardware performance capability. In the worst case, the system may spend too much time context switching, and not enough time doing the throughput related work. For this reason, a lot of throughput oriented systems like routers tend to run very simple loop as the realtime kernel. In these systems, the whole runtime is governed by a hand optimized loop which services events in sequence, the order of which is determined by careful tuning and experimenting. These systems tend to be either completely single threaded or based on cooperative multi-threading. The latter is a system where there are multiple thread contexts supported (as in preemptive systems), but thread scheduling only happens synchronously as each of threads request to yield the CPU time for other threads. That is, threads can only run when other thread that is currently running specifically and explicitly yields the CPU time. In both of these environments, thread switching overhead is minimized. First case, there is no switching. Second case, there is minimal switching only when it is needed. In the second case, the system still benefits from thread abstraction and priorities, but the context switch is hand-tuned so that they only happen when they need to. It may be necessary to explore the cooperative multi-tasking behavior in the context of VxWorks runtime. Users are allowed to create within VxWorks environment a subsystem that will run in cooperative fashion. This is true because VxWorks runtime is very simple and can be viewed simply as just a large program running in single address space. There are preemption related scheduling going on, but you can always bypass it (i.e. if a task that currently holds CPU at a given priority wants to run forever and does not yield CPU, all lower priority threads will not run). 5. Strategies for optimizing device drivers The number one overhead in networking I/O is data copying. Avoidance of data copying is the quickest way to optimize any network device driver. Many network drivers employ the idea of "loaning" buffers to avoid copying them. By (pre)-allocating data structures that are used for incoming packets and loaning them out to the network interface hardware the CPU avoids having to copy the data out of device memory into mbufs. Instead the network hardware can DMA the data directly into the loaned buffer, which is pointed by the mbuf header, which means that the data DMA'ed thusly already belongs to mbuf data structure format that can be passed upwards to IP stack without copying. Similar things are done for outgoing packets. They mbufs that contain clusters that can be copied virtually (that is, not every byte in the data buffer is physically copied) via references. The mbuf data can then be given (loaned) to the output engine of the network hardware which will directly DMA the data out of the buffer space. When transmission is complete, an interrupt is generated to recycle such mbufs. When loaning mbufs to devices and upper layer protocols, it is important for the device driver writer to keep in mind the upper limit for the number of mbufs to be loaned out. Loaning more mbufs out does not always mean better performance. A good number can be experimentally determined, by trying various configurations and running extensive tests. A good number for the upper limit of the loaned buffer counts can be then used for the driver. It might be prudent to make this a tunable global variable. Not all drivers are written to be this flexible, and depending on the original VxWorks driver (if you are porting one), you might need to refine this aspect. Another issue to be aware is that application which behaves erratically can cause the loaning mechanism to lock up. For example, if a thread application which should be reading data off of the socket level buffer queue somehow ends up hanging and not reading packets from the queue (or not reading from the queue in timely fashion) then the buffers will get queued to the maximum amount allowed. This maximum amount can be as large as 64k bytes, as determined via socket option for the socket level buffers (there are two -- one for receive side SO_RCVBUF and one for send side SO_SNDBUF). Imagine, that there are application threads that are hanging like this which exhausts the amount of mbufs that are being loaned but never returned to the driver because the application threads do not read them. The buffers are returned to the driver when application thread reads the buffer into application specific buffer and the mbuf is then freed. Freeing mbuf causes callback routine in the driver to be called for the loaned mbufs. When this mechanism of loaning and freeing, thus recycling buffers do not happen as planned, the driver can suffer from using too much buffer space, and sometimes hang depending on how it is written. It is therefore advised that the driver limit the amount of buffers that are to be loaned to the application threads. There should be a maximum threshold over which the driver will no longer loan out buffers at all. The driver should instead copy incoming data into new mbufs that are to be given to upper layers and recycle the loaned mbufs immediately and give it out to the device. 6. Implementing zero-copy As a packet travels from network wire all the way up to application, there can be many places where the data is copied from one buffer to another, sometimes needlessly. It is possible to minimize this data copy overhead at various layers. We are mostly concerned with two places: driver level and socket level data copies. The code in between these layers (IP, TCP, UDP, etc.) are very well optimized already. Any additional optimizations in these protocol areas can take a long time to debug and implement, yielding questionable benefits, if any. 6.1 Device driver layer zero copy strategy The basic data structure used by the buffering is mbuf. In PSO Stack which is derived from NetBSD Unix implementation, the mbuf mechanism has a lot of advanced features that are useful for implementing copy avoidance. In particular, this mbuf implementation can support cluster mbufs that are not actually copied when m_copy is called. Instead a reference is made inside cluster mbuf data structure for the 'copy' and the mbuf is not totally freed until all such references are resolved (all holders of copies/references must free the instance of mbuf). When the mbuf is finally freed, the free routine function pointer embedded within the cluster mbuf structure is used to free the cluster mbuf data portion in flexible way according to the allocator's policy. For example, if the allocator of the data portion of a cluster mbuf was a device driver, the driver will have its own internal routine that knows what to do when the buffer is freed. In other words, driver will allocate the buffer and loan it out to others, and eventually when freed, its own internal function will be called via pointers. So, one place that is obvious for copy avoidance optimization is device driver layer. This can be done via careful use of cluster mbufs for both input and output paths. For input paths, the driver typically allocates large enough buffers for the devices MTU and loan them out to device's receive data structure (typically a linked list or ring buffer of some sort specific to the hardware architecture). The device will DMA incoming frames of data into these buffers and interrupt the driver's handler routine. The driver will attach a small mbuf header that points to this buffer and construct a cluster mbuf out of the DMA'ed loan buffer and pass it up to upper layer. For output side, the upper layer code will send down arbitrarily complex chain of mbufs which contain one or more mbufs. The size of mbufs in the chain are variable, especially if TCP is sending packets down. For the output side buffer loaning to work effectively the hardware has to support data chaining and arbitrary packet start boundary. The data chaining is where software can build a linked list of buffers rather than a whole chunk of buffer, and give the list to the device for output. The device then DMA's data out of the chain of buffers filling in MTU worth (or less) of bytes appropriate for framing output data. The device has to support arbitrary start boundaries instead of requiring strict data boundary conditions such as long word alignment. Beginning of each element of the output packet chain must be able to start at any byte boundary without hanging up the hardware output state machine. 6.2 Socket layer zero copy strategy This section is relevant only if you are writing user level socket API based thread applications or daemons. To preserve the socket API, VxWorks performs extra socket level copying of data from socket buffers queued at the socket queue into user buffers. This is unavoidable, if one has to strictly adhere to the socket API compatible with original BSD software. However, for efficiency reasons, there have been various attempts in VxWorks to support socket level copy avoidance. One recent implementation, called zbuf, attempts to alleviate the copy overhead at socket level, but has resulted in variable behavior; sometimes it is less overhead, sometimes it is more. Overall, zbuf fails to deliver. Besides, the complexity of the interface requires more extensive API and data structure changes than otherwise needed. A sensible alternative is to bypass the socket API altogether and directly use the uipc_socket.c level interface, which is used by the socket library interface. By directly interface to lower layer than the socket API layer, users can specify mbuf data structures to read and write data from any given sockets. Direct access to mbuf such as this is possible because VxWorks is single address space OS that runs as if it were one giant program. Direct access to mbuf allows users and applications to use the same strategy employed by the rest of the protocol kernel code to avoid data copying. There is no requirement to copy from a user specific malloc'ed piece of memory into mbuf and back and forth. For example, instead of calling sendto(), even application threads can call sosend() instead. If you look at the source code for sendto(), it is clear that sendto() eventually calls sosend() anyway (i.e. so->so_send). By bypassing the translation that takes place to make the socket API conform to the BSD socket library, you can directly call mbuf based API specified in uipc_socket.c. 7. Strategies for optimizing IP checksumming The second highest overhead in the protocol stack is the checksum calculation. Optimization of the IP checksum routine can easily yield minor performance boost. The portable C version of in_cksum() is written in a way that actually performs IP checksums in very efficient manner. However, it is likely possible to improve on this by recoding the checksum routines in assembly language. Most of the NetBSD ports have assembly optimized checksum routines. These routines can actually be ported pretty much as is. The fact that PSO Stack does not currently contain optimized assembly version of IP checksum code for all target CPU architectures does not mean that it is not available. Users are encouraged to research existing BSD code for further optimization. For example, the StrongARM port of BSD in_checksum() code uses the same C based implementation augmented with inline assembly code for several key macros, such as ADD64, ADD32, etc. This is a good strategy since you only need to regenerate the equivalent assembly functions for those macros and C routine which drives the macros stays the same. Basically a few things to watch out for are: unrolling the loop for better efficiency, using largest possible arithmetic operations (for example use 32 bit add-with-carry instead of 16 bit versions if at all possible), and keeping the behavior of instruction and data cache in mind and avoiding unnecessary cache flushes. Another approach (taken by Linux networking code, but not in BSD) is to combine data copying with checksumming. This is often called checksum while copying. This has an advantage in that you can reduce the overhead by going through the loop once accessing data. Users may wish to look into this as well. Remember, the first thing to optimize in TCP/IP networking code is removal of unnecessary data copying. The second is IP checksumming. 8. Building and merging PSO Stack with existing VxWorks BSP 8.1 Reducing VxWorks original library Before linking bsdsys libraries to existing VxWorks BSP environment, you need to make a copy of the existing VxWorks main library which is normally linked with your BSP support code to result in a loadable and runnable VxWorks final image. Just make a copy of the library file which is normally located under target/lib/ directory and called libXXXgnuvx.a where XXX is CPU architecture. For example libMC68000cpuvx.a. Modifications required has to do with deleting some of the objects from the library. You should back up the original library. The files to be delete are (not a complete list and you may need to hand tune this): if_bp.o if_cpm.o if_egl.o if_ei.o if_eitp.o if_enp.o if_ex.o if_fn.o if_ilac.o if_ln.o if_lnsgi.o if_lnPci.o if_loop.o if_med.o if_nic.o if_sl.o if_sm.o if_sn.o smNetLib.o smNetShow.o if_eihk.o if_elc.o if_dc.o if_ultra.o if_eex.o if_fei.o if_elt.o if_ene.o if_ulip.o if_es.o if_nicEvb.o if_esmc.o if.o if_ether.o if_subr.o in.o in_pcb.o in_proto.o ip_icmp.o ip_input.o ip_output.o raw_cb.o raw_ip.o raw_usrreq.o route.o sys_socket.o tcp_debug.o tcp_input.o tcp_output.o tcp_subr.o tcp_timer.o tcp_usrreq.o udp_usrreq.o uipc_mbuf.o uipc_sock.o uipc_sock2.o unixLib.o if_ppp.o pppShow.o ifLib.o inetLib.o netLib.o netShow.o routeLib.o sockLib.o zbufSockLib.o mbufSockLib.o bsdSockLib.o memLib.o memPartLib.o memShow.o ipProto.o ipLib.o udpLib.o udpShow.o tcpLib.o tcpShow.o icmpLib.o igmpLib.o etherLib.o dec21x4xEnd.o 8.2 Compiling the PSO Stack libraries The PSO Stack libraries lib_pso_bsd.a (the main PSO Stack library) and lib_bsd_c.a (libc related add-ons) are created when the PSO Stack distribution is un-tarred and make is run in the directory. First, un-tar the distribution and set up the environment. The distribution supports three targets currently (Solaris/VxSim, PC, StrongARM SA110). Depending on your target environments, you set up a few things. The directory machine needs to be symbolic link to one of the arch directories. For example, VxSim/Solaris uses arch/sparc/include (you should make a symlink from arch/sparc/include to machine, e.g. "ln -s arch/sparc/include machine"). Similarly for PC, "ln -s arch/i386/include machine" and for SA110 "ln -s arch/arm32/include machine". You should also import various environment variables used for the toolchain as described in various setup.sh files. There is one for each target. setup.sh.arm32 is for SA110, setup.sh.i386 is for PC, setup.sh.sparc is for VxSim/Solaris. If you are using /bin/bash as your shell you can source in these setup files via ". ./setup.sh.sparc" for example. If you are using other shells you will need to change the syntax (for example, for csh you should change "XXX=value; export XXX" to "setenv XXX" instead). Doing this step allows all the paths and other environment setup. The last thing to do in terms of setting up is do make symlinks for the defs.mk file. There are three defs.mk.arm32, defs.mk.i386 and defs.mk.sparc. Depending on your target, you should symlink one of these to defs.mk. For example, for VxSim/Solaris, you should "ln -s defs.mk.sparc defs.mk" The defs.mk file is included by various Makefile files in various directories. It is a place where all information common to various Makefile files can be located. For example, CFLAGS is defined in there. Once setup is complete, you should do "make depend" followed by "make". Sometimes, the make will fail depending on your target type. For example, when building for VxSim/Solaris target, all the references made by source files that are related to unsupported features such as PCI bus support and other PCI type devices (DEC 21143 ethernet and Intel 82558 ethernet device drivers both of which are PCI) will not compile cleanly. When you see files under dev/pci/ (e.g. if_de.c, if_fxp.c, etc.) do not compile, then it is likely that these are not supported in your target. The VxSim running on Solaris does not support any direct hardware so you should know that PCI is not supported. You should take references to these files out of Makefiles as needed. Laster release of PSO Stack will have better configuration for different targets to avoid this problem. 8.3 Building the runnable VxWorks target image The VxWorks target can be built as "standalone" image which contains all the symbols (the symbols can later be stripped out). Instead of linking with the normal vxWorks library, you should be linking in the reduced library along with extra PSO Stack libraries created. Doing this replaces some of the internal routines VxWorks uses for networking. If you have trouble with some of the missing routines or variables, it is likely that this is due to version mismatch. In most cases, the missing symbols can be stubbed out (see vxworks/vxworks_port.c file). Initialization of PSO Stack modules can be done as in the provided pso.c file which can be linked in as part of your BSP. There are some examples, such as ttcp benchmark program and blaster/blastee program that are provided to give you initial run of the PSO Stack. 9. Mutex protection of various modules (including TCP/IP stack) Unlike Unix, a system such as VxWorks which attempts to minimize interrupt latencies must take care not to spend too much time with interrupts locked out. Unix strategy of using interrupt lockouts as mutual exclusion is not feasible due to this problem. A system's interrupt latency is bounded by the largest continuous segments of code that runs with interrupts locked. Locking interrupts out causes system to not able to respond the hardware events in timely fashion. VxWorks, therefore, uses sempahores for mutual exclusion. By avoiding interrupt lockouts the latency is minimized. Furthermore, VxWorks tries to spend as little time as possible in interrupt level code such as interrupt handlers. Most VxWorks interrupt handlers spend minimal amount of time acknowledging the interrupt and queueing a routine to be performed later by another thread. The idea is to do the absolute minimum to acknowledge the interrupt but not spend any more time processing the interrupt at interrupt level handler. Instead a message containing function pointer and arguments is constructed and sent to a thread that is responsible for carrying out the "real" work of servicing the thread requested by the hardware interrupt. This work is done at thread level code, not interrupt level code. Since minimal amount of time is spent at interrupt level, interrupt latency is minimized. 10. VxWorks 'select' support VxWorks' own select() support is based on FIOSELECT and FIOUNSELECT ioctl commands that are unique to VxWorks. Unix implements select support as integral part of their I/O system design. VxWorks support of select is partially done on specific I/O devices; depending on the support of FIOSELECT, a device may or may not support selection. The PSO Stack follows original VxWorks selection support because it is near impossible to implement it otherwise and still make it compatible with other I/O devices in VxWorks which support selection in original fashion. The support code for selection is distributed in three different files, due to header file conflicts and other complexities. The files are vxworks/vxworks_iotcl.c and vxworks/vxworks_port.c and pso/bsd_compat.c. The socket device, which is registered as vxWorks I/O device, implements a ioctl routine which has added support for FIOSELECT/FIOUNSELECT. Functionally the PSO Stack implementation of select() call is equivalent to that of VxWorks' native version from application point of view. 11. Testing strategies 11.1 Driver testing The network device driver writers can start by unit testing their software. When unit testing looks good and driver starts to function properly, it is possible to further test the drivers using ttcp and blaster programs included in PSO Stack. These programs can exhaustively test the robustness of the driver, as well as benchmark its performance. These are the tools that have been extensively used to test the drivers in VxWorks. Additionally, debugging facilities such as network analyser can help. It is often sufficient to just use snoop or tcpdump on Unix (Solaris or Linux) machines. However, sometimes it may be necessary to have professional tools for tracking link level errors and packet generation. 11.2 Protocol stack testing The ttcp and blaster programs are good in testing most of the stack profile that is often used in networking software. However, there is no substitute for porting as many of the BSD style networking applications to see the proper function of the PSO Stack. For example, proper ICMP protocol can be tested by porting ping program. Provided by Hwa-Jin Bae, bae@Mail.com, Piedmont California All modifications to original BSD software placed under original BSD license.