User's Guide to	PSO Stack Software

		Revision: 1.1


1. Introduction

This document describes the PSO Stack software.	 Users are encouraged to read
other books on TCP/IP and BSD Unix kernel software such as those by Richard
Stevens and Kirk McKusick and others.  The PSO Stack is a port of 4.4 BSD Lite
based Operating Sytem (NetBSD) which is extensively documented in various
books available in technical bookstores.  In particular Richard Stevens'
TCP/IP books described very detailed source code level description of the BSD
TCP/IP stack software.  For other more general BSD Unix kernel level issues
McKusick et al has written a definitive book on the subject.

This document will only discuss issues that are specific to the PSO Stack port
itself.	 Information that are general, or generic in nature, such as the
TCP/IP architecture, the implementation details of BSD TCP/IP source code, or
BSD Unix kernel architecture will not be covered.  The user of PSO Stack is
expected to have the general knowledge about the TCP/IP and BSD Unix kernel
architecture.

In the following sections, references are made to a specific Realtime OS
VxWorks(TM) produced by Wind River Systems of Alameda, CA, USA.	 While the
first port of PSO Stack has been targetted for this specific RTOS, the way
this port is done allows easy portability to other RTOS architectures.  All
RTOS specific features are encapsulated to be independent of the underlying
RTOS primitives and API's.  Any reference to the generic term RTOS can be
substituted for VxWorks for this reason.  Also the generic term thread can 
be substituted for VxWorks term task.


1.1 Overview

The PSO Stack consists of two parts.  The first part is a relatively
straight-forward port of the C source code modules that implement TCP/IP,
socket API and other I/O primitives that will be used to replace the original
(VxWorks) TCP/IP  modules and provide similar functionalities.  The second
part is original C source code modules that emulate enough functionalities
that can be used to glue the first part onto existing RTOS.  These include not
only API related interfaces, but also operational entities that are required
for runtime support, such as various threads that will provide context for the
network device driver I/O to the rest of the PSO Stack and RTOS.  Other
support functions such as replacement memory allocation routines are thrown in
to improve the function of the PSO Stack software.

Originally the software from NETBSD distribution (1.3.3) is taken and compiled
using VxWorks/Tornado cross compilers.  The cross compilation using these GCC
based tools yields libraries that can be linked with target code (VxWorks' BSP
code and original target library shipped with VxWorks).	  Some of the routines
in the original VxWorks' library need to be replaced with PSO Stack routines,
which is accomplished by making a copy of the original library, deleting
existing routines from it, and linking the final BSP code against both
original VxWorks library as well as the new PSO Stack library.

Most of the kernel files are from NetBSD 1.3.3 and later release snapshot
(Feb, 1999).  The library functions under lib directory is from NetBSD 1.3.2
CD-ROM.	 NetBSD 1.3.X is based partly on BSD 4.4.  All of the networking files
therefore are derived from 4.4 BSD Software.

The PSO Stack libraries have been developed for a variety of host and target
environments, but not all combinations are tested due to lack of available
resources.  The following combinations are initially supported.	 Targets:
Sparc/Solaris 2.X VxSim, SA110 (StrongARM) EBSA285, PC (pc486)  Hosts:
Sparc/Solaris 2.7 PC/windows 95.


1.2 Scope of the software


The PSO Stack contents reflect a subset of BSD Unix kernel software.  The
directory structures are similar to that of the BSD Unix kernel source tree.
The directories kern, arch, dev, include, net, netinet, opt, sys, and vm are
all identical to the BSD kernel top level directories.  Only necessary
portions of the these directories are ported over.

In addition to these, new directories specific to PSO Stack are added.  These
are: pso, lib and vxworks.  The 'lib' directory is derived from BSD user level
library software as well as a seperate library to deal with better memory
allocation and freeing, which comes from Doug Lea's public domain distribution
of malloc library.  The 'pso' directory contains PSO Stack specific
implementations of necessary functions that are used to glue the BSD software
to RTOS being used.  The 'vxworks' directory contains all corresponding and
other parts of the software that are used by 'pso' directory and main BSD
software.  The software under 'vxworks' provide all necessary glue to the
VxWorks port of PSO Stack.   It is possible to port PSO Stack by creating a
new directory for a new RTOS, for example pSOS which interface to the portable
'pso' interface.

Functionally, this collection of software provides a complete replacement for
the existing VxWorks TCP/IP, socket API, device driver interface for network
devices, malloc and free.  Its intended purpose is to provide users of the
software a solid TCP/IP software product with complete source code which can
work with existing RTOS environment.



2. Changes made to original 4.4 bsd software

VxWorks notable changes are 'ifndef ORIG' surrounded.  Sometimes 'ifdef ORIG'
is used.  ORIG means original BSD code.	 So in our environment ORIG is
obviously not defined.  Everything that depends on the fact that ORIG is not
defined is the stuff newly added by PSO Systems Inc.  So if you grep for ORIG
you will see changes from original BSD code.  Sometimes ifdef is used
sometimes ifndef is used, with ORIG flag.


2.1 Socket API related issues

Since Unix system calls such as socket related routines are dependent on
architecture specific system call interfaces which rely on "trap" based
interfaces some modifications were made to accommodate this transition.
Although some RTOS system calls are implemented similarly via "trap"
interfaces, VxWorks and other (e.g. Nucleus) do not have the concept of system
calls via "traps" into kernel mode.  These operating systems run in single
address space, in single previledge mode (protected mode).  All functions
calls just simply that -- function calls.  Therefore even system calls are
just simply calls to subroutines.  

Original system call interface relies on the fact that the user level library
calls that implement socket API (for example) will "trap" into kernel to make
a system call.  Kernel resident implemention of this "trap" handling will
execute the function, obtaining the user supplied arguments from the pointer
and return the value.  This interface had to be rewritten to be simplified.  

Most of this is done in sys_socket.c and uipc_syscalls.c.  The use of
syscallarg() and the argument pointer 'uap' is specific to the Unix kernel.
VxWorks version does away with these conventions and expect direct argument
passing via normal C function call interface.

The function falloc() has been re-written so that it will interface with RTOS
to obtain a new object that can be used to map a file descriptor.
Correspondingly getsock() has been modified to return the socket pointer from
a given file descriptor.  The function fd_get_value() will return the value
stored in the file descriptor.  A given file descriptor, for example, will
have an associated socket pointer within it.  This is only specific to the
sockets.  RTOS may implement other file descriptors that are not related to
sockets.  Those file descriptors will use the value differently.   In VxWorks,
iosFdSet() and iosFdValue() are used to set (store) and get the socket pointer
value for a given 'ios' file descriptor in VxWorks.

The socket API is functionally equivalent except in signal behavior.  The
original VxWorks' socket API does not support asynchronous I/O signal SIGIO
because it is difficult to implement in VxWorks.  Similarly PSO Stack socket
API has the same limitation.  In multi-threaded environments, use of SIGIO is
unnecessary since most of the notification is done via semaphores and thread
switching is fast enough to dedicate a thread for I/O channel.

The 'struct socket' (an internal representation of what a socket is, as used by
the kernel software) contains additions to the data structure that are
necessary for this port.  These are defined in socketvar.h file.  These changes
are mostly due to the use of channels that are employed by the timeout/wait
mechanism in the kernel.  BSD Unix sleep/wakeup mechanism is converted to use
semaphore in VxWorks and corresponding changes had to be made to the
datastructures to hold semaphore values.  This change is further propagated to
the tsleep call argument change.  

There are a few PSO specific additions to socket API.  One has to do with
extending support for facilities that are either signal based (SIGIO for
example) or non existent in original code.  In VxWorks, its original socket API
does not support asynchronous SIGIO delivery when I/O event is available for a
socket.	 This omission partly has to do with the way VxWorks signal mechanism
is implemented, and also partly due to the way VxWorks designers intended the
system to be used.  However, it is occasionally necessary to employ completely
asynchronous I/O over sockets, especially in embedded real-time environment.
Thus, PSO has added an extention to socket API that allows users to register
callback routines that can be triggered by the protocol and socket layer
software when there is a need to do so.	 These occasions are mostly equivalent
to places where original Unix software intended signal delivery (SIGIO or
SIGPIPE).  Users can indicate what types of events to monitor via a flag passed
during registration process.  Based on this flag the stack software will
trigger a call to the user's own registered callback routines.  

Another addition, similar to VxWorks' own addition to original BSD Socket API
has to do with connection establishment with timeout.  In original BSD API, the
connect() call does not take a timeout.	 Since TCP can take a long time to
connect(), and longer for connection attempt to fail and connect() call to
return to users, it is necessary to have a timeout mechanism associated with
connection establishment in realtime environments.  Simply hanging on
connection attempt for long non-deterministic amount of time is not compatible
with realtime environments.   For this reason a new API call which takes an
additional timeout argument is created.	 This new call operates on sockets by
first marking it as non-blocking, and performing select() with timeout until
the connection attempt is either deemed successful or failed.

Beyond these minor changes, the socket layer API is  compatible with original
BSD software and porting code from BSD Unix based implementation should be very
easy to do.  




2.2 TCP and IP protocols

Most of the TCP and IP software is preserved in their original condition.
However a few things in the TCP stack code had to be modified to accommodate
the multi-threaded runtime environments.   In a preemptive multi-thread
environment like VxWorks some of the code that works OK in Unix kernel (BSD
Unix) does not work.

In tcp_input.c for example, you can find that there is a place where state
transition to disconnected state happens after notification to other modules
about the disconnectedness happens.  This does not work in multi-thread
environments because as soon as the notification is sent via semaphore, the
thread pending on that condition can run immediately.  The state transition
has to be marked before that happens.  Otherwise, the state is one step behind
the rest of the code that might start running in a different thread context
and malfunction of the stack code can (and does) happen.

The specific code in tcp_input.c has to do with marking the state of the TCP
connection to be in TIME_WAIT state before telling others that it is
disconnected.   The newer 4.4 BSD code already has many fixes that addresses
similar problems which existed in 4.3 release.  Namely the bugs that occurred
due to code which wakes up the readers before appending data in multi-threaded
environments have been already fixed in the original 4.4 distribution.

The IP level code is the same as original except the ip_id (the ID for each IP
datagram) is generated based on the timer tick via timer_tick_get() specific
to PSO Stack environment.

2.3 mbuf and memory resource issues

Original VxWorks' memory allocation and free routines have bad fragmentation
behavior that can leave the system memory pool in severely fragmented state
after a lot of allocation and freeing.  To address this problem a better
memory allocator/freer is ported.  This is located under
lib/libc/stdlib/malloc.c.  

This version of malloc contains two different algorithms.  One is the BSD
based Kingsley "bucket" allocator which has some unique fragmentation
behavior.  The other is Doug Lea's well tested allocator that tries to
minimize fragmentation while keeping the speed/space requirements.  USE_BSD
and USE_DL are ifdefs used to enable either one of these.  USE_DL is turned on
by default since it seems to behave better in heavily networking environments.

The mbuf code has been left pretty much untouched.  The NetBSD implementation
of mbuf has a lot of enhancements over original 4.4 BSD.  It already includes
all of the enhancements that are usually applied to accommodate VxWorks like
single space memory system.  Normally, a lot of changes were required in the
past when porting BSD mbuf code to VxWorks, but the NetBSD code already has
all the issues addressed with respect to the way cluster mbufs are handled.
For example, MFREE() already knows how to detect freeing of cluster mbufs
(marked M_EXT) and the mbuf data structure already has proper place holders
for the cluster related back-call 'free routine' that are placed there by the
allocator.   Usually this kind of functionality had to be added to the
existing source base when porting mbuf code from Unix to VxWorks in the past.
This is due to the fact that Unix code tended to assume VM architecture where
cluster mbufs were allocated out of pages of memory that can be copied easily
by "flipping" page reference bits.  Different Unix systems tended to implement
this differently.  And often cluster mbuf support did not support allocation
of different size clusters at all.  NetBSD implementation of the cluster mbufs
solves all of these problems via use of ext_size and reference pointers, along
with new macros like MEXTADD().

Currently the mbuf system in this release does not initialize the static pool
of memory seperate from the rest of the system.	 This is in part intentional.
Having to carve out a big chunk of system memory for mbufs only tends to
degrade efficiency of the memory usage.	  Since the new malloc has very good
speed and fragmentation behavior this should work out well.  

However, if for any reason, this is not desired, it is easy to create a
seperate pool of memory for mbuf allocation.  Changes need to be made to
mbinit() which sets up pools.  Currently the pool_init() is minimal -- it does
not set up any real pools.  Instead it records the unit of allocation
required, which is later used to allocate memory from system pool dynamically.
To get a seperate pool, pool_init() can be made to allocate a large
(requested) size of memory from system malloc pool and later allocate
pool_get() requests out of this new pool.


2.4 The device driver issues

The PSO Stack as supplied comes with a working loopback device driver which is
used to demonstrate the proper function of the stack software.  When
integrating real device drivers a number of additional steps need to happen.
Although PSO Stack includes ported software that handles PCI devices (probing,
configuring, etc.) it needs to incorporate more code to actually interface
with many PCI style devices available commercially.  It is recommended that
users port existing working drivers from VxWorks to PSO Stack.  Only changes
required are mbuf related code and the way interrupts are queued to the thread
level software.	 The mbuf related code will need to change because the
definitions of mbuf as well as associated macros for handling mbufs have
changed.

The functions such as build_cluster() and do_protocol_with_type() are VxWorks
specific and no longer required by the PSO Stack.  Any use of such calls will
need to be replaced.  The build_cluster() can be replaced with MEXT related
macros that are used to allocate and construct cluster mbufs.  The
do_protocol() routines can be replaced either by direct calls to ether_input()
or manually constructed based on the code inside ether_input() (as supplied in
PSO Stack net/if_ethersubr.c).  If your driver specifically references
netJobAdd() (some do, some don't) it needs to be replaced by schednetisr()
(see if_ethersubr.c).

PSO Stack handles incoming interrupts a little differently than original
VxWorks.  Both attempt to minimize interrupt latency by doing minimal amount
of work at interrupt level (that of queueing further work to be performed
later at thread level).	 However they are different in the way threads are
used.  VxWorks uses overloaded single thread (netTask) to do everything
network related, including handling incoming interrupts from one or more
network devices.  PSO Stack uses a dedicated thread for each interface, and
each can have different priorities.  This allows prioritization of packet
processing and controlling data flows based on type of network devices
attached to a system, for example.  It also does not suffer from congestion
that netTask often suffers from.  Additionally, netTask can hang due to a
number of small errors such as one non-fatal error in any number of areas in
the network code and drivers.  Unlike netTask, PSO Stack will continue to
function even when one of the devices die.



3. Replacements for VxWorks facilities

3.1 netTask replacements

For each of the network devices a seperate thread is created to handle device
specific events (such as interrupts).  This is done from each of the device
drivers via init_netisr_handler().  The body of the code which implements the
thread is passed as a function pointer to init_netisr_handler() which creates
it as a thread.	 This means that each device driver is able to customize what
its own thread is supposed to do.

Note that even the loopback driver (if_loop.c) creates a thread for its work.
Even though loopback driver doesn't use real hardware interrupts at all since
it is a pseudo driver that simply loops the packets over a queue, it does use
a thread of its own to hand off the work of handling the packet that was sent
by an application thread.  An application thread hands off the packet to be
delivered via socket API.  The packet sent is queued and eventually handled in
some other thread's context (the reader), but the work of handing over the
queued packet to the reader (via IP -> TCP stack and back up to socket API on
reader side) is done by the device driver thread.  The VxWorks netTask also
handles other things than interrupts from device drivers.  It serves as
central context for all network related functions, delayed execution of
function calls, and general context provider for all protocol related
software.  PSO Stack provides a generic message handling thread, as
implemented in vxworks/vxworks_port.c as  queue_message_thread.	 It is
possible to use this thread for all non interrupt related functions related to
networking which require context.  It is also possible to fashion another
thread to further sub-divide the functionalities served by an independent
thread with its own priority.  Currently queue_message_thread (whose body is
defined in queue_message_loop()) is mainly used for various timers triggered
by timeout interrupt handler (in VxWorks watchdog timers).  The interrupt
level code that handles timeouts queue messages containing function pointers
which are to be executed at thread level by the queue_message_thread context.

To summarize, the netTask's functionalities are subdivided into multiple
threads.  The interrupt handling (deferred thread level interrupt event
handling) is done in network device driver specific threads, each of which is
created by the device driver writer (or porter).  An instance of such thread
is created per active instance of a network device driver of all kinds.  
Almost all other work that requires an independent thread context (such as
timer event execution at thread level) is done via queue_message_thread.


3.2 Other support code replacements

Two categories of further replacements of existing facilities are made.
The first has to do with functional replacements.  This includes routines that
actually attempt to emulate the original functionality of the routine in
underlying RTOS.  The second has to do with stubbing out routines that are no
longer needed (or not yet needed).  These are simply null functions for the
sake of completing the linking of object modules successfully to turn out a
useable executable object files.  It is only done to satisfy the linker.

The first category includes such routines as: netLibInit() and sockLibInit().
The call the netLibInit() is made by the start-up phase code within VxWorks
that are included as part of the BSP source tree (under target/config/all).
This call is meant to initialize network software, so it is a good place
for PSO Stack to call init_bsd_compat() which initializes mutex'es required in
the PSO Stack and call a sequence of routines that are used to establish
initial runtime conditions, similar to the way BSD Unix initialization
happens.  Except that we are only confining ourselves to network related
functions such as: mbinit(), soinit(), ifinit(), domaininit().  In other
words, mbuf subsystem, socket subsystem, network interface driver subsystem,
protocol domain subsystem including protocol specific initializations are done
here.   The routine sockLibInit() is equivalent to call init_socket_lib().
This creates a hook into VxWorks I/O subsystem (ioLib and iosLib) to allow
socket objects.	 The sockets appear as I/O object that can be read, written,
and ioctl'ed after this point.  

The second category includes a number of routines (not a complete list) that
are either not needed for now or simply unimplemented: ipAttach, ipLibInit ,
rawIpLibInit, rawLibInit, udpLibInit, udpShowInit, tcpLibInit , tcpShowInit,
icmpLibInit, igmpLibInit, netShowInit, ifMaskSet, ifAddrSet, bsdSockLibInit,
sockLibAdd, connectWithTimeout, ifAddrGet, routeAdd, netJobAdd, ulipInit,
elcdetach, ultradetach, slipInit, eiattach , elcattach, ultraattach,
eexattach, eltattach, eneattach.  It may be necessary to stub out more or less
depending on the version of your VxWorks development environment.



3.3. Functional replacements

Some routines are not replaced but implemented under different names.  These
include such routines as ifconfig(), netstat() (in PSO Stack) vs. ifAddrSet(),
icmpstatShow(), etc. (in VxWorks).  The PSO Stack tries to be more BSD Unix
compatible in these kinds of support routines, since more people are familiar
with Unix style commands that are normally used to setup and control the
networking facilities.	 To set	a network device's IP address, for example,
one would do ifconfig("dev0", "addr", inet_addr("123.33.2.38")), instead of
calling non-standard routine ifAddrSet().  This is a minor difference, but
well worth changing.   A lot of original BSD code can be reused as a result.

Examples of some of these calls and how they are used can be found in pso.c
file which is included as part of the target BSP directory addition.  In 
pso.c file the first routine pso_init() calls a number of routines to 
initialize the stack.  init_bsd_compat() and init_socket_lib() are called
to initialize bsd compatibility and socket API related facilities.  Then,
it is possible to initialize the loopback device driver by calling 
loattach() routine.  Note that these sequence of initialization is 
similar to VxWorks' own network initialization in usrNetwork.c but 
much simpler.   Users can change pso_init() by adding more initialization
for user specific device drivers, for example.



3.4  PSO Additions 


There are two directories under the main tree that are created by PSO.  These
are called pso and vxworks.   The pso directory contains RTOS independent
implementation of interfaces to underlying RTOS (e.g. VxWorks) as well as
additional code to interface with the BSD software to fit it into a tradtional
RTOS paradigm.  The vxworks directory implements required API used by PSO
Stack for the VxWorks RTOS specifically.

Users are noted to pay attention to the way data structures that are unique to
VxWorks can be exposed to the PSO Stack software.  Since the original BSD Unix
software and VxWork share similar concepts, there are data structures and
names of data structures that clash if both systems' include files are used
at the same time.  This poses a dilemma for the programmers who port software
based on BSD Unix to VxWorks.  One one hand the data structures unique to
VxWorks must be preserved for binary compatibility at function call level.
On the other hand, knowing too much detail of VxWorks data structures require
inclusion of VxWorks' header file that can clash with BSD header files.

The way PSO Stack uses VxWorks header files handles this problem.  The only
place VxWorks header files are used is inside files under vxworks
subdirectory.  Any exposed data structure elements outside vxworks directory
are exposed as void pointers.  For example, the semaphore structure is exposed
as void pointer.  

This problem is acutely illustrated in the way ioctl routines are implemented.
Due to the system specific definitions of the ioctl 'commands' (e.g. FIONBIO),
the files that references such commands must take care to avoid conflicts
between BSD and VxWorks.   This problem has influenced the way VxWorks
select() support is incorporated into PSO Stack.  VxWorks implements select()
via ioctls FIOSELECT and  FIOUNSELECT.	 These are unique to VxWorks and the
use of them are very unique to VxWorks.	  The existence of
vxworks/vxworks_port.c file is due to this issue.

Note that it is possible to ignore VxWorks' resident select support by
replacing VxWorks I/O subsystem with BSD I/O interfaces altogether.  For PSO
Stack software, we have chosen not to do this.  Our goal has been to preserve
as much of the BSD software while using as much of existing VxWorks facilities
as possible.  (If this were not our goal, why bother using VxWorks at all?
One could simply use BSD kernel as embedded OS.)




4. Packet forwarding and ATM specific considerations


4.1 Buffer handling

When packets arrive from interfaces it is sent up to IP layer code which either
forwards the packet or keeps it.  If the packet is destined for the local host
the packet is consumed locally.	 Otherwise fowarding of packets happen via
routing code at the IP layer. 

The reception and IP layer code handling will happen in the context of the
driver specific thread which is created by the user.  This thread, which is
created when init_netisr_handler() is called in the device driver
initialization routine, provides thread level context for handling the
reception and other interrupt events at thread level rather than interrupt
level (to minimize interrupt latency).  When it receives a packet it typically
will call ether_input() in BSD Unix drivers.  It may also call ipintr()
directly, as in the case of if_loop.c (loopback network driver).  The
ether_input() will eventually call ipintr() as well, after decoding the
ethernet header information and verifying minimal amount of sanity checks.  In
VxWorks environments, drivers that are written for VxWorks may also call
ether_input(), or sometimes they may call do_protocol() instead (also
do_protocol_with_type() sometimes).  The do_protocol() does similar function as
ether_input().  Eventually the packet ends up in IP layer code (via ipintr()),
which will either consume the packet locally by sending it up to the upper
layer protocols (UDP or TCP), or just forwarding the IP packet out via
ip_forward().  The ip_forward() will determine routing information via
rtalloc().  

The driver thread provides context for all of the routing level code and the
forwarded packet gets queued (or sent immediately) via ip_output() which will
send the packet out to the device that is resolved via routing table lookup. 

When forwarding ATM packets to and from the ethernet devices, one has to take
into account of the fact that the MTU for ATM is substantially larger than
ethernet (typically 8K vs. 1.5K bytes, but ATM MTU can be as large as 64K
theoretically).	 To avoid copying data as much as possible, the driver writer
should make an attempt to examine the code path that carries the data from ATM
to ethernet and vice versa.   In an ideal situation, large driver application
memory pieces (8K for ATM) will be initially loaned to the ATM device by the
driver.	 The ATM device will DMA data into these buffers and interrupt the
driver's interrupt handler which will notify the driver thread via semaphore
(schednetisr()).   The driver thread will perform the input side work of
protocol as described above, and possibly forward the packet out to the
ethernet device via ip_output().  The mbuf being passed to ip_output() which
goes down to ethernet driver's output routine at this point, will be a
"cluster" mbuf which has a pointer to the data portion.	 The data portion here
is a piece of memory that contains IP datagram content which has been allocated
by the input device (ATM).  When the packet is sent to output device (ethernet)
it will be freed, at which point the free routine registered with this cluster
mbuf will be called.  The free routine has been registered when the cluster
mbuf is created (see MEXTADD macro in sys/mbuf.h) with a pointer to an internal
function which resides in the device driver (ATM in this example).  Since the
MTU sizes vary, the 8K buffer needs to be split into multiple buffers and given
to ethernet device for output.  Avoding copying during this splitting phase can
be tricky, since multiple buffers for ethernet belong to one ATM buffer.  For
proper optimal copying-avoiding packet forwarding, ethernet driver will need to
be optimized to properly handle these buffer size differences.


4.2 Thread context switching

In hard realtime environments where response time is critical, a typical
preemptive, priority driven strict scheduling behavior is essential.  Minimal
interrupt latency and response is critical to realtime system behavior.  
This is not always the most important factor in other embedded environments
where throughput is of the highest concern.  For example, in an embedded
packet router or file server environments it is sometimes better to optimize
for the best CPU usage in terms of maximizing the throughput while minimizing
the amount of work per unit of work.   The realtime response is typically
achieved with threads based architecture which has fast thread context
switching, with strict priorities that address the event handling priorities.
The thread switching overhead in these environments can sometimes hinder
the overall throughput behavior given fixed hardware performance capability.
In the worst case, the system may spend too much time context switching,
and not enough time doing the throughput related work.

For this reason, a lot of throughput oriented systems like routers tend to
run very simple loop as the realtime kernel.  In these systems, the whole
runtime is governed by a hand optimized loop which services events in
sequence, the order of which is determined by careful tuning and experimenting.
These systems tend to be either completely single threaded or based on
cooperative multi-threading.  The latter is a system where there are 
multiple thread contexts supported (as in preemptive systems), but thread
scheduling only happens synchronously as each of threads request to yield
the CPU time for other threads.	  That is, threads can only run when other
thread that is currently running specifically and explicitly yields the
CPU time.   In both of these environments, thread switching overhead
is minimized.  First case, there is no switching.  Second case, there is
minimal switching only when it is needed.  In the second case, the system
still benefits from thread abstraction and priorities, but the context
switch is hand-tuned so that they only happen when they need to.  

It may be necessary to explore the cooperative multi-tasking behavior in
the context of VxWorks runtime.	 Users are allowed to create within VxWorks
environment a subsystem that will run in cooperative fashion.  This is
true because VxWorks runtime is very simple and can be viewed simply as
just a large program running in single address space.  There are preemption
related scheduling going on, but you can always bypass it (i.e. if a task
that currently holds CPU at a given priority wants to run forever and
does not yield CPU, all lower priority threads will not run).



5. Strategies for optimizing device drivers


The number one overhead in networking I/O is data copying.  Avoidance of data
copying is the quickest way to optimize any network device driver.  Many
network drivers employ the idea of "loaning" buffers to avoid copying them.
By (pre)-allocating data structures that are used for incoming packets and
loaning them out to the network interface hardware the CPU avoids having to
copy the data out of device memory into mbufs.  Instead the network hardware
can DMA the data directly into the loaned buffer, which is pointed by the mbuf
header, which means that the data DMA'ed thusly already belongs to mbuf data
structure format that can be passed upwards to IP stack without copying.
Similar things are done for outgoing packets.  They mbufs that contain
clusters that can be copied virtually (that is, not every byte in the data
buffer is physically copied) via references.  The mbuf data can then be
given (loaned) to the output engine of the network hardware which will
directly DMA the data out of the buffer space.  When transmission is complete,
an interrupt is generated to recycle such mbufs.

When loaning mbufs to devices and upper layer protocols, it is important 
for the device driver writer to keep in mind the upper limit for the 
number of mbufs to be loaned out.   Loaning more mbufs out does not
always mean better performance.	 A good number can be experimentally
determined, by trying various configurations and running extensive tests.
A good number for the upper limit of the loaned buffer counts can be
then used for the driver.  It might be prudent to make this a tunable
global variable.  Not all drivers are written to be this flexible, and
depending on the original VxWorks driver (if you are porting one), you
might need to refine this aspect.

Another issue to be aware is that application which behaves erratically
can cause the loaning mechanism to lock up.  For example, if a thread 
application which should be reading data off of the socket level 
buffer queue somehow ends up hanging and not reading packets from
the queue (or not reading from the queue in timely fashion) then the
buffers will get queued to the maximum amount allowed.  This maximum
amount can be as large as 64k bytes, as determined via socket option
for the socket level buffers (there are two -- one for receive
side SO_RCVBUF and one for send side SO_SNDBUF).  Imagine, that there
are application threads that are hanging like this which exhausts
the amount of mbufs that are being loaned but never returned to the
driver because the application threads do not read them.  The buffers
are returned to the driver when application thread reads the buffer
into application specific buffer and the mbuf is then freed.  Freeing
mbuf causes callback routine in the driver to be called for the loaned
mbufs.  When this mechanism of loaning and freeing, thus recycling
buffers do not happen as planned, the driver can suffer from using
too much buffer space, and sometimes hang depending on how it is 
written.  It is therefore advised that the driver limit the amount
of buffers that are to be loaned to the application threads.   There
should be a maximum threshold over which the driver will no longer
loan out buffers at all.   The driver should instead copy incoming
data into new mbufs that are to be given to upper layers and recycle
the loaned mbufs immediately and give it out to the device.


6. Implementing zero-copy


As a packet travels from network wire all the way up to application,
there can be many places where the data is copied from one buffer to another,
sometimes needlessly.   It is possible to minimize this data copy overhead
at various layers.

We are mostly concerned with two places: driver level and socket level
data copies.  The code in between these layers (IP, TCP, UDP, etc.) are
very well optimized already.  Any additional optimizations in these protocol
areas can take a long time to debug and implement, yielding questionable
benefits, if any.

6.1 Device driver layer zero copy strategy

The basic data structure used by the buffering is mbuf.	 In PSO Stack
which is derived from NetBSD Unix implementation, the mbuf mechanism has
a lot of advanced features that are useful for implementing copy avoidance.
In particular, this mbuf implementation can support cluster mbufs that
are not actually copied when m_copy is called.  Instead a reference is
made inside cluster mbuf data structure for the 'copy' and the mbuf is
not totally freed until all such references are resolved (all holders
of copies/references must free the instance of mbuf).  When the mbuf
is finally freed, the free routine function pointer embedded within 
the cluster mbuf structure is used to free the cluster mbuf data portion
in flexible way according to the allocator's policy.  For example, if
the allocator of the data portion of a cluster mbuf was a device driver,
the driver will have its own internal routine that knows what to do
when the buffer is freed.  In other words, driver will allocate the buffer
and loan it out to others, and eventually when freed, its own internal
function will be called via pointers.

So, one place that is obvious for copy avoidance optimization is device
driver layer.  This can be done via careful use of cluster mbufs for
both input and output paths.  For input paths, the driver typically
allocates large enough buffers for the devices MTU and loan them out
to device's receive data structure (typically a linked list or ring
buffer of some sort specific to the hardware architecture). The device
will DMA incoming frames of data into these buffers and interrupt the
driver's handler routine.  The driver will attach a small mbuf header
that points to this buffer and construct a cluster mbuf out of the
DMA'ed loan buffer and pass it up to upper layer.   For output side,
the upper layer code will send down arbitrarily complex chain of mbufs
which contain one or more mbufs.  The size of mbufs in the chain are
variable, especially if TCP is sending packets down.  For the output
side buffer loaning to work effectively the hardware has to support
data chaining and arbitrary packet start boundary.  The data chaining
is where software can build a linked list of buffers rather than a whole
chunk of buffer, and give the list to the device for output.  The device
then DMA's data out of the chain of buffers filling in MTU worth (or
less) of bytes appropriate for framing output data.  The device has to
support arbitrary start boundaries instead of requiring strict data
boundary conditions such as long word alignment.  Beginning of each
element of the output packet chain must be able to start at any
byte boundary without hanging up the hardware output state machine.


6.2  Socket layer zero copy strategy

This section is relevant only if you are writing user level socket API
based thread applications or daemons.

To preserve the socket API, VxWorks performs extra socket level copying
of data from socket buffers queued at the socket queue into user buffers.
This is unavoidable,  if one has to strictly adhere to the socket API
compatible with original BSD software.	 However, for efficiency reasons,
there have been various attempts in VxWorks to support socket level
copy avoidance.	 One recent implementation, called zbuf, attempts to
alleviate the copy overhead at socket level, but has resulted in
variable behavior; sometimes it is less overhead, sometimes it is more.
Overall, zbuf fails to deliver.	 Besides, the complexity of the interface
requires more extensive API and data structure changes than otherwise needed.

A sensible alternative is to bypass the socket API altogether and directly
use the uipc_socket.c level interface, which is used by the socket library
interface.  By directly interface to lower layer than the socket API layer,
users can specify mbuf data structures to read and write data from any
given sockets.  Direct access to mbuf such as this is possible because
VxWorks is single address space OS that runs as if it were one giant
program.  Direct access to mbuf allows users and applications to use
the same strategy employed by the rest of the protocol kernel code to
avoid data copying.  There is no requirement to copy from a user specific
malloc'ed piece of memory into mbuf and back and forth.	 For example,
instead of calling sendto(), even application threads can call sosend()
instead.  If you look at the source code for sendto(), it is clear that
sendto() eventually calls sosend() anyway (i.e. so->so_send).  By
bypassing the translation that takes place to make the socket API
conform to the BSD socket library, you can directly call mbuf based API
specified in uipc_socket.c.


7. Strategies for optimizing IP checksumming

The second highest overhead in the protocol stack is the checksum calculation.  
Optimization of the IP checksum routine can easily yield minor 
performance boost.

The portable C version of in_cksum() is written in a way that actually performs
IP checksums in very efficient manner.  However, it is likely possible to
improve on this by recoding the checksum routines in assembly language.	 Most
of the NetBSD ports have assembly optimized checksum routines.  These routines
can actually be ported pretty much as is.  The fact that PSO Stack does not
currently contain optimized assembly version of IP checksum code for all target
CPU architectures does not mean that it is not available.  Users are encouraged
to research existing BSD code for further optimization.	 For example, the
StrongARM port of BSD in_checksum() code uses the same C based implementation
augmented with inline assembly code for several key macros, such as ADD64,
ADD32, etc.  This is a good strategy since you only need to regenerate
the equivalent assembly functions for those macros and C routine which
drives the macros stays the same.

Basically a few things to watch out for are: unrolling the loop for better
efficiency, using largest possible arithmetic operations (for example use 32
bit add-with-carry instead of 16 bit versions if at all possible), and keeping
the behavior of instruction and data cache in mind and avoiding unnecessary
cache flushes.

Another approach (taken by Linux networking code, but not in BSD) is to combine
data copying with checksumming.	 This is often called checksum while copying.
This has an advantage in that you can reduce the overhead by going through the
loop once accessing data.  Users may wish to look into this as well.

Remember, the first thing to optimize in TCP/IP networking code is removal of
unnecessary data copying.  The second is IP checksumming.


8. Building and merging PSO Stack with existing VxWorks BSP

8.1 Reducing VxWorks original library

Before linking bsdsys libraries to existing VxWorks BSP environment, you need
to make a copy of the existing VxWorks main library which is normally linked
with your BSP support code to result in a loadable and runnable VxWorks final
image.  Just make a copy of the library file which is normally located under
target/lib/ directory and called libXXXgnuvx.a where XXX is CPU architecture.
For example libMC68000cpuvx.a.	 Modifications required	has to do with
deleting some of the objects from the library.  You should back up the
original library.  The files to be delete are (not a complete list and you may
need to hand tune this):

if_bp.o if_cpm.o if_egl.o if_ei.o if_eitp.o if_enp.o if_ex.o if_fn.o if_ilac.o
if_ln.o if_lnsgi.o if_lnPci.o if_loop.o if_med.o if_nic.o if_sl.o if_sm.o
if_sn.o smNetLib.o smNetShow.o if_eihk.o if_elc.o if_dc.o if_ultra.o if_eex.o
if_fei.o if_elt.o if_ene.o if_ulip.o if_es.o if_nicEvb.o if_esmc.o if.o
if_ether.o if_subr.o in.o in_pcb.o in_proto.o ip_icmp.o ip_input.o ip_output.o
raw_cb.o raw_ip.o raw_usrreq.o route.o sys_socket.o tcp_debug.o tcp_input.o
tcp_output.o tcp_subr.o tcp_timer.o tcp_usrreq.o udp_usrreq.o uipc_mbuf.o
uipc_sock.o uipc_sock2.o unixLib.o if_ppp.o pppShow.o ifLib.o inetLib.o
netLib.o netShow.o routeLib.o sockLib.o zbufSockLib.o mbufSockLib.o
bsdSockLib.o memLib.o memPartLib.o memShow.o ipProto.o ipLib.o udpLib.o
udpShow.o tcpLib.o tcpShow.o icmpLib.o igmpLib.o etherLib.o dec21x4xEnd.o


8.2 Compiling the PSO Stack libraries

The PSO Stack libraries lib_pso_bsd.a (the main PSO Stack library) and
lib_bsd_c.a (libc related add-ons) are created when the PSO Stack
distribution is un-tarred and make is run in the directory.

First, un-tar the distribution and set up the environment.  The distribution
supports three targets currently (Solaris/VxSim, PC, StrongARM SA110).
Depending on your target environments, you set up  a few things.  The
directory machine needs to be symbolic link to one of the arch directories.
For example, VxSim/Solaris uses arch/sparc/include (you should make a
symlink from arch/sparc/include to machine, e.g. "ln -s arch/sparc/include
machine").   Similarly for PC, "ln -s arch/i386/include machine" and
for SA110 "ln -s arch/arm32/include machine".   You should also import
various environment variables used for the toolchain as described in
various setup.sh files.	 There is one for each target.  setup.sh.arm32
is for SA110, setup.sh.i386 is for PC, setup.sh.sparc is for VxSim/Solaris.
If you are using /bin/bash as your shell you can source in these setup
files via ". ./setup.sh.sparc" for example.  If you are using other 
shells you will need to change the syntax (for example, for csh you
should change "XXX=value; export XXX" to "setenv XXX" instead).	 Doing
this step allows all the paths and other environment setup.  The last
thing to do in terms of setting up is do make symlinks for the defs.mk
file.  There are three defs.mk.arm32, defs.mk.i386 and defs.mk.sparc.
Depending on your target, you should symlink one of these to defs.mk.
For example, for VxSim/Solaris, you should "ln -s defs.mk.sparc defs.mk"
The defs.mk file is included by various Makefile files in various
directories.  It is a place where all information common to various
Makefile files can be located.  For example, CFLAGS is defined in there.

Once setup is complete, you should do "make depend" followed by "make".

Sometimes, the make will fail depending on your target type.  For example,
when building for VxSim/Solaris target, all the references made by source
files that are related to unsupported features such as PCI bus support and
other PCI type devices (DEC 21143 ethernet and Intel 82558 ethernet
device drivers both of which are PCI) will not compile cleanly.	 When you
see files under dev/pci/ (e.g. if_de.c, if_fxp.c, etc.) do not compile,
then it is likely that these are not supported in your target.  The VxSim
running on Solaris does not support any direct hardware so you should
know that PCI is not supported.	 You should take references to these
files out of Makefiles as needed.   Laster release of PSO Stack will
have better configuration for different targets to avoid this problem.


8.3  Building the runnable VxWorks target image

The VxWorks target can be built as "standalone" image which contains
all the symbols (the symbols can later be stripped out).  Instead of
linking with the normal vxWorks library, you should be linking in the
reduced library along with extra PSO Stack libraries created.  Doing
this replaces some of the internal routines VxWorks uses for networking.

If you have trouble with some of the missing routines or variables, it 
is likely that this is due to version mismatch.	 In most cases, the
missing symbols can be stubbed out (see vxworks/vxworks_port.c file).

Initialization of PSO Stack modules can be done as in the provided pso.c
file which can be linked in as part of your BSP.  There are some
examples, such as ttcp benchmark program and blaster/blastee program
that are provided to give you initial run of the PSO Stack.


9. Mutex protection of various modules (including TCP/IP stack)

Unlike Unix, a system such as VxWorks which attempts to minimize interrupt
latencies must take care not to spend too much time with interrupts locked out.
Unix strategy of using interrupt lockouts as mutual exclusion is not feasible
due to this problem.  A system's interrupt latency is bounded by the largest
continuous segments of code that runs with interrupts locked.  Locking
interrupts out causes system to not able to respond the hardware events in
timely fashion.

VxWorks, therefore, uses sempahores for mutual exclusion.  By avoiding
interrupt lockouts the latency is minimized.  Furthermore, VxWorks tries to
spend as little time as possible in interrupt level code such as interrupt
handlers.  Most VxWorks interrupt handlers spend minimal amount of time
acknowledging the interrupt and queueing a routine to be performed later by
another thread.	 The idea is to do the absolute minimum to acknowledge the
interrupt but not spend any more time processing the interrupt at interrupt
level handler.  Instead a message containing function pointer and arguments is
constructed and sent to a thread that is responsible for carrying out the
"real" work of servicing the thread requested by the hardware interrupt.  This
work is done at thread level code, not interrupt level code.  Since minimal
amount of time is spent at interrupt level, interrupt latency is minimized.


10.  VxWorks 'select' support

VxWorks' own select() support is based on FIOSELECT and FIOUNSELECT ioctl
commands that are unique to VxWorks.  Unix implements select support
as integral part of their I/O system design.  VxWorks support of select
is partially done on specific I/O devices; depending on the support of
FIOSELECT, a device may or may not support selection.   The PSO Stack
follows original VxWorks selection support because it is near impossible
to implement it otherwise and still make it compatible with other I/O
devices in VxWorks which support selection in original fashion.

The support code for selection is distributed in three different files,
due to header file conflicts and other complexities.  The files are
vxworks/vxworks_iotcl.c and vxworks/vxworks_port.c and pso/bsd_compat.c.
The socket device, which is registered as vxWorks I/O device, implements
a ioctl routine which has added support for FIOSELECT/FIOUNSELECT.  

Functionally the PSO Stack implementation of select() call is equivalent
to that of VxWorks' native version from application point of view.  

11. Testing strategies

11.1 Driver testing

The network device driver writers can start by unit testing their software.
When unit testing looks good and driver starts to function properly, it
is possible to further test the drivers using ttcp and blaster programs
included in PSO Stack.  These programs can exhaustively test the robustness
of the driver, as well as benchmark its performance.  These are the tools
that have been extensively used to test the drivers in VxWorks.

Additionally, debugging facilities such as network analyser can help.
It is often sufficient to just use snoop or tcpdump on Unix (Solaris or Linux) 
machines.  However, sometimes it may be necessary to have professional
tools for tracking link level errors and packet generation.

11.2  Protocol stack testing

The ttcp and blaster programs are good in testing most of the stack profile
that is often used in networking software.  However, there is no substitute
for porting as many of the BSD style networking applications to see the
proper function of the PSO Stack.  For example, proper ICMP protocol can be
tested by porting ping program.  


Provided by    Hwa-Jin Bae, bae@Mail.com, Piedmont California
All modifications to original BSD software placed under original BSD license.