3.6 Distributed vectors
Distributed vectors are widely used by subroutines which manipulate the analysis control vector, including much of the analysis and the Hessian singular vector code. They provide a convenient way to parallelize code which manipulates entire vectors rather than accessing individual elements. Their main advantage is that they hide message passing from the user.
The basis of the parallelization method is the introduction into the IFS of a fortran 90 derived type called a `distributed_vector'. Variables of this type may be thought of as a pointers to one dimensional arrays whose contents are spread over the available processors.
Fortran 90 allows the programmer to define what is meant by assignment, the arithmetic operators, and generic functions, when they are applied to derived types. This feature has been used to allow distributed vectors to appear in expressions and as arguments to the dot product routine. For example, if d1 and d2 are distributed vectors of (say) 10000 elements each, and r1 is a one-dimensional array real array, also of 10000 elements, then the following are legal fortran 90 statements:
r1(:) = d1
d1 = (3.0*d1 -5.5) / DOT_PRODUCT(d1,d2)
What is not immediately obvious is that both statements perform message passing. The first statement copies all the elements of d1 into the array r1. Since many elements of d1 are stored on other processors, they must be sent to the local processor. Similarly, the generic DOT_PRODUCT subroutine has been extended to allow one or both arguments to be distributed vectors, in which case a parallel algorithm is used.
The second statement above illustrates another feature of distributed vector syntax. The arithmetic operations are written as if they apply to the whole vector. However, each processor will perform operations only on the locally held part of the vector, thereby distributing the work.
3.6.2 Using distributed vectors
The following statement declares d1 and d2 to be distributed vectors:
TYPE (distributed_vector) :: d1,d2
(Note that any of the usual qualifiers (INTENT, DIMENSION, ALLOCATABLE, etc.) may appear in the type statement.)
As with any pointer, we must allocate space for the data before we use a distributed vector, and remember to deallocate the space when we have finished with it. The following statements allocate and then deallocate a distributed vector of 10000 elements:
CALL allocate_vector (d1,10000)
CALL deallocate_vector (d1)
Once allocated, distributed vectors may appear in arithmetic expressions with other distributed vectors, with real scalars, or with one dimensional real arrays which are either of the same length as the full vector, or of the same length as the part of the vector which is held on the local processor. The result of such an expression is a real array of the size of the locally held part of the vector.
A distributed vector may appear on either side of an assignment statement. If it appears on the left hand side, then the right hand side must evaluate to a real scalar, to another distributed vector of the same length, to a one dimensional real array of the same length as the full vector, or to a one dimensional array of the same length as the part of the vector which is held locally. If the right hand side of an assignment statement is a distributed vector, then the left hand side must be either another distributed vector, or a real one dimensional array of the same length as the full vector.
A distributed vector may appear as one argument of the DOT_PRODUCT function. The other argument must be either another distributed vector of the same length, or a real array of the same length as the full vector. The functions SUM and MAXVAL may also be called with distributed vector arguments.
A sub-vector may be extracted from a distributed vector into a local array using the function dvsection:
r(i:j) = dvsection (d1,i,j)
The routines gather_vector and scatter_vector may be used to copy between distributed vectors and arrays which are defined on a single processor (for example the IO processor). For examples of the use of these subroutines, see GETMINI and SAVMINI.
3.6.3 Optimizing distributed vector code
The overloaded assignment and arithmetic operators are intended to make it easy to convert existing code, and to make the parallelized code easy to read. However, it should be noted that each assignment and each arithmetic operation results in a subroutine call. This is potentially expensive and inhibits some compiler optimizations. For this reason, an alternative syntax may be used to code frequently executed expressions involving distributed vectors. This syntax is less elegant, but is more efficient. It is illustrated by the following example.
Consider the following statement, where d1 and d2 are distributed vectors of 10000 elements each, and where r1 is a one dimensional real array, also of 10000 elements:
d1 = d2 + r1(:)
This statement results in two subroutine calls: one for the addition and one for the assignment. The following statement is equivalent, but does not result in subroutine calls:
d1%local(:) = d2%local(:) + r1(d1%local_start:d1%local_end)
The notation is explained as follows. Each distributed vector is a structure containing a pointer to the locally held part of the vector, and three integers holding the full length of the vector and the indices of the first and last elements of the locally held part. As illustrated in the example above, the components of a distributed vector are accessible using the "%" notation as (respectively) %local, %global_length, %local_start and %local_end.
There is an implicit assumption that statements involving distributed vectors will be executed by all PEs. Be particularly careful in parallelizing IF statements. Care should also be taken in coding loops which use a mixture of distributed vectors and real arrays. It is easy to generate large amounts of message passing by frequent copying between arrays and distributed vectors.
Another potential danger is a DOT_PRODUCT statement whose arguments are both expressions. For example:
dot = DOT_PRODUCT (d1+5.0 , d2-2.0)
This will not work as expected. Both "d1+5.0" and "d2-2.0" will evaluate to real arrays of the length of the locally held part of d1 and d2. Each processor will apply the standard DOT_PRODUCT routine to its local part of "d1+5.0" and "d2-2.0" and each will calculate a different value for "dot".
This problem arises from the fact that the result of an expression involving a distributed vector is a real array rather than a distributed vector. Unfortunately, it is not possible to make such expressions return distributed vectors. The reason is rather subtle. Consider the following statement, where d1, d2 and d3 are distributed vectors:
d1 = (d2+d3)*d1 + (d2+d3)
The compiler will recognize that (d2+d3) is a common expression, and will generate code of the form:
temp = d2+d3
d1 = dtemp*d1 + temp
If arithmetic operations on distributed vectors returned distributed vectors, then the compiler-generated temporary vector temp would also be a distributed vector. Unfortunately, there is no way in fortran 90 to tell the compiler that allocate_vector and deallocate_vector must be called when temp is created and destroyed. (It would be possible to do this in an object-oriented language, such as C++.)